# Data Preprocessing

This notebook performs data preprocessing on the raw data.

In [1]:
import pandas as pd

# load data frame
df = pd.read_csv('data/article_full_text.csv')

# delete the first column, "Unnamed: 0"
df.drop(columns=['Unnamed: 0'], inplace=True)

# display the first 5 rows
df.head(5)

Unnamed: 0,Title,Abstract,Keywords,File Name,URL,Text
0,Predictive Modeling Applied to Structured Clin...,Predictive analysis is one of current importan...,"Electronic Health Record, FI Nnish Diabetes R ...",2021_17_9_jcssp.2021.762.775.pdf,https://thescipub.com/pdf/jcssp.2021.762.775.pdf,Electronic Health Record (EHR) is the set of c...
1,Predicting Risk of Diabetes using a Model base...,Diabetes (diabetes mellitus) is a disease emer...,"Diabetes Risk Prediction, FI Nnish Diabetes R ...",2021_17_9_jcssp.2021.748.761.pdf,https://thescipub.com/pdf/jcssp.2021.748.761.pdf,The diseases prevention is one of the topic of...
2,Impact and Control of Drug Therapy Guidelines ...,"Since December 2019, many unexplained viral pn...","COVID-19, Cancer, Pneumonia and Healthcare",2021_17_8_jcssp.2021.738.747.pdf,https://thescipub.com/pdf/jcssp.2021.738.747.pdf,A. The Possible Impact of NCP Epidemic on Canc...
3,Structural Equation Model (SEM) for Evaluating...,Information Communication Technology for Devel...,"ICT4D, Success Factors, Structural Equation Mo...",2021_17_8_jcssp.2021.724.737.pdf,https://thescipub.com/pdf/jcssp.2021.724.737.pdf,Information Communication Technology for Devel...
4,Extended Fuzzy Decision Support Model for Crop...,Food crops are the preferred crops to be culti...,"Fuzzy Logic, Decision Support Model, Euclidean...",2021_17_8_jcssp.2021.709.723.pdf,https://thescipub.com/pdf/jcssp.2021.709.723.pdf,Food crop productivity is determined by the qu...


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2682 entries, 0 to 2681
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Title      2682 non-null   object
 1   Abstract   2682 non-null   object
 2   Keywords   2676 non-null   object
 3   File Name  2682 non-null   object
 4   URL        2682 non-null   object
 5   Text       2640 non-null   object
dtypes: object(6)
memory usage: 125.8+ KB


In [3]:
# exclude articles that have missing text
data = df[df['Text'].notnull()].copy()
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2640 entries, 0 to 2679
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Title      2640 non-null   object
 1   Abstract   2640 non-null   object
 2   Keywords   2634 non-null   object
 3   File Name  2640 non-null   object
 4   URL        2640 non-null   object
 5   Text       2640 non-null   object
dtypes: object(6)
memory usage: 144.4+ KB


## Extract Year, Volume Number, and Issue Number

In [4]:
data['Year'] = data['File Name'].apply(lambda x: x.split('_')[0])
data['Volume#'] = data['File Name'].apply(lambda x: x.split('_')[1])
data['Issue#'] = data['File Name'].apply(lambda x: x.split('_')[2])

# drop column "File Name"
data.drop('File Name', axis=1, inplace=True)

# display the first 5 rows
data.head()

Unnamed: 0,Title,Abstract,Keywords,URL,Text,Year,Volume#,Issue#
0,Predictive Modeling Applied to Structured Clin...,Predictive analysis is one of current importan...,"Electronic Health Record, FI Nnish Diabetes R ...",https://thescipub.com/pdf/jcssp.2021.762.775.pdf,Electronic Health Record (EHR) is the set of c...,2021,17,9
1,Predicting Risk of Diabetes using a Model base...,Diabetes (diabetes mellitus) is a disease emer...,"Diabetes Risk Prediction, FI Nnish Diabetes R ...",https://thescipub.com/pdf/jcssp.2021.748.761.pdf,The diseases prevention is one of the topic of...,2021,17,9
2,Impact and Control of Drug Therapy Guidelines ...,"Since December 2019, many unexplained viral pn...","COVID-19, Cancer, Pneumonia and Healthcare",https://thescipub.com/pdf/jcssp.2021.738.747.pdf,A. The Possible Impact of NCP Epidemic on Canc...,2021,17,8
3,Structural Equation Model (SEM) for Evaluating...,Information Communication Technology for Devel...,"ICT4D, Success Factors, Structural Equation Mo...",https://thescipub.com/pdf/jcssp.2021.724.737.pdf,Information Communication Technology for Devel...,2021,17,8
4,Extended Fuzzy Decision Support Model for Crop...,Food crops are the preferred crops to be culti...,"Fuzzy Logic, Decision Support Model, Euclidean...",https://thescipub.com/pdf/jcssp.2021.709.723.pdf,Food crop productivity is determined by the qu...,2021,17,8


## Missing Values

In [5]:
# look at the articles that have missing keywords
missing_keywords = data[data['Keywords'].isnull()].copy()
missing_keywords

# after reviewing at the pdf files, the following articles needed to be removed
# - the first 3s have only 1 page and do not contain article's text 
# - the article at index 968 does not have a keyword and does not follow a standard structure of a literature review
# - the article at index 986 has only 1 page, and do not contain relevant text 

Unnamed: 0,Title,Abstract,Keywords,URL,Text,Year,Volume#,Issue#
422,Corrigendum: An Efficient Cell Placement Using...,Correction to: Journal of Computer Science htt...,,https://thescipub.com/pdf/jcssp.2018.437.pdf,© 2018 The Author(s). This open access article...,2018,14,3
423,Corrigendum: Adaptive Resonance Theory Trainin...,Correction to: Journal of Computer Science htt...,,https://thescipub.com/pdf/jcssp.2018.436.pdf,© 2018 The Author(s). This open access article...,2018,14,3
424,Corrigendum: Simulated Annealing with Determin...,Correction to: Journal of Computer Science htt...,,https://thescipub.com/pdf/jcssp.2018.435.pdf,© 2018 The Author(s). This open access article...,2018,14,3
771,A MOBILE AGENT BASED APPROACH FOR AUTOMATING &...,"Nowadays, the focus is not only on how to exch...",,https://thescipub.com/pdf/jcssp.2014.1628.1641...,A web service is a computer program for commun...,2014,10,9
968,Data Mining in Time Series: Current Study and ...,Time series represent sequences of data points...,,https://thescipub.com/pdf/jcssp.2014.2358.2359...,Journal of Computer Science 10 (12): 2358-2359...,2014,10,12
986,EMERGING TRENDS IN ADAPTIVE COMPUTATION FOR MO...,The future is in networked embedded systems. T...,,https://thescipub.com/pdf/jcssp.2014.2164.2164...,"Journal of Computer Science 10 (11): 2164, 201...",2014,10,11


In [6]:
# update the keywords to article at row 771
# since index 771 is based on index starting position of 1
# to perform row update using index location, we need to minus 1 from 771
data.iloc[770].Keywords = 'Semantic Web Services, Semantic Web Services Discovery, Semantic Web Services Composition, Ontology, Mobile Agent, JADE'

# make sure data has been updated
data.iloc[770]

Title       A MOBILE AGENT BASED APPROACH FOR AUTOMATING &...
Abstract    Nowadays, the focus is not only on how to exch...
Keywords    Semantic Web Services, Semantic Web Services D...
URL         https://thescipub.com/pdf/jcssp.2014.1628.1641...
Text        A web service is a computer program for commun...
Year                                                     2014
Volume#                                                    10
Issue#                                                      9
Name: 771, dtype: object

In [7]:
# check if row 771 has been updated -- it should not be shown up in the table below
# look at the articles that have missing keywords
missing_keywords = data[data['Keywords'].isnull()].copy()
missing_keywords

Unnamed: 0,Title,Abstract,Keywords,URL,Text,Year,Volume#,Issue#
422,Corrigendum: An Efficient Cell Placement Using...,Correction to: Journal of Computer Science htt...,,https://thescipub.com/pdf/jcssp.2018.437.pdf,© 2018 The Author(s). This open access article...,2018,14,3
423,Corrigendum: Adaptive Resonance Theory Trainin...,Correction to: Journal of Computer Science htt...,,https://thescipub.com/pdf/jcssp.2018.436.pdf,© 2018 The Author(s). This open access article...,2018,14,3
424,Corrigendum: Simulated Annealing with Determin...,Correction to: Journal of Computer Science htt...,,https://thescipub.com/pdf/jcssp.2018.435.pdf,© 2018 The Author(s). This open access article...,2018,14,3
968,Data Mining in Time Series: Current Study and ...,Time series represent sequences of data points...,,https://thescipub.com/pdf/jcssp.2014.2358.2359...,Journal of Computer Science 10 (12): 2358-2359...,2014,10,12
986,EMERGING TRENDS IN ADAPTIVE COMPUTATION FOR MO...,The future is in networked embedded systems. T...,,https://thescipub.com/pdf/jcssp.2014.2164.2164...,"Journal of Computer Science 10 (11): 2164, 201...",2014,10,11


In [8]:
# get the index of missing values
removed_rows = list(missing_keywords.index)
print(removed_rows)

[422, 423, 424, 968, 986]


In [9]:
# drop the rows in the removed_rows list
data.drop(index=removed_rows, inplace=True)

# check if there are any missing values left
data[data['Keywords'].isnull()]

Unnamed: 0,Title,Abstract,Keywords,URL,Text,Year,Volume#,Issue#


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2635 entries, 0 to 2679
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Title     2635 non-null   object
 1   Abstract  2635 non-null   object
 2   Keywords  2635 non-null   object
 3   URL       2635 non-null   object
 4   Text      2635 non-null   object
 5   Year      2635 non-null   object
 6   Volume#   2635 non-null   object
 7   Issue#    2635 non-null   object
dtypes: object(8)
memory usage: 185.3+ KB


## Plagiarized / Retracted Articles

In [11]:
data.describe()

Unnamed: 0,Title,Abstract,Keywords,URL,Text,Year,Volume#,Issue#
count,2635,2635,2635,2635,2635,2635,2635,2635
unique,2635,2633,2630,2635,2635,17,17,12
top,Improving Response Time of Authorization Proce...,Publication of this article is cancelled due t...,"Plastic Optical Fiber, Demultiplexer, Spectral...",https://thescipub.com/pdf/jcssp.2007.134.137.pdf,Speech and natural language understanding are ...,2014,10,1
freq,1,2,3,1,1,290,290,239


In [12]:
data.describe().Abstract.top

'Publication of this article is cancelled due to plagiarism.'

In [13]:
# look at the plagarized articles
data[data['Abstract'] == 'Publication of this article is cancelled due to plagiarism.']

Unnamed: 0,Title,Abstract,Keywords,URL,Text,Year,Volume#,Issue#
2179,RETRACTED: Object Oriented and Multi-Scale Ima...,Publication of this article is cancelled due t...,"Object based image analysis, hierarchical netw...",https://thescipub.com/pdf/jcssp.2008.706.712.pdf,What is OBIA?: In the absence of a formal defi...,2008,4,9
2389,RETRACTED: A Bayesian Networks in Intrusion De...,Publication of this article is cancelled due t...,"Computer network, Security, Intrusion detectio...",https://thescipub.com/pdf/jcssp.2007.259.265.pdf,Intrusion detection can be defined as the proc...,2007,3,5


In [14]:
# exclude plagarized articles from the data
data = data[data['Abstract'] != 'Publication of this article is cancelled due to plagiarism.']

# display data info
data.info()

# describe data
data.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2633 entries, 0 to 2679
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Title     2633 non-null   object
 1   Abstract  2633 non-null   object
 2   Keywords  2633 non-null   object
 3   URL       2633 non-null   object
 4   Text      2633 non-null   object
 5   Year      2633 non-null   object
 6   Volume#   2633 non-null   object
 7   Issue#    2633 non-null   object
dtypes: object(8)
memory usage: 185.1+ KB


Unnamed: 0,Title,Abstract,Keywords,URL,Text,Year,Volume#,Issue#
count,2633,2633,2633,2633,2633,2633,2633,2633
unique,2633,2632,2628,2633,2633,17,17,12
top,Improving Response Time of Authorization Proce...,This article has been retracted at the request...,"Plastic Optical Fiber, Demultiplexer, Spectral...",https://thescipub.com/pdf/jcssp.2007.134.137.pdf,Speech and natural language understanding are ...,2014,10,1
freq,1,2,3,1,1,290,290,239


In [15]:
data.describe().Abstract.top

'This article has been retracted at the request of the authors.'

In [16]:
# exclude retracted articles from the data
data = data[data['Abstract'] != 'This article has been retracted at the request of the authors.']

# display data info
data.info()

# describe data
data.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2631 entries, 0 to 2679
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Title     2631 non-null   object
 1   Abstract  2631 non-null   object
 2   Keywords  2631 non-null   object
 3   URL       2631 non-null   object
 4   Text      2631 non-null   object
 5   Year      2631 non-null   object
 6   Volume#   2631 non-null   object
 7   Issue#    2631 non-null   object
dtypes: object(8)
memory usage: 185.0+ KB


Unnamed: 0,Title,Abstract,Keywords,URL,Text,Year,Volume#,Issue#
count,2631,2631,2631,2631,2631,2631,2631,2631
unique,2631,2631,2626,2631,2631,17,17,12
top,Improving Response Time of Authorization Proce...,Problem statement: The phenomenon of flashover...,"Plastic Optical Fiber, Demultiplexer, Spectral...",https://thescipub.com/pdf/jcssp.2007.134.137.pdf,Speech and natural language understanding are ...,2014,10,1
freq,1,1,3,1,1,290,290,239


## Save Data

In [17]:
data.to_csv('data/article_fulltext_preprocessed.csv', index=False)