# Data Preprocessing

This notebook performs data preprocessing on the raw data.

In [1]:
import pandas as pd

# load data frame
df = pd.read_csv('data/article_full_text.csv')

# display the first 5 rows
df.head()

Unnamed: 0,Date,Title,Abstract,Keywords,File Name,URL,Text
0,Published: 8 January 2021,A Systematic Literature Review on English and ...,Due to the enormous growth of information and ...,"English Bangla Comparison, Latent Dirichlet Al...",2021_17_1_jcssp.2021.1.18.pdf,https://thescipub.com/pdf/jcssp.2021.1.18.pdf,Because of the rapid development of Informatio...
1,Published: 21 January 2021,DAD: A Detailed Arabic Dataset for Online Text...,This paper presents a novel Arabic dataset tha...,"Arabic Dataset, Arabic Benchmark, Arabic Recog...",2021_17_1_jcssp.2021.19.32.pdf,https://thescipub.com/pdf/jcssp.2021.19.32.pdf,"In the literature, many papers that focus on A..."
2,Published: 20 January 2021,Collision Avoidance Modelling in Airline Traff...,An Air Traffic Controller (ATC) system aims to...,"Air Traffic Control, Collision Avoidance, Conf...",2021_17_1_jcssp.2021.33.43.pdf,https://thescipub.com/pdf/jcssp.2021.33.43.pdf,Collision avoidance on air traffic becomes ver...
3,Published: 20 January 2021,Fine-Tuned MobileNet Classifier for Classifica...,"This paper proposed an accurate, fast and reli...","Strawberry, Cherry Fruit, Accuracy, MobileNet,...",2021_17_1_jcssp.2021.44.54.pdf,https://thescipub.com/pdf/jcssp.2021.44.54.pdf,"In recent years, farmers in India eventually l..."
4,Published: 21 January 2021,A Content Filtering from Spam Posts on Social ...,The system for filtering spam posts on social ...,"Content Filtering, Spam Detection, Multimodal ...",2021_17_1_jcssp.2021.55.66.pdf,https://thescipub.com/pdf/jcssp.2021.55.66.pdf,Spam is the use of electronic devices to trans...


## Remove Rows with Missing Text and Keywords

In [2]:
# check for missing values in the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2691 entries, 0 to 2690
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Date       2691 non-null   object
 1   Title      2691 non-null   object
 2   Abstract   2691 non-null   object
 3   Keywords   2686 non-null   object
 4   File Name  2691 non-null   object
 5   URL        2691 non-null   object
 6   Text       2651 non-null   object
dtypes: object(7)
memory usage: 147.3+ KB


In [3]:
# exclude articles that have missing text
data = df[df['Text'].notnull()].copy()

# exclude articles that have missing keywords
data = data[data['Keywords'].notnull()]
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2646 entries, 0 to 2690
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Date       2646 non-null   object
 1   Title      2646 non-null   object
 2   Abstract   2646 non-null   object
 3   Keywords   2646 non-null   object
 4   File Name  2646 non-null   object
 5   URL        2646 non-null   object
 6   Text       2646 non-null   object
dtypes: object(7)
memory usage: 165.4+ KB


## Extract Year, Month, Volume Number, and Issue Number

In [4]:
# clean Date field
data['Date'] = data['Date'].apply(lambda x: x.split(':')[1].strip())

# extrac data from existing fields
data['Year'] = data['File Name'].apply(lambda x: x.split('_')[0])
data['Volume#'] = data['File Name'].apply(lambda x: x.split('_')[1])
data['Issue#'] = data['File Name'].apply(lambda x: x.split('_')[2])
data['Month'] = data['Date'].apply(lambda x: x.split(' ')[-2])

# drop column "File Name"
data.drop('File Name', axis=1, inplace=True)

# display the first 5 rows
data.head()

Unnamed: 0,Date,Title,Abstract,Keywords,URL,Text,Year,Volume#,Issue#,Month
0,8 January 2021,A Systematic Literature Review on English and ...,Due to the enormous growth of information and ...,"English Bangla Comparison, Latent Dirichlet Al...",https://thescipub.com/pdf/jcssp.2021.1.18.pdf,Because of the rapid development of Informatio...,2021,17,1,January
1,21 January 2021,DAD: A Detailed Arabic Dataset for Online Text...,This paper presents a novel Arabic dataset tha...,"Arabic Dataset, Arabic Benchmark, Arabic Recog...",https://thescipub.com/pdf/jcssp.2021.19.32.pdf,"In the literature, many papers that focus on A...",2021,17,1,January
2,20 January 2021,Collision Avoidance Modelling in Airline Traff...,An Air Traffic Controller (ATC) system aims to...,"Air Traffic Control, Collision Avoidance, Conf...",https://thescipub.com/pdf/jcssp.2021.33.43.pdf,Collision avoidance on air traffic becomes ver...,2021,17,1,January
3,20 January 2021,Fine-Tuned MobileNet Classifier for Classifica...,"This paper proposed an accurate, fast and reli...","Strawberry, Cherry Fruit, Accuracy, MobileNet,...",https://thescipub.com/pdf/jcssp.2021.44.54.pdf,"In recent years, farmers in India eventually l...",2021,17,1,January
4,21 January 2021,A Content Filtering from Spam Posts on Social ...,The system for filtering spam posts on social ...,"Content Filtering, Spam Detection, Multimodal ...",https://thescipub.com/pdf/jcssp.2021.55.66.pdf,Spam is the use of electronic devices to trans...,2021,17,1,January


## Remove Plagiarized / Retracted Articles

In [5]:
data.describe()

Unnamed: 0,Date,Title,Abstract,Keywords,URL,Text,Year,Volume#,Issue#,Month
count,2646,2646,2646,2646,2646,2646,2646,2646,2646,2646
unique,1040,2646,2644,2641,2646,2646,17,17,12,12
top,31 December 2009,A Proposed Framework for Reducing Electricity ...,Publication of this article is cancelled due t...,"Plastic Optical Fiber, Demultiplexer, Spectral...",https://thescipub.com/pdf/jcssp.2014.614.622.pdf,The signal processing plays an important role ...,2014,10,1,November
freq,30,1,2,3,1,1,290,290,240,253


In [6]:
data.describe().Abstract.top

'Publication of this article is cancelled due to plagiarism.'

In [7]:
# look at the plagarized articles
data[data['Abstract'] == 'Publication of this article is cancelled due to plagiarism.']

Unnamed: 0,Date,Title,Abstract,Keywords,URL,Text,Year,Volume#,Issue#,Month
2277,30 September 2008,RETRACTED: Object Oriented and Multi-Scale Ima...,Publication of this article is cancelled due t...,"Object based image analysis, hierarchical netw...",https://thescipub.com/pdf/jcssp.2008.706.712.pdf,What is OBIA?: In the absence of a formal defi...,2008,4,9,September
2372,31 May 2007,RETRACTED: A Bayesian Networks in Intrusion De...,Publication of this article is cancelled due t...,"Computer network, Security, Intrusion detectio...",https://thescipub.com/pdf/jcssp.2007.259.265.pdf,Intrusion detection can be defined as the proc...,2007,3,5,May


In [8]:
# exclude plagarized articles from the data
data = data[data['Abstract'] != 'Publication of this article is cancelled due to plagiarism.']

# display data info
data.info()

# describe data
data.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2644 entries, 0 to 2690
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Date      2644 non-null   object
 1   Title     2644 non-null   object
 2   Abstract  2644 non-null   object
 3   Keywords  2644 non-null   object
 4   URL       2644 non-null   object
 5   Text      2644 non-null   object
 6   Year      2644 non-null   object
 7   Volume#   2644 non-null   object
 8   Issue#    2644 non-null   object
 9   Month     2644 non-null   object
dtypes: object(10)
memory usage: 227.2+ KB


Unnamed: 0,Date,Title,Abstract,Keywords,URL,Text,Year,Volume#,Issue#,Month
count,2644,2644,2644,2644,2644,2644,2644,2644,2644,2644
unique,1040,2644,2643,2639,2644,2644,17,17,12,12
top,31 December 2009,A Proposed Framework for Reducing Electricity ...,This article has been retracted at the request...,"Plastic Optical Fiber, Demultiplexer, Spectral...",https://thescipub.com/pdf/jcssp.2014.614.622.pdf,The signal processing plays an important role ...,2014,10,1,November
freq,30,1,2,3,1,1,290,290,240,253


In [9]:
data.describe().Abstract.top

'This article has been retracted at the request of the authors.'

In [10]:
# exclude retracted articles from the data
data = data[data['Abstract'] != 'This article has been retracted at the request of the authors.']

# display data info
data.info()

# describe data
data.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2642 entries, 0 to 2690
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Date      2642 non-null   object
 1   Title     2642 non-null   object
 2   Abstract  2642 non-null   object
 3   Keywords  2642 non-null   object
 4   URL       2642 non-null   object
 5   Text      2642 non-null   object
 6   Year      2642 non-null   object
 7   Volume#   2642 non-null   object
 8   Issue#    2642 non-null   object
 9   Month     2642 non-null   object
dtypes: object(10)
memory usage: 227.0+ KB


Unnamed: 0,Date,Title,Abstract,Keywords,URL,Text,Year,Volume#,Issue#,Month
count,2642,2642,2642,2642,2642,2642,2642,2642,2642,2642
unique,1039,2642,2642,2637,2642,2642,17,17,12,12
top,31 December 2009,A Proposed Framework for Reducing Electricity ...,Problem statement: This study described the an...,"Plastic Optical Fiber, Demultiplexer, Spectral...",https://thescipub.com/pdf/jcssp.2014.614.622.pdf,The signal processing plays an important role ...,2014,10,1,November
freq,30,1,1,3,1,1,290,290,240,253


## Derived Features

In [11]:
# get title's length per article
data['Title Length'] = data.Title.str.len()

# get abstract's length per article
data['Abstract Length'] = data.Abstract.str.len()    

# get length of article's full text
data['Text Length'] = data.Text.str.len()  

# get number of keywords per article
data['Number of Keywords'] = data.Keywords.apply(lambda keywords: len(keywords.split(',')))

## Text Length

In [12]:
data['Text Length'].describe()

count      2642.000000
mean      17972.945117
std       11124.030230
min          49.000000
25%       10128.250000
50%       16710.000000
75%       23664.500000
max      107028.000000
Name: Text Length, dtype: float64

In [13]:
# only include articles with full text started with a capital letters
# this would exclude articles with invalid texts
data = data[data['Text'].str.contains('^[A-Z]+.*') == True].reset_index(drop=True)

# exclude articles with text length < 1000 words
data = data[data['Text Length'] > 1000].reset_index(drop=True)

# print data's shape and show the first 5 rows
print("Data shape:", data.shape)
data.head()

Data shape: (1811, 14)


Unnamed: 0,Date,Title,Abstract,Keywords,URL,Text,Year,Volume#,Issue#,Month,Title Length,Abstract Length,Text Length,Number of Keywords
0,8 January 2021,A Systematic Literature Review on English and ...,Due to the enormous growth of information and ...,"English Bangla Comparison, Latent Dirichlet Al...",https://thescipub.com/pdf/jcssp.2021.1.18.pdf,Because of the rapid development of Informatio...,2021,17,1,January,67,2773,48221,6
1,21 January 2021,DAD: A Detailed Arabic Dataset for Online Text...,This paper presents a novel Arabic dataset tha...,"Arabic Dataset, Arabic Benchmark, Arabic Recog...",https://thescipub.com/pdf/jcssp.2021.19.32.pdf,"In the literature, many papers that focus on A...",2021,17,1,January,96,2553,37984,9
2,20 January 2021,Collision Avoidance Modelling in Airline Traff...,An Air Traffic Controller (ATC) system aims to...,"Air Traffic Control, Collision Avoidance, Conf...",https://thescipub.com/pdf/jcssp.2021.33.43.pdf,Collision avoidance on air traffic becomes ver...,2021,17,1,January,109,3375,30346,4
3,20 January 2021,Fine-Tuned MobileNet Classifier for Classifica...,"This paper proposed an accurate, fast and reli...","Strawberry, Cherry Fruit, Accuracy, MobileNet,...",https://thescipub.com/pdf/jcssp.2021.44.54.pdf,"In recent years, farmers in India eventually l...",2021,17,1,January,87,3283,29159,5
4,21 January 2021,A Content Filtering from Spam Posts on Social ...,The system for filtering spam posts on social ...,"Content Filtering, Spam Detection, Multimodal ...",https://thescipub.com/pdf/jcssp.2021.55.66.pdf,Spam is the use of electronic devices to trans...,2021,17,1,January,86,2745,30537,5


In [14]:
data['Text Length'].describe()

count      1811.000000
mean      22214.085036
std       10103.870359
min        2033.000000
25%       15473.000000
50%       20505.000000
75%       26867.500000
max      107028.000000
Name: Text Length, dtype: float64

## Text Processing

- Remove urls, figure texts, and footnote texts
- Convert text to lowercase
- Remove stopwords
- Lemmatize full texts and abstracts
- Apply stemming to keywords

In [15]:
import re

def clean_text(text):
    '''
    Strip off any urls, figure texts, and footnote texts
    '''
    
    patterns = [r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*',    # remove urls
                r'\s*\d+www\.\w+\d*\.\w+/\w+',
                r'DOI: \d+\.\d+/\w+\.\d+\.\d+\.\d+\s*',                 # remove DOI texts
                r'Fig.\s\d+:\s*',                                       # remove Fig texts
                r'Journal of Computer Science \d+,\s\d+\s\(\d+\):\s\d+\.\d+\s\d+\s*'] # remove footnote text  
    
    text_temp = text
    for pattern in patterns:
        text_temp = re.sub(pattern, '', text_temp, flags=re.MULTILINE)
    
    # convert text to lower case
    return text_temp.lower()

In [16]:
import nltk

# get unique English stopwords
stopwords = set(nltk.corpus.stopwords.words('english'))

def remove_stopwords(text):
    '''Remove stopwords from text'''
    return ' '.join([word for word in text.split(' ') if not word in stopwords])

In [17]:
regex = re.compile(r'[^\w\s]+')

def clean_text_2(text):
    '''Remove none-word characters'''
    
    # remove any extra spaces
    text_temp = re.sub(r'\s{2,}', ' ', text, flags=re.MULTILINE)
    
    # remove any none-word characters
    sentences = []
    for sent in text_temp.split('. '):
        sentences.append(' '.join([regex.sub('', word) for word in sent.split(' ')]))
    
    return '. '.join(sentences)

In [18]:
import spacy

# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def lemmatize_text(text):
    # first split text into sentences based on the period
    sentences = text.split('. ')
    lemmas = []   # a list of lemmatized sentences
    
    for sent in sentences:
        # Parse the text using the loaded 'en' model object `nlp`
        doc = nlp(sent)
        
        # Extract the lemma for each token and join
        lemmas.append(" ".join([token.lemma_ for token in doc]))
    
    # return the lemmatized text
    return ". ".join(lemmas)

### Article's Full Text

In [19]:
# remove invalid texts
data['Text_Cleaned'] = data.Text.apply(clean_text)

# remove stopwords from text
data['Text_Cleaned'] = data['Text_Cleaned'].apply(remove_stopwords)

In [20]:
# perform lemmatization on text
data['Text_Cleaned'] = data['Text_Cleaned'].apply(lemmatize_text)

In [21]:
# remove non-words characters
data['Text_Cleaned'] = data['Text_Cleaned'].apply(clean_text_2)

# look at the first article's cleaned text
data['Text_Cleaned'].values[0]

'rapid development information technology  eg.  internet  social medium  online database  etc .   amount datum generate exponentially exacerbate recent year. vast accumulation datum provide essential support training machine learning model easy access search engine query. hand  massive flourish information  extract knowledge interest datum become matter general concern  xu et al .  2019 . accord study domo  a cloud  base business service system   roughly 25 quintilian byte datum produce daily 90  datum world create last two year  accord 2018 study   al helal mouhoub  2018 . feasible person sieve useful information vast amount datum manually. moreover  national science foundation scale scientific datum management data  intensive challenge area future study  karami et al .  2018 . crucial precisely efficiently estimate numerical characteristic  determine appropriate statistical distribution model text corpora  jiang et al .  2017 . topic model probabilistic approach observe instrument me

### Abstracts

In [22]:
# convert text to lower case
data['Abstract_Cleaned'] = data['Abstract'].map(lambda text: text.lower())

# remove stopwords
data['Abstract_Cleaned'] = data['Abstract_Cleaned'].apply(remove_stopwords)

In [23]:
# perform lemmatization
data['Abstract_Cleaned'] = data['Abstract_Cleaned'].apply(lemmatize_text)

In [24]:
# remove any non-words character
data['Abstract_Cleaned'] = data['Abstract_Cleaned'].apply(clean_text_2)

# look at the first abstract
data['Abstract_Cleaned'].values[0]

'due enormous growth information technology  digitize text datum immensely generate. therefore  identify main topic vast collection document human merely impossible. topic model statistical framework infer latent underlie topic text document  corpus  electronic archive probabilistic approach. promise field natural language processing  nlp . though many researcher research field  significant research do bangla. literature review paper  follow systematic approach review topic model study publish 2003 2020. analyze topic modeling method different aspect identify research gap topic model english bangla language. analyze paper  identify several type topic modeling technique  latent dirichlet allocation  lda   latent semantic analysis  lsa   support vector machine  svm   bi  term topic modeling  btm . furthermore  review paper also highlight real  world application topic model. several evaluation method use evaluate model  performance  discuss study. conclude mention huge future research sco

### Keywords

In [25]:
from nltk.stem.snowball import SnowballStemmer

# create a Snowball Stemmer object for stemming
stemmer = SnowballStemmer("english")

def stem_text(text):
    '''Apply stemming on text '''
    # build a list of keywords
    keywords = text.split(',')
    
    # apply stemming to each keywords
    stems = []
    stems.append(' '.join([stemmer.stem(word) for word in text.split(' ')]))
    
    # return processed text, after apply stemming
    return ', '.join(stems)

In [26]:
# convert text to lower case
data['Keywords_Cleaned'] = data['Keywords'].map(lambda text: text.lower())

# remove dash from text
data['Keywords_Cleaned'] = data['Keywords_Cleaned'].map(lambda text: re.sub(r'-', ' ', text))

# apply stemming
data['Keywords_Cleaned'] = data['Keywords_Cleaned'].apply(stem_text)

# look at the first five keywords
data['Keywords_Cleaned'].head()

0    english bangla comparison, latent dirichlet al...
1    arab dataset, arab benchmark, arab recognition...
2    air traffic control, collis avoidance, conflic...
3    strawberry, cherri fruit, accuracy, mobilenet,...
4    content filtering, spam detection, multimod da...
Name: Keywords_Cleaned, dtype: object

## Save Data

In [27]:
data.to_csv('data/data_cleaned.csv', index=False)