# Data Cleaning

Perform data cleaning on article's texts, abstracts and keywords.

In [1]:
import pandas as pd

# load data
df = pd.read_csv('data/article_fulltext_preprocessed.csv')
df.head()

Unnamed: 0,Title,Abstract,Keywords,URL,Text,Year,Volume#,Issue#
0,Predictive Modeling Applied to Structured Clin...,Predictive analysis is one of current importan...,"Electronic Health Record, FI Nnish Diabetes R ...",https://thescipub.com/pdf/jcssp.2021.762.775.pdf,Electronic Health Record (EHR) is the set of c...,2021,17,9
1,Predicting Risk of Diabetes using a Model base...,Diabetes (diabetes mellitus) is a disease emer...,"Diabetes Risk Prediction, FI Nnish Diabetes R ...",https://thescipub.com/pdf/jcssp.2021.748.761.pdf,The diseases prevention is one of the topic of...,2021,17,9
2,Impact and Control of Drug Therapy Guidelines ...,"Since December 2019, many unexplained viral pn...","COVID-19, Cancer, Pneumonia and Healthcare",https://thescipub.com/pdf/jcssp.2021.738.747.pdf,A. The Possible Impact of NCP Epidemic on Canc...,2021,17,8
3,Structural Equation Model (SEM) for Evaluating...,Information Communication Technology for Devel...,"ICT4D, Success Factors, Structural Equation Mo...",https://thescipub.com/pdf/jcssp.2021.724.737.pdf,Information Communication Technology for Devel...,2021,17,8
4,Extended Fuzzy Decision Support Model for Crop...,Food crops are the preferred crops to be culti...,"Fuzzy Logic, Decision Support Model, Euclidean...",https://thescipub.com/pdf/jcssp.2021.709.723.pdf,Food crop productivity is determined by the qu...,2021,17,8


## Derived Features

In [2]:
# get title's length per article
df['Title Length'] = df.Title.str.len()

# get abstract's length per article
df['Abstract Length'] = df.Abstract.str.len()    

# get number of keywords per article
df['Number of Keywords'] = df.Keywords.apply(lambda keywords: len(keywords.split(',')))

## Text Processing

- Remove urls, figure texts, and footnote texts
- Convert text to lowercase
- Remove stopwords
- Apply stemming

In [3]:
import re

def clean_text(text):
    '''
    Strip off any urls, figure texts, and footnote texts
    '''
    
    patterns = [r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*',    # remove urls
                r'\s*\d+www\.\w+\d*\.\w+/\w+',
                r'DOI: \d+\.\d+/\w+\.\d+\.\d+\.\d+\s*',                 # remove DOI texts
                r'Fig.\s\d+:\s*',                                       # remove Fig texts
                r'Journal of Computer Science \d+,\s\d+\s\(\d+\):\s\d+\.\d+\s\d+\s*'] # remove footnote text  
    
    text_temp = text
    for pattern in patterns:
        text_temp = re.sub(pattern, '', text_temp, flags=re.MULTILINE)
    
    # convert text to lower case
    return text_temp.lower()

In [4]:
import nltk

# get unique English stopwords
stopwords = set(nltk.corpus.stopwords.words('english'))

def remove_stopwords(text):
    '''Remove stopwords from text'''
    return ' '.join([word for word in text.split(' ') if not word in stopwords])

In [5]:
regex = re.compile(r'[^\w\s]+')

def clean_text_2(text):
    '''Remove none-word characters'''
    
    # remove any extra spaces
    text_temp = re.sub(r'\s{2,}', ' ', text, flags=re.MULTILINE)
    
    # remove any none-word characters
    sentences = []
    for sent in text_temp.split('. '):
        sentences.append(' '.join([regex.sub('', word) for word in sent.split(' ')]))
    
    return '. '.join(sentences)

In [6]:
import spacy

# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def lemmatize_text(text):
    # first split text into sentences based on the period
    sentences = text.split('. ')
    lemmas = []   # a list of lemmatized sentences
    
    for sent in sentences:
        # Parse the text using the loaded 'en' model object `nlp`
        doc = nlp(sent)
        
        # Extract the lemma for each token and join
        lemmas.append(" ".join([token.lemma_ for token in doc]))
    
    # return the lemmatized text
    return ". ".join(lemmas)

### Article's Text

In [7]:
df['Text_Cleaned'] = df.Text.apply(clean_text)

In [8]:
# remove stopwords from text
df['Text_Cleaned'] = df['Text_Cleaned'].apply(remove_stopwords)

In [9]:
# perform lemmatization on text
df['Text_Cleaned'] = df['Text_Cleaned'].apply(lemmatize_text)

In [10]:
# remove non-words characters
df['Text_Cleaned'] = df['Text_Cleaned'].apply(clean_text_2)

In [11]:
df['Text_Cleaned'].values[0]

'electronic health record  ehr  set clinical  health information useful patient treatment  eg.  clinical laboratory report  discharge letter  emergency report produce hospital patient summary  ps  produce general practitioner  gp . datum two different state  validate  eg.  document digitally sign doctor  validate  eg.  health datum pressure  record patient autonomously  scenario  typically talk personal health record  phr . contexts interest point view. patient  often refer concept relate health datum document alessandra pieroni et al.  associate to   1  whole hospitalization.  2  ward  also consider hospitalization outpatient episode within ward itself .  3  gp.  4  whole hospital  consider episode different ward hospitalization too .  5  health datum document patient regardless structure create. contexts  typically electronic patient record  epr   electronic medical record  emr   ehr  phr acronym use different meaning. study  mainly use ehr term. ehrs important contain history patien

### Abstracts



In [12]:
# convert text to lower case
df['Abstract_Cleaned'] = df['Abstract'].map(lambda text: text.lower())

# remove stopwords
df['Abstract_Cleaned'] = df['Abstract_Cleaned'].apply(remove_stopwords)

In [13]:
# perform lemmatization
df['Abstract_Cleaned'] = df['Abstract_Cleaned'].apply(lemmatize_text)

In [14]:
# remove any non-words character
df['Abstract_Cleaned'] = df['Abstract_Cleaned'].apply(clean_text_2)

In [15]:
# look at the first abstract
df['Abstract_Cleaned'].values[0]

'predictive analysis one current important issue healthcare context. lot patient  input datum obtain electronic health record. research  propose general architecture name health prediction architecture. initially  consider datum refer strongly structured health dataset  no free text . objective relate explore problem prediction context healthcare. particular  consider dataset heterogeneity  accuracy together explain ability  dataset benchmarke. presentation electronic health record useful related standard  propose architecture base two principal module. first module produce feature extraction implement convolutional neural network alternatively multi  layer perceptron. second module produce prediction implement alternatively one graph convolutional network  simplify graph transduction game  near node class graph. define dataset randomly possibility manage datum sufficiently heterogeneous useful benchmarking  without privacy problem too. study  experiment first instantiation architectur

### Keywords

In [23]:
from nltk.stem.snowball import SnowballStemmer

# create a Snowball Stemmer object for stemming
stemmer = SnowballStemmer("english")

def stem_text(text):
    '''Apply stemming on text '''
    # build a list of keywords
    keywords = text.split(',')
    
    # apply stemming to each keywords
    stems = []
    stems.append(' '.join([stemmer.stem(word) for word in text.split(' ')]))
    
    # return processed text, after apply stemming
    return ', '.join(stems)

In [32]:
# convert text to lower case
df['Keywords_Cleaned'] = df['Keywords'].map(lambda text: text.lower())

# remove dash from text
df['Keywords_Cleaned'] = df['Keywords_Cleaned'].map(lambda text: re.sub(r'-', ' ', text))

# apply stemming
df['Keywords_Cleaned'] = df['Keywords_Cleaned'].apply(stem_text)

In [33]:
df['Keywords_Cleaned'].head()

0    electron health record, fi nnish diabet r isk ...
1    diabet risk prediction, fi nnish diabet r isk ...
2            covid 19, cancer, pneumonia and healthcar
3    ict4d, success factors, structur equat model a...
4    fuzzi logic, decis support model, euclidean di...
Name: Keywords_Cleaned, dtype: object

## Save Data

In [34]:
df.to_csv('data/data_cleaned.csv', index=False)