# Data Pre-Processing

Perform data cleaning on the consolidated dataset, including the following:
<ol>
    <li>Convert all text to Lower Case</li>
    <li>Remove special breakline characters</li>
    <li>Lemmatization of text</li>
    <li>Remove Stopwords</li>
    <li>Category Encoding</li>
</ol>

The output is a cleaned dataset.

In [1]:
import pandas as pd
import pickle

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

%matplotlib inline

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\darry\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
path_df = "./Pickles/all_articles_raw.pickle"

with open(path_df, 'rb') as data:
    articles = pickle.load(data)

## 1. Convert All text to Lower Case

In [3]:
articles['article']=articles['article'].str.lower()

In [4]:
articles['article'].iloc[0]

"singapore - a man's body was found on the ground floor rubbish chute area at block 677 woodlands avenue 6 on monday afternoon (aug 12).\n\nthe police said they were alerted to a case of unnatural death at 12.05pm.\n\na 64-year-old man was found motionless and was pronounced dead by paramedics at the scene, the police said.\n\nphotographer vivian low said she noticed the cordoned area when she walked past the block.\n\nms low, 29, added that she heard the housing estate's cleaning crew found the man when they opened the door in the morning.\n\nthe police are investigating the incident."

## 2. Remove special breakline characters and Punctuation

In [5]:
articles['article']=articles['article'].str.replace("\n", " ")
articles['article']=articles['article'].str.replace(r'[^\w\s]+', '')

In [6]:
articles['article'].iloc[0]

'singapore  a mans body was found on the ground floor rubbish chute area at block 677 woodlands avenue 6 on monday afternoon aug 12  the police said they were alerted to a case of unnatural death at 1205pm  a 64yearold man was found motionless and was pronounced dead by paramedics at the scene the police said  photographer vivian low said she noticed the cordoned area when she walked past the block  ms low 29 added that she heard the housing estates cleaning crew found the man when they opened the door in the morning  the police are investigating the incident'

## 3. Lemmatization of text

In [7]:
lemmatizer = WordNetLemmatizer() 

In [8]:
def lemmatize_text(raw_text):
    raw_text_words = raw_text.split(" ")
    
    lemmatized_text_list = []
    
    for word in raw_text_words:
        lemmatized_text_list.append(lemmatizer.lemmatize(word, pos="v"))
        
    lemmatized_text = " ".join(lemmatized_text_list)
    
    return lemmatized_text

In [9]:
articles['article']=articles['article'].apply(lemmatize_text)

In [10]:
articles['article'].iloc[0]

'singapore  a man body be find on the grind floor rubbish chute area at block 677 woodlands avenue 6 on monday afternoon aug 12  the police say they be alert to a case of unnatural death at 1205pm  a 64yearold man be find motionless and be pronounce dead by paramedics at the scene the police say  photographer vivian low say she notice the cordoned area when she walk past the block  ms low 29 add that she hear the house estates clean crew find the man when they open the door in the morning  the police be investigate the incident'

## 4. Remove Stopwords

In [11]:
stop_words = list(stopwords.words('english'))

In [12]:
def remove_stopwords(raw_text):
    
    raw_text_words = raw_text.split(" ")
    
    stopwords_removed_text_list = []
    
    for word in raw_text_words:
        if word.lower() not in stop_words:
            stopwords_removed_text_list.append(word)
        
    stopwords_removed_text = " ".join(stopwords_removed_text_list)
    
    return stopwords_removed_text

In [13]:
articles['article']=articles['article'].apply(remove_stopwords)

In [14]:
articles['article'].iloc[0]

'singapore  man body find grind floor rubbish chute area block 677 woodlands avenue 6 monday afternoon aug 12  police say alert case unnatural death 1205pm  64yearold man find motionless pronounce dead paramedics scene police say  photographer vivian low say notice cordoned area walk past block  ms low 29 add hear house estates clean crew find man open door morning  police investigate incident'

## 5. Category Encoding

In [15]:
articles['category'].unique()

array(['Singapore', 'Sports', 'Lifestyle', 'World', 'Business',
       'Technology'], dtype=object)

In [18]:
category_mapping = {
    'Singapore': 1,
    'Sports': 2,
    'Lifestyle': 3,
    'World': 4,
    'Business': 5,
    'Technology': 4
}

In [23]:
articles['category_code']=articles['category']
processed_articles = articles.replace({'category_code':category_mapping})
processed_articles.head()

Unnamed: 0,source,title,article,category,category_code
0,The Straits Times,Body found in garbage chute area of Woodlands ...,singapore man body find grind floor rubbish c...,Singapore,1
1,The Straits Times,Formula One: Thai Alexander Albon given chance...,london afp thai formula one driver alexander ...,Sports,2
2,The Straits Times,The Straits Times bags 8 wins at Asian Digital...,singapore straits time bag eight award 8th as...,Singapore,1
3,The Straits Times,Games,ready challenge try daily sudoku crossword puz...,Lifestyle,3
4,The Straits Times,Hong Kong cancels all remaining Monday flights...,hong kong bloomberg hong kong airport authori...,World,4


In [24]:
#Export to Serialized Object
with open('Pickles/all_articles_processed.pickle', 'wb') as output:
    pickle.dump(processed_articles, output)