# Data Pre-Processing

Perform data cleaning on the consolidated dataset, including the following:
<ol>
    <li>Convert all text to Lower Case</li>
    <li>Remove special breakline characters</li>
    <li>Remove Numbers</li>
    <li>Lemmatization of text</li>
    <li>Remove Stopwords</li>
    <li>Category Encoding</li>
</ol>

The output is a cleaned dataset.

In [1]:
import pandas as pd
import pickle
import re

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')

%matplotlib inline

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\darry\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
path_df = "./Pickles/sampled_articles_raw.pickle"

with open(path_df, 'rb') as data:
    articles = pickle.load(data)

## 1. Convert All text to Lower Case

In [3]:
articles['article']=articles['article'].str.lower()

In [4]:
articles['article'].iloc[0]

'\nsingapore - from the complicated life story of mexican painter frida kahlo to an exhibition on effigies and the mythic female, the third edition of annual storytelling festival storyfest aims to disrupt and deconstruct the notion of a single narrative.\nthe festival will run at the arts house from june 21 to 24 with 22 programmes.\nthese include the asian premieres of american storyteller david novak\'s rendition of the ancient sumerian epic, gilgamesh, and reflecting fridas, in which brazilian storyteller ana maria lines weaves stories from kahlo\'s life with her own.\n\n\n\n\nlines, 55, has been fascinated by kahlo\'s ability to turn adversity into art since she was a child and was shown a picture of a kahlo painting in class. to create the show, she went to mexico to visit the places where kahlo lived and talk to artists and locals about her.\n"the process of creating a show inspired by frida kahlo helped me to have another understanding about death and how important it is to rem

## 2. Remove special breakline characters and Punctuation

In [5]:
articles['article']=articles['article'].str.replace("\n", " ")
articles['article']=articles['article'].str.replace(r'[^\w\s]+', '')

In [6]:
articles['article'].iloc[0]

' singapore  from the complicated life story of mexican painter frida kahlo to an exhibition on effigies and the mythic female the third edition of annual storytelling festival storyfest aims to disrupt and deconstruct the notion of a single narrative the festival will run at the arts house from june 21 to 24 with 22 programmes these include the asian premieres of american storyteller david novaks rendition of the ancient sumerian epic gilgamesh and reflecting fridas in which brazilian storyteller ana maria lines weaves stories from kahlos life with her own     lines 55 has been fascinated by kahlos ability to turn adversity into art since she was a child and was shown a picture of a kahlo painting in class to create the show she went to mexico to visit the places where kahlo lived and talk to artists and locals about her the process of creating a show inspired by frida kahlo helped me to have another understanding about death and how important it is to remember and celebrate our ances

## 3. Remove Numbers

In [7]:
articles['article'] = articles['article'].str.replace('\d+', '')

In [8]:
articles['article'].iloc[0]

' singapore  from the complicated life story of mexican painter frida kahlo to an exhibition on effigies and the mythic female the third edition of annual storytelling festival storyfest aims to disrupt and deconstruct the notion of a single narrative the festival will run at the arts house from june  to  with  programmes these include the asian premieres of american storyteller david novaks rendition of the ancient sumerian epic gilgamesh and reflecting fridas in which brazilian storyteller ana maria lines weaves stories from kahlos life with her own     lines  has been fascinated by kahlos ability to turn adversity into art since she was a child and was shown a picture of a kahlo painting in class to create the show she went to mexico to visit the places where kahlo lived and talk to artists and locals about her the process of creating a show inspired by frida kahlo helped me to have another understanding about death and how important it is to remember and celebrate our ancestors she

## 4. Lemmatization of text

In [9]:
lemmatizer = WordNetLemmatizer() 

In [10]:
def lemmatize_text(raw_text):
    raw_text_words = raw_text.split(" ")
    
    lemmatized_text_list = []
    
    for word in raw_text_words:
        lemmatized_text_list.append(lemmatizer.lemmatize(word, pos="v"))
        
    lemmatized_text = " ".join(lemmatized_text_list)
    
    return lemmatized_text

In [11]:
articles['article']=articles['article'].apply(lemmatize_text)

In [12]:
articles['article'].iloc[0]

' singapore  from the complicate life story of mexican painter frida kahlo to an exhibition on effigies and the mythic female the third edition of annual storytelling festival storyfest aim to disrupt and deconstruct the notion of a single narrative the festival will run at the arts house from june  to  with  program these include the asian premier of american storyteller david novaks rendition of the ancient sumerian epic gilgamesh and reflect fridas in which brazilian storyteller ana maria line weave stories from kahlos life with her own     line  have be fascinate by kahlos ability to turn adversity into art since she be a child and be show a picture of a kahlo paint in class to create the show she go to mexico to visit the place where kahlo live and talk to artists and locals about her the process of create a show inspire by frida kahlo help me to have another understand about death and how important it be to remember and celebrate our ancestors she say in an email message the fest

## 5. Remove Stopwords

In [13]:
stop_words = list(stopwords.words('english'))

In [14]:
def remove_stopwords(raw_text):
    
    raw_text_words = raw_text.split(" ")
    
    stopwords_removed_text_list = []
    
    for word in raw_text_words:
        if word.lower() not in stop_words:
            stopwords_removed_text_list.append(word)
        
    stopwords_removed_text = " ".join(stopwords_removed_text_list)
    
    return stopwords_removed_text

In [15]:
articles['article']=articles['article'].apply(remove_stopwords)

In [16]:
articles['article'].iloc[0]

' singapore  complicate life story mexican painter frida kahlo exhibition effigies mythic female third edition annual storytelling festival storyfest aim disrupt deconstruct notion single narrative festival run arts house june    program include asian premier american storyteller david novaks rendition ancient sumerian epic gilgamesh reflect fridas brazilian storyteller ana maria line weave stories kahlos life     line  fascinate kahlos ability turn adversity art since child show picture kahlo paint class create show go mexico visit place kahlo live talk artists locals process create show inspire frida kahlo help another understand death important remember celebrate ancestors say email message festivals creative producer kamini ramachandran  say audience grow   begin  last year  attend ticket program      observe increasingly audiences include parent children also professionals keen hone communication skills say somebody tell epic  minutes skills apply pitch market find spend lot time 

## 6. Category Encoding

In [17]:
articles['category'].unique()

array(['Lifestyle', 'World', 'Technology', 'Business', 'Singapore',
       'Sports'], dtype=object)

In [18]:
category_mapping = {
    'Singapore': 1,
    'Sports': 2,
    'Lifestyle': 3,
    'World': 4,
    'Business': 5,
    'Technology': 6
}

In [19]:
articles['category_code']=articles['category']
processed_articles = articles.replace({'category_code':category_mapping})
processed_articles=processed_articles.reset_index(drop=True)
processed_articles.head()

Unnamed: 0,source,title,article,category,length_characters,length_words,category_code
0,The Straits Times,"Myth, magic and memoirs at storytelling festival",singapore complicate life story mexican pain...,Lifestyle,3603,589,3
1,AsiaOne,Beyond the ski slopes: Why the wealthy have ey...,hot sunny day im look hokkaidos mighty usu vol...,Lifestyle,9642,1500,3
2,The Straits Times,Critics gush over the spectacle and story of A...,los angeles unite state reuters film critics...,Lifestyle,2390,374,3
3,AsiaOne,Spaces we love: Singaporean homes washed in na...,want home bath light adjustments need make get...,Lifestyle,1645,286,3
4,Channel News Asia,,amid colossal portraits clean line black white...,Lifestyle,1985,300,3


In [20]:
#Export to Serialized Object
with open('Pickles/all_articles_processed.pickle', 'wb') as output:
    pickle.dump(processed_articles, output)