# Data Pre-Processing

Perform data cleaning on the consolidated dataset, including the following:
<ol>
    <li>Convert all text to Lower Case</li>
    <li>Remove special breakline characters</li>
    <li>Remove Numbers</li>
    <li>Lemmatization of text</li>
    <li>Remove Stopwords</li>
    <li>Category Encoding</li>
</ol>

The output is a cleaned dataset.

In [1]:
import pandas as pd
import pickle
import re

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')

%matplotlib inline

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\darry\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
path_df = "./Pickles/sampled_articles_raw.pickle"

with open(path_df, 'rb') as data:
    articles = pickle.load(data)

## 1. Convert All text to Lower Case

In [3]:
articles['article']=articles['article'].str.lower()

In [4]:
articles['article'].iloc[0]

"whether you’re buying an hdb resale flat as a first-timer or just looking to “upgrade” now that your first bto has passed the 5 year mark, choosing a resale home can be a daunting affair.\nunlike the hdb bto route where you basically don’t have a say in anything, now the options are literally endless.\xa0\n\n\nbut that doesn’t mean you shouldn’t spend an entire year searching for your dream home. \n\n\nbecause after your 12th or 13th viewing, everything will look the same to you.\ninstead, you should make a list of as many criteria as possible to narrow down your search. (and no, silly, we don’t mean en bloc potential.) here’s a list of key factors to look at when choosing a resale flat.\n1. price\nthe very first thing to consider when hunting for a resale flat is your budget. if your budget is $400,000, that’s going to impact a lot of other factors: location, flat size, age and so on.\nof course, what counts as “within budget” really varies from person to person. some people want a h

## 2. Remove special breakline characters and Punctuation

In [5]:
articles['article']=articles['article'].str.replace("\n", " ")
articles['article']=articles['article'].str.replace(r'[^\w\s]+', '')

In [6]:
articles['article'].iloc[0]

'whether youre buying an hdb resale flat as a firsttimer or just looking to upgrade now that your first bto has passed the 5 year mark choosing a resale home can be a daunting affair unlike the hdb bto route where you basically dont have a say in anything now the options are literally endless\xa0   but that doesnt mean you shouldnt spend an entire year searching for your dream home    because after your 12th or 13th viewing everything will look the same to you instead you should make a list of as many criteria as possible to narrow down your search and no silly we dont mean en bloc potential heres a list of key factors to look at when choosing a resale flat 1 price the very first thing to consider when hunting for a resale flat is your budget if your budget is 400000 thats going to impact a lot of other factors location flat size age and so on of course what counts as within budget really varies from person to person some people want a home thats roughly the equivalent of the proceeds 

## 3. Remove Numbers

In [7]:
articles['article'] = articles['article'].str.replace('\d+', '')

In [8]:
articles['article'].iloc[0]

'whether youre buying an hdb resale flat as a firsttimer or just looking to upgrade now that your first bto has passed the  year mark choosing a resale home can be a daunting affair unlike the hdb bto route where you basically dont have a say in anything now the options are literally endless\xa0   but that doesnt mean you shouldnt spend an entire year searching for your dream home    because after your th or th viewing everything will look the same to you instead you should make a list of as many criteria as possible to narrow down your search and no silly we dont mean en bloc potential heres a list of key factors to look at when choosing a resale flat  price the very first thing to consider when hunting for a resale flat is your budget if your budget is  thats going to impact a lot of other factors location flat size age and so on of course what counts as within budget really varies from person to person some people want a home thats roughly the equivalent of the proceeds from their f

## 4. Lemmatization of text

In [9]:
lemmatizer = WordNetLemmatizer() 

In [10]:
def lemmatize_text(raw_text):
    raw_text_words = raw_text.split(" ")
    
    lemmatized_text_list = []
    
    for word in raw_text_words:
        lemmatized_text_list.append(lemmatizer.lemmatize(word, pos="v"))
        
    lemmatized_text = " ".join(lemmatized_text_list)
    
    return lemmatized_text

In [11]:
articles['article']=articles['article'].apply(lemmatize_text)

In [12]:
articles['article'].iloc[0]

'whether youre buy an hdb resale flat as a firsttimer or just look to upgrade now that your first bto have pass the  year mark choose a resale home can be a daunt affair unlike the hdb bto route where you basically dont have a say in anything now the options be literally endless\xa0   but that doesnt mean you shouldnt spend an entire year search for your dream home    because after your th or th view everything will look the same to you instead you should make a list of as many criteria as possible to narrow down your search and no silly we dont mean en bloc potential heres a list of key factor to look at when choose a resale flat  price the very first thing to consider when hunt for a resale flat be your budget if your budget be  thats go to impact a lot of other factor location flat size age and so on of course what count as within budget really vary from person to person some people want a home thats roughly the equivalent of the proceed from their freshlysold bto while others dont 

## 5. Remove Stopwords

In [13]:
stop_words = list(stopwords.words('english'))

In [14]:
def remove_stopwords(raw_text):
    
    raw_text_words = raw_text.split(" ")
    
    stopwords_removed_text_list = []
    
    for word in raw_text_words:
        if word.lower() not in stop_words:
            stopwords_removed_text_list.append(word)
        
    stopwords_removed_text = " ".join(stopwords_removed_text_list)
    
    return stopwords_removed_text

In [15]:
articles['article']=articles['article'].apply(remove_stopwords)

In [16]:
articles['article'].iloc[0]

'whether youre buy hdb resale flat firsttimer look upgrade first bto pass  year mark choose resale home daunt affair unlike hdb bto route basically dont say anything options literally endless\xa0   doesnt mean shouldnt spend entire year search dream home    th th view everything look instead make list many criteria possible narrow search silly dont mean en bloc potential heres list key factor look choose resale flat  price first thing consider hunt resale flat budget budget  thats go impact lot factor location flat size age course count within budget really vary person person people want home thats roughly equivalent proceed freshlysold bto others dont mind increase home loans\xa0     read also        afford hdb flat youre  single    bear mind though probably dont want overstretch might lot financial commitments kid grow parent grow old come point stick budget sure property agent might upsell luxurious  million dbss unit go pay extra  pocket yeah didnt think think budget also consider 

## 6. Category Encoding

In [17]:
articles['category'].unique()

array(['Lifestyle', 'World', 'Technology', 'Business', 'Singapore',
       'Sports'], dtype=object)

In [18]:
category_mapping = {
    'Singapore': 1,
    'Sports': 2,
    'Lifestyle': 3,
    'World': 4,
    'Business': 5,
    'Technology': 6
}

In [19]:
articles['category_code']=articles['category']
processed_articles = articles.replace({'category_code':category_mapping})
processed_articles=processed_articles.reset_index(drop=True)
processed_articles.head()

Unnamed: 0,source,title,article,category,length_characters,length_words,category_code
0,AsiaOne,7 factors to consider when looking for an HDB ...,whether youre buy hdb resale flat firsttimer l...,Lifestyle,9771,1769,3
1,The Straits Times,Jung Joon-young first to be charged in K-pop s...,seoul first arrest kpop scandal singer jung ...,Lifestyle,2162,324,3
2,The Straits Times,Music mogul Dr Dre gets flak over boast that h...,los angeles look talk lack class netizens im...,Lifestyle,1108,182,3
3,Channel News Asia,,think colonel sanders quirky enough apparentl...,Lifestyle,1193,207,3
4,Channel News Asia,,mexican government question louis vuittons use...,Lifestyle,1484,231,3


In [20]:
#Export to Serialized Object
with open('Pickles/all_articles_processed.pickle', 'wb') as output:
    pickle.dump(processed_articles, output)