# Data Cleaning & Splitting

During this step we'll be doing some clean-up based on the insights we acquired through __EDA__. We are going to split data into two representative parts (train & test) as well, since a separate test dataset is not provided; eventually we'll save the cleaned data files separately to facilitate a hassle-free loading for the next phase (model training).

In [1]:
# required library imports & initial settings

import re
import string
import wordninja
import pandas as pd
from sklearn.model_selection import train_test_split


RANDOM_SEED = 1024
test_size_fraction = 0.2
column_target = 'category'

## Reading Raw Data

In [2]:
# define relative data path (according the current path of this notebook) and data file name
DATA_PATH = './scripts/data'
FILE_NAME = 'dataset_all.csv.gz'

df_all = pd.read_csv(f'{DATA_PATH}/{FILE_NAME}')
df_all

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26
...,...,...,...,...,...,...
200848,TECH,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,"Reuters, Reuters",https://www.huffingtonpost.com/entry/rim-ceo-t...,Verizon Wireless and AT&T are already promotin...,2012-01-28
200849,SPORTS,Maria Sharapova Stunned By Victoria Azarenka I...,,https://www.huffingtonpost.com/entry/maria-sha...,"Afterward, Azarenka, more effusive with the pr...",2012-01-28
200850,SPORTS,"Giants Over Patriots, Jets Over Colts Among M...",,https://www.huffingtonpost.com/entry/super-bow...,"Leading up to Super Bowl XLVI, the most talked...",2012-01-28
200851,SPORTS,Aldon Smith Arrested: 49ers Linebacker Busted ...,,https://www.huffingtonpost.com/entry/aldon-smi...,CORRECTION: An earlier version of this story i...,2012-01-28


## Dealing with Missing Values

In [3]:
# remove samples with missing values in 'headline' column
df_all = df_all.dropna(subset=['headline']).reset_index(drop=True)

## Fixing Inconsistencies of Target Column Classes

In [4]:
# create a category tranformer function based on lexical similarity scores and the intuition we have after EDA
def transform_category(value):
    if value in ['ARTS', 'CULTURE & ARTS']:
        return 'ARTS & CULTURE'
    elif value in ['THE WORLDPOST', 'WORLDPOST']:
        return 'WORLD NEWS'
    elif value in ['STYLE']:
        return 'STYLE & BEAUTY'
    elif value in ['PARENTS']:
        return 'PARENTING'
    elif value in ['TASTE']:
        return 'FOOD & DRINK'
    elif value in ['GREEN', 'ENVIRONMENT']:
        return 'GREEN & ENVIRONMENT'
    else:
        return value


df_all[column_target] = df_all.apply(lambda row: transform_category(row[column_target]), axis=1)

## Dropping Unnecessary Columns

In [5]:
columns_desired = ['headline', 'category']

df_all = df_all[columns_desired]
df_all

Unnamed: 0,headline,category
0,There Were 2 Mass Shootings In Texas Last Week...,CRIME
1,Will Smith Joins Diplo And Nicky Jam For The 2...,ENTERTAINMENT
2,Hugh Grant Marries For The First Time At Age 57,ENTERTAINMENT
3,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,ENTERTAINMENT
4,Julianna Margulies Uses Donald Trump Poop Bags...,ENTERTAINMENT
...,...,...
200842,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,TECH
200843,Maria Sharapova Stunned By Victoria Azarenka I...,SPORTS
200844,"Giants Over Patriots, Jets Over Colts Among M...",SPORTS
200845,Aldon Smith Arrested: 49ers Linebacker Busted ...,SPORTS


## Splitting Data into Representative Train & Test Sets
This should take place before going any further; because we will be doing some additional data cleaning which should only be done on the training data.

In [6]:
# let's shuffle the whole dataframe before proceeding
df_all = df_all.sample(frac=1, random_state=RANDOM_SEED).reset_index(drop=True)
# second shuffle with an exponential random seed :D
df_all = df_all.sample(frac=1, random_state=(RANDOM_SEED**2)).reset_index(drop=True)

# split the data into representative training and test sets
df_train, df_test = train_test_split(df_all, test_size=test_size_fraction, stratify=df_all[column_target], random_state=RANDOM_SEED)

# reset the index of the dataframes
df_train.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)

__Note:__ You may have heard many times that for final training, we should use the entire dataset. As a side note, we'll be applying all train set considerations to this separate entire set as well; so that we can have an entire dataset ready to use for the final training. No test set exists in this scenario, and the test cases would be unseen out of sample data.

## Preprocessing Train Data
We need to do some preprocessing on the training data, which will help us further clean the training samples.

In [7]:
def preprocess_text(text: str) -> str:
    '''
    Applies text preprocessing to input text.

    Args:
        text (str): example => 'This, is the #TEXT   that needs   to be preprocessed.  '

    Returns:
        str: example => 'this is the text that needs to be preprocessed'
    '''
    text = text.lower()  # convert to lowercase
    text = re.sub('(#)(\S+)', r' \2', text)  # remove hashtags sign
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)  # remove punctuations
    text = re.sub(' +', ' ', text)  # replace multiple whitespaces with a single space
    text = text.strip()  # remove leading and trailing whitespaces
    return text


# create a new column in train dataframe which contains headline in preprocessed form
df_train['headline_preprocessed'] = df_train.apply(lambda row: preprocess_text(str(row['headline'])), axis=1)
# do the same for our separate 'entire' train dataframe
df_all['headline_preprocessed'] = df_all.apply(lambda row: preprocess_text(str(row['headline'])), axis=1)

# re-order columns
columns_order = ['headline', 'headline_preprocessed', 'category']
df_train = df_train[columns_order]
df_all = df_all[columns_order]


df_train.head()

Unnamed: 0,headline,headline_preprocessed,category
0,Medicare Supplemental Policies: Do You Need One?,medicare supplemental policies do you need one,WELLNESS
1,7 Tips For You And Your Dog This July 4th,7 tips for you and your dog this july 4th,GREEN & ENVIRONMENT
2,The Best Hotel-Hosted Super Bowl Parties In La...,the best hotelhosted super bowl parties in las...,TRAVEL
3,"Even If You Lose The Weight, Obesity May Still...",even if you lose the weight obesity may still ...,HEALTHY LIVING
4,Cocaine Cowboy 'White Boy Rick' Could Be Relea...,cocaine cowboy white boy rick could be release...,CRIME


## Apply Further Data Cleaning on Train Data
This is where we check character length and word count of the headlines and remove any rows that do not meet the requirements we identified during the __EDA__.

In [8]:
def split_concatenated(text: str) -> str:
    '''
    splits words of a concatenated English text string.

    Args:
        text (str): example => 'thistextstringisconcatenated'

    Returns:
        str: example => 'this text string is concatenated'
    '''
    text = ' '.join(wordninja.split(text))
    return text


# apply splitting function to headlines with just 1 word
df_train['headline_preprocessed'] = df_train['headline_preprocessed'].map(lambda value: split_concatenated(value) if (len(str(value).split()) < 2) else value)
# apply the same process to our separate 'entire' train dataframe
df_all['headline_preprocessed'] = df_all['headline_preprocessed'].map(lambda value: split_concatenated(value) if (len(str(value).split()) < 2) else value)

# drop remaining rows with only 1 word in headline
df_train = df_train.loc[~df_train['headline_preprocessed'].str.split().str.len() < 2].reset_index(drop=True)
# the same for our separate 'entire' train dataframe
df_all = df_all.loc[~df_all['headline_preprocessed'].str.split().str.len() < 2].reset_index(drop=True)

In [9]:
# drop samples having headlines with character length <= 10 and words count above 2
df_train = df_train.loc[~(df_train['headline_preprocessed'].str.len() <= 10) & (df_train['headline_preprocessed'].str.split().str.len() > 2)].reset_index(drop=True)
# the same for our separate 'entire' train dataframe
df_all = df_all.loc[~(df_all['headline_preprocessed'].str.len() <= 10) & (df_all['headline_preprocessed'].str.split().str.len() > 2)].reset_index(drop=True)

## Saving Cleaned Versions of Data

In [10]:
df_train.to_csv(f'{DATA_PATH}/train_cleaned.csv.gz', compression='gzip', index=False)
df_test.to_csv(f'{DATA_PATH}/test_cleaned.csv.gz', compression='gzip', index=False)
df_all.to_csv(f'{DATA_PATH}/train_cleaned_entire.csv.gz', compression='gzip', index=False)