# Text Preprocessing 

<br/><br/>In this notebook we are going to perform a few basic text preprocessing steps on the extracted data. 

In [14]:
# Import some import-worthy things 
import numpy as np
import pandas as pd 
import re

# Some stuff the help me with the debugging 
_GLOBAL_DEBUG_ = True 
# Note to self - I really need to start using Python Logger!!! 

### Load the data in a Pandas DataFrame 

<br/>Initially we will work with just 1,000 entries and scale up later. 

In [23]:
# Let's load a small subset of the June data. 
# We will work on it, test it, verify it
# Once confident, we will apply all the steps to the whole dataset.

"""
header : int, list of int, default ‘infer’
    Row number(s) to use as the column names, and the start of the data. 
index_col : int, str, sequence of int / str, or False, default None
    Column(s) to use as the row labels of the DataFrame, either given as string name or column index.
usecols : list-like or callable, optional
    Return a subset of the columns. If list-like, all elements must either be positional 
    (i.e. integer indices into the document columns) or strings that correspond to column names 
    provided either by the user in names or inferred from the document header row(s).    
prefix : str, optional
    Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, …
nrows : int, optional
    Number of rows of file to read. Useful for reading pieces of large files.    
"""
raw_text_df = pd.read_csv('./data/60_days_of_udacity_june.csv', 
                          header=0, usecols=['message', 'username', 'datetime'], nrows=1000)

# This little bit of pandas hackery will ensure that we get to see all the columns
# and rows, whenever a dataframe is printed 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

if _GLOBAL_DEBUG_:
    print(raw_text_df.shape)

raw_text_df.head(10)

(1000, 3)


Unnamed: 0,message,username,datetime
0,\n@akshit has joined the channel\n \n,akshit,2019-06-21 16:58:00
1,\n@aleksandra.mozejko has joined the channel\n,aleksandra.mozejko,2019-06-21 17:07:00
2,\n@161210032 has joined the channel\n \n@16121...,161210032,2019-06-21 17:09:00
3,\n@ziad.esam.ezat has joined the channel\n \n@...,ziad.esam.ezat,2019-06-21 17:26:00
4,"\n@ziad.esam.ezat No , i didn't find a way . i...",161210032,2019-06-21 17:40:00
5,\n@rekha.chandrasekaran2 has joined the channel\n,rekha.chandrasekaran2,2019-06-21 17:44:00
6,\n@aleksandra.mozejko has joined the channel\n,aleksandra.mozejko,2019-06-21 17:49:00
7,\n@abhinav.raj116 has joined the channel\n \nT...,abhinav.raj116,2019-06-21 18:23:00
8,\n@rupeshpurum has joined the channel\n,rupeshpurum,2019-06-21 18:47:00
9,\n@amatseshe has joined the channel\n \nI'm ex...,amatseshe,2019-06-21 18:53:00


### Preprocess Text
<br/>We are going to preprocess text now. Following preprocessing steps are going to get applied, 
<br/>
1. Lowercase all text messages 
2. Replace URLs with a space 
3. Replace usernames with a space 
4. Replace all special characters with a space 

In [34]:
# 1
# Lowercase all message strings 
raw_text_df['message'] = raw_text_df['message'].str.lower()
raw_text_df.head()

Unnamed: 0,message,username,datetime
0,\n@akshit has joined the channel\n \n,akshit,2019-06-21 16:58:00
1,\n@aleksandra.mozejko has joined the channel\n,aleksandra.mozejko,2019-06-21 17:07:00
2,\n@161210032 has joined the channel\n \n@16121...,161210032,2019-06-21 17:09:00
3,\n@ziad.esam.ezat has joined the channel\n \n@...,ziad.esam.ezat,2019-06-21 17:26:00
4,"\n@ziad.esam.ezat no , i didn't find a way . i...",161210032,2019-06-21 17:40:00


In [40]:
# 2 
# Replace URLs with a space " "
# This nasty regex string covers ALL the edge cases for a URL 
re_string = r'(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})'

# Before we go around removing links from the data, 
# we need to have a way to VERIFY if our regex removal 
# system as worked or not. 
# VERIFICATION is a very important aspect of data preproccessing. 

# A simple method would be find the messages containing links using the same regex string 
# After removal, we should not find anything using the same string. 
print(f'Messages containing links = {sum(raw_text_df["message"].str.contains(re_string, regex=True).astype(int))}')

# Let's remove the links 
raw_text_df['message'] = raw_text_df['message'].replace(to_replace=re_string, value=' ', regex=True)

# Let's try again and see if we can find any messages containing links to verify 
# A value of 0 should indicate that we have been successful! 
print(f'Messages containing links after removal = {sum(raw_text_df["message"].str.contains(re_string, regex=True).astype(int))}')


Messages containing links = 76
Messages containing links after removal = 0


  del sys.path[0]


In [42]:
# 3
# Usernames don't really add any semantic meaning for the machine 
# Therefore, replace all usernames with a space 
raw_text_df['message'] = raw_text_df['message'].replace(to_replace=r'@(\S)*', value=' ', regex=True)

raw_text_df['message'].head(10)

0                      \n  has joined the channel\n \n
1                         \n  has joined the channel\n
2    \n  has joined the channel\n \n  set the chann...
3    \n  has joined the channel\n \n  isn't there a...
4    \n  no , i didn't find a way . if   can help i...
5                         \n  has joined the channel\n
6                         \n  has joined the channel\n
7    \n  has joined the channel\n \nthis is my firs...
8                         \n  has joined the channel\n
9    \n  has joined the channel\n \ni'm excited for...
Name: message, dtype: object

In [43]:
# 4
# Remove all the special characters except for letters and numbers with a space
raw_text_df['message'] = raw_text_df['message'].replace(to_replace=r'[^0-9a-z]+', value=' ', regex=True)

raw_text_df['message'].head(10)

0                              has joined the channel 
1                              has joined the channel 
2     has joined the channel set the channel topic ...
3     has joined the channel isn t there a way we c...
4     no i didn t find a way if can help in finding...
5                              has joined the channel 
6                              has joined the channel 
7     has joined the channel this is my first time ...
8                              has joined the channel 
9     has joined the channel i m excited for 60 day...
Name: message, dtype: object

### Text Preprocessing Routine 
<br/>Looks like we did a good job up there. Time to put everything together into a 
_one-routine-to-process-them-all_. 

In [96]:
def text_preprocess(df, column='message'):
    """
    Text preprocessing routine. Performs the following 
    steps on the data,
    1. Lowercase all text messages
    2. Replace URLs with a space
    3. Replace usernames with a space
    4. Replace all special characters with a space    
    
    Parameters
    ----------
    df : Pandas DataFrame   
        Unprocessed dataframe
    
    column : string 
        Column name that is going to be processed
   
    Returns
    -------
    df : Post processed dataframe 
    """
    # Debug Flag
    _LOCAL_DEBUG_ = True
    
    # 1
    # Lowercase all message strings 
    df[column] = df[column].astype(str).str.lower()
    if _GLOBAL_DEBUG_ and _LOCAL_DEBUG_:
        print(df[column][1000:1020])
    # 2 
    # Replace URLs with a space " "
    re_string = r'(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})'
    df[column] = df[column].replace(to_replace=re_string, value=' ', regex=True)
    if _GLOBAL_DEBUG_ and _LOCAL_DEBUG_:
        print(df[column][1000:1020])
    # 3
    # Replace all usernames with a space 
    df[column] = df[column].replace(to_replace=r'@(\S)*', value=' ', regex=True)
    if _GLOBAL_DEBUG_ and _LOCAL_DEBUG_:
        print(df[column][1000:1020])
    # 4
    # Remove all the special characters except for letters and numbers, with a space
    df[column] = df[column].replace(to_replace=r'[^0-9a-z]+', value=' ', regex=True)
    if _GLOBAL_DEBUG_ and _LOCAL_DEBUG_:
        print(df[column][1000:1020])

    if _GLOBAL_DEBUG_ and _LOCAL_DEBUG_:
        print(df[column].head(25))
        print(df[column][1000:1020])
  
    return df

Let's do a quick test run of our new and shiny text_preprocess() routine. 

In [54]:
filtered_df = text_preprocess(raw_text_df, 'message')

filtered_df.head(10)

Unnamed: 0,message,username,datetime
0,has joined the channel,akshit,2019-06-21 16:58:00
1,has joined the channel,aleksandra.mozejko,2019-06-21 17:07:00
2,has joined the channel set the channel topic ...,161210032,2019-06-21 17:09:00
3,has joined the channel isn t there a way we c...,ziad.esam.ezat,2019-06-21 17:26:00
4,no i didn t find a way if can help in finding...,161210032,2019-06-21 17:40:00
5,has joined the channel,rekha.chandrasekaran2,2019-06-21 17:44:00
6,has joined the channel,aleksandra.mozejko,2019-06-21 17:49:00
7,has joined the channel this is my first time ...,abhinav.raj116,2019-06-21 18:23:00
8,has joined the channel,rupeshpurum,2019-06-21 18:47:00
9,has joined the channel i m excited for 60 day...,amatseshe,2019-06-21 18:53:00


# Preprocessing for GPT-2

<br/>With our text_preprocess routine working fine. We are left with the following tasks, 
<br/>
1. Load the full text corpus from both the files
2. Preprocess the complete text corpus 
3. Combine the processed text corpus into a single file 

In [97]:
# 1
# Load the data from both the files in two separate dataframes
raw_text_df_1 = pd.read_csv('./data/60_days_of_udacity_june.csv', 
                          header=0, usecols=['message', 'username', 'datetime'], nrows=3500)
raw_text_df_2 = pd.read_csv('./data/60_days_of_udacity_july_13.csv', 
                          header=0, usecols=['message', 'username', 'datetime'], nrows=15000)

if _GLOBAL_DEBUG_:
    print(raw_text_df_1.shape)
    print(raw_text_df_2.shape)
    print(f'Total Messages in the corpus = {raw_text_df_1.shape[0]+raw_text_df_2.shape[0]}')


(3406, 3)
(9612, 3)
Total Messages in the corpus = 13018


In [98]:
# 2
# Preprocess both the dataframes separately 
filtered_df_1 = text_preprocess(raw_text_df_1, 'message')
filtered_df_2 = text_preprocess(raw_text_df_2, 'message')

print('Done!')

1000    \npytorch + pysyft in raspbery pi3b ??  update...
1001    \n@george.christ1987 when i figure out how to ...
1002    \n#60daysofudacity\nday 1/60\n1. pledged \n2. ...
1003          \n@pratikthakare65 has joined the channel\n
1004    \nday 1\ntook the pledge\ntalked with @meijoke...
1005    \nhey @edgarinvillegas i did joined, i wonder ...
1006    \nday 1\n1. i took the pledge\n2. completed le...
1007    \nday 1\n1. took the pledge\n2. revised lesson...
1008    \n*day 1/60!*\n1. pledged\n2. started working ...
1009    \nday 1\n1. took the pledge\n2. attended udaci...
1010    \n:thank-you: and good luck for tomorrow @kshn...
1011    \n@quratfatima581 thanks for the encouragement...
1012                                         \nday 1/60\n
1013    \n*day 1*\n1. took the #60daysofudacity pledge...
1014    \nday 1/60\n-took the pledge\n-did a video/exe...
1015    \nday 1/60 (6/27/2019)\nstep 1: pledge made 6/...
1016    \nday 1\n1. took the pledge\n2. revised lesson...
1017    \nday 

1000                     \ncongratulations! keep going!\n
1001    \nday 6:\n1) 30 minutes of coding\n2) meetup s...
1002    \n*day 2*\n- i forgot to post this yesterday \...
1003    \nday 5:\n1. started this tensorflow course i ...
1004                       \nok, thanks, @anjumercian85\n
1005                            \n@cioloboc.florin :+1:\n
1006    \nlooks like this was just the motivation i ne...
1007                \nhi just joined ! #60daysofudacity\n
1008    \ngood morning, how are you doing? i'll only g...
1009    \nday 6:\ngot familiar with fastai library as ...
1010    \nhello @zakiya.fathima27\nif you haven't star...
1011    \nday 1 of #60daysofudacity:\n1. building  dee...
1012    \nday 1 of #60daysofudacity\n1. 30 minutes of ...
1013          \nnice! i'm glad people is still joining.\n
1014    \nday 2: completed #3diff priv: towards evalua...
1015                 \ngreat, i'll add her to the group\n
1016    \nwhat happens if one day i don't have interne...
1017    \nwe a

In [99]:
# 3
# Combine both the dataframes 
text_corpus_df = pd.concat([filtered_df_1, filtered_df_2], ignore_index=True)

# Trust, but verify! 
if _GLOBAL_DEBUG_:
    print(f'Total Messages in filtered_df_1 and filtered_df_2 = {filtered_df_1.shape[0]+filtered_df_2.shape[0]}')
    print(f'Total Messages in the text_corpus_df = {text_corpus_df.shape[0]}')
    print('Above numbers should be same. ')

Total Messages in filtered_df_1 and filtered_df_2 = 13018
Total Messages in the text_corpus_df = 13018
Above numbers should be same. 


In [100]:
# Also, we save our processed data 
text_corpus_df.to_csv('./data/complete_text_corpus.csv')

print('File Write Complete!')

File Write Complete!


### Filtering out the Messages 

<br/>Our work is not done yet. We need to filter out sentences that are too long, or too short. 


In [101]:
# Calculate the length of each message 
text_corpus_df['m_length'] = text_corpus_df['message'].astype(str).apply(lambda row: len(row))

text_corpus_df[['message', 'm_length']].head(20)

Unnamed: 0,message,m_length
0,has joined the channel,24
1,has joined the channel,24
2,has joined the channel set the channel topic ...,273
3,has joined the channel isn t there a way we c...,201
4,no i didn t find a way if can help in finding...,90
5,has joined the channel,24
6,has joined the channel,24
7,has joined the channel this is my first time ...,120
8,has joined the channel,24
9,has joined the channel i m excited for 60 day...,59


In [104]:
# 'Describe' our m_length column 
# Divide the message lengths in 'Quintiles'
text_corpus_df['m_length'].describe(percentiles=[0.2, 0.4, 0.6, 0.8])


count    13018.000000
mean       185.176371
std        445.287703
min          1.000000
20%         30.000000
40%         70.000000
50%         94.000000
60%        127.000000
80%        239.600000
max      13522.000000
Name: m_length, dtype: float64