# Text Cleaning

In this notebook we perform essential text data cleaning.

Implemented methods are very general and can be useful for various datasets. If neccesary, you can implement some additional cleaning methods and  add it easially to final cleaning method.

### Setup

In [0]:
from google.colab import drive
import sys, os

import numpy as np
import pandas as pd
import re
import time
from sklearn.externals import joblib

%reload_ext autoreload
%autoreload 2

import nltk
from nltk.corpus import stopwords

pd.set_option('display.max_colwidth', -1)

In [0]:
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
os.getcwd()

In [0]:
# Create a symbolic link to omit issues with whitespace in "My Drive"
!ln -s /content/gdrive/"My Drive"/ MyDrive

In [0]:
PROJECT_HOME_PATH = os.path.join('MyDrive', 'LSTM-light')

In [0]:
os.chdir(PROJECT_HOME_PATH)

### Load dataset

In [0]:
df_data = pd.read_csv(os.path.join(PROJECT_HOME_PATH, 'DATA', 'quora.csv'))

### Look at examples

In [0]:
text = df_data['question_text']

In [0]:
#Display some examples
for row in text[100:110]:
  print(row)

What do physicists, mathematicians, computer scientists and philosophers think of David Deutsch's 'Constructor Theory'?
Why are old scriptures from eastern cultures appear lost in the current culture?
Can I know my I.Q, even if I hate numbers?
How can I really make up my mind and get rid of my bad habits like procrastination?
Was there any relationship between Napoleon and Ali Pasha of Tepelene?
Where are presynaptic neurons found?
What ways will a narcissist mother punish her child for going no contact if child goes back to contact with her?
Can I start freelancing after finishing Udacity's Android basic nanodegree?
What is the reason why we really need Bitcoin?
What are some good songs for a long journey?


### Data cleaning

In [0]:
# Run cell if you want to use english stopwords
nltk.download('stopwords')

In [0]:
stopwords_from_file = False

In [0]:
if stopwords_from_file:
  with open(os.path.join(PROJECT_HOME_PATH, 'DATA', 'stopwords.txt'), encoding="utf-8", newline='\n') as f:
    stopwords = f.readlines()
    
  stopwords_list = [x[:-1] for x in stopwords]

else:
  stopwords_list = list(set(stopwords.words('english')))

In [0]:
def compareChanges(text, func):
    """Function to compare changes due to applied function.

    Args:
        text, pd.Series: string to modify.
        func, def: function which returns modified text.

    Returns:
        text_mod, pd.Dataframe: result df with columns: TextBefore, TextAfter, Changed.
    """

    text_mod = pd.DataFrame(columns=['TextBefore', 'TextAfter', 'Changed'])
    text_mod['TextBefore'] = text.copy()

    for index, row in text_mod.iterrows():
        row['TextAfter'] = func(row['TextBefore'])

    text_mod['Changed'] = np.where(text_mod['TextBefore']==text_mod['TextAfter'], 0, 1)

    return text_mod
  
  
def removeNumbers(text):
    text = re.sub('[0-9]', '', text)
    return text

  
def removeAllOtherThanLetters(text):
    text = re.sub('[^a-zA-Z]', ' ', text)
    return text
  
  
def makeLowerCase(text):
    return text.lower()
  
  
def removeSpecial(text):
    puncts = ['&']
    
    
    for punct in puncts:
        text = text.replace(punct, '')
        
    return text
  
  
def removeWhitespace(text):
    text = " ".join(text.split())
    return text
  

def removeStopwords(text, stopwords_list):
    final_tokens = []
    tokens = text.split()
    
    for word in tokens:
        if (word not in stopwords_list):
            final_tokens.append(word)
            
    text = " ".join(final_tokens)
  
    return text
  
  
def removeSelectedWords(text):
    final_tokens = []
    tokens = text.split()
    selected_words = []
    for word in tokens:
        if (word not in selected_words):
            final_tokens.append(word)
            
    text = " ".join(final_tokens)
    
    return text
  
  
def removeShortWords(text):
    final_tokens = []
    tokens = text.split()
    
    for word in tokens:
        if len(word) > 1:
            final_tokens.append(word)
            
    text = " ".join(final_tokens)
    
    return text
    
    
def replacePolishLetters(text):
  
  
    patterns = [ (r'ł', 'l'), 
                 (r'ó', 'o'),
                 (r'ą', 'a'),
                 (r'ć', 'c'),
                 (r'ę', 'e'),
                 (r'ń', 'n'),
                 (r'ś', 's'),
                 (r'ź', 'z'),
                 (r'ż', 'z')]
  
    for (pattern, repl) in patterns:
        text = re.sub(pattern, repl, text)
    return text
            
            
def cleanData(text):
  
    final_text = []
    
    for w in text:
        w = removeAllOtherThanLetters(w)
        w = makeLowerCase(w)
        w = removeStopwords(w, stopwords_list)
        w = removeSelectedWords(w)
        w = removeShortWords(w)
        w = removeWhitespace(w)    

        
        final_text.append(w)
        
    return pd.Series(final_text, index=text.index)
  
  
def cleanDataToDF(text):
  
    text_mod = pd.DataFrame(columns=['TextBefore', 'TextAfter'])
    text_mod['TextBefore'] = text.copy()
    
    for index, row in text_mod.iterrows():
        row['TextAfter'] = removeAllOtherThanLetters(row['TextBefore'])
        row['TextAfter'] = makeLowerCase(row['TextBefore'])
        row['TextAfter'] = removeStopwords(row['TextBefore'], stopwords_list)  
        row['TextAfter'] = removeSelectedWords(row['TextBefore'])  
        row['TextAfter'] = removeShortWords(row['TextBefore'])        
        row['TextAfter'] = removeWhitespace(row['TextBefore'])    

    return text_mod
  
  

#### This method is to verify how particular cleaning method works:

In [0]:
#text_mod = compareChanges(text[:20], removeAllOtherThanLetters)
#text_mod.head()

### Run cleaning

In [0]:
# We fill null values with string 'empty string'
text = text.fillna('empty string')


# Run data cleaning
text_mod = cleanData(text)

In [0]:
# If we did not ommit any record, then add columns with cleaned text

if len(text) == len(text_mod):
        df_data['question_text_mod'] = text_mod

In [0]:
# Rearrange columns order
df_data = df_data[['qid', 'question_text', 'question_text_mod', 'target']]


# If any cleaned text resulted with as 0 len string '', we drop it
df_data = df_data.dropna()
df_data = df_data.drop(df_data[df_data.question_text_mod == ''].index)

In [0]:
df_data.tail(5)

Unnamed: 0,qid,question_text,question_text_mod,target
1306117,ffffcc4e2331aaf1e41e,What other technical skills do you need as a computer science undergrad other than c and c++?,technical skills need computer science undergrad,0
1306118,ffffd431801e5a2f4861,Does MS in ECE have good job prospects in USA or like India there are more IT jobs present?,ms ece good job prospects usa like india jobs present,0
1306119,ffffd48fb36b63db010c,Is foam insulation toxic?,foam insulation toxic,0
1306120,ffffec519fa37cf60c78,How can one start a research project based on biochemistry at UG level?,one start research project based biochemistry ug level,0
1306121,ffffed09fedb5088744a,Who wins in a battle between a Wolverine and a Puma?,wins battle wolverine puma,0


#### Save cleaned data

In [0]:
joblib.dump(df_data, os.path.join(PROJECT_HOME_PATH, 'DATA', 'interim', 'quora_mod.dat'))

['/content/gdrive/My Drive/LSTM-light/DATA/interim/quora_mod.dat']