# Data preprocessing

## Import packages

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import regexp_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from collections import Counter

## Read in data

If just using a subset of all the available rows:

In [2]:
#n_rows = 100
#complaints_df = pd.read_csv('~/documents/data/consumer_complaints/consumer_complaints_clean.csv', \
#                            index_col = 0, nrows = n_rows)

If using all rows:

In [3]:
complaints_df = pd.read_csv('~/documents/data/consumer_complaints/consumer_complaints_clean.csv', \
                            index_col = 0)
n_rows = complaints_df.shape[0]

First complaint before pre-processing:

In [4]:
print(complaints_df.iloc[0, 2])

received capital one charge card offer xxxx. applied, was accepted ( {$500.00} limit ), activated card and used for xxxx presents. charge card # xxxx. right after activating card ... capital one sent me another card with same {$500.00} limit ... never activated ... never used that card. first bill from above card # came due xxxx and minimum payment due was {$15.00}. i sent in {$20.00} via uspmo and sent in before due date. with the xxxx non-activated, non used credit card ... ..they also sent me bill for some yearly fees when never even activated the card. so called them up ... ... .told them did not want the card and sent back to them. well ... .get my next bill from the card # above ( xxxx ) ... .they did not credit me for the {$20.00} payment and charged me outrageous over the limit fees, late fees, etc ... and now {$70.00} payment due. so, i called up, their rep stated they accidentally applied my {$20.00} payment to wrong account number and would be corrected. so, i sent in a {$70

## Make lower case

Words with and without capital letters are considered to be different words by the nltk package. I can simplify the problem by making all letters lower case.

In [5]:
for col in complaints_df.columns:
    complaints_df[col] = [element.lower() for element in complaints_df[col]]

## Tokenize, lemmatize and remove stop words

Here I tokenize each of the complaints by splitting them into a list of separate words, i.e. a 'bag of words'. I also discard away punctuation and numerics at this point. This will make the remaining steps easier.

I use lemmatization on each word to its root to save space, speed up the following analysis, and minimise any kind of overfitting to non-meaningful words. There is also an option for stemming if lemmatization ends up being too slow.

Stop words are commonly occuring words that take up space and add little meaning. Here I remove the stop words to reduce the size of the problem even further. "xxxx" is a a string used to replace words used in the consumer complaints with confidentiality issues. This string occurs frequently but as its meaning is obscured it adds no value. So here I add it to the list of stop words.

In [6]:
stop_words = stopwords.words('english')
stop_words.append("xxxx")

In [7]:
# Function borrowed from https://www.kaggle.com/alvations/basic-nlp-with-nltk
def penn2morphy(penntag):
    """ Converts Penn Treebank tags to WordNet. """
    morphy_tag = {'NN':'n', 'JJ':'a',
                  'VB':'v', 'RB':'r'}
    try:
        return morphy_tag[penntag[:2]]
    except:
        return 'n' # if mapping isn't found, fall back to Noun.

In [8]:
pattern = r"\w+"
complaints_list = []

# For stemming
# ps = PorterStemmer()

# For lemmatization
lem = WordNetLemmatizer()

for i in range(n_rows):
    complaint = regexp_tokenize(complaints_df.iloc[i, 2], pattern)
    tags = nltk.pos_tag(complaint)
    # For stemming
    # complaints_list.append([ps.stem(word) for word in complaint if word not in stop_words and word.isalpha()])
    
    # For lemmatisation
    complaints_list.append([lem.lemmatize(word, pos = penn2morphy(tag)) for word, tag in tags\
                                          if word not in stop_words and word.isalpha()])

complaints_df['Consumer complaint narrative'] = complaints_list

del complaints_list

First complaint after pre-processing:

In [11]:
print(complaints_df.iloc[0, 2])

['receive', 'capital', 'one', 'charge', 'card', 'offer', 'apply', 'accept', 'limit', 'activate', 'card', 'use', 'present', 'charge', 'card', 'right', 'activate', 'card', 'capital', 'one', 'sent', 'another', 'card', 'limit', 'never', 'activate', 'never', 'use', 'card', 'first', 'bill', 'card', 'come', 'due', 'minimum', 'payment', 'due', 'send', 'via', 'uspmo', 'send', 'due', 'date', 'non', 'activate', 'non', 'use', 'credit', 'card', 'also', 'send', 'bill', 'yearly', 'fee', 'never', 'even', 'activate', 'card', 'call', 'tell', 'want', 'card', 'send', 'back', 'well', 'get', 'next', 'bill', 'card', 'credit', 'payment', 'charge', 'outrageous', 'limit', 'fee', 'late', 'fee', 'etc', 'payment', 'due', 'call', 'rep', 'state', 'accidentally', 'apply', 'payment', 'wrong', 'account', 'number', 'would', 'correct', 'sent', 'payment', 'via', 'uspmo', 'along', 'note', 'make', 'sure', 'account', 'correct', 'payment', 'apply', 'correct', 'account', 'number', 'minimum', 'due', 'want', 'keep', 'card', 'als

## Save pre-processed data to file

In [12]:
complaints_df.to_csv('~/documents/data/consumer_complaints/consumer_complaints_pre-processed.csv')