# Data preprocessing

## Import packages

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import regexp_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from collections import Counter

## Read in data

If just using a subset of all the available rows:

In [None]:
n_rows = 100
complaints_df = pd.read_csv('~/documents/data/consumer_complaints/consumer_complaints_clean.csv', \
                            index_col = 0, nrows = n_rows)

If using all rows:

In [2]:
complaints_df = pd.read_csv('~/documents/data/consumer_complaints/consumer_complaints_clean.csv', \
                            index_col = 0)
n_rows = complaints_df.shape[0]

First complaint before pre-processing:

In [3]:
print(complaints_df.iloc[0, 2])

received capital one charge card offer xxxx. applied, was accepted ( {$500.00} limit ), activated card and used for xxxx presents. charge card # xxxx. right after activating card ... capital one sent me another card with same {$500.00} limit ... never activated ... never used that card. first bill from above card # came due xxxx and minimum payment due was {$15.00}. i sent in {$20.00} via uspmo and sent in before due date. with the xxxx non-activated, non used credit card ... ..they also sent me bill for some yearly fees when never even activated the card. so called them up ... ... .told them did not want the card and sent back to them. well ... .get my next bill from the card # above ( xxxx ) ... .they did not credit me for the {$20.00} payment and charged me outrageous over the limit fees, late fees, etc ... and now {$70.00} payment due. so, i called up, their rep stated they accidentally applied my {$20.00} payment to wrong account number and would be corrected. so, i sent in a {$70

## Make lower case

Words with and without capital letters are considered to be different words by the nltk package. I can simplify the problem by making all letters lower case.

In [4]:
for col in complaints_df.columns:
    complaints_df[col] = [element.lower() for element in complaints_df[col]]

## Tokenize, lemmatize and remove stop words

Here I tokenize each of the complaints by splitting them into a list of separate words, i.e. a 'bag of words'. I also discard away punctuation and numerics at this point. This will make the remaining steps easier.

I use lemmatization on each word to its root to save space, speed up the following analysis, and minimise any kind of overfitting to non-meaningful words. There is also an option for stemming if lemmatization ends up being too slow.

Stop words are commonly occuring words that take up space and add little meaning. Here I remove the stop words to reduce the size of the problem even further. "xxxx" is a a string used to replace words used in the consumer complaints with confidentiality issues. This string occurs frequently but as its meaning is obscured it adds no value. So here I add it to the list of stop words.

Actually, seeing as I will do tokenization using scikit-learn's CountVectorizer and TfidfVectorizer methods, I will make the output here a string rather than separate tokens.

In [5]:
stop_words = stopwords.words('english')
stop_words.append("xxxx")

In [6]:
# Function borrowed from https://www.kaggle.com/alvations/basic-nlp-with-nltk
def penn2morphy(penntag):
    """ Converts Penn Treebank tags to WordNet. """
    morphy_tag = {'NN':'n', 'JJ':'a',
                  'VB':'v', 'RB':'r'}
    try:
        return morphy_tag[penntag[:2]]
    except:
        return 'n' # if mapping isn't found, fall back to Noun.

In [23]:
pattern = r"\w+"
complaints = []

# For stemming
# ps = PorterStemmer()

# For lemmatization
lem = WordNetLemmatizer()

for i in range(n_rows):
    complaint = regexp_tokenize(complaints_df.iloc[i, 2], pattern)
    tags = nltk.pos_tag(complaint)
    # For stemming
    # complaint_list = [ps.stem(word) for word in complaint if word not in stop_words and word.isalpha()]
    
    # For lemmatisation
    complaint_list = [lem.lemmatize(word, pos = penn2morphy(tag)) for word, tag in tags\
                                          if word not in stop_words and word.isalpha()]
    
    complaints.append(' '.join(complaint_list))
    
    if i % 1000 == 0: print('Complaint:', i, 'complete')

complaints_df['Consumer complaint narrative'] = complaints

del complaint_list

Complaint: 0 complete
Complaint: 1000 complete
Complaint: 2000 complete
Complaint: 3000 complete
Complaint: 4000 complete
Complaint: 5000 complete
Complaint: 6000 complete
Complaint: 7000 complete
Complaint: 8000 complete
Complaint: 9000 complete
Complaint: 10000 complete
Complaint: 11000 complete
Complaint: 12000 complete
Complaint: 13000 complete
Complaint: 14000 complete
Complaint: 15000 complete
Complaint: 16000 complete
Complaint: 17000 complete
Complaint: 18000 complete
Complaint: 19000 complete
Complaint: 20000 complete
Complaint: 21000 complete
Complaint: 22000 complete
Complaint: 23000 complete
Complaint: 24000 complete
Complaint: 25000 complete
Complaint: 26000 complete
Complaint: 27000 complete
Complaint: 28000 complete
Complaint: 29000 complete
Complaint: 30000 complete
Complaint: 31000 complete
Complaint: 32000 complete
Complaint: 33000 complete
Complaint: 34000 complete
Complaint: 35000 complete
Complaint: 36000 complete
Complaint: 37000 complete
Complaint: 38000 complete

First complaint after pre-processing:

In [28]:
print(complaints_df.iloc[0, 2])

receive capital one charge card offer apply accept limit activate card use present charge card right activate card capital one sent another card limit never activate never use card first bill card come due minimum payment due send via uspmo send due date non activate non use credit card also send bill yearly fee never even activate card call tell want card send back well get next bill card credit payment charge outrageous limit fee late fee etc payment due call rep state accidentally apply payment wrong account number would correct sent payment via uspmo along note make sure account correct payment apply correct account number minimum due want keep card also repair credit bankruptcy bill come mail apply payment previous payment state would correct charge outrageous limit fee late fee etc along stupid note spread thin think wow total b call numerous time write numerous time success correct account acknowledge mistake want minimum payment think totally illegal send payment time send mini

## Save pre-processed data to file

to_pickle() is needed to preserve the list structure of the complaints column.

In [25]:
complaints_df.to_csv('~/documents/data/consumer_complaints/consumer_complaints_pre-processed.csv')