# What is the problem?

Let us get started with the introduction to natural language processing by first looking at the problem that we are going to solve. We have a rich dataset of consumer complaints on various financial products and services. Each row in the dataset describes the complaint and the different features associated with it. In this concept, we'll first construct the features and then build a model that predicts the category into which the complaint falls. You can read more about this dataset here.

Along with the complaint narrative, the other features that are present in data are the issue, the category of the complaint, the date it was received on, the zip code, details of the customer placing the complaint and the current status of the complaint. The final idea is to build a model that will categorize each customer's complaint into a product (12 categories in all).

For the purpose of understanding how text processing works, we will specifically, work on only 2 columns of this dataset. It is evident that if we add more features, the model accuracy will rise and be more robust.

# Brief explanation of the dataset & features

Consumer Complaint Narrative: This is a paragraph (or text) written by the customer explaining his complaint in detail. The data is a string type consisting of text in the form of paragraphs.
Product: This is the category we are to classify each complaint to. The 12 categories the complaints need to be categorized into are:
'Mortgage', 'Student loan', 'Credit card or prepaid card', 'Credit card', 'Debt collection', 'Credit reporting', 'Credit reporting, credit repair services, or other personal consumer reports', 'Bank account or service', 'Consumer Loan', 'Money transfers', 'Vehicle loan or lease', 'Money transfer, virtual currency, or money service', 'Checking or savings account', 'Payday loan', 'Payday loan, title loan, or personal loan', 'Other financial service', 'Prepaid card'



# What we want as the outcome?

We would classify each complaint to its respective category, so that the complaint can be directed to the right vertical.

In [1]:
import pandas as pd

In [3]:
# In this task you will load Consumer_complaints.csv into a dataframe 
# using pandas and explore the column Consumer Complaint Narrative.

full_data = pd.read_csv("Consumer_complaints.csv")
full_data.head(2)

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,03/12/2014,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,M&T BANK CORPORATION,MI,48382.0,,,Referral,03/17/2014,Closed with explanation,Yes,No,759217
1,01/19/2017,Student loan,Federal student loan servicing,Dealing with my lender or servicer,Received bad information about my loan,When my loan was switched over to Navient i wa...,,"Navient Solutions, LLC.",LA,,,Consent provided,Web,01/19/2017,Closed with explanation,Yes,No,2296496


In [4]:
# keeping the relevant columns
data = full_data[["Consumer complaint narrative", "Product"]]

In [5]:
data.head(2)

Unnamed: 0,Consumer complaint narrative,Product
0,,Mortgage
1,When my loan was switched over to Navient i wa...,Student loan


In [9]:
# Printing out the first non-empty value of the X column. Hence the second value, index is 1
data.rename(columns={"Consumer complaint narrative": "X", "Product": "y"}, inplace=True)
data.head(2)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,X,y
0,,Mortgage
1,When my loan was switched over to Navient i wa...,Student loan


In [13]:
data.isnull().sum()

X    8194
y       0
dtype: int64

In [14]:
data.shape

(10000, 2)

In [17]:
dataSub = data[data['X'].notna()]
dataSub.head(3)

Unnamed: 0,X,y
1,When my loan was switched over to Navient i wa...,Student loan
2,I tried to sign up for a spending monitoring p...,Credit card or prepaid card
7,"My mortgage is with BB & T Bank, recently I ha...",Mortgage


In [18]:
dataSub.shape

(1806, 2)

In [31]:
import nltk
from nltk.tokenize import word_tokenize

In [25]:
first_complaint = dataSub['X'].iloc[0]
first_complaint

'When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.'

In [28]:
# conver to lowercase
def convertToLowerCase(text):
    return text.lower()

first_complaint = convertToLowerCase(first_complaint)

# Pre Processing

## Tokenizing with NLTK - The problem intuition

We will first need to find a way to convert the text to numbers to get them to a form where you would be able to apply an algorithm to this. Think of this like sklearn, which require all non-numeric data to be encoded (label or one-hot) prior to the sklearn pipeline.

Intuitively, it would make sense to divide each paragraph of text to its basic form (words) and then convert each of those words to numbers. We could assign a particular number to each word, in which case a sentence could look like a set of numbers to us, each number representing a particular word.

The first step to achieving that would be to break the text down to words. That's what tokenization aims to do. NLTK has a built in libraries for tokenization which we will use for our purpose.

In [29]:
first_complaint

'when my loan was switched over to navient i was never told that i had a deliquint balance because with xxxx i did not. when going to purchase a vehicle i discovered my credit score had been dropped from the xxxx into the xxxx. i have been faithful at paying my student loan. i was told that navient was the company i had delinquency with. i contacted navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. i was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. i have had so much trouble bringing my credit score back up.'

### Word Tokenizor

In [33]:
def tokenizeWord(text):
    return word_tokenize(text)

tokenizeWord(first_complaint)

['when',
 'my',
 'loan',
 'was',
 'switched',
 'over',
 'to',
 'navient',
 'i',
 'was',
 'never',
 'told',
 'that',
 'i',
 'had',
 'a',
 'deliquint',
 'balance',
 'because',
 'with',
 'xxxx',
 'i',
 'did',
 'not',
 '.',
 'when',
 'going',
 'to',
 'purchase',
 'a',
 'vehicle',
 'i',
 'discovered',
 'my',
 'credit',
 'score',
 'had',
 'been',
 'dropped',
 'from',
 'the',
 'xxxx',
 'into',
 'the',
 'xxxx',
 '.',
 'i',
 'have',
 'been',
 'faithful',
 'at',
 'paying',
 'my',
 'student',
 'loan',
 '.',
 'i',
 'was',
 'told',
 'that',
 'navient',
 'was',
 'the',
 'company',
 'i',
 'had',
 'delinquency',
 'with',
 '.',
 'i',
 'contacted',
 'navient',
 'to',
 'resolve',
 'this',
 'issue',
 'you',
 'and',
 'kept',
 'being',
 'told',
 'to',
 'just',
 'contact',
 'the',
 'credit',
 'bureaus',
 'and',
 'expalin',
 'the',
 'situation',
 'and',
 'maybe',
 'they',
 'could',
 'help',
 'me',
 '.',
 'i',
 'was',
 'so',
 'angry',
 'that',
 'i',
 'just',
 'hurried',
 'and',
 'paid',
 'the',
 'balance',
 'o

### Sentence Tokenizor

In [34]:
from nltk.tokenize import sent_tokenize

def tokenizeSent(text):
    return sent_tokenize(text)

tokenizeSent(first_complaint)

['when my loan was switched over to navient i was never told that i had a deliquint balance because with xxxx i did not.',
 'when going to purchase a vehicle i discovered my credit score had been dropped from the xxxx into the xxxx.',
 'i have been faithful at paying my student loan.',
 'i was told that navient was the company i had delinquency with.',
 'i contacted navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me.',
 'i was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus.',
 'i have had so much trouble bringing my credit score back up.']

## Stop Words Removal

In [36]:
first_complaint

'when my loan was switched over to navient i was never told that i had a deliquint balance because with xxxx i did not. when going to purchase a vehicle i discovered my credit score had been dropped from the xxxx into the xxxx. i have been faithful at paying my student loan. i was told that navient was the company i had delinquency with. i contacted navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. i was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. i have had so much trouble bringing my credit score back up.'

In [62]:
from nltk.corpus import stopwords
stopWords = stopwords.words('english')
def stopWordRemoval(text):
    text = convertToLowerCase(text)
    text = tokenizeWord(text)
    text = ' '.join([word for word in text if word not in stopWords])
    return text
stopWordRemoval(first_complaint)

'loan switched navient never told deliquint balance xxxx . going purchase vehicle discovered credit score dropped xxxx xxxx . faithful paying student loan . told navient company delinquency . contacted navient resolve issue kept told contact credit bureaus expalin situation maybe could help . angry hurried paid balance tried dispute delinquency credit bureaus . much trouble bringing credit score back .'

## Stemming

Stemming is the process of converting the words of a sentence to its non-changing portions. So stemming a word or sentence may result in words that are not actual words. Stems are created by removing the suffixes or prefixes used with a word.

For eg: Likes, liked, likely, unlike ---> like

### Porter Stemmer (Implemented in almost all languages)





In [40]:
#Breaking the sentence to words
words = word_tokenize(first_complaint)

#Defining Porter Stemmer object
porter = nltk.PorterStemmer()

#Applying the stemming
portar_stem = [porter.stem(i) for i in words]
print(portar_stem)

['when', 'my', 'loan', 'wa', 'switch', 'over', 'to', 'navient', 'i', 'wa', 'never', 'told', 'that', 'i', 'had', 'a', 'deliquint', 'balanc', 'becaus', 'with', 'xxxx', 'i', 'did', 'not', '.', 'when', 'go', 'to', 'purchas', 'a', 'vehicl', 'i', 'discov', 'my', 'credit', 'score', 'had', 'been', 'drop', 'from', 'the', 'xxxx', 'into', 'the', 'xxxx', '.', 'i', 'have', 'been', 'faith', 'at', 'pay', 'my', 'student', 'loan', '.', 'i', 'wa', 'told', 'that', 'navient', 'wa', 'the', 'compani', 'i', 'had', 'delinqu', 'with', '.', 'i', 'contact', 'navient', 'to', 'resolv', 'thi', 'issu', 'you', 'and', 'kept', 'be', 'told', 'to', 'just', 'contact', 'the', 'credit', 'bureau', 'and', 'expalin', 'the', 'situat', 'and', 'mayb', 'they', 'could', 'help', 'me', '.', 'i', 'wa', 'so', 'angri', 'that', 'i', 'just', 'hurri', 'and', 'paid', 'the', 'balanc', 'off', 'and', 'then', 'after', 'tri', 'to', 'disput', 'the', 'delinqu', 'with', 'the', 'credit', 'bureau', '.', 'i', 'have', 'had', 'so', 'much', 'troubl', 'br

### Lancaster Stemmer

In [45]:
#Defining Porter Stemmer object
lancaster = nltk.LancasterStemmer()

#Applying the stemming
lancaster_stem = [lancaster.stem(i) for i in words]
print(lancaster_stem)

['when', 'my', 'loan', 'was', 'switch', 'ov', 'to', 'navy', 'i', 'was', 'nev', 'told', 'that', 'i', 'had', 'a', 'deliquint', 'bal', 'becaus', 'with', 'xxxx', 'i', 'did', 'not', '.', 'when', 'going', 'to', 'purchas', 'a', 'vehic', 'i', 'discov', 'my', 'credit', 'scor', 'had', 'been', 'drop', 'from', 'the', 'xxxx', 'into', 'the', 'xxxx', '.', 'i', 'hav', 'been', 'faith', 'at', 'pay', 'my', 'stud', 'loan', '.', 'i', 'was', 'told', 'that', 'navy', 'was', 'the', 'company', 'i', 'had', 'delinqu', 'with', '.', 'i', 'contact', 'navy', 'to', 'resolv', 'thi', 'issu', 'you', 'and', 'kept', 'being', 'told', 'to', 'just', 'contact', 'the', 'credit', 'burea', 'and', 'expalin', 'the', 'situ', 'and', 'mayb', 'they', 'could', 'help', 'me', '.', 'i', 'was', 'so', 'angry', 'that', 'i', 'just', 'hurry', 'and', 'paid', 'the', 'bal', 'off', 'and', 'then', 'aft', 'tri', 'to', 'disput', 'the', 'delinqu', 'with', 'the', 'credit', 'burea', '.', 'i', 'hav', 'had', 'so', 'much', 'troubl', 'bring', 'my', 'credit',

### Snowball Stemmer

In [46]:
sno = nltk.stem.SnowballStemmer('english')
#Applying the stemming
sno_stem = [sno.stem(i) for i in words]
print(sno_stem)

['when', 'my', 'loan', 'was', 'switch', 'over', 'to', 'navient', 'i', 'was', 'never', 'told', 'that', 'i', 'had', 'a', 'deliquint', 'balanc', 'becaus', 'with', 'xxxx', 'i', 'did', 'not', '.', 'when', 'go', 'to', 'purchas', 'a', 'vehicl', 'i', 'discov', 'my', 'credit', 'score', 'had', 'been', 'drop', 'from', 'the', 'xxxx', 'into', 'the', 'xxxx', '.', 'i', 'have', 'been', 'faith', 'at', 'pay', 'my', 'student', 'loan', '.', 'i', 'was', 'told', 'that', 'navient', 'was', 'the', 'compani', 'i', 'had', 'delinqu', 'with', '.', 'i', 'contact', 'navient', 'to', 'resolv', 'this', 'issu', 'you', 'and', 'kept', 'be', 'told', 'to', 'just', 'contact', 'the', 'credit', 'bureaus', 'and', 'expalin', 'the', 'situat', 'and', 'mayb', 'they', 'could', 'help', 'me', '.', 'i', 'was', 'so', 'angri', 'that', 'i', 'just', 'hurri', 'and', 'paid', 'the', 'balanc', 'off', 'and', 'then', 'after', 'tri', 'to', 'disput', 'the', 'delinqu', 'with', 'the', 'credit', 'bureaus', '.', 'i', 'have', 'had', 'so', 'much', 'trou

## Lemmatization

This method is a more refined way of breaking words through the use of a vocabulary and morphological analysis of words. The aim is to always return the base form of a word known as lemma.

Consider the following words:<br>
'Studied', 'Studious' ,'Studying'<br>
Stemming of them will result in Studi<br>
Lemmatisation of them will result in Study


'Studied', 'Studious' ,'Studying'<br>
Stemming of them will result in Studi<br>
Lemmatisation of them will result in Study<br>

Difference<br>

Stem         Lemm<br>
Fast         Slow<br>
Rule Based   Dictionary Based<br>

In [48]:
from nltk.stem import WordNetLemmatizer

lemma = WordNetLemmatizer()
lemma_result = [lemma.lemmatize(i) for i in words]
print(lemma_result)

['when', 'my', 'loan', 'wa', 'switched', 'over', 'to', 'navient', 'i', 'wa', 'never', 'told', 'that', 'i', 'had', 'a', 'deliquint', 'balance', 'because', 'with', 'xxxx', 'i', 'did', 'not', '.', 'when', 'going', 'to', 'purchase', 'a', 'vehicle', 'i', 'discovered', 'my', 'credit', 'score', 'had', 'been', 'dropped', 'from', 'the', 'xxxx', 'into', 'the', 'xxxx', '.', 'i', 'have', 'been', 'faithful', 'at', 'paying', 'my', 'student', 'loan', '.', 'i', 'wa', 'told', 'that', 'navient', 'wa', 'the', 'company', 'i', 'had', 'delinquency', 'with', '.', 'i', 'contacted', 'navient', 'to', 'resolve', 'this', 'issue', 'you', 'and', 'kept', 'being', 'told', 'to', 'just', 'contact', 'the', 'credit', 'bureau', 'and', 'expalin', 'the', 'situation', 'and', 'maybe', 'they', 'could', 'help', 'me', '.', 'i', 'wa', 'so', 'angry', 'that', 'i', 'just', 'hurried', 'and', 'paid', 'the', 'balance', 'off', 'and', 'then', 'after', 'tried', 'to', 'dispute', 'the', 'delinquency', 'with', 'the', 'credit', 'bureau', '.',

##### Difference 
The word "meeting" can have base form of a noun or a form of a verb ("to meet") [(e.g.,"in last week's meeting"(noun) or "We are meeting next month"(verb))]. Lemmatisation can select the appropriate lemma depending on the context, stemming can't.

In [49]:
word1 = "in last week's meeting"
word2 = "We are meeting next month"

In [53]:
lemma_result = lemma.lemmatize(word1)
print(lemma_result)

lemma_result = lemma.lemmatize(word2)
print(lemma_result)


#Applying the stemming
portar_stem = porter.stem(word1)
print(portar_stem)

portar_stem = porter.stem(word2)
print(portar_stem)

in last week's meeting
We are meeting next month
in last week's meet
we are meeting next month


# Pre process

In [64]:
specialChars = [',',';','.','#','$']
def specialCharRemoval(text):
    for char in specialChars:
        text = text.replace(char,'')
    return text
specialCharRemoval(first_complaint)



from nltk.stem import PorterStemmer
ps = PorterStemmer()
#ps.stem('giving')
def stemData(text):
    text = convertToLowerCase(text)
    text = word_tokenize(text)
    newText = []
    for word in text:
        newText.append(ps.stem(word))
    return ' '.join(newText)
stemData(first_complaint)


'when my loan wa switch over to navient i wa never told that i had a deliquint balanc becaus with xxxx i did not . when go to purchas a vehicl i discov my credit score had been drop from the xxxx into the xxxx . i have been faith at pay my student loan . i wa told that navient wa the compani i had delinqu with . i contact navient to resolv thi issu you and kept be told to just contact the credit bureau and expalin the situat and mayb they could help me . i wa so angri that i just hurri and paid the balanc off and then after tri to disput the delinqu with the credit bureau . i have had so much troubl bring my credit score back up .'

In [65]:
def preprocess(text):
    try:
        text = convertToLowerCase(text)
        text = tokenizeWord(text)
        text = specialCharRemoval(' '.join(text))
        text = stopWordRemoval(text)
        text = stemData(text)
    except:
        pass
    return text
preprocess(first_complaint)

'loan switch navient never told deliquint balanc xxxx go purchas vehicl discov credit score drop xxxx xxxx faith pay student loan told navient compani delinqu contact navient resolv issu kept told contact credit bureau expalin situat mayb could help angri hurri paid balanc tri disput delinqu credit bureau much troubl bring credit score back'

In [67]:
dataSub.head(2)

Unnamed: 0,X,y
1,When my loan was switched over to Navient i wa...,Student loan
2,I tried to sign up for a spending monitoring p...,Credit card or prepaid card


In [68]:
dataSub['X'] = dataSub['X'].apply(preprocess)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [69]:
dataSub.head(2)

Unnamed: 0,X,y
1,loan switch navient never told deliquint balan...,Student loan
2,tri sign spend monitor program capit one let a...,Credit card or prepaid card
