# Clustering, Topic Modeling Consumer Mortgage Complaints

## Background

The Consumer Financial Protection Bureau (CFPB) was created as part of the Dodd-Frank Financial Protection Act, and it's mission is to empower and educate consumers about financial products. CFPB also is responsible for enforcing financial regulations. Therefore consumer complaints hold useful information both for CFPB and the customers they serve. In the following sections we connect to the complaints data and do an exploratory analysis of the narratives provided by customers. 

Along the way we'll highlight areas that may be useful to CFPB, both in educating & empowering consumers within its Division of Consumer Education and Engagement and Division of Research Markets and Regulation. 


## Getting the Data

After importing each package, we'll query the consumer compaints database provided by CFPB by

1. Connecting to the database file
2. Creating a cursor for the connection
3. Creating and execute a SQL query, saving the results to the 'complaints' variable
4. Closing the connection.

In [1]:
import sqlite3
import os 
import nltk
import numpy as np
import scipy

os.chdir("C:\kaggle\cfpb")
con = sqlite3.connect("database.sqlite")
cur = con.cursor()
sqlString = """ 
            SELECT complaint_id, date_received, consumer_complaint_narrative, company
            FROM consumer_complaints
            WHERE product = "Mortgage" AND 
                            consumer_complaint_narrative != ""
            """
cur.execute(sqlString)
complaints = cur.fetchall()
con.close()

### Peeking at the Data

The rows selected -- complaint_id, consumer_complaint_narrative, company -- are stored as a list of tuples. There is one tuple for each row. As an example, let's randomly (well, sort of...) select a complaint and print its date, narrative, and company

In [2]:
import random
random.seed(7040)
rand_complaint = random.randint(0, len(complaints))
print(rand_complaint)
print(len(complaints))
print(complaints[rand_complaint])

2
14919
(1292137, '03/19/2015', 'XXXX was submitted XX/XX/XXXX. At the time I submitted this complaint, I had dealt with Rushmore Mortgage directly endeavoring to get them to stop the continuous daily calls I was receiving trying to collect on a mortgage for which I was not responsible due to bankruptcy. They denied having knowledge of the bankruptcy, even though I had spoken with them about it repeatedly and had written them repeatedly referencing the bankruptcy requesting them to cease the pursuit, they continued to do so. When they were unable to trick me into paying, force me into paying in retaliation they placed reported to my credit bureaus a past due mortgage amount that had been discharged in Federal Court. On XX/XX/XXXX Rushmore responded the referenced complaint indicating that they would remove the reporting from my bureau, yet it is still there now in XX/XX/XXXX. I would like them to remove it immediately and send me a letter indicating that it should not have been there i

## Some Questions

Our random complaint is voyeristically interesting (if a little disheartening), but reading it with CFPB in mind questions come to mind: "Are there other, similar narratives?". Another que

In [3]:
total_words=0
for i in range(len(complaints)):
    txt_lst = complaints[i][2].split()
    total_words += len(txt_lst)
average = round(total_words/len(complaints),2)
print(average, " Average words per complaint")
for i in range(len(complaints)):
    txt_lst = complaints[i][1].split()
    

260.34  Average words per complaint


So we've answered our first question about the distribution of words in complaints, but what about the other questions - can we extract deeper meaning from the text of consumer's narratives to learn about complaint topics? Maybe, but first we need to represent each narrative's text as vectors in a matrix, a so-called 'bag-of-words'.

## Document-Term Matrix

The first step in answering the second question is taking our raw text and process it. For each narrative we want two pieces of information

1. What words appear, and
2. How many times each of the words appears

An efficient way to do this is with a Document Term Matrix (DTM) and Vocabulary. The DTM will have a row for each mortgage complaint, and a column for each word in the vocabulary. This gives us an MxN matrix, where M is the number of complaints, N is the number of words in the vocabulary, and the ijth entry corresponds to the count of the jth vocabulary word in the ith narrative. That may seem a bit abstract if your unfamiliar with text analysis and/or linear algebra so we'll go through an example or two to make it more concrete. 

First, we extract each complaint so we have a list of complaints.

In [4]:
complaint_list = []
for i in range(len(complaints)):
    complaint_list.append(complaints[i][2])

### Stop Words

In extracting a vocabulary for our text, we want balance: Including all words used is more than we need, but too few and we won't extract any meaningful information. Once we have a vocabulary we'll count up how many times each narrative uses each word in the vocabulary. Those counts will make up the DTM.

The big idea behind creating a DTM is that each document -- in our case mortgage complaint narratives -- can be represented as a vector. Using vectors we can compute things like distance and similarity between narratives. But some words -- like 'the', 'a', 'it' -- occur so frequently in English text they'll be in nearly 100% of the narratives and therefore don't add much value to our DTM. Think of it like this -- if I tell you two narratives use the word 'the' and 'it' five times you likely haven't learned anything about their content, but if I tell you two narratives contain the words 'refinance' and 'foreclosure' 5 times you can begin to make some inferences about what other words they include. 

In text analysis, these frequently occuring terms are known as 'stop words'. There's a dictionary of them in the nltk package we'll use, but we'll also include some words from the text reading the example narrative above that we want the DTM counter to ignore. 

In [5]:
stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(['wells', 'fargo, bank', 'america','chase', 'x','xx','xxx','xxxx','xxxxx',
                'mortgage', 'x/xx/xxxx', 'mortgage', '00'])

### Words to Vectors
Now that we've got our stop words, we can create a CountVectorizer object with our stop word list and feed it our complaint list. Then we'll coerce the matrix and vocabulary to numpy arrays because they have more methods that we'll use in later computations. 

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words = stopwords)
dtm = vectorizer.fit_transform(complaint_list)
vocab = vectorizer.get_feature_names()
dtm = dtm.toarray()
vocab = np.array(vocab)

## Using the DTM 

Now that we've processed and vectorized our text some doors have opened. For example, for any two narratives we could compute the

- **Cosine Similarity**: A measure of how similar two narratives are. It's the measure of the cosine of the angle between two vectors. A value of 1 corresponds to an angle of 0 degrees, or equivalent vectors and 90 degrees, or orthogonal vectors. 
- **Euclidean Distance**: 

We could use the DTM, for example, to find the narratives which are most similar to each other. As an example, we'll loop through the DTM, measure the similarity of each narrative to the Rushmore complaint we started with (the randomly selected one from the beginning), and return the complaint that's most similar. 

In [7]:
from sklearn.metrics.pairwise import cosine_similarity
 
sim = 0
for i in range(len(dtm)):
    if i == 2:
        pass
    else:
        s = cosine_similarity(dtm[2].reshape(1,-1),dtm[i].reshape(1,-1))[0][0]
        if s > sim:
            sim = s
            closest = i
print(sim, closest)

0.321863154789 11740


So we have the complaint we started with and another which is the most similar according to the similarity measurement we defined. But what about a human reader? Would a person reading the texts notice any similarities? Let's print them out side-by-side and see

In [8]:
print("COMPLAINT 2: ",complaints[2])
print("COMPLAINT ", closest,": ", complaints[closest])

COMPLAINT 2:  (1292137, '03/19/2015', 'XXXX was submitted XX/XX/XXXX. At the time I submitted this complaint, I had dealt with Rushmore Mortgage directly endeavoring to get them to stop the continuous daily calls I was receiving trying to collect on a mortgage for which I was not responsible due to bankruptcy. They denied having knowledge of the bankruptcy, even though I had spoken with them about it repeatedly and had written them repeatedly referencing the bankruptcy requesting them to cease the pursuit, they continued to do so. When they were unable to trick me into paying, force me into paying in retaliation they placed reported to my credit bureaus a past due mortgage amount that had been discharged in Federal Court. On XX/XX/XXXX Rushmore responded the referenced complaint indicating that they would remove the reporting from my bureau, yet it is still there now in XX/XX/XXXX. I would like them to remove it immediately and send me a letter indicating that it should not have been t

The narratives are definitely unique, but it is reasuring -- at least in validating our similarity measurement -- that both coplaints stem from the same issue. That is lenders failing to cease reporting to credit bureaus after a customer's bankruptcy. So how could CFPB use this information? 

1. When complaints are filed the most similar complaint -- or even the 10 most similar complaints -- could be used for outreach and connecting consumers. When filing a complaint you're probably interested in how similar complaints were resolved. 

2. Clustering topics to learn more about categories of consumer complaints. Knowing about clusters would allow CFPB to identify systemic issues. 

3. Using the vector representations CFPB could identify words distinctive to particular categories. 

## Applying the Similarity Measure

### Finding Nearest-Neighbors

Now that we've defined a measure of 'similarity', we can use it to sort the narratives into buckets based on their proximity to each other. Again, the DTM is an array of arrays, and the arrays tally up the number of times each vocab word occurs. To use the Rushmore complaint, 

In [9]:
for i in range(len(vocab)):
    if dtm[2][i] != 0:
        print(vocab[i], dtm[2][i])

agency 1
amount 1
anyone 1
assist 1
away 1
bankruptcy 3
bullying 1
bureau 1
bureaus 2
calls 1
cease 1
cfpb 1
collect 1
complaint 2
continued 1
continues 1
continuous 1
court 1
credit 2
daily 1
damaging 1
dealt 1
denied 1
denying 1
directly 1
discharged 1
discrimination 1
due 2
endeavoring 1
even 1
exposed 1
federal 1
first 1
force 1
get 1
going 1
holder 1
identified 1
immediately 1
indicating 2
intent 1
involved 1
kind 1
knowledge 1
letter 1
like 1
needs 1
new 1
note 1
one 1
past 1
paying 2
penalties 1
place 1
placed 1
please 1
practices 1
procuring 1
pursuit 1
racial 1
reasons 1
receiving 1
referenced 1
referencing 1
remove 3
repeatedly 2
reported 1
reporting 1
represented 1
requesting 1
resolution 1
responded 1
responsible 1
retaliation 1
rushmore 3
send 1
servicing 1
speaking 1
spoken 1
still 1
stop 2
submitted 2
tactics 1
though 1
time 1
trick 1
trying 1
unable 1
walking 1
would 2
written 1
yet 1


With that in mind, let's take a look at the 5 nearest neighbors to the Rushmore complaint and see what, if anything, they have in common. 

In [10]:
# Given item's value and list of items, return an ordered list of 5 items from list closest to given item
def addItem(itemValue, itemIndex, lst):
    newList = lst + [(itemValue, itemIndex)]
    newList = sorted(newList)
    while len(newList) > 5:
        newList.pop(0)
    return newList

In [11]:
nearestNeighbors=[]
for i in range(len(dtm)):
    if i == 2:
        continue
    value = cosine_similarity(dtm[2].reshape(1,-1),dtm[i].reshape(1,-1))[0][0]
    nearestNeighbors = addItem(value, i, nearestNeighbors)

In [12]:
nearestNeighbors

[(0.26159188238936149, 2234),
 (0.26276513239651561, 12074),
 (0.26901283624863742, 8960),
 (0.2837634597609297, 4600),
 (0.32186315478935912, 11740)]

In [13]:
for tpl in nearestNeighbors:
    print(complaints[tpl[1]])

(1398458, '05/29/2015', "My mortgage is being serviced through Seterus Inc . They have reported to the credit reporting agencies as a Chapter XXXX bankruptcy. I am not in bankruptcy and have never been late. My husband and I work in the XXXX and can loose our jobs for poor credit and bankruptcy. I called seterus and they told me that it was a debtor mismatch and it would be corrected. I received a letter from them on XXXX XXXX stating it has been corrected. As of my new credit report is it still showing chapter XXXX and now they have the loan as closed/no balance. I have called XXXX XXXX ( the Seterus rep that sent me a letter ) and she will not return my calls. My credit score has dropped over XXXX points in their reporting error and I can not get any help in correcting. My livelihood is at risk, we ca n't move to a new house and I ca n't re-fi my mortgage to get away from them due to their error. \n", 'Seterus, Inc.')
(1736556, '01/12/2016', "I had filed and was awarded bankruptcy in

In [22]:
companies = np.array([complaints[i][3] for i in range(len(complaints))])
companies_unique = sorted(set(companies))
print(len(companies_unique))

504


Now we'll create an empty array the size of our vocabulary for each of the 504 companies. The, for each company, we'll fill up the empty array with the sum of the company's individual complaint vectors from the DTM we created earlier. 

In [30]:
# Start with an empty array for each company
dtm_companies = np.zeros((len(companies_unique), len(vocab)))
# Now, for each company we'll store the sum of the frequency of each vocab
# word in the dtm_companies array
for i, company in enumerate(companies_unique):
    dtm_companies[i, :] = np.sum(dtm[companies == company, :], axis=0) 

In [34]:
dist = 1 - cosine_similarity(dtm_companies)

504


In [42]:
from scipy.cluster.hierarchy import ward, dendrogram
linkage_matrix = ward(dist)

In [46]:
import matplotlib.pyplot as plt
plt.figure()
dgram = dendrogram(linkage_matrix, orientation="right")
plt.show()