## Consumer Financial Protection Bureau (CFPB) - Consumer Complaint Database
The Consumer Complaint Database is a collection of complaints about consumer financial products and services that we sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. Complaints referred to other regulators, such as complaints about depository institutions with less than $10 billion in assets, are not published in the Consumer Complaint Database.

[CFPB](https://www.consumerfinance.gov/data-research/consumer-complaints/)

### What are we solving?

#### Classifying unclassified issues based on the complaint narrative by topic modeling

In [200]:
# import required packaages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import warnings


import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
from nltk.stem import PorterStemmer 
ps= PorterStemmer()
from nltk.corpus import stopwords 

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer, TfidfTransformer

from sklearn.model_selection import train_test_split 

In [155]:
# import the dataset - this is a filtered dataset(2018,2019) for easier processing for the training
df = pd.read_csv("complaints_2018_2019.csv")
predict_df = pd.read_csv("predict.csv")
df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,09/17/16,Credit card,,Other,,Credit privilege are restricted on my existing...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",FL,328XX,Older American,Consent provided,Web,09/17/16,Closed with explanation,Yes,No,2117531
1,01/22/16,Credit card,,Other,,I have a questions.Last year I received notice...,Company chooses not to provide a public response,"CITIBANK, N.A.",WI,,,Consent provided,Web,01/22/16,Closed with explanation,Yes,No,1753260
2,05/22/15,Credit card,,Other,,My client whom I have POA for has been signed ...,,"Paypal Holdings, Inc",AZ,852XX,Older American,Consent provided,Web,05/27/15,Closed with explanation,Yes,No,1389183
3,01/08/17,Credit card,,Other,,"Approximately XX/XX/2016, I was contacted via ...",,Affiliates Management Company,FL,,,Consent provided,Web,01/10/17,Closed with explanation,No,Yes,2279929
4,03/30/17,Credit card,,Other,,I contacted the billing department regarding t...,Company has responded to the consumer and the ...,SYNCHRONY FINANCIAL,GA,303XX,,Consent provided,Web,03/30/17,Closed with non-monetary relief,Yes,Yes,2408837


In [156]:
# To simplify our dataset we keep only relevant text data and extract complaint id's to a new df

complaint_ids = df.filter(['Complaint ID'], axis=1)
df = df[['Consumer complaint narrative','Issue']] # dropping irrelevant columns for easier demo
df = df.dropna(subset=['Consumer complaint narrative']) # drop any rows where the narrative is NaN
df = df.replace(np.nan, '', regex=True) ## replacing Nan with Empty String

display(df.count())
display(df.head())

Consumer complaint narrative    238771
Issue                           238771
dtype: int64

Unnamed: 0,Consumer complaint narrative,Issue
0,I opened a credit account at MACYs Department ...,Incorrect information on your report
1,"According to the Fair Credit Reporting Act, Se...",Incorrect information on your report
2,"I am submitting this complaint against, XXXX X...",Took or threatened to take negative or legal a...
4,"Hi, My Name is XXXX XXXX. \n\nI took a master ...",Problem with a credit reporting company's inve...
5,I have am looking into my credit report and I ...,Incorrect information on your report


## Preprocessing:

* Convert all text to lowercase
* Remove punctuation
* Remove any numbers if unnecessary
* Remove Stopwords
* Perform Stemming/Lemmetization
* Perform any dataset specific cleanup. Ex: Removing X's from this dataset (X's are used to mask any personal information)

**Do explore the dataset to identify additional dataset specific pre-processing**

### Other recommended common pre-processing techniques:
* Removing most frequent words that does not add much value to the analysis
* Normalizing

etc.,

In [186]:
# preprocessing text
def Pre_Process_Text(raw):
    """
    preprocessing such as lower case conversion, removal of numbers, punctuations stopwords,stemming are performed
    """

    lower_case = raw.lower() #convert to lower case
    
    no_punctuation = ''.join(c for c in lower_case if c not in punctuation) #remove any punctuation

    no_digit = ''.join([c for c in no_punctuation if not c.isdigit()]) #remove any digit

    stemming = ps.stem(no_digit).split() # stemming
    #lemmatize = wordnet_lemmatizer.lemmatize(word) for word in no_digit
    
    
    stopWords = set(stopwords.words('english'))      #remove common words using nltk dictionary
    processed_text = [j for j in stemming if not j in stopWords] 
    
    processed_text = " ".join(processed_text)
    #processed_text = nltk.word_tokenize(processed_text) #tokenization


    return(processed_text)

In [187]:
df_small = df.iloc[:10,]
df_small_clean = df_small.applymap(lambda x: Pre_Process_Text(x))
#df_clean = df.applymap(lambda x: Pre_Process_Text(x))
display(df_small.head())
display(df_clean.head())

Unnamed: 0,Consumer complaint narrative,Issue
0,I opened a credit account at MACYs Department ...,Incorrect information on your report
1,"According to the Fair Credit Reporting Act, Se...",Incorrect information on your report
2,"I am submitting this complaint against, XXXX X...",Took or threatened to take negative or legal a...
4,"Hi, My Name is XXXX XXXX. \n\nI took a master ...",Problem with a credit reporting company's inve...
5,I have am looking into my credit report and I ...,Incorrect information on your report


Unnamed: 0,Consumer complaint narrative,Issue
0,"[opened, credit, account, macys, department, s...","[incorrect, information, report]"
1,"[according, fair, credit, reporting, act, sect...","[incorrect, information, report]"
2,"[submitting, complaint, xxxx, xxxx, rereaging,...","[took, threatened, take, negative, legal, act]"
4,"[hi, name, xxxx, xxxx, took, master, credit, c...","[problem, credit, reporting, companys, investi..."
5,"[looking, credit, report, found, errors, addre...","[incorrect, information, report]"


In [182]:
##TODO: Identify and remove continiously occuring XXXXs

test = 'XXXX'
import collections
def test_counter(xs):
    freq = collections.Counter(xs)
    for k in freq:
        if freq[k] > 1:
            return k
        
print(test_counter(test))

X


In [216]:
# Term Frequency - Inverse Document Frequency using scikit-learn
cv = CountVectorizer()
count_vector = cv.fit_transform(df_small_clean['Consumer complaint narrative'])
#count_vector = pd.DataFrame(tfidf.toarray(), columns=cv.get_feature_names())
#count_vector.index = df_small_clean.index
#print(count_vector)
'''
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf = tfidf_transformer.fit(count_vector)
# print idf values
df_tfidf = pd.DataFrame(tfidf.idf_, index=cv.get_feature_names(),columns=["idf_weights"])
 
df_tfidf
'''

# Create feature vectors 
vectorizer = TfidfVectorizer()
# Train the feature vectors
tfidf = vectorizer.fit_transform(df_small_clean['Consumer complaint narrative'])
# Apply model on test data 
#test_vectors = vectorizer.transform(test_data)
tfidf

<10x567 sparse matrix of type '<class 'numpy.float64'>'
	with 763 stored elements in Compressed Sparse Row format>