# Mario Alberto Moctezuma Salazar

In [3]:
import pandas as pd
import sklearn
import numpy as np

In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, KFold
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

In [8]:
Complaints = pd.read_csv('Consumer_Complaints.csv', low_memory=False)

In [11]:
Complaints.sample(5)

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
81040,09/16/2015,Credit reporting,,Incorrect information on credit report,Account terms,The company is disputing an item from XX/XX/XX...,Company chooses not to provide a public response,"TransUnion Intermediate Holdings, Inc.",NY,112XX,,Consent provided,Web,09/16/2015,Closed with explanation,Yes,No,1567736
8207,11/12/2013,Credit reporting,,Incorrect information on credit report,Information is not mine,,,Equifax,TX,76901,,,Web,11/12/2013,Closed with non-monetary relief,Yes,No,593418
387598,06/25/2013,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,Ocwen,NY,10901,,,Referral,07/03/2013,Closed with explanation,Yes,No,441794
579658,05/03/2016,Credit reporting,,Credit reporting company's investigation,No notice of investigation status/result,I 'm filing this complaint Experian has ignore...,Company has responded to the consumer and the ...,Experian,DE,198XX,,Consent provided,Web,05/03/2016,Closed with non-monetary relief,Yes,No,1907353
179439,09/19/2014,Credit reporting,,Incorrect information on credit report,Information is not mine,,,Equifax,FL,33317,,,Web,09/19/2014,Closed with explanation,Yes,Yes,1036021


<hr>
<h1> <center> ML for product prediction </center> </h1>
<hr>

## *Approach 1:* MultinomialNB applied to `Consumer complaint narrative` 

### *Why this model?* We use this method to consider maximum likelihood using the bag of words

### First we create a  `DataFrame` containing only `Consumer complaint narrative` different from `NaN`.
### Here we take
``
X1 = Nar_complaints['Consumer complaint narrative']
y1 = Nar_complaints.Product
``
### Then we make a pipeline where we vectorize the text, eliminate "stop words" in english, and apply the method MultinominalNB

In [4]:
Nar_complaints = Complaints.dropna(subset=['Consumer complaint narrative'])
Nar_complaints = Nar_complaints.reset_index(drop=True)

X1 = Nar_complaints['Consumer complaint narrative']
y1 = Nar_complaints.Product

text_clf = Pipeline([('tfidf', TfidfVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')),
                     ('clf', MultinomialNB(alpha=1.0e-10)),])

### Next we create our own function to validate and calculate the accuracy for each fold, since there is not cross-validation technique for `str` data. In the second cell below we can see the accuracy score of each fold and the mean of these:

In [5]:
kf = KFold(n_splits=5, random_state=1, shuffle=True)


def kfold(X, y, kf, pipe):
    acc_list = []
    for train_idx, test_idx in kf.split(X):
        pipe.fit(X[train_idx], y[train_idx])
        predd = pipe.predict(X[test_idx])
        acc = accuracy_score(y[test_idx], predd)
        print(acc)
    acc_list.append(acc)
    return np.mean(acc_list)

In [6]:
kfold(X1, y1, kf, text_clf) 

0.7874983653720413
0.7918573732618456
0.7915086526306613
0.7918573732618456
0.7884045335658239


0.7884045335658239

### We see that the accuracy scores above are not that different , which roughly means that any subset to test is good. Besides, based on the mean value of the accuaracy of the folds, we can predict the product from the text of the 'Consumer complaint narrative', with an 78.8% of accuracy.

## *Approach 2:* MultinomialNB applied to `Issue`

### Since the column of `Consumer complaint narrative` has many `NaN`, we consider the `Issues` column and `MultinomialNB`:


### Here we take
``
X2 = Complaints.Issue
y2 = Complaints.Product
``
### Then we make a pipeline where we choose vectorize with `TfidfTransformer()` since the `Issue` column is quite standard in its values. Applying the method we get:

In [7]:
X2 = Complaints.Issue
y2 = Complaints.Product

In [8]:
text_clf2 = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB(alpha=1.0e-10)),])

kfold(X2, y2, kf, text_clf2)

0.9844915001491202
0.9851774530271399
0.9842156277960036
0.9849834848157234
0.9846703300800036


0.9846703300800036

### We see that the accuracy scores above are not that different among them, but *significantly* better than those of the first approach. The latter is possibly due to the "standard" values of the `Issue` column.

### Here we predict the product with an 98.4% of accuracy.