<a href="https://colab.research.google.com/github/jyamaoka/Misc-Projects/blob/master/fr_dc_20190525.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# THE DATA


---


```
Each week we send thousands of consumer complaints about financial products and services to companies for response. Data from those complaints helps us understand the financial marketplace and protect consumers.
```
From: https://www.consumerfinance.gov/complaint/data-use/




```


```

# GET DATA


---


Kaggle datasets can be accessed through an API with a token. 

In [0]:
from google.colab import files
files.upload() # upload kaggle token json

In [0]:
#
# install and set up kaggle api software
!pip install -q kaggle
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json  # set permission

In [0]:
#
# get the data
!kaggle datasets download -d selener/consumer-complaint-database -p /content
!unzip consumer-complaint-database.zip 

```



```

# Load the Data


---



In [0]:
import pandas as pd
import numpy as np

In [0]:
df = pd.read_csv('rows.csv',low_memory=False)

In [8]:
df.head(2).T

Unnamed: 0,0,1
Date received,05/10/2019,05/10/2019
Product,Checking or savings account,Checking or savings account
Sub-product,Checking account,Other banking product or service
Issue,Managing an account,Managing an account
Sub-issue,Problem using a debit or ATM card,Deposits and withdrawals
Consumer complaint narrative,,
Company public response,,
Company,NAVY FEDERAL CREDIT UNION,BOEING EMPLOYEES CREDIT UNION
State,FL,WA
ZIP code,328XX,98204


```


```


# QUESTION 1: 


---


## Please identify the data columns from the following set that can benefit from classification?

There are several columns that could be benefit from classification.  


1.   **Product**/**Sub-product**
2.   **Issue**/**Sub-issue**
3.  **Consumer disputed?**






```


```


# QUESTION 2: 


---


## Choose one of the columns, identified in #1 and create a taxonomy for classified types.
Let's create a taxonomy for the Product column

In [10]:
#
# clean up data to have just product, sub-product and consumer complaint narrative	
df_slim = df[['Product','Sub-product','Consumer complaint narrative']].copy()


df_slim.groupby(['Product']).count()[['Consumer complaint narrative']]

Unnamed: 0_level_0,Consumer complaint narrative
Product,Unnamed: 1_level_1
Bank account or service,14885
Checking or savings account,12881
Consumer Loan,9474
Credit card,18838
Credit card or prepaid card,21379
Credit reporting,31588
"Credit reporting, credit repair services, or other personal consumer reports",92378
Debt collection,86710
"Money transfer, virtual currency, or money service",5466
Money transfers,1497


There are several classes that it makes sense to merge.  For instance, 'Money transfer, virtual currency, or money service' and 'Money transfers'.  I make this simplification below. 

In [0]:
#
# clean up
df_slim.replace({'Product': {'Credit reporting':'Credit reporting, credit repair services, or other personal consumer reports', 
                             'Credit card': 'Credit card or prepaid card',
                             'Prepaid card': 'Credit card or prepaid card',
                             'Payday loan': 'Payday loan, title loan, or personal loan',
                             'Money transfers': 'Money transfer, virtual currency, or money service',
                             'Virtual currency': 'Money transfer, virtual currency, or money service'}}, inplace=True)

## Our taxonomy will be these 12 classes.   

In [12]:
df_slim.groupby(['Product']).count()[['Consumer complaint narrative']]

Unnamed: 0_level_0,Consumer complaint narrative
Product,Unnamed: 1_level_1
Bank account or service,14885
Checking or savings account,12881
Consumer Loan,9474
Credit card or prepaid card,41667
"Credit reporting, credit repair services, or other personal consumer reports",123966
Debt collection,86710
"Money transfer, virtual currency, or money service",6979
Mortgage,52987
Other financial service,292
"Payday loan, title loan, or personal loan",6168




```


```


# QUESTION 3: 


---


## Based on a created taxonomy, describe the algorithm for classification.
We now have a multiclass classification problem.  Using the Consumer complaint narrative we can analyize the text to create features for a classifer.  There are several options for how to create these features.  I suggest trying term frequency inverse document frequency (TF-IDF) and a ensamble tree method like xgboost.



```


```


# EXTRA CREDIT: 


---


## Let's see if we can make it work?


In [0]:
# drop those with no narrative
df_cleaned = df_slim.dropna()

In [0]:
from sklearn import model_selection, preprocessing

# label encode the target variable 
encoder = preprocessing.LabelEncoder()
y       = encoder.fit_transform(df_cleaned['Product'])
X       = df_cleaned['Consumer complaint narrative']

#train test split
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state = 42)

### TF-IDF

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2)
                        , sublinear_tf=True, min_df=5
                        , stop_words='english')

X_train_tfidf =  tfidf.fit_transform(X_train)
X_test_tfidf  =  tfidf.transform(X_test)

### XGBoost

In [0]:
import xgboost as xgb

# use DMatrix for xgboost
dtrain = xgb.DMatrix(X_train_tfidf, label=y_train)
dtest = xgb.DMatrix(X_test_tfidf, label=y_test)

In [0]:
param = {
    'max_depth': 3,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'silent': 0,  # logging mode - quiet
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 12}  # the number of classes that exist in this datset
num_round = 5

bst = xgb.train(param, dtrain, num_round)

In [0]:
preds      = bst.predict(dtest)
best_preds = np.asarray([np.argmax(line) for line in preds]) #convert back to class number

### Metrics

In [46]:
from sklearn.metrics import confusion_matrix
from sklearn import metrics

tn = list(encoder.inverse_transform(range(12)))
print(metrics.classification_report(y_test, best_preds, target_names=tn))

                                                                              precision    recall  f1-score   support

                                                     Bank account or service       0.49      0.46      0.48      4420
                                                 Checking or savings account       0.56      0.34      0.42      3941
                                                               Consumer Loan       0.47      0.23      0.31      2806
                                                 Credit card or prepaid card       0.62      0.61      0.62      6907
Credit reporting, credit repair services, or other personal consumer reports       0.67      0.81      0.73     27726
                                                             Debt collection       0.72      0.76      0.74     26060
                          Money transfer, virtual currency, or money service       0.80      0.56      0.66      2133
                                                       

Here you can see we have about a 70% accuracy.  Certainly more can be done to tune the model.  It would also be desireable to run a cross validate to test the stability and validity of the model.  