### Project Problem and Hypothesis

+ What's the project about? What problem are you solving?

Project focuses on predicting: 1)likelihood that consumer disputes financial institution resolution regarding consumer complaint, 2) likelihood that complaint results in monetary restitution to consumer, and 3) what type of response a consumer will receive based on various inputs (e.g. financial institution, type of complaint, how complaint submitted, state)

Problems being solved:
     - Allow financial institutions to predict top 3 complaints by state and "success" rate (success rate = Little to no consumer disputes regarding resolution) of resolution
     - Allow financial institutions to determine if resolutions to same complaint vary by state and if that, in turn, impacts resolution success
     - Allow consumers to identify financial institutions with greatest number of complaints and highest consumer resolution rate (by state)


+ Where does this seem to reside as a machine learning problem? Are you predicting some continuous number, or predicting a binary value?

This will be a logistic regression model with possible use of decision tree modeling as a precursor.

Some variables (or factors) will have binary value (such as whether consumer disputed resolution). Other factors will reflect a continuous number (such as number of complaint type A over past 4 years).


+ What kind of impact do you think it could have?

Using only this dataset, the impact could be significant from an investor view. For example, if Wells Fargo sees a spike in complaints, investors may lower the investment grade (e.g. from B to D).

What would be more helpful would be to combine findings from this dataset with economic data by state and zip code. During the recession and housing bubble, states and even certain zip codes were affected differently. Foreclosures were particularly high in certain areas and this would have an impact on mortgage rate complaints by consumers (one of the variables in my dataset).


+ What do you think will have the most impact in predicting the value you are interested in solving for?

Financial institution: because varying companies handle complaints differently and some companies are more customer-focused than others

States: geographic economic differences are significant with the U.S.


### Data Sets
These are the fields available in the dataset:
    + Date received
    + Product
    Sub-product
    Issue
    Sub-issue
    Consumer complaint narrative
    Company public response
    Company
    State
    ZIP code
    Tags
    Consumer consent provided?
    Submitted via
    Date sent to company
    Company response to consumer
    Timely response?
    Consumer disputed?
    Complaint ID


### Domain Knowledge

I don't have any experience in area of consumer credit complaints. The dataset is interesting to me because I dealt with a significant amount of stolen credit card fraud in a previous job.

The dataset I downloaded from data.gov might only go through mid-summer 2016. However, data to present data may be downloadable from here:  https://data.consumerfinance.gov/dataset/Consumer-Complaints/s6ew-h6mp/data It looks very similar in terms of factors and data within factors.

Each month, the Consumer Financial Protection Bureau writes a report based on the past month's data. These reports highlight a particular area of complaint (e.g. mortgages, loans) and a specific state. The reports are a summary of the data with MoM and YoY trends.

*There is no prediction analysis provided in the CFPB monthly reports.*


### Project Concerns

+ I won't do the correct predictive model.

+ All the data is a type=object. I tried to find a way to change certain columns to string and certain columns to float but I couldn't do it (after trying for over an hour). Errors shown in code down below.

+ Not exactly sure where to start with data munging. Have identified that I'm missing data (see below), some fields are categories and will have to be made into dummy variables, and I don't know if some fields should be made 'str' and some 'int', etc.

+ I need to understand what to do with empty fields in the data. I could calc the mean or median and add it to empty fields??? 

    +Result of df.isnull().sum()

    * Date received                        0
    * Product                              0
    * Sub-product                     201117
    * Issue                                1
    * Sub-issue                       405537
    * Consumer complaint narrative    561564
    * Company public response         529434
    * Company                              0
    * State                             5368
    * ZIP code                          5381
    * Tags                            583715
    * Consumer consent provided?      464520
    * Submitted via                        1
    * Date sent to company                 0
    * Company response to consumer         0
    * Timely response?                     0
    * Consumer disputed?               40594
    * Complaint ID                         0


+ I attempted to go through the credit complaint process myself. Potential errors or mis-information could be possible if the consumer does not understand any part of the form or choose the wrong multiple choice item during the form completion process. Errors are unlikely from the gov't (am I being too trusting?) but errors could easily occur on accident by the consumer. 

### Outcomes

+ I expect the outcome to be several prediction models that do/do not validate the hypotheses above, as well as a summary for each for predictor

+ The financial institution audience will expect an overview of how the analysis was conducted, what variables could affect the model(s), summary of predictions, and how external factors not included in the model could impact the predictions.

+ The consumer audience will expect an overview of what a financial institution's response is likely to be based on: geography, financial institution, type of credit complaint, etc.

+ I actually don't know how complicated my model has to be at this point. 

+ I will consider this a success if I can: build all the prediction models and create a summary that's presentable to financial institutions and consumers.

+ I am not going to let this project be a bust.





In [1]:
import pandas as pd
import os

In [14]:
df = pd.read_csv('Consumer_Complaints1.csv')

In [15]:
for x in df.columns.values:
    print x

Date received
Product
Sub-product
Issue
Sub-issue
Consumer complaint narrative
Company public response
Company
State
ZIP code
Tags
Consumer consent provided?
Submitted via
Date sent to company
Company response to consumer
Timely response?
Consumer disputed?
Complaint ID


In [16]:
print df.head()

  Date received                  Product                  Sub-product  \
0    07/29/2013            Consumer Loan                 Vehicle loan   
1    07/29/2013  Bank account or service             Checking account   
2    07/29/2013  Bank account or service             Checking account   
3    07/29/2013  Bank account or service             Checking account   
4    07/29/2013                 Mortgage  Conventional fixed mortgage   

                                      Issue Sub-issue  \
0                Managing the loan or lease       NaN   
1                 Using a debit or ATM card       NaN   
2   Account opening, closing, or management       NaN   
3                  Deposits and withdrawals       NaN   
4  Loan servicing, payments, escrow account       NaN   

  Consumer complaint narrative Company public response  \
0                          NaN                     NaN   
1                          NaN                     NaN   
2                          NaN              

In [18]:
#This is a problem
df.dtypes

Date received                   object
Product                         object
Sub-product                     object
Issue                           object
Sub-issue                       object
Consumer complaint narrative    object
Company public response         object
Company                         object
State                           object
ZIP code                        object
Tags                            object
Consumer consent provided?      object
Submitted via                   object
Date sent to company            object
Company response to consumer    object
Timely response?                object
Consumer disputed?              object
Complaint ID                     int64
dtype: object

In [9]:
print len(df)

679879


In [13]:
df.isnull().sum()

Date received                        0
Product                              0
Sub-product                     201117
Issue                                1
Sub-issue                       405537
Consumer complaint narrative    561564
Company public response         529434
Company                              0
State                             5368
ZIP code                          5381
Tags                            583715
Consumer consent provided?      464520
Submitted via                        1
Date sent to company                 0
Company response to consumer         0
Timely response?                     0
Consumer disputed?               40594
Complaint ID                         0
dtype: int64

In [14]:
df.Product.describe()

count       679879
unique          12
top       Mortgage
freq        212178
Name: Product, dtype: object

In [18]:
df.Product.value_counts()

Mortgage                   212178
Debt collection            126369
Credit reporting           120998
Credit card                 80119
Bank account or service     77253
Consumer Loan               27101
Student loan                22083
Payday loan                  4893
Money transfers              4792
Prepaid card                 3242
Other financial service       836
Virtual currency               15
Name: Product, dtype: int64

In [17]:
df.Tags.describe()

count              96164
unique                 3
top       Older American
freq               55639
Name: Tags, dtype: object

In [25]:
df.Issue.describe()

count                                       679878
unique                                          95
top       Loan modification,collection,foreclosure
freq                                        107093
Name: Issue, dtype: object

In [26]:
df.Issue.value_counts()

Loan modification,collection,foreclosure    107093
Incorrect information on credit report       88243
Loan servicing, payments, escrow account     70979
Cont'd attempts collect debt not owed        52502
Account opening, closing, or management      33832
Disclosure verification of debt              25173
Communication tactics                        21621
Deposits and withdrawals                     20618
Application, originator, mortgage broker     15702
Credit reporting company's investigation     14178
Billing disputes                             13374
Other                                        13262
Managing the loan or lease                   12973
Problems caused by my funds being low        10785
Dealing with my lender or servicer           10546
False statements or representation           10174
Unable to get credit report/credit score      9870
Improper contact or sharing of info           8938
Problems when you are unable to pay           8281
Settlement process and costs   

In [2]:
import numpy as np
import matplotlib as plt
import pandas as pd

In [3]:
df1 = pd.read_csv('Consumer_Complaints1.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [25]:
df1.dtypes

AttributeError: 'str' object has no attribute 'dtypes'

In [22]:
df1 = pd.to_numeric(errors='ignore')

In [23]:
df1.dtypes

AttributeError: 'str' object has no attribute 'dtypes'

# 