In [1]:
##This is a project I've undertaken to check out an API from www.kaggle.com containing data from 
##consumer complaints. I have eliminated many of the companies based on the number of complaints,
##narrowing it down to a 'top ten'. So in order to have made it into my data analysis, you really 
##need to have made a lot of people mad.

In [None]:
#Standard
import numpy as np
import pandas as pd
from numpy.random import randn
#Stats
from scipy import stats
#Plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
#for inline output in iPython Notebook:
%matplotlib inline



In [None]:
cc_df = pd.read_csv('Consumer_Complaints.csv')
cc_df = pd.DataFrame(cc_df)
cc_df.head(15)

In [None]:
cc_df.info()

In [None]:
## Grouping by company 
g = cc_df.groupby('Company')

In [None]:
## Figure out the top ten offenders by number of complaints. If you pissed off more than 10,000 
## people, congratulations you're in!
size = g.size()
most_comps = size[size > 10000]
most_comps

In [None]:
cc_df = g.filter(lambda x: len(x) > 10000)
cc_df.head()

In [None]:
## Cleaning out some unnecessary stuff, stuff we can't really graph anyway. 
cc_df = cc_df.drop(['Sub-product','Company public response','Consumer complaint narrative',
                    'ZIP code','Submitted via','Complaint ID',],axis=1)
cc_df.head()

In [None]:
cc_df.shape

In [None]:
## Looks like crummy mortgages were the biggest complaint in 2015. Let's find out who really stood 
## out here... 

In [None]:
cc_df['Product'].value_counts()

In [None]:
cc_df.sort_index(by=['Product','Company']).head()
## Ok so now I'm sorted by the company and the product they received complaints on. Now, how to graph
## the companies by the individual product? That way the graphs will be readable and provide some 
## clarity from the data. Right now there are too many products.

In [None]:
## Okay, now we have a DataFrame with only Mortgage complaints! Let's see who the bad boys really 
## were in this field
mort_df = cc_df[cc_df['Product'] == 'Mortgage']
mort_df.head()

In [None]:
mort_df.shape
## My DataFrame shape matches the value_counts above for mortgages...good. 

In [None]:
##So here we can see that BofA came in first for mortgages complaints (who knew?) with Wells Fargo
##in a fairly distant second place. Ocwen is third.
##set the font size to something readable:
sns.set_context('notebook', font_scale=1.3, rc={'line.linewidth': 2.5})
sns.factorplot(x='Product',data=mort_df, hue='Company', kind='count',size=4, aspect=1.5,palette="deep")

In [None]:
sns.set_context('paper', font_scale=0.9, rc={'line.linewidth': 2.5})
sns.factorplot(x='State',data=mort_df, hue='Product', kind='count',size=4, aspect=2.7,palette="deep")

In [None]:
## Above, we see that California was hit hard, nearly twice as much as Florida which came in second,
## which was twice as much as New York, which takes 3rd place for consumer complaints in the mortgage 
## industry. Wonder if this is due in part to population? Yes but not entirely, since Texas is the 
## second-most populated state, but 5th in complaints. Also, I wonder which state 
## has the abbreviation 'PW'?

In [None]:
mort_df['State'].unique()

In [None]:
###...and where the heck is 'GU'??? 
len(mort_df['State'].unique())
###must be some new states that have joined the union unbeknownst to me...or I have a dirty, dirty dataset. 

In [None]:
###Well at least most people got a 'timely response'. I just wonder if it was the response they were hoping for...
sns.set_context('notebook', font_scale=1.1, rc={'line.linewidth': 1.5})
sns.factorplot('Company', hue='Timely response?', data=mort_df, kind='count',size=4.5,aspect=2.8,palette="dark")

In [None]:
## Kinda figures the top three would have the most consumer-disputed cases, which is 
## borne-out here. I wonder why the clear majority of complaints are labeled as "not 
## consumer disputed", unless that criteria has only to do with the final outcome. 
## In which case, if the data is correct, the companies did a fairly good job at dealing 
## with their complaints. 
sns.set_style('whitegrid')
sns.set_context('notebook', font_scale=1.1, rc={'line.linewidth': 1.5})
sns.factorplot('Company', hue='Consumer disputed?', data=mort_df, kind='count',size=4.0,aspect=2.9,palette="bright")

In [None]:
## So in order to get a view for when the complaints rolled in and for which company, 
## I need to figure out how to rename the dates to years instead of the mm-dd-yyyy format. 
## Without doing this, we'd get data for every day, which is too much to graph.

In [None]:
date = mort_df['Date received']
date.shape

In [None]:
str_date = []

In [None]:
x = []
for d in date:
    n = d[6:11]
    str_date.append(str(n))
str_date[112599:112604]

In [None]:
#Double-checking the length of the list 
len(str_date)

In [None]:
#Pass the list into the mortgage DataFrame
mort_df['year'] = str_date

In [None]:
##Got 'er did. Now to figure out how to graph the data.
mort_df.head()

In [None]:
###The API only had data from December of 2011, so let's get rid of that year entirely:
mort_df = mort_df[mort_df['year'] != '2011']

In [None]:
mort_df.groupby('year')
sns.set_context('notebook', font_scale=1.5, rc={'line.linewidth': 1.5})
sns.factorplot('year', hue='Company', data=mort_df, kind='count',size=4,aspect=2.3,palette="bright")

In [None]:
## We can basically say things really got bad in 2012 and 2013 and then things seem 
## to calm down. I wonder if we had some laws get passed at that time that tried to 
## curb some of the most egregious abuses...?