## Consumers who are more likely to dispute a conclusion

In this particular case we have been given detailed consumer complaints along with whether consumer disputed with the conclusion. If we are able to predict this, consumer who is more likely to dispute a conclusion can be given more attention as to how the complaints are handled as well as how persuasively the final conlusions are conveyed to them.

We're going to take the following approach:
1. Problem definition
2. Data
3. Evaluation
4. Modelling

## 1. Problem Definition

In a statement,
> Given a detailed consumer complaints, can we predict whether or not a customer is going to dispute?

## 2. Data

For training the model : `Consumer_Complaints_train.csv`
For testing the model: `Consumer_Complaints_test_share.csv`

## 3. Evaluation

> If we can get atleast 0.54 AUC score at predicting whether or not a customer will dispute or not.


## Preparing the tools

We're going to use `pandas`, `matplotlib`, `scikit-learn` and `numpy` for data analysis and manipulation

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn import metrics
from sklearn.metrics import roc_auc_score, confusion_matrix, mean_squared_error

import warnings
warnings.filterwarnings('ignore')

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

## Load Data

In [2]:
df_train = pd.read_csv('Consumer_Complaints_train.csv')
df_test = pd.read_csv('Consumer_Complaints_test_share.csv')

In [20]:
len(df_train['Consumer complaint narrative'].value_counts())

74019

## Data Cleaning

In [29]:
df_train.info()

#dropping -> Sub issue, Complaint narrative, Public response, Zip, Tag

df_train.drop(['Sub-issue','Consumer complaint narrative','Company public response',
               'ZIP code','Tags','Consumer consent provided?']
               ,1,inplace=True)

#dropping -> Sub issue, Complaint narrative, Public response, Zip, Tag

df_test.drop(['Sub-issue','Consumer complaint narrative','Company public response',
               'ZIP code','Tags','Consumer consent provided?']
               ,1,inplace=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478421 entries, 0 to 478420
Data columns (total 18 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   Date received                 478421 non-null  object
 1   Product                       478421 non-null  object
 2   Sub-product                   339948 non-null  object
 3   Issue                         478421 non-null  object
 4   Sub-issue                     185796 non-null  object
 5   Consumer complaint narrative  75094 non-null   object
 6   Company public response       90392 non-null   object
 7   Company                       478421 non-null  object
 8   State                         474582 non-null  object
 9   ZIP code                      474573 non-null  object
 10  Tags                          67206 non-null   object
 11  Consumer consent provided?    135487 non-null  object
 12  Submitted via                 478421 non-null  object
 13 

In [30]:
# Convert Dates
# Date received and date sent to company

df_train['Date received'] = pd.to_datetime(df_train['Date received'], errors='coerce',infer_datetime_format = True) 

df_train['Date sent to company']=pd.to_datetime(df_train['Date sent to company'], errors='coerce',infer_datetime_format = True)

df_train['days_diff'] = df_train['Date sent to company'] - df_train['Date received']

df_train.drop(['Date received','Date sent to company'],1,inplace=True)

df_train['days_diff'] = pd.to_numeric(df_train['days_diff'].dt.days, downcast='integer')

In [31]:
#Date received and date sent to company
df_test['Date received'] = pd.to_datetime(df_test['Date received'], errors='coerce',infer_datetime_format = True) 

df_test['Date sent to company']=pd.to_datetime(df_test['Date sent to company'], errors='coerce',infer_datetime_format = True)

df_test['days_diff'] = df_test['Date sent to company'] - df_test['Date received']

df_test.drop(['Date received','Date sent to company'],1,inplace=True)

df_test['days_diff'] = pd.to_numeric(df_test['days_diff'].dt.days, downcast='integer')

In [32]:
df_train['Consumer disputed?'].value_counts()

df_train['Y'] = np.where(df_train['Consumer disputed?'] == 'Yes',1,0)
del df_train['Consumer disputed?']

In [33]:
#product, submitted via, company response, timely response
conv_dummies = ['Product','Submitted via','Company response to consumer','Timely response?']

for col in conv_dummies:
    dummy=pd.get_dummies(df_train[col],prefix=col,drop_first=True)
    df_train=pd.concat([df_train,dummy],axis=1)
    print(col)
    del df_train[col]
del dummy

Product
Submitted via
Company response to consumer
Timely response?


In [34]:
#product, submitted via, company response, timely response
conv_dummies = ['Product','Submitted via','Company response to consumer','Timely response?']

for col in conv_dummies:
    dummy=pd.get_dummies(df_test[col],prefix=col,drop_first=True)
    df_test=pd.concat([df_test,dummy],axis=1)
    print(col)
    del df_test[col]
del dummy

Product
Submitted via
Company response to consumer
Timely response?


## Dummies

In [35]:
k=df_train['State'].value_counts()
for val in k.axes[0][0:15]:
    varname='State_'+val.replace(',','_').replace(' ','_')
    df_train[varname]=np.where(df_train['State']==val,1,0)
    df_test[varname]=np.where(df_test['State']==val,1,0)
del df_train['State']
del df_test['State']

In [36]:
k=df_train['Company'].value_counts()
for val in k.axes[0][0:40]:
    varname='Company_'+val.replace(',','_').replace(' ','_')
    df_train[varname]=np.where(df_train['Company']==val,1,0)
    df_test[varname]=np.where(df_test['Company']==val,1,0)
del df_train['Company']
del df_test['Company']

In [37]:
k=df_train['Issue'].value_counts()
for val in k.axes[0][0:30]:
    varname='Issue_'+val.replace(',','_').replace(' ','_')
    df_train[varname]=np.where(df_train['Issue']==val,1,0)
    df_test[varname]=np.where(df_test['Issue']==val,1,0)
del df_train['Issue']
del df_test['Issue']

In [38]:
k=df_train['Sub-product'].value_counts()
for val in k.axes[0][0:30]:
    varname='Sub-product_'+val.replace(',','_').replace(' ','_')
    df_train[varname]=np.where(df_train['Sub-product']==val,1,0)
    df_test[varname]=np.where(df_test['Sub-product']==val,1,0)
del df_train['Sub-product']
del df_test['Sub-product']

In [39]:
df_train.dropna(inplace=True)

df_test.dropna(inplace=True)

## Modelling

In [40]:
x=df_train.drop(['Y','Complaint ID'],1)
y=df_train['Y']

In [41]:
clf=LogisticRegression()

In [42]:
clf.fit(x,y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [43]:
prediction=np.where(clf.predict(df_test.drop(['Complaint ID'],1))==1,"Yes","No")

In [44]:
submission=pd.DataFrame(list(zip(df_test['Complaint ID'],list(prediction))),
                       columns=['Complaint ID','Consumer disputed?'])

In [45]:
submission.head(4)

Unnamed: 0,Complaint ID,Consumer disputed?
0,675956,No
1,1858795,No
2,32637,No
3,1731374,No


In [46]:
#train score
train_score=clf.predict_proba(x)[:,1] 

train_score

array([0.304956  , 0.21837561, 0.09678504, ..., 0.24199153, 0.29827327,
       0.15781488])

In [47]:
#test score
x_test=df_test.drop('Complaint ID',1)

test_score=clf.predict_proba(x_test)[:,1] 

test_score

array([0.30636425, 0.22085381, 0.34360414, ..., 0.19834864, 0.21275703,
       0.21427519])

## Logistic Regression - Hyperparameter tuning

In [48]:
x=df_train.drop(['Y','Complaint ID'],1)
y=df_train['Y']


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [49]:
logr_ht=LogisticRegression()
logr_ht
roc_dict={}

c_range=[0.001,0.01,0.1,1.0,10]

for i in c_range:
    logr_ht=LogisticRegression(C=i)
    logr_ht.fit(x_train, y_train)
    pred=logr_ht.predict_proba(x_test)[:,1]
    r1=roc_auc_score(y_test, pred)
    roc_dict[i]=r1
    print(i , ' ' ,  roc_auc_score(y_test, pred))


roc_dict

Keymax = max(roc_dict, key=roc_dict.get) 
print(Keymax)

0.001   0.6173063220944757
0.01   0.6208255734373733
0.1   0.6215246635891455
1.0   0.6210660249567241
10   0.6210952280830376
0.1


In [50]:
clf=LogisticRegression(C=0.1)
clf.fit(x,y)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [51]:
prediction=np.where(clf.predict(df_test.drop(['Complaint ID'],1))==1,"Yes","No")

submission=pd.DataFrame(list(zip(df_test['Complaint ID'],list(prediction))),
                       columns=['Complaint ID','Consumer disputed?'])

submission.head(4)

Unnamed: 0,Complaint ID,Consumer disputed?
0,675956,No
1,1858795,No
2,32637,No
3,1731374,No


In [52]:
submission.to_csv('Project1_submission_classes.csv',index=False)

#train score
train_score=clf.predict_proba(x)[:,1] 

train_score

#test score
x_test=df_test.drop('Complaint ID',1)

test_score=clf.predict_proba(x_test)[:,1] 

test_score

array([0.30051791, 0.21698454, 0.36007386, ..., 0.195078  , 0.20972367,
       0.21230586])

In [53]:
pd.DataFrame(test_score).to_csv("Project1_submission_score.csv",index=False)