
# Finance Complaint Project
## Exploratory Data Analysis

### Problem Statement
The Consumer Financial Protection Bureau (CFPB) is a federal U.S. agency that acts as a mediator when disputes arise between financial institutions and consumers. Via a web form, consumers can send the agency a narrative of their dispute. 

This project made using Natural Language Processing (NLP) with machine learning models to process the issues text written in the complaint and other features in the dataset to predict if the customer will dispute or not.


*Industry use case:* An NLP + Machine learning model would make the classification of whether the consumer will dispute with the companty or not and thus helping the company to prioritize the complaint based on the prediction.


## Import

In [1]:
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
import warnings
import os
warnings.filterwarnings("ignore")

%matplotlib inline
pd.set_option("display.max_columns", 50)

In [2]:
df = pd.read_csv("data/complaints.csv")

In [3]:
df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2024-07-03,Credit reporting or other personal consumer re...,Credit reporting,Incorrect information on your report,Information belongs to someone else,,,Experian Information Solutions Inc.,IL,60160,,,Web,2024-07-03,In progress,Yes,,9419821
1,2024-07-03,Credit reporting or other personal consumer re...,Credit reporting,Incorrect information on your report,Information belongs to someone else,,,Experian Information Solutions Inc.,IL,60626,,,Web,2024-07-03,In progress,Yes,,9419624
2,2024-07-04,Credit reporting or other personal consumer re...,Credit reporting,Improper use of your report,Reporting company used your report improperly,,,Experian Information Solutions Inc.,VA,23228,,,Web,2024-07-04,In progress,Yes,,9422216
3,2024-07-04,Credit reporting or other personal consumer re...,Credit reporting,Improper use of your report,Reporting company used your report improperly,,,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",NE,68801,,,Web,2024-07-04,In progress,Yes,,9422225
4,2024-07-04,Credit reporting or other personal consumer re...,Credit reporting,Improper use of your report,Reporting company used your report improperly,,,Experian Information Solutions Inc.,CA,91789,,,Web,2024-07-04,In progress,Yes,,9422229


In [4]:
df.columns

Index(['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue',
       'Consumer complaint narrative', 'Company public response', 'Company',
       'State', 'ZIP code', 'Tags', 'Consumer consent provided?',
       'Submitted via', 'Date sent to company', 'Company response to consumer',
       'Timely response?', 'Consumer disputed?', 'Complaint ID'],
      dtype='object')

*target: Consumer disputed?*

In [5]:
df["Consumer disputed?"].value_counts(normalize=True)*100

No     80.687894
Yes    19.312106
Name: Consumer disputed?, dtype: float64

In [6]:
df.shape

(5585491, 18)

In [10]:
df.isna().sum()

Date received                         0
Product                               0
Sub-product                      235295
Issue                                 6
Sub-issue                        740880
Consumer complaint narrative    3639944
Company public response         2884778
Company                               0
State                             46478
ZIP code                          30226
Tags                            5086244
Consumer consent provided?      1072644
Submitted via                         0
Date sent to company                  0
Company response to consumer         17
Timely response?                      0
Consumer disputed?              4817175
Complaint ID                          0
dtype: int64

In [11]:
df.replace('', np.nan, inplace=True)

In [14]:
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Date received,5585491.0,4599.0,2024-06-13,9241.0,,,,,,,
Product,5585491.0,21.0,"Credit reporting, credit repair services, or o...",2163873.0,,,,,,,
Sub-product,5350196.0,86.0,Credit reporting,3457073.0,,,,,,,
Issue,5585485.0,178.0,Incorrect information on your report,1671998.0,,,,,,,
Sub-issue,4844611.0,272.0,Information belongs to someone else,1107538.0,,,,,,,
Consumer complaint narrative,1945547.0,1553853.0,In accordance with the Fair Credit Reporting a...,9381.0,,,,,,,
Company public response,2700713.0,11.0,Company has responded to the consumer and the ...,2445853.0,,,,,,,
Company,5585491.0,7289.0,"EQUIFAX, INC.",1191654.0,,,,,,,
State,5539013.0,63.0,FL,679476.0,,,,,,,
ZIP code,5555265.0,33783.0,XXXXX,125043.0,,,,,,,


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5585491 entries, 0 to 5585490
Data columns (total 18 columns):
 #   Column                        Dtype 
---  ------                        ----- 
 0   Date received                 object
 1   Product                       object
 2   Sub-product                   object
 3   Issue                         object
 4   Sub-issue                     object
 5   Consumer complaint narrative  object
 6   Company public response       object
 7   Company                       object
 8   State                         object
 9   ZIP code                      object
 10  Tags                          object
 11  Consumer consent provided?    object
 12  Submitted via                 object
 13  Date sent to company          object
 14  Company response to consumer  object
 15  Timely response?              object
 16  Consumer disputed?            object
 17  Complaint ID                  int64 
dtypes: int64(1), object(17)
memory usage: 767.

## Exploring the Data

In [16]:
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == "O"]

print(f"We have {len(numeric_features)} numeric features: {numeric_features}")
print(f"We have {len(categorical_features)} categorical features: {categorical_features}")

We have 1 numeric features: ['Complaint ID']
We have 17 categorical features: ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?']


In [18]:
df.shape

(5585491, 18)

## Null Values per column

In [19]:
missing = df.isnull().sum().div(df.shape[0]).mul(100).to_frame().sort_values(by=0, ascending=False)
missing

Unnamed: 0,0
Tags,91.061717
Consumer disputed?,86.244432
Consumer complaint narrative,65.167843
Company public response,51.647707
Consumer consent provided?,19.204113
Sub-issue,13.264367
Sub-product,4.212611
State,0.83212
ZIP code,0.541152
Company response to consumer,0.000304


In [20]:
drop_columns = ["Tags", "Consumer complaint narrative", "Company public response", "Sub-issue", "Sub-product", "ZIP code", "Complaint ID"]
df.drop(drop_columns, axis=1, inplace=True)

Droping columns with a lof of missing values

Number of unique value for each column

In [21]:
for col in df.columns:
    print(col, df[col].nunique())

Date received 4599
Product 21
Issue 178
Company 7289
State 63
Consumer consent provided? 4
Submitted via 7
Date sent to company 4548
Company response to consumer 8
Timely response? 2
Consumer disputed? 2
