###### The Consumer Financial Protection Bureau (CFPB) is a federal U.S. agency that acts as a mediator when disputes arise between financial institutions and consumers. Via a web form, consumers can send the agency a narrative of their dispute.


###### This project made using Natural Language Processing (NLP) with machine learning models to process the issues text written in the complaint and other features in the dataset to predict if the customer will dispute or not.


###### Industry use case: An NLP + Machine learning model would make the classification of whether the consumer will dispute with the companty or not and thus helping the company to prioritize the complaint based on the prediction.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

pd.set_option("display.max_columns", 50)


In [2]:
df=pd.read_csv('complaints.csv')

In [8]:
df.sample(5)

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
1758277,2023-05-25,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,I am writing to dispute the following accounts...,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,CA,93638,,Consent provided,Web,2023-05-25,Closed with explanation,Yes,,7030559
3354625,2024-02-18,Credit reporting or other personal consumer re...,Credit reporting,Improper use of your report,Reporting company used your report improperly,According to the Fair Credit Reporting Act 15 ...,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",CA,90018,,Consent provided,Web,2024-02-18,Closed with non-monetary relief,Yes,,8367896
4722910,2024-04-16,Credit reporting or other personal consumer re...,Credit reporting,Problem with a company's investigation into an...,Problem with personal statement of dispute,,,"EQUIFAX, INC.",FL,33545,,Consent not provided,Web,2024-04-16,Closed with non-monetary relief,Yes,,8784799
1400104,2024-02-20,Credit reporting or other personal consumer re...,Credit reporting,Improper use of your report,Reporting company used your report improperly,,,LEXISNEXIS,IL,60411,,Consent not provided,Web,2024-02-20,Closed with non-monetary relief,Yes,,8382984
1611666,2024-03-15,Credit reporting or other personal consumer re...,Credit reporting,Incorrect information on your report,Information belongs to someone else,,,"EQUIFAX, INC.",TX,77545,,Consent not provided,Web,2024-03-15,Closed with non-monetary relief,Yes,,8548497


In [6]:
round(df['Consumer disputed?'].value_counts(normalize=True)*100)

Consumer disputed?
No     81.0
Yes    19.0
Name: proportion, dtype: float64

In [10]:
df['Consumer complaint narrative'].replace('NaN', np.nan, inplace=True)

In [12]:
df.sample(5)

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
4858387,2012-02-10,Mortgage,Second mortgage,"Loan modification,collection,foreclosure",,,,"BANK OF AMERICA, NATIONAL ASSOCIATION",CA,91411,,,Referral,2012-02-10,Closed without relief,No,No,22595
2654106,2023-03-04,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",GA,30035,,Consent not provided,Web,2023-03-04,Closed with non-monetary relief,Yes,,6649196
4355824,2024-05-20,Credit reporting or other personal consumer re...,Credit reporting,Incorrect information on your report,Information belongs to someone else,,,"EQUIFAX, INC.",FL,33178,,,Web,2024-05-20,In progress,Yes,,9051865
2976059,2022-03-06,Checking or savings account,Checking account,Managing an account,Banking errors,On XX/XX/XXXX I applied and was approved for a...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",KY,41051,,Consent provided,Web,2022-03-06,Closed with monetary relief,Yes,,5290832
3780397,2021-06-08,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,,,"EQUIFAX, INC.",NY,11239,,Consent not provided,Web,2021-06-08,Closed with explanation,Yes,,4441921


In [15]:
df.fillna('', inplace=True)

In [16]:
df.sample(5)

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
3722235,2021-07-15,"Credit reporting, credit repair services, or o...",Credit reporting,Problem with a credit reporting company's inve...,Was not notified of investigation status or re...,These accounts are not mine. According to XXXX...,,"EQUIFAX, INC.",FL,34786,,Consent provided,Web,2021-07-15,Closed with explanation,Yes,,4546018
5278640,2024-03-19,Credit reporting or other personal consumer re...,Credit reporting,Incorrect information on your report,Information belongs to someone else,,,"EQUIFAX, INC.",FL,33805,Servicemember,Consent not provided,Web,2024-03-19,Closed with non-monetary relief,Yes,,8583424
4798715,2017-01-22,Student loan,Non-federal student loan,Dealing with my lender or servicer,Trouble with how payments are handled,I have been with Sallie Mae since XX/XX/XXXX w...,Company believes it acted appropriately as aut...,SLM CORPORATION,MO,64119,,Consent provided,Web,2017-01-22,Closed with monetary relief,Yes,No,2302991
4454330,2020-05-07,Checking or savings account,Checking account,Managing an account,Deposits and withdrawals,,,JPMORGAN CHASE & CO.,NY,10031,,,Referral,2020-05-07,Closed with explanation,Yes,,3641729
1269732,2023-07-25,"Credit reporting, credit repair services, or o...",Credit reporting,Problem with a credit reporting company's inve...,Their investigation did not fix an error on yo...,I have previously disputed XXXX XXXX Bankruptc...,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,MI,48197,,Consent provided,Web,2023-07-25,Closed with explanation,Yes,,7294815


In [17]:
df.replace('', np.nan, inplace=True)

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5479762 entries, 0 to 5479761
Data columns (total 18 columns):
 #   Column                        Dtype 
---  ------                        ----- 
 0   Date received                 object
 1   Product                       object
 2   Sub-product                   object
 3   Issue                         object
 4   Sub-issue                     object
 5   Consumer complaint narrative  object
 6   Company public response       object
 7   Company                       object
 8   State                         object
 9   ZIP code                      object
 10  Tags                          object
 11  Consumer consent provided?    object
 12  Submitted via                 object
 13  Date sent to company          object
 14  Company response to consumer  object
 15  Timely response?              object
 16  Consumer disputed?            object
 17  Complaint ID                  int64 
dtypes: int64(1), object(17)
memory usage: 752.

In [22]:
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

print('We have {} numerical features: {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features: {}'.format(len(categorical_features), categorical_features))


We have 1 numerical features: ['Complaint ID']

We have 17 categorical features: ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?']


In [24]:
missing=df.isnull().sum().div(df.shape[0]).mul(100).to_frame().sort_values(by=0, ascending=True)
missing 

Unnamed: 0,0
Date received,0.0
Timely response?,0.0
Date sent to company,0.0
Submitted via,0.0
Company,0.0
Complaint ID,0.0
Product,0.0
Issue,0.000109
Company response to consumer,0.000292
ZIP code,0.551593


In [25]:
# Define a dictionary with the proportion of NaN values
nan_proportions = {
    'Issue': 0.000109,
    'Company response to consumer': 0.000292,
    'ZIP code': 0.551593,
    'State': 0.843960,
    'Sub-product': 4.293891,
    'Sub-issue': 13.494856,
    'Consumer consent provided?': 19.467926,
    'Company public response': 51.788910,
    'Consumer complaint narrative': 65.134927,
    'Consumer disputed?': 85.979026,
    'Tags': 90.962454
}


In [26]:

for column, proportion in nan_proportions.items():
    if proportion < 5:
        df[column].fillna('Unknown', inplace=True)
    elif proportion < 50:
        if df[column].dtype == 'O':
            df[column].fillna(df[column].mode()[0], inplace=True)
        else:
            df[column].fillna(df[column].mean(), inplace=True)
    else:
        df.drop(columns=[column], inplace=True)

print("\nDataFrame after handling NaN values based on specific proportions:")
print(df)


DataFrame after handling NaN values based on specific proportions:
        Date received                                            Product  \
0          2024-06-19  Credit reporting or other personal consumer re...   
1          2024-06-19  Credit reporting or other personal consumer re...   
2          2024-06-19  Credit reporting or other personal consumer re...   
3          2024-06-19  Credit reporting or other personal consumer re...   
4          2024-06-16                          Debt or credit management   
...               ...                                                ...   
5479757    2013-06-04                                        Credit card   
5479758    2011-12-30                                        Credit card   
5479759    2013-04-23                                           Mortgage   
5479760    2013-03-05                                           Mortgage   
5479761    2012-05-01                                        Credit card   

                   

# Visualize the Target Feature(Coustomer Disputed)

In [None]:
percentage=df['']