# CFPB Consumer Complaints Modeling – Exploratory Data Analysis 

This notebook focuses on the analysis of financial complaints data provided by the Consumer Financial Protection Bureau (CFPB), which is publicly available on the CFPB's official website at https://www.consumerfinance.gov/data-research/consumer-complaints/. The dataset contains information about consumer complaints filed against financial companies and banks.



Comprehending the dynamics of consumer complaints is crucial for understanding customer experiences and optimizing operational costs for financial institutions. The dataset offers a comprehensive overview of complaints, capturing detailed information such as date, company involved, complaint details, and outcomes. This analysis aims to extract valuable insights from the data, serving financial institutions and policymakers. The goal is to identify patterns and factors that contribute to complaint disputes, ultimately reducing operational costs and improving customer satisfaction.




The original dataset contains over 2 million recent records covering 6000+ financial providers. For this analysis, we will focus on data up to 2017 and specifically consider complaints related to the top 5 banks in the US. The dataset used for this project can be found in the JupyterHub shared folder (complaints_25Nov21.csv).




The cost structure associated with managing complaints is as follows:




It costs banks one hundred dollars on average to resolve and close a complaint that is not disputed. 
If a resolution is disputed by the customer, it costs an additional five hundred dollars to resolve the complaint (making the total cost six hundred dollars).
To minimize costs and optimize complaint resolution, banks can perform "extra diligence" at an additional cost of $90 per complaint to prevent disputes. However, this cost is only justified if the complaint is likely to be disputed.




The objective is to develop a model that can predict which complaints are likely to be disputed, allowing banks to proactively perform extra diligence on those complaints to reduce overall costs.




By leveraging predictive modeling techniques and data analysis, we aim to identify key indicators and patterns that can help predict dispute likelihood, thereby enabling banks to optimize their response strategies and minimize operational expenses.

Explored the dataset and selected specific variables as predictors ('Product', 'Sub-product', 'Issue', 'State', 'Tags', 'Submitted via', 'Company response to consumer', 'Timely response?') with 'Consumer disputed?' as the target variable.
Converted the target variable into binary format (0 for non-disputed and 1 for disputed).
Split the data into training (80%) and testing (20%) sets using train_test_split from sklearn.model_selection

.

Checked the proportion of disputed complaints in the training dataset.
Applied random undersampling (RandomUnderSampler) if the proportion of disputed complaints was less than 

30%.

Trained an XGBoost Classifier (XGBClassifier) to predict complaint disputes on the training data.
Evaluated the model performance on the test set using a classification report and confusion 

matrix.

Calculated the total cost based on the default model predictions (using a fixed threshold of 0.5 for binary classification).
Explored changing the classification threshold to minimize the total cost, considering the cost structure600vdollars ided ($600 for disputed cmpl dollarsaints, $100 for non-disut dollarsed, and $90 for extra diligence per complaint).

In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.under_sampling import RandomUnderSampler
from xgboost import XGBClassifier

In [18]:
df = pd.read_csv('complaints_25Nov21.csv')
df

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2016-10-26,Money transfers,International money transfer,Other transaction issues,,"To whom it concerns, I would like to file a fo...",Company has responded to the consumer and the ...,"CITIBANK, N.A.",,,,Consent provided,Web,2016-10-29,Closed with explanation,Yes,No,2180490
1,2015-03-27,Bank account or service,Other bank product/service,"Account opening, closing, or management",,My name is XXXX XXXX XXXX and huband name is X...,Company chooses not to provide a public response,"CITIBANK, N.A.",PA,151XX,Older American,Consent provided,Web,2015-03-27,Closed with explanation,Yes,No,1305453
2,2015-04-20,Bank account or service,Other bank product/service,"Making/receiving payments, sending money",,XXXX 2015 : I called to make a payment on XXXX...,Company chooses not to provide a public response,U.S. BANCORP,PA,152XX,,Consent provided,Web,2015-04-22,Closed with monetary relief,Yes,No,1337613
3,2013-04-29,Mortgage,Conventional fixed mortgage,"Application, originator, mortgage broker",,,,JPMORGAN CHASE & CO.,VA,22406,Servicemember,,Phone,2013-04-30,Closed with explanation,Yes,Yes,393900
4,2013-05-29,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,"BANK OF AMERICA, NATIONAL ASSOCIATION",GA,30044,,,Referral,2013-05-31,Closed with explanation,Yes,No,418647
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
207255,2015-05-24,Debt collection,Credit card,Taking/threatening an illegal action,Sued w/o proper notification of suit,,,JPMORGAN CHASE & CO.,FL,33133,,Consent not provided,Web,2015-05-24,Closed with explanation,Yes,No,1390395
207256,2012-01-10,Mortgage,Conventional fixed mortgage,"Loan modification,collection,foreclosure",,,,JPMORGAN CHASE & CO.,NY,10312,,,Referral,2012-01-11,Closed without relief,Yes,Yes,12192
207257,2012-07-17,Student loan,Non-federal student loan,Repaying your loan,,,,"BANK OF AMERICA, NATIONAL ASSOCIATION",NH,032XX,,,Web,2012-07-18,Closed with explanation,Yes,No,118351
207258,2016-09-29,Bank account or service,Checking account,"Account opening, closing, or management",,Near the end of XXXX 2016 I opened a Citigold ...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",CA,900XX,,Consent provided,Web,2016-09-29,Closed with non-monetary relief,Yes,No,2138969


In [19]:
# Load the dataset
complaints = pd.read_csv('complaints_25Nov21.csv')

In [20]:
# Select predictors and target variable
X = complaints[['Product', 'Sub-product', 'Issue', 'State', 'Tags', 
                'Submitted via', 'Company response to consumer', 'Timely response?']]
y = complaints['Consumer disputed?']

# Convert target variable to binary (0 for non-disputed, 1 for disputed)
le = LabelEncoder()
y = le.fit_transform(y)

# Split data into train and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Define categorical preprocessing steps
categorical_features = X.select_dtypes(include=['object']).columns.tolist()
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Preprocess data using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features)
    ])

# Define the XGBoost model
model_xgb = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(random_state=123))
])

# Train the XGBoost Classifier
model_xgb.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = model_xgb.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

# Calculate total cost based on the default model predictions
base_cost = (sum(y_test == 0) * 100) + (sum(y_test == 1) * 600)
model_cost = (sum((y_test == 0) & (y_pred == 0)) * 100) + (sum((y_test == 1) & (y_pred == 0)) * 90)

print(f"Base Case Total Cost: ${base_cost}")
print(f"Model Total Cost: ${base_cost - model_cost}")

              precision    recall  f1-score   support

           0       0.78      1.00      0.88     32504
           1       0.49      0.00      0.01      8948

    accuracy                           0.78     41452
   macro avg       0.64      0.50      0.44     41452
weighted avg       0.72      0.78      0.69     41452

[[32463    41]
 [ 8909    39]]
Base Case Total Cost: $8619200
Model Total Cost: $4571090


## Proportion of consumers raised a dispute

In [22]:
# Calculate the total number of consumers in the test set
total_consumers = len(y_test)

# Count the number of consumers who raised a dispute in the test set
disputed_consumers = sum(y_test == 1)

# Calculate the proportion of consumers who raised a dispute
proportion_disputed = disputed_consumers / total_consumers

print(f"Proportion of consumers who raised a dispute in the test set: {proportion_disputed:.6f}")

Proportion of consumers who raised a dispute in the test set: 0.215864


## Proportion of consumers in the training dataset raised a dispute

In [88]:
#Checking the proportion
disputed_proportion = (y_train == 1).mean()

#Applying random undersampling when < 30%
if disputed_proportion < 0.30:
    from imblearn.under_sampling import RandomUnderSampler
    
    undersampler = RandomUnderSampler(random_state=123)
    X_train_resampled, y_train_resampled = undersampler.fit_resample(X_train, y_train)
    resampled_proportion = (y_train_resampled == 1).mean()
else:
    X_train_resampled, y_train_resampled = X_train, y_train
    resampled_proportion = disputed_proportion

X_train_resampled.shape, y_train_resampled.shape, resampled_proportion

((71910, 8), (71910,), 0.5)

In [25]:
## Recall for 'Consumer disputed?' = 'Yes' on the test set

In [47]:
undersampler = RandomUnderSampler(random_state=123)
X_train_resampled, y_train_resampled = undersampler.fit_resample(X_train, y_train)
model_xgb = XGBClassifier(random_state=123)
model_xgb.fit(X_train_resampled, y_train_resampled)

In [50]:
y_pred = model_xgb.predict(X_test)

# Generate classification report
report = classification_report(y_test, y_pred, target_names=le.classes_)
print(report)

# Extract recall for the category 'Consumer disputed? = Yes' from the classification report
recall_yes_index = report.find('Yes')
if recall_yes_index != -1:
    recall_value = float(report[recall_yes_index:].split()[2])
    print(f"Recall for 'Consumer disputed?' = 'Yes' on the test set: {recall_value}")
else:
    print("Recall value for 'Consumer disputed?' = 'Yes' not found in the classification report.")

              precision    recall  f1-score   support

          No       0.84      0.52      0.64     32504
         Yes       0.26      0.63      0.37      8948

    accuracy                           0.54     41452
   macro avg       0.55      0.57      0.51     41452
weighted avg       0.71      0.54      0.58     41452

Recall for 'Consumer disputed?' = 'Yes' on the test set: 0.63


In [51]:
## Total estimated cost to banks (no model)

In [56]:
# Load the complaints data (assuming the data is already loaded and processed)
data = pd.read_csv('complaints_25Nov21.csv')

# Split data into train and test sets (you may have already done this)
# Assuming you have a way to identify test set indices
# For example, using train_test_split from scikit-learn
from sklearn.model_selection import train_test_split

# Define your predictors (X) and target variable (y)
X = data[['Product', 'Sub-product', 'Issue', 'State', 'Tags', 'Submitted via', 'Company response to consumer', 'Timely response?']]
y = data['Consumer disputed?']

# Split data into train and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Assuming `test_indices` are the indices of the complaints in the test set
test_indices = X_test.index

# Filter data for the test set complaints only
test_data = data.loc[test_indices]

# Calculate total cost without using a model
total_cost_no_model = 0

for index, row in test_data.iterrows():
    if row['Consumer disputed?'] == 'Yes':
        # If the complaint is disputed
        total_cost_no_model += 100 + 500  # $100 base cost + $500 additional cost for dispute
    else:
        # If the complaint is not disputed
        total_cost_no_model += 100  # $100 base cost

# Display the total cost without using a model
print(f"Total cost without using a model: ${total_cost_no_model}")

Total cost without using a model: $8619200


In [33]:
## Total cost to the banks (based on model predictions)

In [64]:
# Define categorical columns
cat_columns = ['Product', 'Sub-product', 'Issue', 'State', 'Tags', 'Submitted via', 'Company response to consumer', 'Timely response?']

# Extract categorical features and target
X_cat = data[cat_columns]
y = (data['Consumer disputed?'] == 'Yes').astype(int)

# Use pandas get_dummies to one-hot encode categorical columns
X_encoded = pd.get_dummies(X_cat, drop_first=True)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=123)

# Train XGBoost classifier
model_xgb = XGBClassifier(random_state=123)
model_xgb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model_xgb.predict(X_test)

# Calculate total cost based on model predictions
total_cost = 0

for i in range(len(X_test)):
    if y_pred[i] == 1:  # Predicted as disputed
        total_cost += 90  # Cost of extra diligence to avoid dispute
        if y_test.iloc[i] == 1:  # Actual dispute
            total_cost += 500  # Additional cost if the dispute occurs
    else:  # Predicted as not disputed
        total_cost += 100  # Base resolution cost for non-disputed complaints

# Print the total cost
print(f"Total cost to the banks based on model predictions: ${total_cost}")

Total cost to the banks based on model predictions: $4157090


In [108]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Assuming 'complaints' is your DataFrame containing the relevant data
# Select features and target variable
X = complaints[['Product', 'Sub-product', 'Issue', 'State', 'Tags', 'Submitted via', 'Company response to consumer', 'Timely response?']]
y = complaints['Consumer disputed?']

# Map string labels to integer labels
y = y.map({'No': 0, 'Yes': 1})

# Perform one-hot encoding for categorical features
encoder = OneHotEncoder(drop='first')
X_encoded = encoder.fit_transform(X)

# Split data into train and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=123)

# Fit the XGBClassifier model
model_xgb = XGBClassifier(random_state=123)
model_xgb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model_xgb.predict(X_test)

# Create the classification report and confusion matrix
report = classification_report(y_test, y_pred)
matrix = confusion_matrix(y_test, y_pred)

# Print the classification report and confusion matrix
print("Classification Report:")
print(report)
print("\nConfusion Matrix:")
print(matrix)

# Calculate the total cost
tn, fp, fn, tp = matrix.ravel()
total_cost = tn * 100 + fp * 190 + fn * 600 + tp * 190
print(f'Total cost: ${total_cost}')

Classification Report:
              precision    recall  f1-score   support

           0       0.78      1.00      0.88     32504
           1       0.47      0.00      0.01      8948

    accuracy                           0.78     41452
   macro avg       0.63      0.50      0.44     41452
weighted avg       0.72      0.78      0.69     41452


Confusion Matrix:
[[32459    45]
 [ 8908    40]]
Total cost: $8606850


In [None]:
## The lowest total cost & the best threshold to the banks based on the observations in the test set

In [99]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix

# Define categorical columns that need to be encoded
categorical_columns = ['Product', 'Sub-product', 'Issue', 'State', 'Tags', 'Submitted via', 'Company response to consumer', 'Timely response?']

# Copy the original data and apply one-hot encoding
X_encoded = pd.get_dummies(X, columns=categorical_columns, drop_first=True)

# Split the encoded data into train and test sets
X_train_encoded, X_test_encoded, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=123)

# Train XGBoost Model
model_xgb = XGBClassifier(random_state=123)
model_xgb.fit(X_train_encoded, y_train)

# Predict probabilities on the test set
y_probs = model_xgb.predict_proba(X_test_encoded)[:, 1]

# Find the threshold that minimizes the total cost
min_cost = float('inf')
best_threshold = None

for threshold in np.linspace(0, 1, 101):
    y_pred_threshold = (y_probs > threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_threshold).ravel()
    current_cost = tn * 100 + fp * 190 + fn * 600 + tp * 190
    
    if current_cost < min_cost:
        min_cost = current_cost
        best_threshold = threshold

print(f"Lowest total cost: {min_cost} at threshold: {best_threshold}")

Lowest total cost: 7589900 at threshold: 0.19
