## Introduction

In response to frequent customer complaints posted on social media platforms, our store has taken a proactive approach to mitigate dissatisfaction. To address this issue, we have devised a campaign aimed at reaching out to customers who are more likely to complain, with the intention of gathering their valuable feedback on improving our services.

To identify these customers, we will leverage their transaction history from the last 30 days and develop a machine learning model using Python. This model will predict whether a customer is likely to file a complaint by the end of the month or not. By utilizing historical transaction data, we aim to proactively engage with customers who are more prone to expressing their concerns, enabling us to address any issues and enhance their overall experience.

Through the application of machine learning techniques, we anticipate gaining valuable insights that will facilitate targeted communication and support our ongoing efforts to provide exceptional customer service.

## The Plan

1. Load our dataset
2. Data Expolaration
3. Data pre-processing
4. Machine Learning
   1. Random Forest
   2. Gradient Boost
5. Conclusion

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import numpy as np
from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import f_classif, mutual_info_classif
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import confusion_matrix,classification_report
from sklearn.preprocessing import LabelEncoder

import statsmodels.api as sm
import scipy.stats as stats

from sklearn.model_selection import KFold
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import plotly.graph_objects as go

from scipy.stats import zscore
from sklearn.metrics import roc_curve, auc

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import cross_val_score, cross_val_predict, cross_val_predict


In [2]:
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', None)  # or 199

### 1. Dataset

In [3]:
df = pd.read_csv("data/store_complains_dataset.csv")
df.shape

(31924, 22)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31924 entries, 0 to 31923
Data columns (total 22 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   customer_registration_number  31924 non-null  object 
 1   merchandize_category          26843 non-null  object 
 2   amount_deposited_via_counter  31924 non-null  float64
 3   amount_deposited_via_card     31924 non-null  float64
 4   balance_on_complaign_date     31924 non-null  float64
 5   transaction_date              31924 non-null  object 
 6   complaint_date                31924 non-null  object 
 7   restaurant_points             31924 non-null  int64  
 8   fuel_points                   31924 non-null  int64  
 9   groceries_points              31924 non-null  int64  
 10  toys_points                   31924 non-null  int64  
 11  cash_back_points              31924 non-null  int64  
 12  electronics                   31924 non-null  int64  
 13  c

In [5]:
df.head()

Unnamed: 0,customer_registration_number,merchandize_category,amount_deposited_via_counter,amount_deposited_via_card,balance_on_complaign_date,transaction_date,complaint_date,restaurant_points,fuel_points,groceries_points,toys_points,cash_back_points,electronics,complained,Order_tyPe,amount,quantity,card_vendor,used_coupon,product_discounted,cust_age,cust_gender
0,64257fd79a53006421b72c3f,Breakfast,0.0,779.0,1272.0,2023-03-15,2023-03-29,0,0,0,0,0,0,YES,Other,1.0,1.0,Visa,No,no,49,Female
1,64257ffc9a53006421b72c40,Frozen,0.0,536.0,928.0,2023-03-15,2023-03-29,0,1,0,0,0,0,YES,Pickup,3000.0,1.0,Visa,No,no,82,Female
2,642580229a53006421b72c41,Alcohol,0.0,330225.0,177850.0,2023-03-15,2023-03-29,0,0,0,0,1,0,NO,Walk In,22000.0,1.0,Visa,No,no,35,Male
3,642580469a53006421b72c42,Baking,0.0,6215561.04,301542.04,2023-03-15,2023-03-29,3,0,1,0,1,0,NO,Walk In,1600000.0,5.0,Mastercard,No,YEs,95,Female
4,642580469a53006421b72c42,Alcohol,0.0,6215561.04,301542.04,2023-03-15,2023-03-29,3,0,1,0,1,0,NO,Walk In,1600000.0,5.0,Mastercard,No,YEs,95,Female


In [6]:
df.describe()

Unnamed: 0,amount_deposited_via_counter,amount_deposited_via_card,balance_on_complaign_date,restaurant_points,fuel_points,groceries_points,toys_points,cash_back_points,electronics,amount,quantity,cust_age
count,31924.0,31924.0,31924.0,31924.0,31924.0,31924.0,31924.0,31924.0,31924.0,31711.0,31675.0,31924.0
mean,113725.2,674415.1,680655.4,0.242639,0.192488,0.37251,0.002913,0.334263,0.012874,3136570.0,3.137889,49.368688
std,4752873.0,6580391.0,6771707.0,1.016028,0.90984,1.345659,0.136032,1.328839,0.294554,328770000.0,4.782462,28.832312
min,-5900.0,-199679.3,-11567.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
25%,0.0,4100.0,1140.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,25.0
50%,0.0,4100.0,4793.5,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,49.0
75%,0.0,104100.0,122153.8,0.0,0.0,0.0,0.0,0.0,0.0,26000.0,3.25,75.0
max,700000000.0,700000000.0,700000000.0,59.0,23.0,27.0,22.0,26.0,16.0,50154410000.0,155.0,99.0


In [7]:
duplicate_rows_count = df.duplicated().sum()

print(f'Duplicates rows {duplicate_rows_count}')

Duplicates rows 0


In [8]:
df = df.dropna()
missing_values_count = df.isnull().sum()
missing_values_count

customer_registration_number    0
merchandize_category            0
amount_deposited_via_counter    0
amount_deposited_via_card       0
balance_on_complaign_date       0
transaction_date                0
complaint_date                  0
restaurant_points               0
fuel_points                     0
groceries_points                0
toys_points                     0
cash_back_points                0
electronics                     0
complained                      0
Order_tyPe                      0
amount                          0
quantity                        0
card_vendor                     0
used_coupon                     0
product_discounted              0
cust_age                        0
cust_gender                     0
dtype: int64

In [9]:
# Calculate Z-scores
df['Z_score_amount_deposited_via_counter'] = zscore(df['amount_deposited_via_counter'])
df['Z_score_amount_deposited_via_card'] = zscore(df['amount_deposited_via_card'])
df['Z_score_amount'] = zscore(df['amount'])

# Remove outliers
df = df[(np.abs(df['Z_score_amount_deposited_via_counter']) <= 3) & (np.abs(df['Z_score_amount_deposited_via_card']) <= 3) & (np.abs(df['Z_score_amount']) <= 3)]


df.describe()


Unnamed: 0,amount_deposited_via_counter,amount_deposited_via_card,balance_on_complaign_date,restaurant_points,fuel_points,groceries_points,toys_points,cash_back_points,electronics,amount,quantity,cust_age,Z_score_amount_deposited_via_counter,Z_score_amount_deposited_via_card,Z_score_amount
count,26211.0,26211.0,26211.0,26211.0,26211.0,26211.0,26211.0,26211.0,26211.0,26211.0,26211.0,26211.0,26211.0,26211.0,26211.0
mean,15574.51,313081.4,389065.5,0.260845,0.201137,0.390447,0.002518,0.354088,0.012476,315940.2,2.767811,49.470451,-0.031203,-0.060673,-0.009001
std,193338.6,1167428.0,2773856.0,1.048056,0.897825,1.370454,0.058545,1.351031,0.295517,3637791.0,4.33326,28.864571,0.109883,0.242568,0.010112
min,-5900.0,-199679.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,-0.043408,-0.167214,-0.00988
25%,0.0,4100.0,1053.65,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,25.0,-0.040055,-0.124873,-0.00988
50%,0.0,4100.0,4100.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,49.0,-0.040055,-0.124873,-0.00988
75%,0.0,90590.0,104100.0,0.0,0.0,0.0,0.0,0.0,0.0,28000.0,3.0,75.0,-0.040055,-0.106902,-0.009802
max,5025500.0,15014650.0,220990800.0,59.0,22.0,27.0,3.0,23.0,16.0,190000000.0,155.0,99.0,2.816156,2.994015,0.51825


### 2. Data Exploration

In [10]:
complained_counts = df['complained'].value_counts()

# Create the traces
trace = go.Bar(
    x=complained_counts.index,
    y=complained_counts.values
    # marker=dict(color='blue'),
)

data = [trace]

# Create the layout
layout = go.Layout(
    title='Distribution of Complaints',
    xaxis=dict(title='Complained'),
    yaxis=dict(title='Count'),
)

# Create the figure
fig = go.Figure(data=data, layout=layout)

# Show the plot
fig.show()

In our dataset, instances of customer complaints surpass those without complaints. The distribution is not entirely even; however, it's not particularly concerning at this stage. As we progress with our analysis, we'll further assess whether any action is required to address this imbalance

#### 1. Deposit Method and Complaints

In [11]:
le = LabelEncoder()
df['complained_num'] = le.fit_transform(df['complained'])

df.loc[df['amount_deposited_via_card'] > df['amount_deposited_via_counter'], 'deposit_method'] = 'card'
df.loc[df['amount_deposited_via_card'] < df['amount_deposited_via_counter'], 'deposit_method'] = 'counter'

complaints_by_deposit_method = df.groupby('deposit_method')['complained_num'].mean() * 100
complaints_by_deposit_method = complaints_by_deposit_method.reset_index()

fig = px.bar(complaints_by_deposit_method, x='deposit_method', y='complained_num', 
             labels={'deposit_method': 'Deposit Method', 'complained_num': 'Complaint(%)'},
             title='Complaint Percentage by Deposit Method')

fig.show()


Based on the above visualization, we note a higher likelihood of complaints from customers who have deposited using cards compared to counter deposits. This could potentially be attributed to the technology involved in card transactions, which might still be new or unfamiliar to some of our customers. However, it's important to clarify that this observation does not definitively imply counter deposits are superior to card deposits

#### 2. Spending Behavior and Complaints

In [12]:
# Boxplot of amount for complainers vs non-complainers
fig3 = px.box(df, x='complained', y='amount',
              labels={'complained':'complained',
                      'Amount':'Amount'},
              title='Purchase amount for Complainers vs Non-Complainers')
fig3.show()

We observer that customer with smaller purchase are more likely to complain. One possible explanation is that they have higher expectations for the quality of lower-cost products, which, if not met, could lead to dissatisfaction and complaints. And, there's a common perception that lower spending often corresponds to lower quality, which might lead to an increase in complaints. It's important to note, however, that these are assumptions; a comprehensive analysis of the specific data and context would be required to determine the exact cause.

#### 3. Point accumulation and Complaints

In [13]:
# Create a new column for total points
df['total_points'] = df['restaurant_points'] + df['fuel_points'] + df['groceries_points'] + df['toys_points'] + df['cash_back_points']

# Plot the data
fig = px.scatter(df, x='total_points', y='complained_num',
                 labels={'total_points': 'Total Points', 'complained_num': 'Complaint Status'},
                 title='Complaint Status by Total Points',
                 color='complained_num',  # This will color the points based on complaint status
                 opacity=0.7)  # This makes points semi-transparent to see overlap

fig.show()


Customers with more points may be more satisfied, engaged, and loyal, because these points could be viewed as rewards or discounts. They might also be enjoying a higher quality of service or product, which comes with earning more points. All of these factors could lead to fewer complaints. However, remember that correlation does not imply causation, and further investigation is necessary to confirm these hypotheses.

#### 4. Card vendor and Complaints

In [14]:
# Group by card vendor and calculate the percentage of complaints for each
complaints_by_card_vendor = df.groupby('card_vendor')['complained_num'].mean() * 100
complaints_by_card_vendor = complaints_by_card_vendor.reset_index()

# Plot the data
fig = px.bar(complaints_by_card_vendor, x='card_vendor', y='complained_num',
             labels={'card_vendor': 'Card Vendor', 'complained_num': 'Complaint Percentage'},
             title='Complaint Percentage by Card Vendor')

fig.show()


This visualization presents intriguing insights: customers using Visa appear to be much more likely to lodge complaints compared to those using Mastercard. The gap is substantial, warranting further exploration. While it's only speculative at this point, one hypothesis could be that the technology differences between Visa and Mastercard may be contributing to this discrepancy in complaint rates.

#### 5. Usage of Coupons and Complaints

In [15]:
df['used_coupon_num'] = le.fit_transform(df['used_coupon'])

# Group by coupon usage and calculate the percentage of complaints for each
complaints_by_coupon_usage = df.groupby('used_coupon_num')['complained_num'].mean() * 100
complaints_by_coupon_usage = complaints_by_coupon_usage.reset_index()

# Plot the data
fig = px.bar(complaints_by_coupon_usage, x='used_coupon_num', y='complained_num',
             labels={'used_coupon_num': 'Used Coupon', 'complained_num': 'Complaint(%)'},
             title='Complaint Percentage by Coupon Usage',
             category_orders={'used_coupon_num': [0, 1]})  # This ensures that the order of the x-axis is [No, Yes]

fig.show()


The observation that customers who don't use coupons complain more might be attributed to several factors. Those who use coupons might perceive that they're getting a better deal and hence have a more positive experience, reducing the likelihood of complaints. Coupon users could also be more price-sensitive and feel less inclined to complain if they believe they're receiving a bargain. Moreover, using a coupon might indicate a higher level of engagement with the product or service, and a more rewarding experience, both of which could lead to fewer complaints. To validate these assumptions and gain deeper insights, further investigation such as customer surveys could be beneficial.

#### 6. Product Discount and Complaints

In [16]:
df['product_discounted_num'] = le.fit_transform(df['product_discounted'])

# Group by product discount status and calculate the percentage of complaints for each
complaints_by_discount_status = df.groupby('product_discounted_num')['complained_num'].mean() * 100
complaints_by_discount_status = complaints_by_discount_status.reset_index()

# Plot the data
fig = px.bar(complaints_by_discount_status, x='product_discounted_num', y='complained_num',
             labels={'product_discounted_num': 'Product Discounted', 'complained_num': 'Complaint Percentage'},
             title='Complaint Percentage by Product Discount Status',
             category_orders={'product_discounted_num': [0, 1]})  # This ensures that the order of the x-axis is [No, Yes]

fig.show()


Customers buying discounted products may be more price-sensitive, have higher expectations, or perceive the product to be of lower quality, leading to more complaints. They may also be under financial stress, making them more likely to complain about any perceived faults. Remember, correlation does not imply causation, and other factors could influence this observation.

#### 7. Merchadize Category and Complaints

In [17]:
# Group by merchandize category and calculate the percentage of complaints for each
complaints_by_merchandize = df.groupby('merchandize_category')['complained_num'].mean() * 100
complaints_by_merchandize = complaints_by_merchandize.reset_index()

# Sort the DataFrame in ascending order of complaint percentage
complaints_by_merchandize = complaints_by_merchandize.sort_values('complained_num')

# Plot the data
fig = px.bar(complaints_by_merchandize, x='merchandize_category', y='complained_num',
             labels={'merchandize_category': 'Merchandize Category', 'complained_num': 'Complaint(%)'},
             title='Complaint Percentage by Merchandize Category')

fig.show()


The plot clearly indicates a higher volume of complaints associated with customers who purchase breakfast products. This trend warrants a deeper examination, such as identifying the specific breakfast items that are most frequently linked to customer dissatisfaction. By pinpointing the problematic products, we can better understand and address the root causes of these complaints.

#### 8. Order Type and Complaints

In [18]:
# Group by merchandize category and calculate the percentage of complaints for each
complaints_by_merchandize = df.groupby('Order_tyPe')['complained_num'].mean() * 100
complaints_by_merchandize = complaints_by_merchandize.reset_index()

# Sort the DataFrame in ascending order of complaint percentage
complaints_by_merchandize = complaints_by_merchandize.sort_values('complained_num')

# Plot the data
fig = px.bar(complaints_by_merchandize, x='Order_tyPe', y='complained_num',
             labels={'Order_tyPe': 'Order Type', 'complained_num': 'Complaint(%)'},
             title='Complaint Percentage by Order Type')

fig.show()


The visualizations above present noteworthy insights. First, it's promising to see fewer complaints from customers who engage with us directly - either by using our application, visiting us in person, or utilizing our website for pick-ups - compared to those who order via third-party platforms. However, a category labeled 'Other' appears to generate a significant number of complaints, warranting further investigation. These observations suggest a potential strategy to reduce complaints: actively encourage customers to utilize our app for their orders.

#### 8. Customer and Complaints

In [19]:

df_complained = df[df['complained_num'] == 1]

# Plot the density of age for customers who complained
fig = px.histogram(df_complained, x="cust_age", histnorm='probability density',
                   labels={'cust_age': 'Customer Age'},
                   title='Density of Customer Age for Complaints')

fig.update_traces(opacity=0.7) # adjust the opacity of the area to your liking
fig.show()


# Calculate the percentage of complaints for each gender
complaints_by_gender = df.groupby('cust_gender')['complained_num'].mean() * 100
complaints_by_gender = complaints_by_gender.reset_index()

# Plot the percentage of complaints by gender
fig = px.bar(complaints_by_gender, x='cust_gender', y='complained_num',
             labels={'cust_gender': 'Customer Gender', 'complained_num': 'Complaint Percentage'},
             title='Complaint Percentage by Customer Gender')
fig.show()


### 3. Data pre-processing

#### 1. Null Values

In [20]:
df = pd.read_csv("data/store_complains_dataset.csv")
df.shape

(31924, 22)

In [21]:
missing_values_count = df.isnull().sum()
missing_values_count

customer_registration_number       0
merchandize_category            5081
amount_deposited_via_counter       0
amount_deposited_via_card          0
balance_on_complaign_date          0
transaction_date                   0
complaint_date                     0
restaurant_points                  0
fuel_points                        0
groceries_points                   0
toys_points                        0
cash_back_points                   0
electronics                        0
complained                         0
Order_tyPe                         0
amount                           213
quantity                         249
card_vendor                        0
used_coupon                        0
product_discounted                 0
cust_age                           0
cust_gender                        0
dtype: int64

In [22]:
# Imputation of null values

df['merchandize_category'].fillna('Missing', inplace=True)

mean = df['amount'].mean()
df['amount'].fillna(mean, inplace=True)

mean = df['quantity'].mean()
df['quantity'].fillna(mean, inplace=True)


missing_values_count = df.isnull().sum()
missing_values_count

customer_registration_number    0
merchandize_category            0
amount_deposited_via_counter    0
amount_deposited_via_card       0
balance_on_complaign_date       0
transaction_date                0
complaint_date                  0
restaurant_points               0
fuel_points                     0
groceries_points                0
toys_points                     0
cash_back_points                0
electronics                     0
complained                      0
Order_tyPe                      0
amount                          0
quantity                        0
card_vendor                     0
used_coupon                     0
product_discounted              0
cust_age                        0
cust_gender                     0
dtype: int64

#### 2. Remove Outliers

In [23]:
# Calculate Z-scores
df['Z_score_amount_deposited_via_counter'] = zscore(df['amount_deposited_via_counter'])
df['Z_score_amount_deposited_via_card'] = zscore(df['amount_deposited_via_card'])
df['Z_score_amount'] = zscore(df['amount'])
df['Z_score_cust_age'] = zscore(df['cust_age'])
df['Z_score_bbalance_on_complaign_date'] = zscore(df['balance_on_complaign_date'])


# Remove outliers
df = df[(np.abs(df['Z_score_bbalance_on_complaign_date']) <= 3) & (np.abs(df['Z_score_cust_age']) <= 3) & (np.abs(df['Z_score_amount_deposited_via_counter']) <= 3) & (np.abs(df['Z_score_amount_deposited_via_card']) <= 3) & (np.abs(df['Z_score_amount']) <= 3)]


df = df.drop(['Z_score_amount_deposited_via_card', 'Z_score_amount_deposited_via_counter', 'Z_score_amount', 'Z_score_cust_age', 'Z_score_bbalance_on_complaign_date'], axis=1)

df.shape

(31690, 22)

#### 3. Feature Engineering

In [24]:
# Adding column total points and total deposit
df['total_points'] = df['restaurant_points'] + df['fuel_points'] + df['groceries_points'] + df['toys_points'] + df['cash_back_points']

df['total_deposit'] = df['amount_deposited_via_counter'] + df['amount_deposited_via_card']

In [25]:
# Create colum days_between_transaction_complaint based on txn date and complain date

df['transaction_date'] = pd.to_datetime(df['transaction_date'])
df['complaint_date'] = pd.to_datetime(df['complaint_date'])

# Extract month and day of the week from transaction_date
df['transaction_month'] = df['transaction_date'].dt.month
df['transaction_day_of_week'] = df['transaction_date'].dt.dayofweek

# Extract month and day of the week from complaint_date
df['complaint_month'] = df['complaint_date'].dt.month
df['complaint_day_of_week'] = df['complaint_date'].dt.dayofweek

# Calculate the difference in days between transaction_date and complaint_date
df['days_between_transaction_complaint'] = (df['complaint_date'] - df['transaction_date']).dt.days

df = df.drop(['complaint_day_of_week', 'complaint_month', 'transaction_day_of_week', 'transaction_month'], axis=1)


In [26]:
# One hot encoding for categorical columns
df = pd.get_dummies(df, columns=['merchandize_category', 'card_vendor', 'cust_gender', 'Order_tyPe'], drop_first=True)

In [27]:
df.shape

(31690, 52)

#### 4. Cleanup and  Dataset Split 

In [28]:
le = LabelEncoder()
df['used_coupon_num'] = le.fit_transform(df['used_coupon'])
df['complained_num'] = le.fit_transform(df['complained'])
df['product_discounted_num'] = le.fit_transform(df['product_discounted'])


In [29]:
df = df.drop(['customer_registration_number', 'transaction_date', 'complaint_date',  'complained', 'used_coupon' , 'product_discounted'], axis=1)

In [30]:
df.shape

(31690, 49)

In [31]:
# Splitting our dataset to train and test
X_train, X_test, y_train, y_test = train_test_split(df, df["complained_num"], test_size=0.3, random_state=22)

X_train.groupby(['complained_num',]).size()

complained_num
0     8882
1    13301
dtype: int64

In [32]:
X_train = X_train.drop('complained_num', axis=1)
X_test = X_test.drop('complained_num', axis=1)

#### 5. Resampling :
 We need to balance our dataset by either oversampling the minority class or undersampling the majority class. Oversampling can be done by duplicating examples from the minority class, whereas undersampling can be achieved by deleting instances from the majority class. We will use popular technique for oversampling the minority class is SMOTE (Synthetic Minority Over-sampling Technique).

In [33]:
# X = df.drop('complained_num', axis=1)  # Replace 'complaint' with the actual name of the target variable column
# y = df['complained_num']

# Oversampling using SMOTE
# smote = SMOTE(random_state=42)
# X_train, y_train = smote.fit_resample(X_train, y_train)
# X_train.shape


### 4. Machine Learning

#### 1. Random Forest with Cross Validation

In [34]:
# Define the random forest model
rf = RandomForestClassifier()

# Define the grid search parameters
param_grid = {'n_estimators': [50, 100, 150],
              'max_depth': [None, 5, 10],
              'min_samples_split': [2, 5, 10],
              'min_samples_leaf': [1, 2, 4]}

# Define the cross-validation
cv = KFold(n_splits=10, shuffle=True, random_state=42)

# Perform grid search with cross-validation
rf_model = GridSearchCV(estimator=rf, param_grid=param_grid, cv=cv, n_jobs=-1, verbose=1)

rf_model.fit(X_train, y_train)

Fitting 10 folds for each of 81 candidates, totalling 810 fits


In [35]:
y_rf_pred = rf_model.predict(X_test)
print(classification_report(y_test,y_rf_pred))

              precision    recall  f1-score   support

           0       0.79      0.66      0.72      3792
           1       0.80      0.88      0.84      5715

    accuracy                           0.79      9507
   macro avg       0.79      0.77      0.78      9507
weighted avg       0.79      0.79      0.79      9507



In [36]:
print (f'Train Accuracy - : {rf_model.score(X_train,y_train):.3f}')
print (f'Test Accuracy - : {rf_model.score(X_test,y_test):.3f}')

Train Accuracy - : 0.910
Test Accuracy - : 0.794


In [37]:
# Get predicted probabilities for the positive class
y_score = rf_model.predict_proba(X_test)[:,1]

# Compute ROC curve
fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)

# Create a trace for the ROC curve
trace0 = go.Scatter(x=fpr, y=tpr, mode='lines', name='ROC curve (area = %0.2f)' % roc_auc)

# Create a trace for the random guessing line
trace1 = go.Scatter(x=[0, 1], y=[0, 1], mode='lines', name='Random Guessing')

# Define the layout
layout = go.Layout(title='ROC Curve - Random Forest',
                   xaxis=dict(title='False Positive Rate'),
                   yaxis=dict(title='True Positive Rate'),
                   showlegend=True)

# Define the figure
fig = go.Figure(data=[trace0, trace1], layout=layout)

# Plot the figure
fig.show()

In [38]:
# Get the best estimator
best_rf_model = rf_model.best_estimator_

# Get feature importances from the best model
importances = best_rf_model.feature_importances_

# Convert the importances into one-dimensional 1darray with corresponding df column names as axis labels
f_importances = pd.Series(importances, X_train.columns)

# Create a DataFrame for the importances
df_importances = pd.DataFrame({'Features': f_importances.index, 'Importance': f_importances.values})

# Sort the DataFrame by importance and get the top 10
df_importances = df_importances.sort_values(by='Importance', ascending=False).head(10)

# Plot using Plotly Express
fig = px.bar(df_importances, x='Importance', y='Features', orientation='h', title='Feature Importances')

# Show the plot
fig.show()

#### 2. XGBoost

In [39]:
xg_model = XGBClassifier(
    objective= 'binary:logistic',
    nthread=4,
    seed=42
)


parameters = {
    'max_depth': range (2, 10, 1),
    'n_estimators': range(60, 220, 40),
    'learning_rate': [0.1, 0.01, 0.05]
}

xg_classfier = GridSearchCV(
    estimator=xg_model,
    param_grid=parameters,
    scoring = 'roc_auc',
    n_jobs = 10,
    cv = 10,
    verbose=True
)

xg_classfier.fit(X_train,y_train)

Fitting 10 folds for each of 96 candidates, totalling 960 fits


In [40]:
y_xg_pred = xg_classfier.predict(X_test)
print(classification_report(y_test,y_xg_pred))

              precision    recall  f1-score   support

           0       0.80      0.68      0.74      3792
           1       0.81      0.89      0.85      5715

    accuracy                           0.81      9507
   macro avg       0.81      0.79      0.79      9507
weighted avg       0.81      0.81      0.80      9507



In [41]:
print (f'Train Accuracy - : {xg_classfier.score(X_train,y_train):.3f}')
print (f'Test Accuracy - : {xg_classfier.score(X_test,y_test):.3f}')

Train Accuracy - : 0.959
Test Accuracy - : 0.860


In [42]:
# Predict probabilities
probs = xg_classfier.predict_proba(X_test)
probs = probs[:, 1] # Keep probabilities for the positive outcome only

# Compute ROC curve (Receiver Operating Characteristic)
fpr, tpr, thresholds = roc_curve(y_test, probs)

# Get the best threshold: Youden's J statistic
J = tpr - fpr
idx = np.argmax(J)
best_thresh = thresholds[idx]
print('Best Threshold=%f' % (best_thresh))

# Now, to make predictions based on the new threshold
new_predictions = (xg_classfier.predict_proba(X_test)[:,1] >= best_thresh).astype(int)

print(classification_report(y_test,new_predictions))

Best Threshold=0.575830
              precision    recall  f1-score   support

           0       0.76      0.73      0.75      3792
           1       0.83      0.85      0.84      5715

    accuracy                           0.80      9507
   macro avg       0.80      0.79      0.79      9507
weighted avg       0.80      0.80      0.80      9507



In [43]:
# Get predicted probabilities for the positive class
y_score = xg_classfier.predict_proba(X_test)[:,1]

# Compute ROC curve
fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)

# Create a trace for the ROC curve
trace0 = go.Scatter(x=fpr, y=tpr, mode='lines', name='ROC curve (area = %0.2f)' % roc_auc)

# Create a trace for the random guessing line
trace1 = go.Scatter(x=[0, 1], y=[0, 1], mode='lines', name='Random Guessing')

# Define the layout
layout = go.Layout(title='ROC Curve - XGBoost',
                   xaxis=dict(title='False Positive Rate'),
                   yaxis=dict(title='True Positive Rate'),
                   showlegend=True)

# Define the figure
fig = go.Figure(data=[trace0, trace1], layout=layout)

# Plot the figure
fig.show()

In [44]:
# Get the best estimator
best_rf_model = xg_classfier.best_estimator_

# Get feature importances from the best model
importances = best_rf_model.feature_importances_

# Convert the importances into one-dimensional 1darray with corresponding df column names as axis labels
f_importances = pd.Series(importances, X_train.columns)

# Create a DataFrame for the importances
df_importances = pd.DataFrame({'Features': f_importances.index, 'Importance': f_importances.values})

# Sort the DataFrame by importance and get the top 10
df_importances = df_importances.sort_values(by='Importance', ascending=False).head(10)

# Plot using Plotly Express
fig = px.bar(df_importances, x='Importance', y='Features', orientation='h', title='Feature Importances')

# Show the plot
fig.show()

In [45]:
# Predict probabilities
probs = xg_classfier.predict_proba(X_test)
probs = probs[:, 1] # Keep probabilities for the positive outcome only

# Compute ROC curve (Receiver Operating Characteristic)
fpr, tpr, thresholds = roc_curve(y_test, probs)

# Get the best threshold: Youden's J statistic
J = tpr - fpr
idx = np.argmax(J)
best_thresh = thresholds[idx]
print('Best Threshold=%f' % (best_thresh))

# Now, to make predictions based on the new threshold
new_predictions = (xg_classfier.predict_proba(X_test)[:,1] >= best_thresh).astype(int)

print(classification_report(y_test,new_predictions))

Best Threshold=0.575830
              precision    recall  f1-score   support

           0       0.76      0.73      0.75      3792
           1       0.83      0.85      0.84      5715

    accuracy                           0.80      9507
   macro avg       0.80      0.79      0.79      9507
weighted avg       0.80      0.80      0.80      9507



In [46]:
print (f'Train Accuracy - : {xg_classfier.score(X_train,y_train):.3f}')
print (f'Test Accuracy - : {xg_classfier.score(X_test,y_test):.3f}')

# accuracy = accuracy_score(y_test, new_predictions)


Train Accuracy - : 0.959
Test Accuracy - : 0.860


### 5. Conclusion 
Our model shows a reasonably strong performance, with an overall accuracy of 86%. This means that out of all predictions made, 86% of the cases were correctly predicted. However, a more nuanced view can be seen when looking at the individual precision, recall, and F1-scores for both classes.

Class 0, presumably representing customers who will not complain, displays a precision of 0.80, which indicates that when the model predicts that a customer will not complain, it is correct 80% of the time. This class has a recall of 0.68, which means that out of all the customers who did not complain, the model successfully identified 68% of them. The F1-score for this class, a harmonic mean of precision and recall, is 0.74.

Class 1, presumably representing customers who will complain, has a higher performance. It exhibits a precision of 0.81, indicating that the model's prediction is correct 81% of the time when it predicts that a customer will complain. The recall is 0.89, meaning the model identified 89% of all the customers who complained. The F1-score for this class is 0.85.

Overall, the model appears to be more successful at identifying customers who will complain compared to those who won't. The model's lower recall for the 'no complaint' category (0.68) suggests it may be falsely identifying some customers who won't complain as those who will. It might be beneficial to look into ways of improving this. Nevertheless, the results show promising utility for this model in predicting customer complaints and can be a valuable asset in proactive customer service measures. However, as with all models, its performance should be continuously monitored and tuned over time to ensure its effectiveness.