In [None]:
#pip install pandas

In [None]:
#pip install matplotlib

In [None]:
#pip install kagglehub

In [None]:
#pip install seaborn

In [None]:
#pip install scikit-learn

In [None]:
#pip install nbstripout


In [None]:
#data eda/visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#data modeling
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

#for github
import nbstripout

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("willianoliveiragibin/customer-churn")

print("Path to dataset files:", path)

#resources:
# https://www.datacamp.com/tutorial/understanding-logistic-regression-python
# https://www.datacamp.com/tutorial/understanding-logistic-regression-python

EDA (Exploratory Data Analysis)

In [None]:
data = pd.read_csv('Customer Churn new.csv')

In [None]:
data.head()

The below indicates that we do not have customers with multiple rows. Each customer is unique to its own singular row of data.

In [None]:
data.CustomerId.nunique()

In [None]:
data.RowNumber.nunique()

Statistics

In [None]:
print(data.describe())

Statistical narrative:

Looking across our various quantitative columns, it appears to me that we have a evan spread of information. Starting with credit score, we see that our lowest value is 350 and highest is 850. This means the data properly represents a good spread of financially healthy individuals. Our age gap is great as we may be able to mix in Knearest Neighbors which are classification models. The tenure range is only 0 to 10 which may need to researched more. A gap of 10 days vs 10 years vastly changes the way we would think about the predictions. The balance represents the amount of money in the customers bank account, which again, we seem to have a healthy gap. Lastly, our estimated salary which I hypothesize to be our leading prediction indicator has an interesting minimum value of 11. I assume this is an error as making 11$ a year is not feasible, however the 25%, 50%, and 75% values seem to be in line with a typical yearly salary.  

Independant Variables (x values)

In [None]:
data_x = data[['CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary']]
data_x.head()

In [None]:
sns.pairplot(data_x, markers='o', diag_kind='hist', plot_kws={'color': 'red'}, diag_kws={'color': 'red'})

Further analysis needed**

After reviewing the pairplt, we do not see any clear linear relationships, howeever we also do not see any drastic outliers. 

#note to self:

May need to do more EDA on this dataset. Do a couple of graphs such as a heatmap, or grouping our salary, balance or age data.

Idea: graph the salary and balance by age to see if there is a relationship.

    Graph the amount of people by age and gender and geography to see if there is a relationship or trends.

Data Cleaning

In [None]:
data.dtypes

In [None]:
#checking our different options (for automation we can code to have it find all the options and assign a number if needed)
data['Gender'].unique()

In [None]:
data['Geography'].unique()

In [None]:
#for modeling we usually only want to use numeric, meaning our string type values will have to be mapped to numeric (example: Male 0, Female 1)

data['Gender'] = data['Gender'].map({'Male': 0, 'Female': 1})
data['Geography'] = data['Geography'].map({'France': 0, 'Spain': 1, 'Germany': 2})



In [None]:
#Percentage of customers who exited and didn't 

exited_percentage = data['Exited'].value_counts(normalize=True)
print(exited_percentage)



From this we can keep in mind that about 80% of our dataset has not churned (or stayed a customer) and 20% 

Normalization

This is a technique used to change the values of numeric columns in the dataset to a common scale, without distorting the differences in the ranges of values. This makes it easier for our model to understand the data and improve the accuracy of our predictions. Also, it is good practice to do this so that you don't have one variable dominating the other.





StandardScaler will subtracting the mean of the data by the data point and then divided that value by the deviation. This will give us negative values. 

In [None]:
data.head()

In [None]:
scaler = StandardScaler()
numeric_columns = ['CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary']
df_normalized = scaler.fit_transform(data[numeric_columns])

In [None]:
# Convert normalized array to DataFrame with column names
df_normalized = pd.DataFrame(df_normalized, 
                           columns=numeric_columns, 
                           index=data.index)

In [None]:
df_normalized.head()

In [None]:
df_normalized.describe()

Another method of normalization is by using MinMaxScaler

In [None]:

scaler = MinMaxScaler()
numeric_columns = ['CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary']
df_normalized = scaler.fit_transform(data[numeric_columns])

In [None]:
df_normalized_2 = pd.DataFrame(df_normalized, 
                           columns=numeric_columns, 
                           index=data.index)
df_normalized_2.head()


In [None]:
df_normalized_2.describe()

The MinMaxScaler did as expected which was making it so each data point was mapped to a value between 0 and 1 to represent where the value lies within the minimum and maximum value of the column.

*** I will continue with the MinMaxScaler option, but could do some A/B testing with the other normalized set to see if we gain or lose performance.

Modeling

In [None]:
data.head()

K Nearest Neighbors

Normalized MinMaxScaler vs StandardScaler KNN Comparison

In [None]:
#Add normalized data to our other indicators while also splitting out the training and test variables:
extra_variables = ['CustomerId', 'Surname', 'Geography', 'Gender'] #keep in mind that customerid and surname should be unique
model_x_extra = ['CustomerId', 'Gender', 'Geography']
X_data = pd.concat([df_normalized_2, data[model_x_extra]], axis=1)
y_data = data['Exited']




In [None]:
#KNN, Logistic Regression, Decision Tree/random forests, XGBoost
#start with KNN
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=10)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Based on the statistics from the KNN, we can see that the model is statstically signficant based on the accuracy being over the 60-70% mark. 

*Relearn precision, recall, f1-score.

Our confusion matrix shows that we have 1522 true positives, 100 false positives, 351 false negatives, and 27 true negatives. Based on the TP rate we can see that it is correctly predicting positives 83% of the time which is very good. However, the true negatives was a bit lower than I would have expected. 


In [None]:
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True)

A 77% accuracy rating is pretty good, but what we can do is test multiple neighbor values to find the optimal amount of neighbors.

In [None]:
# quick test to see if the different normalization moethods affect our accuracy at all.

extra_variables = ['CustomerId', 'Surname', 'Geography', 'Gender'] #keep in mind that customerid and surname should be unique
model_x_extra = ['CustomerId', 'Gender', 'Geography']
X_data = pd.concat([df_normalized, data[model_x_extra]], axis=1)
y_data = data['Exited']

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=10)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

In [None]:
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True)

In [None]:
#stick with the MinMaxScaler dataset:

extra_variables = ['CustomerId', 'Surname', 'Geography', 'Gender'] #keep in mind that customerid and surname should be unique
model_x_extra = ['CustomerId', 'Gender', 'Geography']
X_data = pd.concat([df_normalized_2, data[model_x_extra]], axis=1)
y_data = data['Exited']

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=10)

In [None]:
knn_accur = []

for i in range(1, 50):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    knn_accur.append((i, accuracy_score(y_test, y_pred)))

knn_df = pd.DataFrame(knn_accur, columns=['n_neighbors', 'accuracy'])

plt.plot(knn_df.n_neighbors, knn_df.accuracy)
plt.show()


In [None]:
#top 5 neighbor accuracy values.
knn_df_sorted = knn_df.sort_values(['accuracy'], ascending =False)
print(knn_df_sorted.head())


In [None]:
#best neighbor accuracy value example:

knn = KNeighborsClassifier(n_neighbors=14)
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

In [None]:
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix, annot=True)

In [None]:
#graphing the different clustrs

#start here: https://plotly.com/python/knn-classification/

Logistic Regression

In [None]:
X_data = pd.concat([df_normalized_2, data[model_x_extra]], axis=1)
y_data = data['Exited']

print(X_data.head())

In [None]:
#continue with  https://www.datacamp.com/tutorial/understanding-logistic-regression-python

#work on training the model and making sure to increase accuracy..