Tasks:
1. Data Preprocessing:
Load the provided dataset and perform initial data exploration.
Handle missing data and outliers.
Prepare the data for machine learning by encoding categorical variables and splitting it into 
training and testing sets.
2. Feature Engineering:
Generate relevant features from the dataset that can help improve the model's prediction 
accuracy.
Apply feature scaling or normalization if necessary.
3. Model Building:
Choose appropriate machine learning algorithms (e.g., logistic regression, random forest, or 
neural networks).
Train and validate the selected model on the training dataset.
Evaluate the model's performance using appropriate metrics (e.g., accuracy, precision, recall, 
F1-score).
4. Model Optimization:
Fine-tune the model parameters to improve its predictive performance.
Explore techniques like cross-validation and hyperparameter tuning.
5. Model Deployment:
Once satisfied with the model's performance, deploy it into a production-like 
environment (you can simulate this in a development environment).
Ensure the model can take new customer data as input and provide churn predictions.

In [33]:
import pandas as pd
import numpy as np
%pip install matplotlib
import matplotlib.pyplot as plt
%pip install seaborn
import seaborn as sns
import os


In [34]:
#load dataset
%pip install openpyxl
import openpyxl
df = pd.read_excel('customer_churn_large_dataset.xlsx')
#initial data exploration
df.head()


Unnamed: 0,CustomerID,Name,Age,Gender,Location,Subscription_Length_Months,Monthly_Bill,Total_Usage_GB,Churn
0,1,Customer_1,63,Male,Los Angeles,17,73.36,236,0
1,2,Customer_2,62,Female,New York,1,48.76,172,0
2,3,Customer_3,24,Female,Los Angeles,5,85.47,460,0
3,4,Customer_4,36,Female,Miami,3,97.94,297,1
4,5,Customer_5,46,Female,Miami,19,58.14,266,0


In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   CustomerID                  100000 non-null  int64  
 1   Name                        100000 non-null  object 
 2   Age                         100000 non-null  int64  
 3   Gender                      100000 non-null  object 
 4   Location                    100000 non-null  object 
 5   Subscription_Length_Months  100000 non-null  int64  
 6   Monthly_Bill                100000 non-null  float64
 7   Total_Usage_GB              100000 non-null  int64  
 8   Churn                       100000 non-null  int64  
dtypes: float64(1), int64(5), object(3)
memory usage: 6.9+ MB


so there  are  8 columns,  out  of which customerID is  an  integer,and age  along  with subscription_length_months ,total  usage  and  churn
so i am  assuming  this  is  for  some  internet service client is  providing.
the  client  wants to know  or  find  patterns  between the  users subscription usage,bill,total gb usage ,location ,age,gender  has  anything  to   do with  the  user   discounting  service

In [36]:
#let us  check for some  basic  statistics
df.describe()

Unnamed: 0,CustomerID,Age,Subscription_Length_Months,Monthly_Bill,Total_Usage_GB,Churn
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,50000.5,44.02702,12.4901,65.053197,274.39365,0.49779
std,28867.657797,15.280283,6.926461,20.230696,130.463063,0.499998
min,1.0,18.0,1.0,30.0,50.0,0.0
25%,25000.75,31.0,6.0,47.54,161.0,0.0
50%,50000.5,44.0,12.0,65.01,274.0,0.0
75%,75000.25,57.0,19.0,82.64,387.0,1.0
max,100000.0,70.0,24.0,100.0,500.0,1.0


The dataset consists  of  one  hundred thousand(100000) entries,  that  is quite  a lot,
all  columns  have  hundred  thousand, so  there are  no null entries
that makes  my life easier  since  i dont  have to  impute  or  fill any missing  values    
also the  average   age is  44, so the client  has  a  relatively old  userbase
average  months  a  user  using  the service  is 12 so  that means   they have  quite  loyal  customers,
their  monthly bill is  65 usd? thats decent  i would say  compared  to other  prices  i have  seen after  research, india has  it  way  cheaper though.
churn   is at  or near  0.5 or  49 % thats  very high,according to  some   websites "Churn rates for subscription-based services can range from around 5% to 15% annually"
so it  is  indeed  very high

In [37]:
#just checking again if  there are any null values
df.isnull().sum()

CustomerID                    0
Name                          0
Age                           0
Gender                        0
Location                      0
Subscription_Length_Months    0
Monthly_Bill                  0
Total_Usage_GB                0
Churn                         0
dtype: int64

In [38]:
#Prepare the data for machine learning by encoding categorical variables and splitting it into training and testing sets.

#Encoding categorical variables

#let us  check for  the  unique values in  the  categorical columns
df.copy().select_dtypes(include=['object']).nunique()


Name        100000
Gender           2
Location         5
dtype: int64

ok we can  ignore  the  name , but since  only two types  of data is present  for gender, we can  replace  it  with 0 for  male, and 1  for female
also for location we can also  use  1-5  values   for  locations  

In [39]:
df['Location'].value_counts()

Location
Houston        20157
Los Angeles    20041
Miami          20031
Chicago        19958
New York       19813
Name: count, dtype: int64

ok lets  take Houston as  1, Los Angeles as 2 and Miami as 3,Chicago as 4 and  New York as 5

In [40]:
#ok lets  take Houston as  1, Los Angeles as 2 and Miami as 3,Chicago as 4 and  New York as 5
df.replace({'Location':{'Houston':1,'Los Angeles':2,'Miami':3,'Chicago':4,'New York':5}},inplace=True)
#also  lets  replace  gender with  0 and 1
df.replace({'Gender':{'Male':0,'Female':1}},inplace=True)
df.head()

Unnamed: 0,CustomerID,Name,Age,Gender,Location,Subscription_Length_Months,Monthly_Bill,Total_Usage_GB,Churn
0,1,Customer_1,63,0,2,17,73.36,236,0
1,2,Customer_2,62,1,5,1,48.76,172,0
2,3,Customer_3,24,1,2,5,85.47,460,0
3,4,Customer_4,36,1,3,3,97.94,297,1
4,5,Customer_5,46,1,3,19,58.14,266,0


feature engineering????
hmmm, after  thinking aa lot and  staring  at the spreadsheet for a  long time  
i  have  come up  with an idea  to understand  corelation between the  age of the user,
his  monthly bill, and  total  gb used,
since  statistically young  consumers  are  bound  to use  more  internet  but  also  have  less budget 
getting  the  required value  from the  amount  of money they  pay is  crucial
so lets calculate  a  feature called  as  usage/bill


In [41]:
#feature engineering
#lets calculate  a  feature called  as  usage/bill

df['usage/bill'] = df['Total_Usage_GB'] / df['Monthly_Bill']

#since statistically lower age  users with  higher usage/bill  are  more  likely to  churn
#lets  add  age  to  the  metric
df['usage/bill/age'] = df['usage/bill'] / df['Age']
df.head()
#feature scaling
#lets  scale the  data  using  standard scaler
#we  will use  this if  model acccuracy without  it isnt any good
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df.drop(['Churn','CustomerID','Name'],axis=1))
df_scaled = pd.DataFrame(df_scaled,columns=df.drop(['Churn','CustomerID','Name'],axis=1).columns)
df_scaled.head()

Unnamed: 0,Age,Gender,Location,Subscription_Length_Months,Monthly_Bill,Total_Usage_GB,usage/bill,usage/bill/age
0,1.24167,-1.004329,-0.701878,0.651115,0.410606,-0.294289,-0.514146,-0.747695
1,1.176226,0.995689,1.420116,-1.658879,-0.805374,-0.784852,-0.407508,-0.687717
2,-1.310651,0.995689,-0.701878,-1.08138,1.009204,1.422681,0.229471,1.033661
3,-0.525321,0.995689,0.005454,-1.370129,1.625597,0.173279,-0.577532,-0.4065
4,0.12912,0.995689,0.005454,0.939864,-0.34172,-0.064338,-0.047657,-0.249901


In [42]:
#lets split  data into  train and test
%pip install scikit-learn
from sklearn.model_selection import train_test_split
#churn is  the  variable  we  have  to  predict
#lets  remove customer id ,Name since  that  cant be used to predict  anything
X = df.drop(['Churn','CustomerID','Name'],axis=1)
X_scaled = df_scaled
y = df['Churn']
#taking  test size  as 0.3  because  it seems  like  a  good choice since churn rate  is at  almsot  50%  
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

#lets  check  the  shape of  the  data
X_train.shape,X_test.shape,y_train.shape,y_test.shape



((70000, 8), (30000, 8), (70000,), (30000,))

In [43]:
#model building
#lets  try  a  simple  logistic regression  model
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

#lets  predict  the  values
predictions = logmodel.predict(X_test)

#lets  check  the  accuracy
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
print(confusion_matrix(y_test,predictions))
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logmodel.score(X_test, y_test)))


              precision    recall  f1-score   support

           0       0.51      0.54      0.52     15152
           1       0.50      0.46      0.48     14848

    accuracy                           0.50     30000
   macro avg       0.50      0.50      0.50     30000
weighted avg       0.50      0.50      0.50     30000

[[8154 6998]
 [7947 6901]]
Accuracy of logistic regression classifier on test set: 0.50


yikes  0.5   thats  similar to a  coin flip,  we can do better


In [44]:
%pip install xgboost
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train,y_train)
predictions = xgb.predict(X_test)
print(classification_report(y_test,predictions))
print(confusion_matrix(y_test,predictions))
print('Accuracy of XGBoost classifier on test set: {:.2f}'.format(xgb.score(X_test, y_test)))


              precision    recall  f1-score   support

           0       0.50      0.51      0.51     15152
           1       0.49      0.48      0.49     14848

    accuracy                           0.50     30000
   macro avg       0.50      0.50      0.50     30000
weighted avg       0.50      0.50      0.50     30000

[[7787 7365]
 [7647 7201]]
Accuracy of XGBoost classifier on test set: 0.50


hmm i  am not  quite  sure why i am getting  0.5   accuracy on everything