<a href="https://colab.research.google.com/github/josmyrose/Data-Science-Teaching-Projects/blob/main/github_AI_and_ML_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 🎯 Learning Objectives

By completing this project, learners will be able to:

1. **Understand the concept of customer churn** and explain its business significance in the telecommunications industry.  
   - Recognize how predictive analytics supports customer retention and strategic decision-making.

2. **Build an end-to-end supervised machine learning pipeline** using Python and Scikit-learn.  
   - Integrate all stages of the workflow, including data preprocessing, feature engineering, model training, and evaluation.

3. **Perform effective data preprocessing and feature engineering.**  
   - Handle missing data, encode categorical variables, and scale numerical features for optimal model performance.

4. **Conduct exploratory data analysis (EDA)** to uncover meaningful patterns and relationships.  
   - Use visualizations to explore the impact of factors like tenure, contract type, and payment method on churn.

5. **Implement and compare multiple classification algorithms.**  
   - Train models such as Logistic Regression, Decision Trees, Random Forests, and Gradient Boosting to predict churn outcomes.

6. **Evaluate and interpret model performance** using appropriate metrics.  
   - Apply accuracy, precision, recall, F1-score, ROC curve, and AUC to assess predictive quality.

7. **Optimize model performance** through hyperparameter tuning and cross-validation.  
   - Utilize `GridSearchCV` or `RandomizedSearchCV` to improve accuracy and generalization.

8. **Interpret model results** to identify key drivers of customer churn.  
   - Use feature importance or explainable AI tools to determine the most influential factors affecting churn.

9. **Apply best practices in coding and documentation.**  
   - Write clean, well-commented notebooks with clear explanations and visual summaries.

10. **Communicate actionable insights** derived from data analysis and modeling.  
    - Present technical findings in a business-oriented manner to support decision-making and retention strategies.

---

## 📚 Skills & Tools Covered

| Category | Tools / Concepts |
|-----------|------------------|
| **Programming Language** | Python |
| **Libraries** | Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn |
| **Machine Learning Techniques** | Supervised Learning, Classification, Model Evaluation, Feature Engineering |
| **Statistical Concepts** | Data preprocessing, scaling, encoding, cross-validation |
| **Visualization Tools** | Matplotlib, Seaborn |
| **Advanced Topics** | Hyperparameter tuning, Explainable AI, Business analytics |
| **Soft Skills** | Problem-solving, analytical thinking, technical communication |

---

💡 *Tip:* You can run this notebook on **Google Colab** for a fully interactive learning experience. Each section of the notebook is organized to follow the standard data science workflow:
1. Data Loading  
2. Data Exploration  
3. Data Preprocessing  
4. Model Building  
5. Model Evaluation  
6. Interpretation and Insights

#1.Business probelm statement understanding

1.1 Build an end-to-end supervised machine
learning pipeline to predict whether a customer will change the telecommunication provider or not in the US

Customer churn is one of the biggest problems the telecoms industry is now facing, especially with relation to consumers of voice call plans. Total call time, or the amount of time a client spends on voice calls in a particular period, might be a significant contributor to customer churn in this situation.
A telecommunications firm in the United States named  **Comcast Corporation** contains client information such as voice call plans, international voice call plans, how many minutes the customer called during the day, evening, and night, among other things.In order to forecast whether a customer will stick with Comcast Corporation's telecommunication services or not, the American telecoms corporation Comcast Corporation approached a client company named "ABC"**.So,the company **ABC** appointed me as data analyst.
The company wants to predict  the customer continue with the compnay or not.So it is a classification problem especially binary classification.



Dataset is Telecommunication customer churn data


# **Dataset and its attributes explanation**
---
The following are the details of customers usage plan and other details provided by the comcast company to client ABC company.


*   state- It is two letter state code who reside on USA.

*   residence account_length- The number of months that a consumer has used a specific telecommunications provider's service.

*   area_code-Three digit area code


*   international_plan-If the customer has the international plan,its value indicates that 1 ,otherwise it is 0

*   voice_mail_plan- The customer has voice mail plan.If the customer has the voice mail plan,its value indicates that 1 ,otherwise it is 0

*   number_vmail_messages - How many voicemails there are
*   total_day_minutes- Total daily call minutes.


*  total_day_calls- Total number of calls per day.

*  total_day_charge- Total daily call cost.
*  total_eve_minutes-Minutes total for evening calls.

*   total_eve_calls-Total number of evening calls.

*  total_eve_charge-Total amount of calls in the evening.


*   total_night_minutes-Overall duration of late-night calls.

*   total_night_charge-The whole cost of late-night calls.

*   total_intl_calls-Total amount of calls made internationally.


*   total_intl_charge-Cost of all international calls

*   number_customer_service_calls- Number of customer service contacts

*  churn- Customer churn (If the customer is continue with compnay its value is Yes ,otherwise it isNo)





In this dataset,total  columns are 20.Out of 20 columns,column churn is target variable.Do some data exploration,select features from the remaining 19 columns(features).Here create a classification model.

The following code explains about the dataset,preprocessing of dataset and model creation and evaluation of the model.

Import neccessary libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder# encoding categorical variables
from sklearn.preprocessing import StandardScaler# normalizing values in the dataset.
from imblearn.over_sampling import SMOTE #for balancing the dataset
from sklearn.tree import DecisionTreeClassifier # to use decision tree
from sklearn.model_selection import GridSearchCV #hyperparameter turning
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,precision_score,recall_score,f1_score
import warnings
warnings.filterwarnings("ignore")
from sklearn import metrics
#import matplot.pyplot as pl


Read the dataset into python enviornment from google drive.

In [None]:
df1=pd.read_csv('/content/drive/MyDrive/dataset/customerchrum.csv')#The resulting DataFrame object is assigned to a variable named "df"

The dataset is too large.So take the sample of dataset to run the model on google colab CPU.

In [None]:
df=df1.sample(500)

# **Examining the dataset**

---



Display the first five rows of dataset to get an idea about the data containing on the dataset.

In [None]:
df.head(5)#displaying first five rows from the dataset.

Unnamed: 0,state,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churn
2080,GA,131,area_code_408,no,no,0,197.0,79,33.49,201.0,114,17.09,151.2,111,6.8,11.6,5,3.13,1,no
1725,WA,132,area_code_415,no,no,0,240.1,115,40.82,180.4,91,15.33,133.4,122,6.0,8.0,6,2.16,3,no
2167,VT,72,area_code_415,no,no,0,139.9,117,23.78,223.6,96,19.01,240.8,93,10.84,12.7,4,3.43,2,no
567,NE,189,area_code_415,no,yes,38,256.7,98,43.64,150.5,120,12.79,123.0,87,5.54,11.4,3,3.08,3,no
791,NJ,46,area_code_408,no,no,0,257.4,67,43.76,261.1,91,22.19,204.4,107,9.2,13.4,5,3.62,2,yes


From the result,we get an overall idea about the data in the dataset.

The following code explains the information about dataset.

In [None]:
print("The size of the dataset",df.shape)
print("-------------------------------------------------")
print("The Datatype of columns\n {} of the dataset\n".format(df.dtypes))
print("-------------------------------------------------")
#print("Display first five rows of the dataset\n",df.head(5))
print("-------------------------------------------------")
print("Information about dataset".format(df.info()))
print("\n-------------------------------------------------\n")
#print("Product of different categories\n",df['product'].value_counts())#for getting the count of each item in the product table.We can understand that the product column has five types of values.
#print("Number of null values in the dataset\n",df.isnull().sum())
#df.dropna(axis=0,inplace=True)# Removing rows contains null values(axis=0),if inplcae=True,the removing is done on the current dataset.

The size of the dataset (500, 20)
-------------------------------------------------
The Datatype of columns
 state                             object
account_length                     int64
area_code                         object
international_plan                object
voice_mail_plan                   object
number_vmail_messages              int64
total_day_minutes                float64
total_day_calls                    int64
total_day_charge                 float64
total_eve_minutes                float64
total_eve_calls                    int64
total_eve_charge                 float64
total_night_minutes              float64
total_night_calls                  int64
total_night_charge               float64
total_intl_minutes               float64
total_intl_calls                   int64
total_intl_charge                float64
number_customer_service_calls      int64
churn                             object
dtype: object of the dataset

-----------------------------------------

The dataset conatins 500 rows and 20 features.It includes 20 cloumns.In this state,areacode


Dataset statistics

The following result display the minimum and maximum value of each columns in the dataset.The value of each column is distributed in different range.For example,Minimum  of total_day_calls and total_day_charge are
35 and 5 respectively and maximum value of total_day_calls and total_day_charge are 147 and 55 respectively.So there is a need of standardization is required in this dataset.It has done on the section of **STANDARDISATION** of this notebook.


In [None]:
df.describe()

Unnamed: 0,account_length,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,99.126,6.884,179.5802,99.086,30.52936,198.8812,100.0,16.905,202.7126,100.204,9.1223,10.2102,4.518,2.75716,1.534
std,39.312454,13.117877,53.883729,20.921616,9.159975,53.124919,18.667859,4.515632,49.470586,20.035399,2.226169,2.735596,2.496024,0.738483,1.237703
min,1.0,0.0,34.0,40.0,5.78,22.3,51.0,1.9,45.0,40.0,2.03,0.0,0.0,0.0,0.0
25%,72.0,0.0,144.25,85.0,24.5225,163.575,87.0,13.9075,169.35,86.0,7.6175,8.5,3.0,2.3,1.0
50%,95.0,0.0,183.05,98.0,31.12,201.25,100.0,17.105,202.3,99.0,9.105,10.2,4.0,2.75,1.0
75%,125.0,0.0,214.225,113.0,36.415,234.625,113.0,19.9425,239.725,115.0,10.79,12.1,6.0,3.27,2.0
max,225.0,48.0,338.4,157.0,57.53,349.4,155.0,29.7,352.5,165.0,15.86,19.7,17.0,5.32,6.0


Doing bivariate analysis,state,account_length and area_code in the dataset are not related with churn(target column).So drop the columns.The remaining dataframe has 15 columns.





In [None]:
# drop columns 'state' and ',account_length' and 'area_code'
df = df.drop(['state', 'account_length','area_code','international_plan','voice_mail_plan'], axis=1)

# print the updated DataFrame
print("Updated DataFrame:")
print(df.shape)

Updated DataFrame:
(500, 15)


# 1.2   **Data Exploration STEPS**

Analyse whether the dataset is imbalanced.

In [None]:
#To get the count of people who continue with company and leave the company
df['churn'].value_counts()


# count the number of people who continue with company
df_no = df[df['churn'] == 'no']['churn'].count()
df_yes = df[df['churn'] == 'yes']['churn'].count()

# count the total number of observations
total_count = df['churn'].count()

# calculate the percentage of churn with no value

no_percentage = (df_no / total_count) * 100
# calculate the percentage of churn with nes value

yes_percentage= (df_yes / total_count) * 100

# print the result
print('The percentage of chrun with no values in the  dataset is{:.2f}'.format(no_percentage))
# print the result
print('The percentage of chrun with yes values in the  dataset is{:.2f}'.format(yes_percentage))

The percentage of chrun with no values in the  dataset is85.80
The percentage of chrun with yes values in the  dataset is14.20


The percentage of churn with **no** values in the dataset is 85% and The percentage of chrun with **yes** values in the  dataset is 14.7.The dataset contains more values churn with **no** value than churn with **yes** value.So the dataset is imbalanced.If the dataset is imbalnced,the model is also biased.To avoid this problem,dataset should be balanced.The balancing of dataset should do on the section of Balance the dataset using SMOT Algorithm of this notebook.

Check whether any null value is present in the dataset.There is no null value present in the dataset.

In [None]:
print("Number of null values in the dataset\n",df.isnull().sum())

Number of null values in the dataset
 number_vmail_messages            0
total_day_minutes                0
total_day_calls                  0
total_day_charge                 0
total_eve_minutes                0
total_eve_calls                  0
total_eve_charge                 0
total_night_minutes              0
total_night_calls                0
total_night_charge               0
total_intl_minutes               0
total_intl_calls                 0
total_intl_charge                0
number_customer_service_calls    0
churn                            0
dtype: int64


**Label Encoding**

Most of the machine learning algorithms could not handle categorical values.So in this Dataset,the column 'churn' has categorical values such as 'yes' or 'no'.
Here use the label encoding.Label encoding assigns a numerical label to each category.If use the onehot encoding,number of columns in the dataset is increased.

In [None]:
# creating instance of labelencoder
labelencoder1 = LabelEncoder()
# Assigning numerical values and storing in another column
df['churn'] = labelencoder1.fit_transform(df['churn'])#churn values changed


There are 19 features in this dataset.To select most corelated features with target variable chrun,so use the correlation method.For example,feature total_eve_charge is correlated with  churn ,it value is 0.239978	.Based on this,selected the following features to create model total_day_charge','total_eve_charge','total_night_charge','number_customer_service_calls

In [None]:
df.corr()

Unnamed: 0,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churn
number_vmail_messages,1.0,0.039813,-0.028675,0.039809,0.038452,-0.000818,0.038417,0.041954,0.028341,0.041975,-0.01547,-0.066711,-0.015495,-0.004694,-0.096963
total_day_minutes,0.039813,1.0,-0.045254,1.0,0.013029,0.077886,0.012984,-0.02678,0.029684,-0.026733,-0.005504,-0.014544,-0.005172,-0.086147,0.166053
total_day_calls,-0.028675,-0.045254,1.0,-0.045249,0.019555,-0.039674,0.019601,0.019353,0.032927,0.0193,0.013914,0.000987,0.014146,0.026548,-0.038684
total_day_charge,0.039809,1.0,-0.045249,1.0,0.013033,0.07787,0.012987,-0.026767,0.02966,-0.02672,-0.00552,-0.014553,-0.005188,-0.086139,0.166041
total_eve_minutes,0.038452,0.013029,0.019555,0.013033,1.0,-0.045727,1.0,0.010361,0.041815,0.010337,-0.044521,-0.032636,-0.044331,-0.033065,0.05038
total_eve_calls,-0.000818,0.077886,-0.039674,0.07787,-0.045727,1.0,-0.045709,0.048282,-0.026608,0.048343,0.009477,-0.026235,0.009497,0.014398,0.0424
total_eve_charge,0.038417,0.012984,0.019601,0.012987,1.0,-0.045709,1.0,0.010393,0.041821,0.010369,-0.044489,-0.032652,-0.044299,-0.033076,0.050381
total_night_minutes,0.041954,-0.02678,0.019353,-0.026767,0.010361,0.048282,0.010393,1.0,-0.055249,0.999999,-0.01943,-0.054534,-0.019259,0.024368,0.030249
total_night_calls,0.028341,0.029684,0.032927,0.02966,0.041815,-0.026608,0.041821,-0.055249,1.0,-0.055289,-0.037849,-0.043994,-0.037922,0.018953,-0.026476
total_night_charge,0.041975,-0.026733,0.0193,-0.02672,0.010337,0.048343,0.010369,0.999999,-0.055289,1.0,-0.019489,-0.054493,-0.019318,0.024333,0.030239


Copied the dataframe df with 4 features and target variable into dataframe df1

In [None]:
#copy the dataframe into df1 with selected features
df1=df[['total_day_charge','total_eve_charge','total_night_charge','number_customer_service_calls','churn']].copy()

In [None]:
df1.head()

Unnamed: 0,total_day_charge,total_eve_charge,total_night_charge,number_customer_service_calls,churn
2080,33.49,17.09,6.8,1,0
1725,40.82,15.33,6.0,3,0
2167,23.78,19.01,10.84,2,0
567,43.64,12.79,5.54,3,0
791,43.76,22.19,9.2,2,1


**Split the dataset into train and test set.**



Split the dataset into a training set and a testing set. This is done to evaluate the performance of the model on new, unseen data and to avoid overfitting, where the model performs well on the training data but poorly on new data.

Please note that the dataset is split into train and test,not as target and independent variable.Because need to do some cleaning on the dataset.This is done to avoid data leakage.

We have passed **test_size** as 0.33 which means 33% of data will be in the test part and rest will be in train part.
random_state=42 means that when you run the code split the data as it is done previously.It means that your result are same when the code is run on any number of times.

In [None]:
X_train, X_test= train_test_split(df1, test_size=0.33, random_state=42)
#Display the size of train and test dataset.
print("The size of  train dataset",X_train.shape)
print("The size of test dataset",X_test.shape)

The size of  train dataset (335, 5)
The size of test dataset (165, 5)


# **Separating the features and target label**

In [None]:
#split the train dataset and test data set as features and target label
x_train=X_train.drop(['churn'],axis=1)
y_train=X_train['churn']
x_test=X_test.drop(['churn'],axis=1)
y_test=X_test['churn']
print("x-train size",x_train.shape)
print("y-train size",y_train.shape)
print("X-test size",x_test.shape)
print("y-test size",y_test.shape)

x-train size (335, 4)
y-train size (335,)
X-test size (165, 4)
y-test size (165,)


To model the dataset,split the dataset into x-train and y_train.

Standardisation

Machine learning pre-processing methods like the StandardScaler are used to uniformly scale a dataset's characteristics. Because it ensures that each feature contributes equally to the analysis and prevents features with significant variations from dominating the study, it scales the data to have zero mean and unit variance, which is critical in many machine learning techniques.

In [None]:

# Standard Scaler object
sc = StandardScaler()

# fit the sc to the data especially train dataset
sc.fit(x_train)

# transform the data
X_scaled_tr = sc.transform(x_train)


# fit the scaler to the data
sc.fit(x_test)

# transform the data using the scaler
X_scale_te= sc.transform(x_test)




Standardization allows you to compare the relative importance of various data features. If the features are not standardised, a feature with a larger scale may outperform other features, resulting in biassed results.

In [None]:
X_scaled_tr.shape

(335, 4)

In [None]:
y_train.shape

(335,)

**Balance the dataset using SMOT Algorithm**

---



The dataset is imbalanced the aforementioned in the code (Cell number:8).To balance the dataset,use the SMOTE algorithm.

In [None]:
# apply SMOTE
smote = SMOTE()
X_train,y_train = smote.fit_resample(X_scaled_tr,y_train)

**Algorithm selection and Hyperparameter turning**

First create a model of decision tree algorithm with best hyperparameters  and find out best hyperparameters using gridsearch method .

In [None]:
# create a decision tree classifier object
dtc = DecisionTreeClassifier()

#  the hyperparameters to tune
params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 4, 5, 6],
    'min_samples_leaf': [1, 2, 3, 4],
    'min_samples_split': [2, 3, 4, 5]
}

# create a GridSearchCV object
grid1 = GridSearchCV(dtc, params, cv=5,n_jobs=-1)

# fit the GridSearchCV object to the data
grid1.fit(X_train,y_train )

# display the best hyperparameters
print("Best accuracy {:.2f}".format(grid1.best_score_))
print("Best Hyperparameters: ", grid1.best_params_)

Best accuracy 0.83
Best Hyperparameters:  {'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 1, 'min_samples_split': 4}


Secondly create a model of SVM algorithm with best hyperparameters and find out best hyperparameters using gridsearch method.And best accuracy of the model.Here hyperparameters are kernel and C.

In [None]:
# create a SVM classifier object
svm = SVC()

# define the hyperparameters to tune
params = {
    'kernel': ['linear', 'rbf', 'poly'],
    'C': [0.1, 1, 10],

}

# create a GridSearchCV object
grid_search1= GridSearchCV(svm, params, cv=5,n_jobs=-1)

# fit the GridSearchCV object to the data
grid_search1.fit(X_train,y_train)

# print the best hyperparameters
print("Best Hyperparameters: ", grid_search1.best_params_)
print("Best Hyperparameters: {:.2f}".format(grid_search1.best_score_))


Best Hyperparameters:  {'C': 10, 'kernel': 'rbf'}
Best Hyperparameters: 0.86



# **Testing the model**
---



Create model for the customer chrun dataset using Support Vector Machine and Decision Tree classification.Best accuracy of the model for this dataset is 86% when the model is created using SVM algorithm.So the model is tested using SVM algorithm.

In [None]:
y_predict=grid_search.predict(x_test)
confusion_matrix = metrics.confusion_matrix(y_test, y_predict)
precision = precision_score(y_test, y_predict)
recall = recall_score(y_test, y_predict)

print("The value of confusion matrix",confusion_matrix)
print("The value of precision",precision)
print("The value of recall",recall)



The value of confusion matrix [[  0 136]
 [  0  29]]
The value of precision 0.17575757575757575
The value of recall 1.0


The confusion matrix [[0 136] [0 29]] represents the performance of a binary classification model that made predictions on a set of samples. The matrix shows the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) for the model's predictions.

Looking at the matrix, we can see that there are two classes being predicted. The first row represents the negative class, and the second row represents the positive class. The first column represents the samples predicted as negative, while the second column represents the samples predicted as positive.

The number 0 in the top-left corner represents the number of true negatives (TN). The model correctly predicted 0 samples as negative out of a total of 0 + 136 = 136 negative samples.
The number 136 in the top-right corner represents the number of false positives (FP). The model incorrectly predicted 136 samples as positive out of a total of 136 + 29 = 165 positive samples.
The number 0 in the bottom-left corner represents the number of false negatives (FN). The model incorrectly predicted 0 samples as negative out of a total of 0 + 29 = 29 positive samples.
The number 29 in the bottom-right corner represents the number of true positives (TP). The model correctly predicted 29 samples as positive out of a total of 29 positive samples.


# **Conclusion**



In conclusion, the confusion matrix [[0 136] [0 29]] shows that the model is completely ineffective. Out of 29 positive samples in the dataset, it only properly recognised 29 positive samples and 0 negative samples. The model's high rate of false positives shows that it frequently misclassifies data as positive even when they are negative. In any event, the model needs to be significantly improved before it can be used to this dataset to make predictions.
Here,model is created with four selected features ,so the model is not performing well.If you select more  number of features for creating model,the performance will definitely increased. I
**Future Enhancments**
Model will create with KNN algoritm and test the accuracy and do some data exploration to find increase features to be selected for the model creation.



**REFERENCES**


1.https://www.kaggle.com/competitions/customer-churn-prediction-2020/data?select=train.csv
2.https://towardsdatascience.com/dealing-with-imbalanced-data-in-churn-analysis-6ea1afba8b5e