## Introduction: Telecom Customer Retention

Imagine you are a data scientist in the marketing department at a telecom company. Recently, in a department meeting, the leadership emphasized the critical importance of **customer retention**. In your role, you have access to a comprehensive dataset from your organization's database. This data contains a wealth of information, from basic customer demographics to their frequency of use and the results of their churn.

You're tasked with delving into this dataset to uncover insights that can help improve customer retention. Think of yourself as a **data scientist**, providing actionable data-driven suggestions based on your quantitative analysis.

You have at your disposal various data analysis techniques, ranging from basic statistical analysis to complex predictive modeling. Your challenge is to use these tools effectively to extract meaningful insights from the dataset.

```
Anonymous Customer ID
Call Failures: number of call failures
Complains: binary (0: No complaint, 1: complaint)
Subscription Length: total months of subscription
Charge Amount: Ordinal attribute (0: lowest amount, 9: highest amount)
Seconds of Use: total seconds of calls
Frequency of use: total number of calls
Frequency of SMS: total number of text messages
Distinct Called Numbers: total number of distinct phone calls
Age Group: ordinal attribute (1: younger age, 5: older age)
Tariff Plan: binary (1: Pay as you go, 2: contractual)
Status: binary (1: active, 2: non-active)
Churn: binary (1: churn, 0: non-churn) - Class label (Outcome Variable)
Customer Value: The calculated value of the customer
```

Tools at Your Disposal:

- **Python/Anaconda Distribution**: You have access to Python, a powerful programming language, and the Anaconda, a popular platform for data science. This combination provides a robust environment for data analysis and machine learning.

- **Internet Connection**: You can use online resources like search engines and forums (e.g., Stack Overflow, but **NOT** generative AI such as chatGPT) to assist in building your solution. These resources can offer valuable insights, solutions to coding challenges, and best practices in data analysis.


Files provided:

- **train.csv**,the dataset,
- **Final Exam.ipynb**, this notebook.
- **test_no_label.csv**, the test dataset without label

### Predictive Analysis Task: Train Logistic Regression Models

Your first challenge is to train a few **Logistic Regression** models to  predict customer churn based on the dataset features. This model should be capable of predicting the exit of unseen customers based on their respective data points.

**Task Details:**

1. **Model Development:** Using the customer Churn dataset fit a Logistic Regression model that predicts if a customer will exit. You are required to use the logistic regression model with three different regularization hyperparameters (No regularization, L1, L2). Log and report your model selection procedure, and answer the rest of the questions using your selected model.

2. **Saving Your Prediction Output:** "test_no_label.csv" contains the input data of the test dataset without labels. The column 'Churn' is empty. **Use your selected model to predict the labels of records in the test set, save, and submit the updated test.csv with labels.**

Make sure your submission has the predicted values for the given test set.

Example:
```python
>>> test['Churn'] = model.predict(preprocessing(test))
>>> test.to_csv('test.csv', index=False)
```

### Step 1: Preprocessing Data

Examine the data and check the datatypes of the columns. Determine if and how you will transform the columns, especially the categorical ones:
 - Age Group (1 to 5, young to old)
 - Tariff Plan (1 Pay as you go,  2 Contractual)
 - Complaints (0 No complaint, 1 Complaints)
 - Status (1 Active, 2 Non-active)
 - Charge Amount (0 to 10, low to high)

Consider how to preprocess the rest of the numeric/boolean columns, and transform them appropriately.

After preprocessing, you should have the input variables `X` and outcome variable `y` ready.

In [532]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [533]:
import os
os.chdir("/content/drive/MyDrive")

In [534]:
#Importing libraries
import re
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import KBinsDiscretizer, FunctionTransformer, StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,precision_score,recall_score,f1_score
import pickle
from sklearn.linear_model import Lasso, Ridge, LogisticRegression,SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error
from sklearn.impute import SimpleImputer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import warnings
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder

In [535]:
#Load the datasets test csv and train csv and display the first few rows to examine the reviews
df_train = pd.read_csv("train-1.csv")
df_test = pd.read_csv("test_no_label.csv")

In [536]:
#Display first few rows of train data using head()
df_train.head()

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,Churn
0,10,0,36,1,6143,99,7,10,3,1,1,30,277.68,0
1,12,0,30,2,1933,48,42,72,5,1,1,55,92.715,0
2,7,0,36,1,2618,62,7,31,2,1,1,25,152.1,0
3,15,0,30,3,2330,53,208,13,3,2,1,30,927.32,0
4,7,0,33,0,6638,73,98,33,2,1,1,25,742.995,0


In [537]:
#Display first few rows of test data using head()
df_test.head()

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value
0,14,0,40,3,7515,103,201,28,3,1,1,30,1108.72
1,3,0,37,0,7508,127,384,43,2,1,1,25,2071.575
2,0,0,28,0,3153,66,0,20,2,1,1,25,144.855
3,21,0,33,3,15850,234,3,82,2,1,1,25,737.28
4,23,0,18,2,9947,188,88,42,5,1,1,55,284.025


In [538]:
df_train.shape

(2520, 14)

In [539]:
df_test.shape

(630, 13)

In [540]:
df_train.dtypes

Call  Failure                int64
Complains                    int64
Subscription  Length         int64
Charge  Amount               int64
Seconds of Use               int64
Frequency of use             int64
Frequency of SMS             int64
Distinct Called Numbers      int64
Age Group                    int64
Tariff Plan                  int64
Status                       int64
Age                          int64
Customer Value             float64
Churn                        int64
dtype: object

In [541]:
df_test.dtypes

Call  Failure                int64
Complains                    int64
Subscription  Length         int64
Charge  Amount               int64
Seconds of Use               int64
Frequency of use             int64
Frequency of SMS             int64
Distinct Called Numbers      int64
Age Group                    int64
Tariff Plan                  int64
Status                       int64
Age                          int64
Customer Value             float64
dtype: object

In [542]:
# Checking for null values in train data
df_train.isna().sum()

Call  Failure              0
Complains                  0
Subscription  Length       0
Charge  Amount             0
Seconds of Use             0
Frequency of use           0
Frequency of SMS           0
Distinct Called Numbers    0
Age Group                  0
Tariff Plan                0
Status                     0
Age                        0
Customer Value             0
Churn                      0
dtype: int64

In [543]:
# Checking for null values in test data
df_test.isna().sum()

Call  Failure              0
Complains                  0
Subscription  Length       0
Charge  Amount             0
Seconds of Use             0
Frequency of use           0
Frequency of SMS           0
Distinct Called Numbers    0
Age Group                  0
Tariff Plan                0
Status                     0
Age                        0
Customer Value             0
dtype: int64

In [544]:
# drop = Frequency of SMS,Distinct Called Numbers,Age
# cont =  Customer Value,  Seconds of Use, Frequency of use, Subscription  Length, Call  Failure
# ordinal enc = Charge Amount, Age Group
# onehot enc =  Tariff Plan, Complains, Status

In [545]:
# Onehotencoding categorical columns : Tariff Plan, Complains, Status
# Scaling numerical continuous columns :  Customer Value,  Seconds of Use, Frequency of use, Subscription  Length, Call  Failure
# Ordinal encoding Ordinal attribute : Charge Amount, Age Group
# Dropping unnecessary columns for my analysis : Frequency of SMS,Distinct Called Numbers,Age

In [546]:
#Dropping irrelevent columns from train data
df_train.drop(["Frequency of SMS","Distinct Called Numbers","Age"],axis = 1,inplace=True)
df_train.head()

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Age Group,Tariff Plan,Status,Customer Value,Churn
0,10,0,36,1,6143,99,3,1,1,277.68,0
1,12,0,30,2,1933,48,5,1,1,92.715,0
2,7,0,36,1,2618,62,2,1,1,152.1,0
3,15,0,30,3,2330,53,3,2,1,927.32,0
4,7,0,33,0,6638,73,2,1,1,742.995,0


In [547]:
#Dropping irrelevent columns from test data
df_test.drop(["Frequency of SMS","Distinct Called Numbers","Age"],axis = 1,inplace=True)
df_test.head()

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Age Group,Tariff Plan,Status,Customer Value
0,14,0,40,3,7515,103,3,1,1,1108.72
1,3,0,37,0,7508,127,2,1,1,2071.575
2,0,0,28,0,3153,66,2,1,1,144.855
3,21,0,33,3,15850,234,2,1,1,737.28
4,23,0,18,2,9947,188,5,1,1,284.025


In [548]:
#Preprocessing for train data

#one hot encoding for train data for column Tariff Plan, Complains, Status

#Encode categorical variables for train data

#Apply one hot encoding
one_hot_encoder = OneHotEncoder(drop = "first",sparse=False, dtype=int)

#Fit and transform the Tariff Plan, Complains, Status column needed for training data
col_enc_df = one_hot_encoder.fit_transform(df_train[["Tariff Plan", "Complains", "Status"]])

# Get the column names for the one-hot encoded variables
churn_columns = one_hot_encoder.get_feature_names_out(["Tariff Plan", "Complains", "Status"])

# Create a DataFrame with the encoded data
col_enc_df1 = pd.DataFrame(col_enc_df, columns=churn_columns)

# Drop the original Tariff Plan, Complains, Status column and concatenate the new encoded columns
df_encoded = pd.concat([df_train.drop(columns=["Tariff Plan", "Complains", "Status"]), col_enc_df1], axis=1)

# Display the encoded DataFrame
df_encoded.head()



Unnamed: 0,Call Failure,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Age Group,Customer Value,Churn,Tariff Plan_2,Complains_1,Status_2
0,10,36,1,6143,99,3,277.68,0,0,0,0
1,12,30,2,1933,48,5,92.715,0,0,0,0
2,7,36,1,2618,62,2,152.1,0,0,0,0
3,15,30,3,2330,53,3,927.32,0,1,0,0
4,7,33,0,6638,73,2,742.995,0,0,0,0


In [549]:
#Ordinal encoding for train data for "Age Group","Charge  Amount"

# Select the column to encode
ord_col = df_encoded[["Age Group","Charge  Amount"]]

# Initialize the OrdinalEncoder
encoder = OrdinalEncoder()

# Fit and transform the column
encoded_ord_col = encoder.fit_transform(ord_col)

# Replace the original column with the encoded values
df_encoded[["Age Group","Charge  Amount"]] = encoded_ord_col

# Display the first few rows to confirm the changes
df_encoded.head()

Unnamed: 0,Call Failure,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Age Group,Customer Value,Churn,Tariff Plan_2,Complains_1,Status_2
0,10,36,1.0,6143,99,2.0,277.68,0,0,0,0
1,12,30,2.0,1933,48,4.0,92.715,0,0,0,0
2,7,36,1.0,2618,62,1.0,152.1,0,0,0,0
3,15,30,3.0,2330,53,2.0,927.32,0,1,0,0
4,7,33,0.0,6638,73,1.0,742.995,0,0,0,0


In [550]:
# Scaling for train data for columns "Customer Value",  "Seconds of Use", "Frequency of use", "Subscription  Length", "Call  Failure"

#Identify columns to scale
continuous_features = ["Customer Value",  "Seconds of Use", "Frequency of use", "Subscription  Length", "Call  Failure"]

#Apply scaling only to the numerical features

#Fit and Transform needed for training data
scaler = StandardScaler()
df_encoded[continuous_features] = scaler.fit_transform(df_encoded[continuous_features])
df_encoded.head()

Unnamed: 0,Call Failure,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Age Group,Customer Value,Churn,Tariff Plan_2,Complains_1,Status_2
0,0.315152,0.406674,1.0,0.383934,0.502544,2.0,-0.383844,0,0,0,0
1,0.586342,-0.292622,2.0,-0.612745,-0.382852,4.0,-0.737532,0,0,0,0
2,-0.091634,0.406674,1.0,-0.450577,-0.139802,1.0,-0.623976,0,0,0,0
3,0.993128,-0.292622,3.0,-0.518759,-0.296048,2.0,0.858391,0,1,0,0
4,-0.091634,0.057026,0.0,0.501121,0.051166,1.0,0.505926,0,0,0,0


In [551]:
#Encoding test data for columns Tariff Plan, Complains, Status

# Transform the Tariff Plan, Complains, Status column for test data
encoded_new = one_hot_encoder.transform(df_test[["Tariff Plan", "Complains", "Status"]])

# Get the column names for the one-hot encoded variables
columns_new = one_hot_encoder.get_feature_names_out(["Tariff Plan", "Complains", "Status"])

# Create a DataFrame with the encoded data
encoded_df_new = pd.DataFrame(encoded_new, columns=columns_new)

# Drop the original Tariff Plan, Complains, Status column and concatenate the new encoded columns
df_encoded_new = pd.concat([df_test.drop(columns=["Tariff Plan", "Complains", "Status"]), encoded_df_new], axis=1)
df_encoded_new.head()

Unnamed: 0,Call Failure,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Age Group,Customer Value,Tariff Plan_2,Complains_1,Status_2
0,14,40,3,7515,103,3,1108.72,0,0,0
1,3,37,0,7508,127,2,2071.575,0,0,0
2,0,28,0,3153,66,2,144.855,0,0,0
3,21,33,3,15850,234,2,737.28,0,0,0
4,23,18,2,9947,188,5,284.025,0,0,0


In [552]:
#Ordinal encoding for test data for columns "Age Group","Charge  Amount"

# Select the column to encode
ord_col_test = df_encoded_new[["Age Group","Charge  Amount"]]

# # Initialize the OrdinalEncoder
# encoder = OrdinalEncoder()

# Fit and transform the column
encoded_ord_col_test = encoder.transform(ord_col_test)

# Replace the original column with the encoded values
df_encoded_new[["Age Group","Charge  Amount"]] = encoded_ord_col_test

# Display the first few rows to confirm the changes
df_encoded_new.head()

Unnamed: 0,Call Failure,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Age Group,Customer Value,Tariff Plan_2,Complains_1,Status_2
0,14,40,3.0,7515,103,2.0,1108.72,0,0,0
1,3,37,0.0,7508,127,1.0,2071.575,0,0,0
2,0,28,0.0,3153,66,1.0,144.855,0,0,0
3,21,33,3.0,15850,234,1.0,737.28,0,0,0
4,23,18,2.0,9947,188,4.0,284.025,0,0,0


In [553]:
#Scaling test data for columns "Customer Value",  "Seconds of Use", "Frequency of use", "Subscription  Length", "Call  Failure"

df_encoded_new[continuous_features] = scaler.transform(df_encoded_new[continuous_features])
df_encoded_new.head()

Unnamed: 0,Call Failure,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Age Group,Customer Value,Tariff Plan_2,Complains_1,Status_2
0,0.857533,0.872872,3.0,0.708742,0.571987,2.0,1.205262,0,0,0
1,-0.634016,0.523223,0.0,0.707085,0.988645,1.0,3.046422,0,0,0
2,-1.040802,-0.525721,0.0,-0.323921,-0.070359,1.0,-0.63783,0,0,0
3,1.8067,0.057026,3.0,2.681977,2.846241,1.0,0.494998,0,0,0
4,2.077891,-1.691215,2.0,1.284496,2.047648,4.0,-0.371711,0,0,0


In [554]:
# Separate features and traget variable for train data
X = df_encoded.drop(["Churn"],axis = 1)
y = df_encoded["Churn"]

In [555]:
#Features for test data
X_test = df_encoded_new

In [556]:
X.shape

(2520, 10)

In [557]:
X_test.shape

(630, 10)

### Step 2: Fitting a Logistic Regression on Your Training Data

After defining `X` and `y`, create the following models and train them:
 1. LogisticRegression with No Regularization
 2. LogisticRegression with L1 and C=10
 3. LogisticRegression with L2 and C=10

You should use k-fold cross-validation to assess the performance of the models.

Report the performance (accuracy, recall, precision, f-1 score) for each model.

Select the final model for interpretation/prediction and clearly justify your selection.

In [558]:
#1)LogisticRegression with No Regularization

log_noreg = LogisticRegression(random_state=0)

log_noreg.fit(X, y)

print(log_noreg.predict_proba(X)[:,1]/log_noreg.predict_proba(X)[:,0])

# #Predict on the test set
# y_test_pred_log = best_log_reg_new.predict(X_test)
# # y_test_pred_log = best_log_reg_new.predict(X_test)
# X_test['Churn_log']=  y_test_pred_log

# #Save the predictions to CSV files
# X_test.to_csv('predictions_churn.csv', index=False)

#Predict on the train set
y_train_pred_new_noreg = log_noreg.predict(X)

#Evaluation
accuracy_log_noreg = accuracy_score(y, y_train_pred_new_noreg)
conf_matrix_log_noreg = confusion_matrix(y, y_train_pred_new_noreg)
class_report_log_noreg = classification_report(y, y_train_pred_new_noreg)

print("Logistic Regression Accuracy:", accuracy_log_noreg)
print("Logistic Regression Confusion Matrix:\n", conf_matrix_log_noreg)
print("Logistic Regression Classification Report:\n", class_report_log_noreg)

[0.01569416 0.04201523 0.03032332 ... 0.0090033  0.20813596 0.01365188]
Logistic Regression Accuracy: 0.9011904761904762
Logistic Regression Confusion Matrix:
 [[2098   37]
 [ 212  173]]
Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.98      0.94      2135
           1       0.82      0.45      0.58       385

    accuracy                           0.90      2520
   macro avg       0.87      0.72      0.76      2520
weighted avg       0.90      0.90      0.89      2520



In [559]:
log_noreg.coef_

array([[ 0.82447653, -0.24172643, -0.47809573,  1.18848558, -2.38058706,
        -0.15134897, -1.27141193,  0.20732586,  3.82616146,  1.24999239]])

In [560]:
log_noreg.feature_names_in_

array(['Call  Failure', 'Subscription  Length', 'Charge  Amount',
       'Seconds of Use', 'Frequency of use', 'Age Group',
       'Customer Value', 'Tariff Plan_2', 'Complains_1', 'Status_2'],
      dtype=object)

In [561]:
# Report the performance (accuracy, recall, precision, f-1 score) for each model.

# Logistic Regression Accuracy: 0.9011904761904762
# Logistic Regression Confusion Matrix:
#  [[2098   37]
#  [ 212  173]]
# Logistic Regression Classification Report:
#                precision    recall  f1-score   support

#            0       0.91      0.98      0.94      2135
#            1       0.82      0.45      0.58       385

#     accuracy                           0.90      2520
#    macro avg       0.87      0.72      0.76      2520
# weighted avg       0.90      0.90      0.89      2520

# Accuracy: A strong accuracy of 90.11% indicates that the model correctly classifies approximately 90 out of 100 instances.

# Class 0 (Not Exited):
# High precision (0.91) and recall (0.98) indicate that the model is very good at correctly identifying customers who do not exit.

# Class 1 (Exited):
# Precision of 0.82 is strong, but recall of 0.45 indicates that the model misses a significant portion of customers who actually exit. The lower recall impacts the F1-score, resulting in 0.58.

In [562]:
#2)LogisticRegression with L1 and C=10

# Ensure correct parameter grid
log_reg_params = {
    'C': [10],
    'penalty': ['l1'],
    'solver': ['liblinear']
    }

# Logistic Regression with GridSearchCV (The GridSearchCV is configured to try these combinations with 5-fold cross-validation)
# Error Handling parameter is set to 'raise' to immediately raise an error if a fit fails, which can help in debugging.
log_reg = LogisticRegression()
log_reg_grid = GridSearchCV(log_reg, log_reg_params, cv=5, error_score='raise')
log_reg_grid.fit(X, y)

print(log_reg_grid.cv_results_)

best_log_reg = log_reg_grid.best_estimator_

print(best_log_reg.predict_proba(X)[:,1]/best_log_reg.predict_proba(X)[:,0])

#Best parameters and scores
print("Best Logistic Regression Parameters:", log_reg_grid.best_params_)
print("Best Logistic Regression Score:", log_reg_grid.best_score_)

# #Predict on the test set
# y_test_pred_log = best_log_reg.predict(X_test)
# # y_test_pred_log = best_log_reg.predict(X_test)
# X_test['Churn_log']=  y_test_pred_log

# #Save the predictions to CSV files
# X_test.to_csv('predictions_churn.csv', index=False)

#Predict on the train set
y_train_pred_new = best_log_reg.predict(X)

#Evaluation
accuracy_log_reg = accuracy_score(y, y_train_pred_new)
conf_matrix_log_reg = confusion_matrix(y, y_train_pred_new)
class_report_log_reg = classification_report(y, y_train_pred_new)

print("Logistic Regression Accuracy:", accuracy_log_reg)
print("Logistic Regression Confusion Matrix:\n", conf_matrix_log_reg)
print("Logistic Regression Classification Report:\n", class_report_log_reg)

{'mean_fit_time': array([0.02726932]), 'std_fit_time': array([0.00329513]), 'mean_score_time': array([0.00536723]), 'std_score_time': array([0.00141021]), 'param_C': masked_array(data=[10],
             mask=[False],
       fill_value='?',
            dtype=object), 'param_penalty': masked_array(data=['l1'],
             mask=[False],
       fill_value='?',
            dtype=object), 'param_solver': masked_array(data=['liblinear'],
             mask=[False],
       fill_value='?',
            dtype=object), 'params': [{'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}], 'split0_test_score': array([0.90674603]), 'split1_test_score': array([0.88888889]), 'split2_test_score': array([0.90674603]), 'split3_test_score': array([0.89880952]), 'split4_test_score': array([0.90277778]), 'mean_test_score': array([0.90079365]), 'std_test_score': array([0.00664016]), 'rank_test_score': array([1], dtype=int32)}
[0.01270004 0.02792979 0.0207546  ... 0.00720396 0.2111285  0.01123837]
Best Logistic Regre

In [563]:
best_log_reg.coef_

array([[ 0.95831767, -0.24342226, -0.61339298,  2.02641223, -3.20206651,
        -0.17360408, -1.4226166 ,  0.38491335,  4.2688069 ,  1.24335414]])

In [564]:
best_log_reg.feature_names_in_

array(['Call  Failure', 'Subscription  Length', 'Charge  Amount',
       'Seconds of Use', 'Frequency of use', 'Age Group',
       'Customer Value', 'Tariff Plan_2', 'Complains_1', 'Status_2'],
      dtype=object)

In [565]:
# Report the performance (accuracy, recall, precision, f-1 score) for each model.

# Best Logistic Regression Score: 0.9007936507936508
# Logistic Regression Accuracy: 0.9011904761904762
# Logistic Regression Confusion Matrix:
#  [[2094   41]
#  [ 208  177]]
# Logistic Regression Classification Report:
#                precision    recall  f1-score   support

#            0       0.91      0.98      0.94      2135
#            1       0.81      0.46      0.59       385

#     accuracy                           0.90      2520
#    macro avg       0.86      0.72      0.77      2520
# weighted avg       0.89      0.90      0.89      2520

# Accuracy: A strong accuracy of 90.11% indicates that the model correctly classifies approximately 90 out of 100 instances.

# Class 0 (Not Exited):
# High precision (0.91) and recall (0.98) indicate that the model is very good at correctly identifying customers who do not exit.

# Class 1 (Exited):
# Precision of 0.81 is strong, but recall of 0.46 indicates that the model misses a significant portion of customers who actually exit. The lower recall impacts the F1-score, resulting in 0.59.

In [566]:
#3)LogisticRegression with L2 and C=10

# Ensure correct parameter grid
log_reg_params_new = {
    'C': [10],
    'penalty': ['l2'],
    'solver': ['liblinear']
    }

# Logistic Regression with GridSearchCV (The GridSearchCV is configured to try these combinations with 5-fold cross-validation)
# Error Handling parameter is set to 'raise' to immediately raise an error if a fit fails, which can help in debugging.
# log_reg = LogisticRegression()
log_reg_grid_new = GridSearchCV(log_reg, log_reg_params_new, cv=5, error_score='raise')
log_reg_grid_new.fit(X, y)

print(log_reg_grid_new.cv_results_)

best_log_reg_new = log_reg_grid_new.best_estimator_

print(best_log_reg_new.predict_proba(X)[:,1]/best_log_reg_new.predict_proba(X)[:,0])

#Best parameters and scores
print("Best Logistic Regression Parameters:", log_reg_grid_new.best_params_)
print("Best Logistic Regression Score:", log_reg_grid_new.best_score_)



#Predict on the train set
y_train_pred_new_one = best_log_reg_new.predict(X)

#Evaluation
accuracy_log_reg_one = accuracy_score(y, y_train_pred_new_one)
conf_matrix_log_reg_one = confusion_matrix(y, y_train_pred_new_one)
class_report_log_reg_one = classification_report(y, y_train_pred_new_one)

print("Logistic Regression Accuracy:", accuracy_log_reg_one)
print("Logistic Regression Confusion Matrix:\n", conf_matrix_log_reg_one)
print("Logistic Regression Classification Report:\n", class_report_log_reg_one)

{'mean_fit_time': array([0.01730528]), 'std_fit_time': array([0.00341344]), 'mean_score_time': array([0.00648866]), 'std_score_time': array([0.00226063]), 'param_C': masked_array(data=[10],
             mask=[False],
       fill_value='?',
            dtype=object), 'param_penalty': masked_array(data=['l2'],
             mask=[False],
       fill_value='?',
            dtype=object), 'param_solver': masked_array(data=['liblinear'],
             mask=[False],
       fill_value='?',
            dtype=object), 'params': [{'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}], 'split0_test_score': array([0.90674603]), 'split1_test_score': array([0.88888889]), 'split2_test_score': array([0.90674603]), 'split3_test_score': array([0.8968254]), 'split4_test_score': array([0.90277778]), 'mean_test_score': array([0.90039683]), 'std_test_score': array([0.00680414]), 'rank_test_score': array([1], dtype=int32)}
[0.01306876 0.0286241  0.02153918 ... 0.00740227 0.21246034 0.01155649]
Best Logistic Regres

In [567]:
best_log_reg_new.coef_

array([[ 0.94792512, -0.244346  , -0.6059585 ,  1.9675402 , -3.13203633,
        -0.17537921, -1.41569356,  0.40916276,  4.21893744,  1.24335733]])

In [568]:
best_log_reg_new.feature_names_in_

array(['Call  Failure', 'Subscription  Length', 'Charge  Amount',
       'Seconds of Use', 'Frequency of use', 'Age Group',
       'Customer Value', 'Tariff Plan_2', 'Complains_1', 'Status_2'],
      dtype=object)

In [569]:
# Report the performance (accuracy, recall, precision, f-1 score) for each model.

# Best Logistic Regression Score: 0.9003968253968253
# Logistic Regression Accuracy: 0.901984126984127
# Logistic Regression Confusion Matrix:
#  [[2094   41]
#  [ 206  179]]
# Logistic Regression Classification Report:
#                precision    recall  f1-score   support

#            0       0.91      0.98      0.94      2135
#            1       0.81      0.46      0.59       385

#     accuracy                           0.90      2520
#    macro avg       0.86      0.72      0.77      2520
# weighted avg       0.90      0.90      0.89      2520

# Accuracy: A strong accuracy of 90.19% indicates that the model correctly classifies approximately 90 out of 100 instances.

# Class 0 (Not Exited):
# High precision (0.91) and recall (0.98) indicate that the model is very good at correctly identifying customers who do not exit.

# Class 1 (Exited):
# Precision of 0.81 is strong, but recall of 0.46 indicates that the model misses a significant portion of customers who actually exit. The lower recall impacts the F1-score, resulting in 0.59.

In [570]:
# Select the final model for interpretation/prediction and clearly justify your selection.

# The final model is best_log_reg_new - LogisticRegression with L2 and C=10 based on highest accuracy 90.19%.
# This means that approximately 90.19% of the predictions made by the logistic regression model are correct.

final_model = best_log_reg_new

### Step 3: Predict with Your Selected Model, and Save the Results

Use your selected model to predict the test set, and save the results as a CSV file and upload for grading.

In [571]:
#Predict on the test set
y_test_pred_log_final = final_model.predict(X_test)
# y_test_pred_log = best_log_reg_new.predict(X_test)
X_test['Churn_log']=  y_test_pred_log_final

#Save the predictions to CSV files
X_test.to_csv('predictions_churn.csv', index=False)

### Analytical Task: Interpreting Your Model

Based on your model and analysis, answer the following 4 questions.

1. What is the accuracy of your selected model? Which model has the best performance (list all three model performance and show which is the best model)?
2. What are the top three driving factors influencing customers' exit decisions? Does it make business sense? What business insights do the logistic model coefficients reveal?
3. Interpret the coefficients of Complaints, Tariff Plan, and Seconds of Use. How do they influence customer's probability of churn? Hint: Use the concept of Odds in your answer.
4. Given the below data point/case, what is your selected model's prediction and why? Based on this data point, what recommendations would you suggest to encourage this customer to stay?

```
{
    "Call  Failure": 80,
    "Complains": 1,
    "Subscription  Length": 32,
    "Charge  Amount": 0,
    "Seconds of Use": 1519,
    "Frequency of use": 29,
    "Frequency of SMS": 15,
    "Distinct Called Numbers": 12,
    "Age Group": 3,
    "Tariff Plan": 1,
    "Status": 2,
    "Age": 31,
    "Customer Value": 120.86
}
```

In [None]:
#1)What is the accuracy of your selected model? Which model has the best performance (list all three model performance and show which is the best model)?

# The selected final model is the model with LogisticRegression with L2 and C=10 that has highest accuracy among three logistic regression models trained.

# Logistic Regression Accuracy: 0.901984126984127
# Logistic Regression Confusion Matrix:
#  [[2094   41]
#  [ 206  179]]
# Logistic Regression Classification Report:
#                precision    recall  f1-score   support

#            0       0.91      0.98      0.94      2135
#            1       0.81      0.46      0.59       385

#     accuracy                           0.90      2520
#    macro avg       0.86      0.72      0.77      2520
# weighted avg       0.90      0.90      0.89      2520

# Accuracy: A strong accuracy of 90.19% indicates that the model correctly classifies approximately 90 out of 100 instances.

# Class 0 (Not Exited):
# High precision (0.91) and recall (0.98) indicate that the model is very good at correctly identifying customers who do not exit.

# Class 1 (Exited):
# Precision of 0.81 is strong, but recall of 0.46 indicates that the model misses a significant portion of customers who actually exit. The lower recall impacts the F1-score, resulting in 0.59.

In [None]:
#2)What are the top three driving factors influencing customers' exit decisions? Does it make business sense? What business insights do the logistic model coefficients reveal?


# Based on the coefficients of the final model selected, we can see the below result

# array([[ 0.94792512, -0.244346  , -0.6059585 ,  1.9675402 , -3.13203633,
#         -0.17537921, -1.41569356,  0.40916276,  4.21893744,  1.24335733]])

# array(['Call  Failure', 'Subscription  Length', 'Charge  Amount',
#        'Seconds of Use', 'Frequency of use', 'Age Group',
#        'Customer Value', 'Tariff Plan_2', 'Complains_1', 'Status_2'],
#       dtype=object)

# Top three Key Drivers and Business Sense
# Complains_1 (4.21893744)
# Status_2(1.24335733)
# Call  Failure(0.94792512)

# Pattern: Complaints is the most influential factor in predicting customer exit.

# This makes business sense since the likelihood of customer staying compared to not staying increases if there are less complaints

In [None]:
# 3)Interpret the coefficients of Complaints, Tariff Plan, and Seconds of Use. How do they influence customer's probability of churn? Hint: Use the concept of Odds in your answer.

# Tariff Plan - 0.4091627
# Complaints - 4.21893744
# Seconds of Use - 1.9675402


In [572]:
#4)Given the below data point/case, what is your selected model's prediction and why? Based on this data point, what recommendations would you suggest to encourage this customer to stay?

In [573]:
new_cust = {
    "Call  Failure": 80,
    "Complains": 1,
    "Subscription  Length": 32,
    "Charge  Amount": 0,
    "Seconds of Use": 1519,
    "Frequency of use": 29,
    "Frequency of SMS": 15,
    "Distinct Called Numbers": 12,
    "Age Group": 3,
    "Tariff Plan": 1,
    "Status": 2,
    "Age": 31,
    "Customer Value": 120.86
}

df_new = pd.DataFrame([new_cust])
df_new.head()

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value
0,80,1,32,0,1519,29,15,12,3,1,2,31,120.86


In [574]:
df_new.isna().sum()
df_new.shape
df_new.dtypes

Call  Failure                int64
Complains                    int64
Subscription  Length         int64
Charge  Amount               int64
Seconds of Use               int64
Frequency of use             int64
Frequency of SMS             int64
Distinct Called Numbers      int64
Age Group                    int64
Tariff Plan                  int64
Status                       int64
Age                          int64
Customer Value             float64
dtype: object

In [575]:
#Dropping irrelevent columns from new data
df_new.drop(["Frequency of SMS","Distinct Called Numbers","Age"],axis = 1,inplace=True)
df_new.head()

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Age Group,Tariff Plan,Status,Customer Value
0,80,1,32,0,1519,29,3,1,2,120.86


In [576]:
#Encoding test data for columns Tariff Plan, Complains, Status

# Transform the Tariff Plan, Complains, Status column for test data
encoded_new = one_hot_encoder.transform(df_new[["Tariff Plan", "Complains", "Status"]])

# Get the column names for the one-hot encoded variables
columns_new = one_hot_encoder.get_feature_names_out(["Tariff Plan", "Complains", "Status"])

# Create a DataFrame with the encoded data
encoded_df_new = pd.DataFrame(encoded_new, columns=columns_new)

# Drop the original Tariff Plan, Complains, Status column and concatenate the new encoded columns
df_encoded_new_test = pd.concat([df_new.drop(columns=["Tariff Plan", "Complains", "Status"]), encoded_df_new], axis=1)
df_encoded_new_test.head()

Unnamed: 0,Call Failure,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Age Group,Customer Value,Tariff Plan_2,Complains_1,Status_2
0,80,32,0,1519,29,3,120.86,0,1,1


In [577]:
#Ordinal encoding for test data for columns "Age Group","Charge  Amount"

# Select the column to encode
ord_col_test = df_encoded_new_test[["Age Group","Charge  Amount"]]

# # Initialize the OrdinalEncoder
# encoder = OrdinalEncoder()

# Fit and transform the column
encoded_ord_col_test = encoder.transform(ord_col_test)

# Replace the original column with the encoded values
df_encoded_new_test[["Age Group","Charge  Amount"]] = encoded_ord_col_test

# Display the first few rows to confirm the changes
df_encoded_new_test.head()

Unnamed: 0,Call Failure,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Age Group,Customer Value,Tariff Plan_2,Complains_1,Status_2
0,80,32,0.0,1519,29,2.0,120.86,0,1,1


In [578]:
#Scaling test data for columns "Customer Value",  "Seconds of Use", "Frequency of use", "Subscription  Length", "Call  Failure"

df_encoded_new_test[continuous_features] = scaler.transform(df_encoded_new_test[continuous_features])
df_encoded_new_test.head()

Unnamed: 0,Call Failure,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Age Group,Customer Value,Tariff Plan_2,Complains_1,Status_2
0,9.806824,-0.059523,0.0,-0.710756,-0.712706,2.0,-0.683713,0,1,1


In [579]:
X_new = df_encoded_new_test

In [580]:
# Predict new data point using Log Reg

#Predict on the test set
y_test_pred_log_new_test = final_model.predict(X_new)
# y_test_pred_log = final_model.predict(X_test)
X_new['churn__log']=  y_test_pred_log_new_test


print(f"Prediction: {'Exited' if y_test_pred_log_new_test == 1 else 'Not Exited'}")

# Recommendation
if y_test_pred_log_new_test == 1:
    print("Recommendation: Engage with the customer through personalized offers and improved service to retain them.")
else:
    print("Recommendation: Continue to monitor customer satisfaction and engagement levels.")


Prediction: Exited
Recommendation: Engage with the customer through personalized offers and improved service to retain them.


In [None]:
# The logistic regression model has predicted that the customer will exit. This means that based on the features provided (such as 	Call Failure, Subscription Length,
# Charge Amount, Seconds of Use, etc.), the model has determined that the customer is not likely to stay with the company.

# Recommendation: Engage with the customer through personalized offers and improved service to retain them.

# Since the prediction indicates that the customer will not stay, the recommendation focuses on maintaining and enhancing their satisfaction and engagement to ensure they do not exit and stay a loyal customer.

### Deliverables to be uploaded to Canvas for Grading
1. Jupiter Notebook with code and all questions answered in Markdown cells
2. test.csv file with the predicted label/column (Churn)