## Instructions

In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned in class.

Here is the list of steps to be followed (building a simple model without balancing the data):

## Round 1

- Import the required libraries and modules that you would need

- Read that data into Python and call the dataframe churnData

- Check the datatypes of all the columns in the data. You will see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function

- Check for null values in the dataframe. Replace the null values

- Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges

- Split the data into a training set and a test set

- Scale the features either by using normalizer or a standard scaler

- Fit a logistic Regression model on the training data

- Fit a Knn Classifier(NOT KnnRegressor please!)model on the training data.

## Data Preprocessing

In [2]:
import pandas as pd 

In [3]:
churnData = pd.read_csv('DATA_Customer-Churn.csv') # import csv, create dataframe 

In [4]:
churnData # display dataframe 

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.80,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.20,7362.9,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.40,306.6,Yes


In [5]:
churnData.info() # get overview of data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   object 
 15  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(13)
memory

In [6]:
# Count null values in each column of the DataFrame
null_counts = churnData.isnull().sum()

# Print the null value counts
print(null_counts)

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


In [7]:
churnData['TotalCharges'] = pd.to_numeric(churnData['TotalCharges'], errors='coerce') # By adding errors='coerce', the function will convert non-numeric values to NaN, 
# allowing the conversion to proceed without raising an error.

In [8]:
# Print the null value counts
print(null_counts)

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


In [9]:
churnData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7032 non-null   float64
 15  Churn             7043 non-null   object 
dtypes: float64(2), int64(2), object(12)
memory

In [10]:
# Filter the DataFrame to show only entirely duplicated rows
duplicate_rows = churnData[churnData.duplicated(keep=False)]

# Print the duplicated rows
print(duplicate_rows)

      gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
22      Male              0      No         No       1          Yes   
33      Male              0      No         No       1          Yes   
100     Male              0      No         No       1          Yes   
128     Male              0      No         No       1          Yes   
211   Female              0      No         No       1           No   
...      ...            ...     ...        ...     ...          ...   
6706  Female              0      No         No       1          Yes   
6764  Female              0      No         No       1          Yes   
6774  Female              0      No         No       1          Yes   
6789  Female              0      No         No       1          Yes   
6924    Male              0      No         No       1          Yes   

           OnlineSecurity         OnlineBackup     DeviceProtection  \
22    No internet service  No internet service  No internet service   
33   

In [11]:
# Drop entirely duplicated rows from the DataFrame
churnData.drop_duplicates(inplace=True)

In [12]:
# Reset the index of the DataFrame
churnData.reset_index(drop=True, inplace=True)

In [13]:
churnData

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.50,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6989,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.80,1990.50,No
6990,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.20,7362.90,No
6991,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.60,346.45,No
6992,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.40,306.60,Yes


In [14]:
# Assuming churn is my target variable

churnData['Churn'].nunique

<bound method IndexOpsMixin.nunique of 0        No
1        No
2       Yes
3        No
4       Yes
       ... 
6989     No
6990     No
6991     No
6992    Yes
6993     No
Name: Churn, Length: 6994, dtype: object>

In [15]:
# List of columns to keep
columns_to_keep = ['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges', 'Churn']

# Drop all columns except the ones in columns_to_keep
churnData.drop(columns=churnData.columns.difference(columns_to_keep), inplace=True)
churnData

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn
0,0,1,29.85,29.85,No
1,0,34,56.95,1889.50,No
2,0,2,53.85,108.15,Yes
3,0,45,42.30,1840.75,No
4,0,2,70.70,151.65,Yes
...,...,...,...,...,...
6989,0,24,84.80,1990.50,No
6990,0,72,103.20,7362.90,No
6991,0,11,29.60,346.45,No
6992,1,4,74.40,306.60,Yes


In [16]:
churnData['Churn']=churnData['Churn'].map({'No': 0, 'Yes': 1})

In [17]:
churnData

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn
0,0,1,29.85,29.85,0
1,0,34,56.95,1889.50,0
2,0,2,53.85,108.15,1
3,0,45,42.30,1840.75,0
4,0,2,70.70,151.65,1
...,...,...,...,...,...
6989,0,24,84.80,1990.50,0
6990,0,72,103.20,7362.90,0
6991,0,11,29.60,346.45,0
6992,1,4,74.40,306.60,1


In [24]:
churnData = churnData.dropna()

## Split data into train and test set and apply StandardScaler

In [25]:
from sklearn.model_selection import train_test_split

X = churnData[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]
y = churnData['Churn']

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

log_model = LogisticRegression() 

## Data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=11)

ss = StandardScaler()
ss.fit(X_train)
X_train_log = ss.transform(X_train)
X_test_log = ss.transform(X_test)


## Fit a logistic regression model on the training data 

In [28]:
log_model.fit(X_train_log, y_train)

y_pred_train_log = log_model.predict(X_train_log)
y_pred_test_log = log_model.predict(X_test_log)

performance_log = pd.DataFrame({'Error_metric': ['Accuracy','Precision','Recall'],
                               'Train': [accuracy_score(y_train, y_pred_train_log),
                                         precision_score(y_train, y_pred_train_log),
                                         recall_score(y_train, y_pred_train_log)],
                               'Test': [accuracy_score(y_test, y_pred_test_log),
                                        precision_score(y_test, y_pred_test_log),
                                        recall_score(y_test, y_pred_test_log)]})

display(performance_log)


Unnamed: 0,Error_metric,Train,Test
0,Accuracy,0.790548,0.801718
1,Precision,0.647239,0.715356
2,Recall,0.434156,0.487245


## Fit a KNN classifier

In [27]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5,weights='uniform') # declare we're using knn classification model
model.fit(X_train, y_train) # train model
y_pred = model.predict(X_test.values) # predict test
y_pred_train=model.predict(X_train.values) # predict train (for sanity checks)

performance_log = pd.DataFrame({'Error_metric': ['Accuracy','Precision','Recall'],
                               'Train': [accuracy_score(y_train, y_pred_train),
                                         precision_score(y_train, y_pred_train),
                                         recall_score(y_train, y_pred_train)],
                               'Test': [accuracy_score(y_test, y_pred),
                                        precision_score(y_test, y_pred),
                                        recall_score(y_test, y_pred)]})

display(performance_log)



Unnamed: 0,Error_metric,Train,Test
0,Accuracy,0.832796,0.762348
1,Precision,0.739488,0.598684
2,Recall,0.55487,0.464286
