#### Scenario

You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class.

Here is the list of steps to be followed (building a simple model without balancing the data):

**Import the required libraries and modules that you would need.**

In [1]:
import pandas as pd
import numpy as np

import warnings

warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")


from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


**Read that data into Python and call the dataframe churnData.**

In [2]:
churnData = pd.read_csv('Customer-Churn.csv') 
churnData.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [3]:
churnData.shape

(7043, 16)

In [4]:
churnData.columns = churnData.columns.str.lower()
churnData.head()

Unnamed: 0,gender,seniorcitizen,partner,dependents,tenure,phoneservice,onlinesecurity,onlinebackup,deviceprotection,techsupport,streamingtv,streamingmovies,contract,monthlycharges,totalcharges,churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


**Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.**

In [5]:
churnData.dtypes

gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
monthlycharges      float64
totalcharges         object
churn                object
dtype: object

In [7]:
churnData['totalcharges'] = pd.to_numeric(churnData['totalcharges'])

ValueError: Unable to parse string " " at position 488

Let's check the values in 'TotalCharges' and specifically check the value at position 488 to see its type.

In [8]:
churnData['totalcharges'].value_counts()

          11
20.2      11
19.75      9
20.05      8
19.9       8
          ..
6849.4     1
692.35     1
130.15     1
3211.9     1
6844.5     1
Name: totalcharges, Length: 6531, dtype: int64

In [9]:
missing_value = churnData['totalcharges'].iat[488]
missing_value

' '

In [10]:
is_null = pd.isnull(missing_value)
is_null

False

In [11]:
#let's replace the space with the mean value of TotalCharges
#using errors='coerce' to handle non-numeric values gracefully and convert them to NaN.

mean_value = pd.to_numeric(churnData['totalcharges'], errors='coerce').mean()


churnData['totalcharges'] = churnData['totalcharges'].replace(' ', mean_value)


In [12]:
churnData['totalcharges'].value_counts() # so the space is converted to mean of the col

2283.3004408418697    11
20.2                  11
19.75                  9
20.05                  8
19.9                   8
                      ..
6849.4                 1
692.35                 1
130.15                 1
3211.9                 1
6844.5                 1
Name: totalcharges, Length: 6531, dtype: int64

In [13]:
churnData.dtypes

gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
monthlycharges      float64
totalcharges         object
churn                object
dtype: object

In [14]:
#dtype is still object so will convert the whoe col to numeric again

churnData['totalcharges'] = pd.to_numeric(churnData['totalcharges'])

In [15]:
churnData.dtypes

gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
monthlycharges      float64
totalcharges        float64
churn                object
dtype: object

**Check for null values in the dataframe. Replace the null values.**

In [16]:
null_counts = churnData.isnull().sum()
null_counts

gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

**Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:**
        
Scale the features either by using normalizer or a standard scaler.<hr>
Split the data into a training set and a test set.<hr>
Fit a logistic regression model on the training data.<hr>
Check the accuracy on the test data.

In [17]:
new_data = churnData[['tenure', 'seniorcitizen', 'monthlycharges', 'totalcharges']]
new_data.head()

Unnamed: 0,tenure,seniorcitizen,monthlycharges,totalcharges
0,1,0,29.85,29.85
1,34,0,56.95,1889.5
2,2,0,53.85,108.15
3,45,0,42.3,1840.75
4,2,0,70.7,151.65


In [18]:
new_data.dtypes

tenure              int64
seniorcitizen       int64
monthlycharges    float64
totalcharges      float64
dtype: object

In [19]:
X = new_data
y = churnData['churn']

In [20]:
normalizer = MinMaxScaler()
X_normalized = normalizer.fit_transform(X)
X_normalized = pd.DataFrame(X_normalized, columns=X.columns)
X_normalized


Unnamed: 0,tenure,seniorcitizen,monthlycharges,totalcharges
0,0.013889,0.0,0.115423,0.001275
1,0.472222,0.0,0.385075,0.215867
2,0.027778,0.0,0.354229,0.010310
3,0.625000,0.0,0.239303,0.210241
4,0.027778,0.0,0.521891,0.015330
...,...,...,...,...
7038,0.333333,0.0,0.662189,0.227521
7039,1.000000,0.0,0.845274,0.847461
7040,0.152778,0.0,0.112935,0.037809
7041,0.055556,1.0,0.558706,0.033210


In [21]:
churnData['churn'].unique()

array(['No', 'Yes'], dtype=object)

In [22]:
churnData['churn'] = churnData['churn'].apply(lambda x: 1 if x == "Yes" else 0)

In [23]:
y= churnData[['churn']]
y.head()

Unnamed: 0,churn
0,0
1,0
2,1
3,0
4,1


In [24]:
churnData['churn'].unique()

array([0, 1])

In [25]:
# train-test split using X_normalized

X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.3, random_state=42)

In [26]:
#Fit a logistic regression model on the training data.

model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In [27]:
y_pred = model.predict(X_test)

In [28]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.7950780880265026


In [29]:
from sklearn.metrics import classification_report

In [30]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.93      0.87      1539
           1       0.69      0.44      0.54       574

    accuracy                           0.80      2113
   macro avg       0.75      0.69      0.70      2113
weighted avg       0.78      0.80      0.78      2113



**Managing imbalance in the dataset**

Check for the imbalance.<hr>
Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.<hr>
Each time fit the model and see how the accuracy of the model is.<hr>

In [31]:
#Cheking how imbalance the churn data is

y.value_counts()

churn
0        5174
1        1869
dtype: int64

#### SMOTE

In [32]:
from imblearn.over_sampling import SMOTE

In [33]:
smote = SMOTE()
X_smote, y_smote = smote.fit_resample(X_normalized, y)

In [34]:
y_smote.value_counts()

churn
0        5174
1        5174
dtype: int64

#### TomekLinks

In [35]:
from imblearn.under_sampling import TomekLinks

In [36]:
tomek = TomekLinks(sampling_strategy='majority')

X_tomek, y_tomek = tomek.fit_resample(X_normalized, y)

In [37]:
y_tomek.value_counts()

churn
0        4651
1        1869
dtype: int64

#### Reevaluating the model balanced with SMOTE

In [38]:
# train-test split using normalized and oversampled data

X_train, X_test, y_train, y_test= train_test_split(X_smote, y_smote, test_size=0.3, random_state=42)


In [39]:
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In [40]:
y_pred_smote = model.predict(X_test)

In [41]:
accuracy_smote = accuracy_score(y_test, y_pred_smote)
print("Accuracy_smote:", accuracy)

Accuracy_smote: 0.7950780880265026


In [42]:
print(classification_report(y_test, y_pred_smote))

              precision    recall  f1-score   support

           0       0.76      0.74      0.75      1574
           1       0.74      0.76      0.75      1531

    accuracy                           0.75      3105
   macro avg       0.75      0.75      0.75      3105
weighted avg       0.75      0.75      0.75      3105



#### Reevaluating the model balanced with TomekLink

In [43]:
# train-test split using normalized and undersampled data

X_train, X_test, y_train, y_test = train_test_split(X_tomek, y_tomek, test_size=0.3, random_state=42)

In [44]:
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In [45]:
y_pred_tomek = model.predict(X_test)

In [46]:
accuracy = accuracy_score(y_test, y_pred_tomek)
print("Accuracy_tomek:", accuracy)

Accuracy_tomek: 0.7847648261758691


In [47]:
print(classification_report(y_test, y_pred_tomek))

              precision    recall  f1-score   support

           0       0.81      0.90      0.85      1368
           1       0.69      0.52      0.59       588

    accuracy                           0.78      1956
   macro avg       0.75      0.71      0.72      1956
weighted avg       0.78      0.78      0.78      1956



#### Baseic model with normalized data had a similar accuracy as normalized data balanced with SMOTE. Precision, recall and f1-score  all decreased for 0 and increased for 1 values.

####  Modeling the normalized data balanced with Tomeklink gave us a lower accuracy compared to the base model. the precision, recall and f1-score  didn't change significantly for 0 values but recall and f1-score increased a bit.