Here is the list of steps to be followed (building a simple model without balancing the data):

- Import the required libraries and modules that you would need.
- Read that data into Python and call the dataframe `churnData`.
- Check the datatypes of all the columns in the data. You would see that the column `TotalCharges` is object type. Convert this column into numeric type using `pd.to_numeric` function.
- Check for null values in the dataframe. Replace the null values.
- Use the following features: `tenure`, `SeniorCitizen`, `MonthlyCharges` and `TotalCharges`:
  - Scale the features either by using normalizer or a standard scaler.
  - Split the data into a training set and a test set.
  - Fit a logistic regression model on the training data.
  - Check the accuracy on the test data.
**Note**: So far we have not balanced the data.

Managing imbalance in the dataset

- Check for the imbalance.
- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- Each time fit the model and see how the accuracy of the model is.

In [21]:
import pandas as pd

In [22]:
churnData = pd.read_csv("./Customer-Churn.csv")

In [23]:
churnData

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.80,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.20,7362.9,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.40,306.6,Yes


In [24]:
# check for data types
print(churnData.dtypes)


gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object


In [25]:
# Convert the 'TotalCharges' column to numeric type
churnData['TotalCharges'] = pd.to_numeric(churnData['TotalCharges'], errors='coerce')


In [26]:
# check for null values
churnData.isnull().sum()


gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [27]:
# replace null values in "TotalCharges" with the average
churnData['TotalCharges'].fillna(churnData['TotalCharges'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  churnData['TotalCharges'].fillna(churnData['TotalCharges'].mean(), inplace=True)


In [28]:
# Applying StandardScaler to features tenure, SeniorCitizen, MonthlyCharges and TotalCharges
# this will transform all above features to have a mean of 0 and standard deviation of 1. This helps the model to perform better.

from sklearn.preprocessing import StandardScaler

# Select the features to scale
features_to_scale = ['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']

# Initialize the scaler
scaler = StandardScaler()

# Scale the features
churnData[features_to_scale] = scaler.fit_transform(churnData[features_to_scale])

# Verify the scaling
print(churnData[features_to_scale].head())

     tenure  SeniorCitizen  MonthlyCharges  TotalCharges
0 -1.277445      -0.439916       -1.160323     -0.994971
1  0.066327      -0.439916       -0.259629     -0.173876
2 -1.236724      -0.439916       -0.362660     -0.960399
3  0.514251      -0.439916       -0.746535     -0.195400
4 -1.236724      -0.439916        0.197365     -0.941193


In [29]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Define the features and target variable
X = churnData[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]
y = churnData['Churn']  # Replace 'Churn' with the actual target column name

# Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the logistic regression model
logreg = LogisticRegression()

# Fit the model on the training data
logreg.fit(X_train, y_train)

# Make predictions on the test data
y_pred = logreg.predict(X_test)

# Check the accuracy on the test data
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on test data: {accuracy:.2f}")

Accuracy on test data: 0.81


In [30]:
# Check for class imbalance
print(churnData['Churn'].value_counts())

Churn
No     5174
Yes    1869
Name: count, dtype: int64


In [31]:
# Upsampling test:

from sklearn.utils import resample

churnData['Churn'] = churnData['Churn'].map({'Yes': 1, 'No': 0})

# Separate majority and minority classes
majority = churnData[churnData['Churn'] == 0]
minority = churnData[churnData['Churn'] == 1]

# Upsample the minority class
minority_upsampled = resample(minority, 
                              replace=True,     # Sample with replacement
                              n_samples=len(majority),  # Match majority class size
                              random_state=42)  # Reproducibility

# Combine majority and upsampled minority
upsampled = pd.concat([majority, minority_upsampled])

# Check new class distribution
print(upsampled['Churn'].value_counts())

# Split the data
X = upsampled[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]
y = upsampled['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit logistic regression and evaluate
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy after upsampling: {accuracy}")

Churn
0    5174
1    5174
Name: count, dtype: int64
Accuracy after upsampling: 0.7323671497584541


In [32]:
# Example downsizing: 

majority_downsampled = resample(majority, 
                                replace=False,    # Sample without replacement
                                n_samples=len(minority),  # Match minority class size
                                random_state=42)  # Reproducibility

# Combine downsampled majority and minority
downsampled = pd.concat([majority_downsampled, minority])

# Check new class distribution
print(downsampled['Churn'].value_counts())

# Split the data
X = downsampled[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]
y = downsampled['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit logistic regression and evaluate
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy after downsampling: {accuracy}")

Churn
0    1869
1    1869
Name: count, dtype: int64
Accuracy after downsampling: 0.7553475935828877
