Let's load up our data and do some feature engineering to turn all of our features into numerical data.

In [1]:
import pandas as pd

DATA = '/kaggle/input/telecom-customers/Telecom Customers Churn.csv'
TARGET = 'Churn'

df = pd.read_csv(filepath_or_buffer=DATA, index_col=['customerID'])
for column in ['Partner', 'Dependents', 'PhoneService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'Churn']:
    df[column] = df[column] == 'Yes'
df = pd.get_dummies(data=df, columns=['gender', 'Contract', 'MultipleLines', 'InternetService', 'PaymentMethod'])
df['TotalCharges'] = df['TotalCharges'].replace(' ', '0')
df['TotalCharges'] = df['TotalCharges'].astype(float)
df.head()

Unnamed: 0_level_0,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,...,MultipleLines_No,MultipleLines_No phone service,MultipleLines_Yes,InternetService_DSL,InternetService_Fiber optic,InternetService_No,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7590-VHVEG,0,True,False,1,False,False,True,False,False,False,...,False,True,False,True,False,False,False,False,True,False
5575-GNVDE,0,False,False,34,True,True,False,True,False,False,...,True,False,False,True,False,False,False,False,False,True
3668-QPYBK,0,False,False,2,True,True,True,False,False,False,...,True,False,False,True,False,False,False,False,False,True
7795-CFOCW,0,False,False,45,False,True,False,True,True,False,...,False,True,False,True,False,False,True,False,False,False
9237-HQITU,0,False,False,2,True,False,False,False,False,False,...,True,False,False,False,True,False,False,False,True,False


Are the classes in our target variable balanced?

In [2]:
df[TARGET].value_counts().to_dict()

{False: 5174, True: 1869}

One class outnumbers the other about about three to one. Let's use dimension reduction to visualize our data.

In [3]:
import arrow
from umap import UMAP

time_start = arrow.now()
umap = UMAP(random_state=2024, verbose=False, n_jobs=1, low_memory=False, n_epochs=201)
df[['x', 'y']] = umap.fit_transform(X=df.drop(columns=[TARGET]))
print('done with UMAP in {}'.format(arrow.now() - time_start))

done with UMAP in 0:00:42.633292


In [4]:
from plotly import express

express.scatter(data_frame=df, x='x', y='y', color=TARGET)

We do see some local clustering but not a lot. Let's build a model.

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['x', 'y', TARGET]), df[TARGET], test_size=0.2, random_state=2024, stratify=df[TARGET])

logreg = LogisticRegression(max_iter=1000, tol=1e-12).fit(X_train, y_train)
print('model fit in {} iterations'.format(logreg.n_iter_[0]))
print('accuracy: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=logreg.predict(X=X_test))))
print('f1: {:5.4f}'.format(f1_score(average='weighted', y_true=y_test, y_pred=logreg.predict(X=X_test))))
print(classification_report(y_true=y_test, y_pred=logreg.predict(X=X_test)))

model fit in 193 iterations
accuracy: 0.7878
f1: 0.7812
              precision    recall  f1-score   support

       False       0.84      0.89      0.86      1035
        True       0.62      0.52      0.56       374

    accuracy                           0.79      1409
   macro avg       0.73      0.70      0.71      1409
weighted avg       0.78      0.79      0.78      1409

