<h1>Baseline Models</h1>

<h2>Load Processed Data</h2>

In [69]:
import pandas as pd

# load data
df = pd.read_csv('../../data/processed/telco_churn_clean.csv')

# check if data is correct
df.head() # no ID column
df.isnull().sum() # no missing values
df['Churn'].value_counts() # 'Churn' is binary of type int64
df.tail()
len(df.columns)

20

<h2>Encode Categorical Columns</h2>

In [70]:
# encoding
categorical_cols = df.select_dtypes(include='object').columns 
for c in categorical_cols:

    # encoding
    hot_encoded_c = pd.get_dummies(df[c])
    new_categories = hot_encoded_c.columns

    # new categories for df
    new_categories_distinct = [ f"{new_c} ({c})" for new_c in new_categories]
    #print(new_categories)
    hot_encoded_c.columns = new_categories_distinct
    #print(hot_encoded_c.columns)

    # creating the new categories
    df[new_categories_distinct] = hot_encoded_c
    df = df.drop(columns=[c])

len(df.columns)
df.tail()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn,Female (gender),Male (gender),No (Partner),Yes (Partner),No (Dependents),...,Yes (StreamingMovies),Month-to-month (Contract),One year (Contract),Two year (Contract),No (PaperlessBilling),Yes (PaperlessBilling),Bank transfer (automatic) (PaymentMethod),Credit card (automatic) (PaymentMethod),Electronic check (PaymentMethod),Mailed check (PaymentMethod)
7027,0,24,84.8,1990.5,0,False,True,False,True,False,...,True,False,True,False,False,True,False,False,False,True
7028,0,72,103.2,7362.9,0,True,False,False,True,False,...,True,False,True,False,False,True,False,True,False,False
7029,0,11,29.6,346.45,0,True,False,False,True,False,...,False,True,False,False,False,True,False,False,True,False
7030,1,4,74.4,306.6,1,False,True,False,True,True,...,False,True,False,False,False,True,False,False,False,True
7031,0,66,105.65,6844.5,0,False,True,True,False,True,...,True,False,False,True,False,True,True,False,False,False


<h2>Train Test Split</h2>

In [81]:
# features and target 'Churn'
X = df.drop('Churn', axis=1)
y = df['Churn']

0       0
1       0
2       1
3       0
4       1
       ..
7027    0
7028    0
7029    0
7030    1
7031    0
Name: Churn, Length: 7032, dtype: int64

In [83]:
from sklearn.model_selection import train_test_split

# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<h2>Logisitic Regression</h2>

In [87]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# training model
model1 = LogisticRegression()
model1.fit(X_train, y_train)

# model evaluation
y_preds = model1.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_preds)}")
print(classification_report(y_test, y_preds))

Accuracy: 0.7853589196872779
              precision    recall  f1-score   support

           0       0.83      0.89      0.86      1033
           1       0.62      0.51      0.56       374

    accuracy                           0.79      1407
   macro avg       0.72      0.70      0.71      1407
weighted avg       0.78      0.79      0.78      1407



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<h2>Day 4 Conclusions</h2>
<ol>
    <li>The Churn "Yes" class has a worse recall than the "No" class.</li>
    <li>Based on the prescison, recall, and F1-score for each class, the Logistic Model is better at predicting when customer has not churned.</li>
    <li>I would say that missing when a customer churns is more problematic, because that means we are not detecting a loss of revenue.</li>
</ol>