# Data Processing

## Baseline Model

This Notebook built a simple baseline model to validate insigths from EDA and identify features associated with customers churn.

### 1. Import and Load dataset

In [26]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

df = pd.read_csv('../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv')

### 2. Select Features

The following features were selected to build a baseline model based on the insights obtained during the EDA.

#### Target Feature

- `Churn`

#### Numerical

- `MonthlyCharges`: Amount that a customer pays every month.
- `tenure`: Length of the time that a customer has a contract.

#### Categorical

- `TechSupport`: Has a customer technical support or not.
- `Contract`: Type of contract that a customer has. Like month-to-month or yearly.

### 3. Define `X` and `Y`

#### Create target `Y`

In [11]:
y = df['Churn'].map({'Yes': 1, 'No': 0})
y.head()
y.value_counts(dropna=False)

Churn
0    5174
1    1869
Name: count, dtype: int64

#### Create `X`

In [30]:
selected_features = ['tenure', 'TechSupport', 'Contract', 'MonthlyCharges']
X =  df[selected_features]
X.head()

Unnamed: 0,tenure,TechSupport,Contract,MonthlyCharges
0,1,No,Month-to-month,29.85
1,34,No,One year,56.95
2,2,No,Month-to-month,53.85
3,45,Yes,One year,42.3
4,2,No,Month-to-month,70.7


In [31]:
print("X shape:", X.shape)
print("y shape:", y.shape)
print("Any missing in X?", X.isna().sum().sum())
print("Any missing in y?", y.isna().sum())

X shape: (7043, 4)
y shape: (7043,)
Any missing in X? 0
Any missing in y? 0


### 3. Minimal Preprocessing (Baseline Model)
Make the data model ready. At the end of this step: 

- `X`: Original features
- `X_encoded`: Numerical version of X
- `Y`: Binary target (0/1) 

#### One-hot-encoding

In [38]:
X_encoded = pd.get_dummies(
    X,
    columns = ['TechSupport', 'Contract'],
    drop_first = True
)
X_encoded.head()


Unnamed: 0,tenure,MonthlyCharges,TechSupport_No internet service,TechSupport_Yes,Contract_One year,Contract_Two year
0,1,29.85,False,False,False,False
1,34,56.95,False,False,True,False
2,2,53.85,False,False,False,False
3,45,42.3,False,True,True,False
4,2,70.7,False,False,False,False


### 4. Train / Test Split

In this step the data will be split into:

- Training set: To fit the model 
- Test set: To evaluate on unseen data 

In [46]:
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded,
    y,
    test_size = 0.2,
    random_state = 42,
    stratify = y    
)

#### Quick checks

In [41]:
print('X_train:', X_train.shape)
print('X_test:', X_test.shape)
print('y_train:', y_train.shape)
print('y_train:', y_test.shape)

print('Churn rate (train):', y_train.mean())
print('Churn rate (test):', y_test.mean())

X_train: (5634, 6)
X_test: (1409, 6)
y_train: (5634,)
y_train: (1409,)
Churn rate (train): 0.2653532126375577
Churn rate (test): 0.2654364797728886


**Results:**

The dataset was split into training and test sets using a stratified split to preserve the churn distribution. Both sets show nearly identical churn rates, ensuring a fair evaluation.

### 5. Train the baseline model (Logistic Regression)

Train one simple model to:
- validate EDA insights
- get a reference performance
- inspect feature importance (coefficients)

Not tuning. Not optimizing. Just learning.

#### 5.1 Train the model

In [47]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


#### 5.2 Make prediction on the test set

In [44]:
y_pred = model.predict(X_test)

#### 5.3 Evaluate the baseline

In [48]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.88      0.85      1035
           1       0.60      0.49      0.54       374

    accuracy                           0.78      1409
   macro avg       0.72      0.69      0.70      1409
weighted avg       0.77      0.78      0.77      1409



**Baseline Performance:**  
The logistic regression baseline achieves an overall accuracy of 78%. While the model performs well in identifying non-churned customers, recall for churned customers is lower, indicating that a significant portion of churn cases are missed. This behavior is typical for imbalanced classification problems and motivates further model tuning and feature engineering.

### 6. Feature Importance

Understand which features increase or decrease churn risk, and by how much.
Logistic Regression is great here because:
- it is interpretable
- each coefficient has a direction and magnitude

#### 6.1 Extract coefficients


In [50]:
feature_importance = pd.Series(
    model.coef_[0],
    index = X_train.columns
)

feature_importance

tenure                            -0.033594
MonthlyCharges                     0.026383
TechSupport_No internet service   -0.565708
TechSupport_Yes                   -0.652630
Contract_One year                 -0.929921
Contract_Two year                 -1.682631
dtype: float64

**Baseline Feature Importance Interpretation**

The baseline logistic regression model shows that contract length is the strongest factor associated with churn. Customers with one-year and two-year contracts are significantly less likely to churn compared to month-to-month customers. Additionally, having technical support reduces churn risk, while higher monthly charges slightly increase the probability of churn. Customer tenure also shows a consistent negative association with churn.

## Baseline Model Conclusions

The baseline model confirms the key insights obtained during the EDA. Contract length and customer commitment emerge as the most influential factors associated with churn, followed by technical support availability and pricing. While the baseline model shows reasonable overall performance, recall for churned customers is limited, indicating potential for improvement through feature engineering, model tuning, or alternative algorithms.



Unnamed: 0,Contract_One year,Contract_Two year
0,False,False
1,True,False
2,False,False
3,True,False
4,False,False
