## Model Training

### Steps:
1. **Load and Explore the Data**:
   - Identify data types and handle any inconsistencies.

2. **Split the Data**:
   - Separate the target (`Churn`) and features. Split into training and test sets.

3. ****Data Preprocessing and Preprocessing Pipeline**:
   - Encode binary, nominal, and ordinal columns.
   - Scale numerical features.
   - Handle missing values, if any.

4. **Train a Model**:
   - Use a simple model (Logistic Regression) as a baseline.

5. **Evaluate the Model**:
   - Check accuracy, precision, recall, or other relevant metrics on the test set.

1. **Load and Explore the Data**:

-  Import Pandas, Numpy, Matplotlib, Seaborn and Warnings Library

In [249]:
# Basic Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import warnings


-  Import the CSV Data as Pandas DataFrame and shaow top 5 rows.

In [250]:
df = pd.read_csv('Telco-Customer-Churn.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


- Identify data types, nulls and handle any inconsistencies

In [251]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [252]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].dtype

dtype('float64')

In [253]:
df = df.drop('customerID', axis=1)

2. **Split the Data**:
   - Separate the target (`Churn`) and features. 
   - Split into training and test sets.

In [254]:
# Separate features and target
X = df.drop('Churn', axis=1)
y = df['Churn']

In [255]:
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


3. **Data Preprocessing**:
   - Binary columns are encoded using `OrdinalEncoder`.
   - Nominal columns are one-hot encoded.
   - Ordinal columns use custom mapping.
   - Numerical columns are imputed (missing values filled with the median) and scaled using `StandardScaler`.

In [256]:
# Identify categorical columns

import sys  
import os  

### Set the functions directory 
sys.path.append(os.path.abspath('../src'))
from components.eda_functions import print_categories

### categorical columns
cate_cols = df.select_dtypes(include='object')
print_categories(cate_cols) 

Categories in [1m'gender'[0m there are [1m2[0m categories: [1m['Female' 'Male'][0m
Categories in [1m'Partner'[0m there are [1m2[0m categories: [1m['Yes' 'No'][0m
Categories in [1m'Dependents'[0m there are [1m2[0m categories: [1m['No' 'Yes'][0m
Categories in [1m'PhoneService'[0m there are [1m2[0m categories: [1m['No' 'Yes'][0m
Categories in [1m'MultipleLines'[0m there are [1m3[0m categories: [1m['No phone service' 'No' 'Yes'][0m
Categories in [1m'InternetService'[0m there are [1m3[0m categories: [1m['DSL' 'Fiber optic' 'No'][0m
Categories in [1m'OnlineSecurity'[0m there are [1m3[0m categories: [1m['No' 'Yes' 'No internet service'][0m
Categories in [1m'OnlineBackup'[0m there are [1m3[0m categories: [1m['Yes' 'No' 'No internet service'][0m
Categories in [1m'DeviceProtection'[0m there are [1m3[0m categories: [1m['No' 'Yes' 'No internet service'][0m
Categories in [1m'TechSupport'[0m there are [1m3[0m categories: [1m['No' 'Yes' 'No i

In [257]:
# Define column groups
binary_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling']
nominal_cols = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 
                'DeviceProtection', 'TechSupport', 'PaymentMethod']
ordinal_cols = ['Contract']
numerical_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Custom ordinal mapping for 'Contract'
contract_mapping = [['Month-to-month', 'One year', 'Two year']]

- **Preprocessing**

-  Define preprocessors

In [258]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.impute import SimpleImputer

# Define preprocessors
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(drop='if_binary', dtype=int)) 
])

nominal_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

ordinal_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder(categories=contract_mapping))
])

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Encode target variable
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)  # Encode 'No' -> 0, 'Yes' -> 1
y_test_encoded = label_encoder.transform(y_test)


-  Combine preprocessors

In [259]:
# Combine preprocessors
preprocessor = ColumnTransformer(
    transformers=[
        ('binary', binary_transformer, binary_cols),
        ('nominal', nominal_transformer, nominal_cols),
        ('ordinal', ordinal_transformer, ordinal_cols),
        ('numerical', numerical_transformer, numerical_cols)
    ])

-  **Build the Preprocessing Pipeline and train the base model**:
  -  Combines preprocessing and modeling into one pipeline for reproducibility and ease of deployment.
  -  Logistic Regression is used as a baseline.

In [260]:
from sklearn.pipeline import Pipeline

# Define the pipeline function 
def build_pipeline(model):
    return Pipeline(steps=[
        ('preprocessor', preprocessor),  
        ('model', model)
    ])

4. **Train a Model**:
   - Use a Logistic Regression model as a baseline.

In [261]:
from sklearn.linear_model import LogisticRegression

# Build the pipeline
pipeline = build_pipeline(LogisticRegression(random_state=42))

# Train the model
pipeline.fit(X_train, y_train_encoded)

5. **Evaluation**:
   - Outputs metrics such as accuracy and a classification report to assess the model's performance.

In [262]:
from sklearn.metrics import classification_report, accuracy_score

# Evaluate the model
y_pred = pipeline.predict(X_test)
print("Accuracy:", accuracy_score(y_test_encoded, y_pred))
print("Classification Report:\n", classification_report(y_test_encoded, y_pred))

Accuracy: 0.8197303051809794
Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.90      0.88      1036
           1       0.68      0.61      0.64       373

    accuracy                           0.82      1409
   macro avg       0.77      0.75      0.76      1409
weighted avg       0.81      0.82      0.82      1409



### Alternative models

In [263]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, f1_score

In [264]:
# Define the pipeline function for alternative models
def build_pipeline(model):
    return Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ])


In [265]:
# Define models to experiment with
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Support Vector Classifier': SVC(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'XGBoost': XGBClassifier(random_state=42, eval_metric='logloss'), 
    'CatBoost': CatBoostClassifier(verbose=0, random_state=42)
}

In [266]:
# Train and evaluate models
for name, model in models.items():
    print(f"Training {name}...")
    pipeline = build_pipeline(model)
    pipeline.fit(X_train, y_train_encoded)  # Use encoded target variable
    predictions = pipeline.predict(X_test)  # Predictions will be numeric labels

    # Evaluation
    accuracy = accuracy_score(y_test_encoded, predictions)
    f1 = f1_score(y_test_encoded, predictions, average='weighted')
    print(f"{name} - Accuracy: {accuracy:.4f}, F1 Score: {f1:.4f}\n")

Training Logistic Regression...
Logistic Regression - Accuracy: 0.8197, F1 Score: 0.8163

Training Random Forest...
Random Forest - Accuracy: 0.7984, F1 Score: 0.7883

Training Gradient Boosting...
Gradient Boosting - Accuracy: 0.8105, F1 Score: 0.8025

Training K-Nearest Neighbors...
K-Nearest Neighbors - Accuracy: 0.7906, F1 Score: 0.7876

Training Support Vector Classifier...
Support Vector Classifier - Accuracy: 0.8112, F1 Score: 0.8017

Training Decision Tree...
Decision Tree - Accuracy: 0.7083, F1 Score: 0.7103

Training XGBoost...
XGBoost - Accuracy: 0.7928, F1 Score: 0.7852

Training CatBoost...
CatBoost - Accuracy: 0.8027, F1 Score: 0.7952

