# Customer Churn

- [1 - Introduction](#Introduction)
    - [1.1 - Project Overview](#Project-Overview)
    - [1.2 - Problem Statement](#Problem-Statement)
    - [1.3 - Dataset Description](#Dataset-Description)

- [2 - Import Libraries](#Import-Libraries)

- [3 - Data Loading and Exploration](#Data-Loading-and-Exploration)
    - [3.1 - Load the Dataset](#Load-the-Dataset)
    - [3.2 - Display Basic Information](#Display-Basic-Information)
    - [3.3 - Show Summary Statistics](#Show-Summary-Statistics)
    - [3.4 - Visualize Key Features](#Visualize-Key-Features)

- [4 - Data Preprocessing](#Data-Preprocessing)
    - [4.1 - Handle Missing Values](#Handle-Missing-Values)
    - [4.2 - Encode Categorical Variables](#Encode-Categorical-Variables)
    - [4.3 - Feature Scaling/Normalization](#Feature-Scaling/Normalization)
    - [4.4 - Split Data into Features (X) and Target (y)](#Split-Data-into-Features-X-and-Target-y)

- [5 - Data Splitting](#Data-Splitting)
    - [5.1 - Split into Train, Validation, and Test Sets](#Split-into-Train-Validation-and-Test-Sets)
    - [5.2 - Convert to PyTorch Tensors](#Convert-to-PyTorch-Tensors)

- [6 - Model Definition](#Model-Definition)
    - [6.1 - Define the Logistic Regression Model using PyTorch](#Define-the-Logistic-Regression-Model-using-PyTorch)

- [7 - Model Training](#Model-Training)
    - [7.1 - Set Up Loss Function and Optimizer](#Set-Up-Loss-Function-and-Optimizer)
    - [7.2 - Training Loop](#Training-Loop)
    - [7.3 - Validation During Training](#Validation-During-Training)

- [8 - Model Evaluation](#Model-Evaluation)

- [9 - Conclusion and Future Work](#Conclusion-and-Future-Work)

- [10 - References](#References)



# [1 - Introduction](#Introduction)

## [1.1 - Project Overview](#Project-Overview)
The goal of this project is to develop a predictive model that can identify customers who are likely to churn. Customer churn, or customer attrition, refers to the loss of clients or customers. Predicting customer churn is crucial for businesses as it helps in retaining customers, improving customer satisfaction, and increasing profitability. By analyzing various customer-related data, we aim to build a machine learning model that can effectively predict the likelihood of a customer leaving the service.

## [1.2 - Problem Statement](#Problem-Statement)
Customer churn is a significant problem for businesses, leading to a loss in revenue and increased costs for acquiring new customers. The challenge lies in identifying which customers are at risk of churning before they actually do. This project aims to address the following questions:

1. Can we build an accurate model to predict customer churn using historical customer data?
2. How can we interpret the model's predictions to provide actionable insights for the business to reduce churn rates?

By addressing these questions, we aim to provide a valuable tool for businesses to proactively manage customer relationships and improve retention strategies.

## [1.3 - Dataset Description](#Dataset-Description)
The dataset used in this project is sourced from Kaggle and pertains to a fictional telco company that provided home phone and Internet services to 7043 customers in California.

### Telco Customer Churn
Each row in the dataset represents a customer, and each column contains various attributes describing the customers, as detailed in the column metadata.

- **Number of Rows:** 7043 (customers)
- **Number of Columns:** 21 (features)
- **Target Column:** "Churn"

### Data Composition
The dataset includes the following information:

- **Churn Information:**
  - Customers who left within the last month (indicated in the "Churn" column).

- **Services Signed Up:**
  - Phone service, multiple lines, internet service, online security, online backup, device protection, tech support, streaming TV, and streaming movies.

- **Customer Account Information:**
  - Duration of customer relationship, contract type, payment method, paperless billing, monthly charges, and total charges.

- **Demographic Information:**
  - Gender, age range, and whether the customer has partners and dependents..



# [2 - Import Libraries](#Import-Libraries)

In this section, we import the necessary libraries required for data manipulation, visualization, and building a machine learning model using PyTorch.


In [1]:
# Basic libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# PyTorch libraries for building and training the model
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, random_split

# Sklearn for data preprocessing and evaluation
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc

# [3 - Data Loading and Exploration](#Data-Loading-and-Exploration)

## [3.1 - Load the Dataset](#Load-the-Dataset)

In this section, we will load the Telco Customer Churn dataset into a pandas DataFrame for further exploration and analysis.

In [2]:
# Load the dataset into a pandas DataFrame
data_path = './WA_Fn-UseC_-Telco-Customer-Churn.csv'
df = pd.read_csv(data_path)

# Display the first few rows of the dataset to verify loading
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## [3.2 - Display Basic Information](#Display-Basic-Information)

In this section, we will display basic information about the dataset to understand its structure and contents.


In [3]:
# Display the basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


## [3.3 - Summary Statistics](#Show-Summary-Statistics)

In this section, we will present detailed statistics about the customers in the dataset. 
These statistics provide insights into the characteristics and demographics of the customer base, helping to better understand the types of customers represented.


In [7]:
# Prepare data for plotting
categories = [
    "Gender", "Senior Citizen", "Partner", "Dependents",
    "Phone Service", "Multiple Lines", "Internet Service", 
    "Online Security", "Device Protection", "Tech Support", 
    "Streaming TV", "Streaming Movies", "Contract Type", 
    "Paperless Billing", "Payment Method"
]

# Create a dictionary to store distributions
distributions = {
    "Gender": df['gender'].value_counts(normalize=True) * 100,
    "Senior Citizen": df['SeniorCitizen'].replace({1: 'Yes', 0: 'No'}).value_counts(normalize=True) * 100,
    "Partner": df['Partner'].value_counts(normalize=True) * 100,
    "Dependents": df['Dependents'].value_counts(normalize=True) * 100,
    "Phone Service": df['PhoneService'].value_counts(normalize=True) * 100,
    "Multiple Lines": df['MultipleLines'].value_counts(normalize=True) * 100,
    "Internet Service": df['InternetService'].value_counts(normalize=True) * 100,
    "Online Security": df['OnlineSecurity'].value_counts(normalize=True) * 100,
    "Device Protection": df['DeviceProtection'].value_counts(normalize=True) * 100,
    "Tech Support": df['TechSupport'].value_counts(normalize=True) * 100,
    "Streaming TV": df['StreamingTV'].value_counts(normalize=True) * 100,
    "Streaming Movies": df['StreamingMovies'].value_counts(normalize=True) * 100,
    "Contract Type": df['Contract'].value_counts(normalize=True) * 100,
    "Paperless Billing": df['PaperlessBilling'].value_counts(normalize=True) * 100,
    "Payment Method": df['PaymentMethod'].value_counts(normalize=True) * 100,
}

# Plot the distributions
fig, axs = plt.subplots(8, 2, figsize=(15, 30))
axs = axs.flatten()

for i, category in enumerate(categories):
    distribution = distributions[category]
    axs[i].barh(distribution.index, distribution.values, color='skyblue')
    axs[i].set_title(category)
    axs[i].set_xlim(0, 100)
    for j in range(len(distribution)):
        axs[i].text(distribution.values[j] + 1, j, f"{distribution.values[j]:.2f}%", va='center')

# Hide the last subplot if categories are odd in number
if len(categories) % 2 != 0:
    axs[-1].axis('off')

plt.tight_layout()
plt.show()

Gender:
Male - 50.48%
Female - 49.52%

Senior Citizen:
Senior - 16.21%
Non-Senior - 83.79%

Partner:
Yes - 48.30%
No - 51.70%

Dependents:
Yes - 29.96%
No - 70.04%

Phone Service:
Yes - 90.32%
No - 9.68%

Multiple Lines:
Yes - 42.18%
No - 48.13%

Internet Service:
DSL - 34.37%
Fiber optic - 43.96%
No - 21.67%

Online Security:
Yes - 28.67%
No - 49.67%

Device Protection:
Yes - 34.39%
No - 43.94%

Tech Support:
Yes - 29.02%
No - 49.31%

Streaming TV:
Yes - 38.44%
No - 39.90%

Streaming Movies:
Yes - 38.79%
No - 39.54%

Contract Type:
Month-to-month - 55.02%
One year - 20.91%
Two year - 24.07%

Paperless Billing:
Yes - 59.22%
No - 40.78%

Payment Method:
Electronic check - 33.58%
Mailed check - 22.89%
Bank transfer (automatic) - 21.92%
Credit card (automatic) - 21.61%

