## STOCKHOLM TEAM

# Classification -- Predicting Customer Churn

# Business Understanding

Customer attrition is one of the biggest expenditures of any organization. Customer churn otherwise known as customer attrition or customer turnover is the percentage of customers that stopped using your company's product or service within a specified timeframe.
For instance, if you began the year with 500 customers but later ended with 480 customers, the percentage of customers that left would be 4%. If we could figure out why a customer leaves and when they leave with reasonable accuracy, it would immensely help the organization to strategize their retention initiatives manifold.

In this project, we aim to find the likelihood of a customer leaving the organization, the key indicators of churn as well as the retention strategies that can be implemented to avert this problem.

# The project aims to achieve the following objectives:

**Data Understanding and Preprocessing:** Gain a comprehensive understanding of the provided data, ensuring its quality and suitability for building classification models. Cleanse and preprocess the data to address any issues and prepare it for analysis.

**Exploratory Data Analysis (EDA):** Conduct in-depth exploratory data analysis to unveil insights into customer churn patterns. Identify distribution patterns, correlations, and potential data challenges that may impact the modeling process.

**Feature Engineering:** Select relevant features from the dataset and engineer new features to enhance the model's predictive capabilities. Transform the data into a format that maximizes the model's accuracy and performance.

**Model Selection and Evaluation:** Evaluate multiple classification algorithms, including Logistic Regression, Random Forest, Support Vector Machines, and Gradient Boosting, to identify the most suitable model for predicting customer churn. Use appropriate evaluation metrics to compare model performance.

**Model Deployment:** Deploy the final classification model to enable real-time predictions on new customer data. Develop a user-friendly interface or API for seamless integration with the company's existing systems.

Upon successful completion of the project, the company will have a well-documented and deployed machine learning model capable of accurately predicting customer churn in real-time. This predictive tool will empower the company to proactively take measures to retain valuable customers and make informed business decisions.

## Data Understanding

The data for this project is in a csv format. The following describes the columns present in the data.

**Gender --** Whether the customer is a male or a female

**SeniorCitizen --** Whether a customer is a senior citizen or not

**Partner --** Whether the customer has a partner or not (Yes, No)

**Dependents --** Whether the customer has dependents or not (Yes, No)

**Tenure --** Number of months the customer has stayed with the company

**Phone Service --** Whether the customer has a phone service or not (Yes, No)

**MultipleLines --** Whether the customer has multiple lines or not

**InternetService --** Customer's internet service provider (DSL, Fiber Optic, No)

**OnlineSecurity --** Whether the customer has online security or not (Yes, No, No Internet)

**OnlineBackup --** Whether the customer has online backup or not (Yes, No, No Internet)

**DeviceProtection --** Whether the customer has device protection or not (Yes, No, No internet service)

**TechSupport --** Whether the customer has tech support or not (Yes, No, No internet)

**StreamingTV --** Whether the customer has streaming TV or not (Yes, No, No internet service)

**StreamingMovies --** Whether the customer has streaming movies or not (Yes, No, No Internet service)

**Contract --** The contract term of the customer (Month-to-Month, One year, Two year)

**PaperlessBilling --** Whether the customer has paperless billing or not (Yes, No)

**Payment Method --** The customer's payment method (Electronic check, mailed check, Bank transfer(automatic), Credit card(automatic))

**MonthlyCharges --** The amount charged to the customer monthly

**TotalCharges --** The total amount charged to the customer

**Churn --** Whether the customer churned or not (Yes or No)

# Hypothesis

## Null Hypothesis (HO): 
### Customers with longer tenure (i.e., those who have been with the company for a longer time) are less likely to churn compared to customers with shorter tenure.

## Alternative Hypothesis (HA): 
### Customers with shorter tenure are more likely to churn compared to customers with longer tenure.

## Exploratory Data Analysis

# Questions

1.What is the distribution of customers by gender?

2.What is the distribution of customers based on their SeniorCitizen status?

3.How is the distribution of customers based on their tenure with the company?

4.Is there any correlation between MonthlyCharges and TotalCharges?

5.What percentage of customers have churned (Yes) versus those who haven't (No)?

6.Which payment method is preferred by most customers?

7.What percentage of customers are on month-to-month, one-year, or two-year contracts?

8.What is the churn rate for customers grouped by their tenure?

9.What is the average monthly charges for each tenure group?

10.What is the distribution of customers based on their payment methods (Electronic check, Mailed check, Bank transfer, Credit card)?

11.What is the distribution of the total charges incurred by customers?


# Setup

# Installation

In [None]:
%pip install pandas as pd
%pip install pyodbc  
%pip install python-dotenv 
%pip install openpyxl
%pip install seaborn
%pip install matplotlib

# Importation

In [2]:
import pyodbc #just installed with pip
from dotenv import dotenv_values #import the dotenv_values function from the dotenv package
import pandas as pd
import warnings 


warnings.filterwarnings('ignore')

# Data Loading

In [3]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')


# Get the values for the credentials you set in the '.env' file
database = environment_variables.get("DATABASE")
server = environment_variables.get("SERVER")
username = environment_variables.get("USERNAME")
password = environment_variables.get("PASSWORD")



connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"

In [None]:
# Use the connect method of the pyodbc library and pass in the connection string.
# This will connect to the server and might take a few seconds to be complete. 
# Check your internet connection if it takes more time than necessary

connection = pyodbc.connect(connection_string)

In [None]:
query = "Select * from dbo.LP2_Telco_churn_first_3000"
df1 = pd.read_sql(query, connection)

In [None]:
df1 = pd.DataFrame(df1)

In [None]:
df1.head()

## Test dataset.

In [None]:
excel_file = 'Telco-churn-second-2000.xlsx'

# Read the Excel file into a Pandas DataFrame
df2 = pd.read_excel(excel_file, engine='openpyxl')

# Save the DataFrame as a CSV file
df2.to_csv('data1.csv', index=False)

In [None]:
df2.head()

In [None]:
df3 = pd.read_csv('LP2_Telco-churn-last-2000.csv')

In [None]:
df3.head()

# Exploratory Data Analysis: EDA