### Project Title
Classification Project; Customer Churn Prediction for a Telecommunication company.

### Problem Statement
This project seeks to ascertain the likelihood of turnover among customers in a Telecommunication company. This project will analyze the possible cause and predict if a customer will churn in future also suggest strategies to retain customers.

### Introduction
Customer churn also known as customer turnover is the fraction of customers that stopped patronizing a company's products or services for a specific period. 
This project is a classification project which will help a telecom company understand their data, point out what is being done wrong or right and predict the possibilities of customers churning and help the company to make the right decision for the development of the company. We will build a Machine learning algorithm to predict whether customers are likely to churn or not. 

### Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import re

# Graphs and visualizations
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.offline as py
import plotly.graph_objs as go
from plotly.subplots import make_subplots

# impute missing values
from sklearn.impute import SimpleImputer
# Dataset splitting
from sklearn.model_selection import train_test_split

# For Fearture Encoding
from sklearn.preprocessing import OneHotEncoder

# For feature scaling
from sklearn.preprocessing import StandardScaler

# For class imbalance
from imblearn import over_sampling

# For Machine learning modelling
#from lazypredict.Supervised import LazyClassifier
from sklearn.metrics import recall_score

### Loading Data set

In [None]:
df = pd.read_csv(r"C:\Users\Nathaniel Havim\Desktop\Azubi_Projects\LP3_DAP\Telco-Customer-Churn.csv")

### Data Overview

In [None]:
df.head()

In [None]:
# the size of the data set
df.shape

In [None]:
#concise summary of a DataFrame.
df.info()

### Identified Issue(s) with the data set

1. TotalCharges is in the wrong datatype (object)
2. The TotalCharges column contains non-numeric characters

### How to handle the identified issue(s)

1. TotalCharges will converted to float data type
2. We will replace non-numeric characters with the appropriate values

##### Checking for unique values the TotalCharges column

In [None]:
# Checking for unique values in TotalCharges column
df.TotalCharges.unique()

In [None]:
df['TotalCharges'] = pd.to_numeric(df.TotalCharges, errors='coerce')
df.isnull().sum()

In [None]:
df[df['tenure'] == 0].index

In [None]:
df.TotalCharges.describe()

In [None]:
df.drop(labels=df[df['tenure'] == 0].index, axis=0, inplace=True)
df[df['tenure'] == 0].index

In [None]:
# description of the data set
df.describe()

In [None]:
# Detect missing values
df.isna().sum().sum()

In [None]:
# Detect duplicate values
df.duplicated().sum()

In [None]:
target = df.Churn.copy()
target_label = 'Churn'

In [None]:
cats = [col for col in df.columns if (df[col].dtype == 'object')& (col not in ['customerID'])]
nums = [col for col in df.columns if df[col].dtype != 'object']
print(cats)
print(nums)

In [None]:
df[cats].describe()

In [None]:
df[nums].describe()

### Hypotheses
*H1*

Null Hypothesis (H0): There is no relationship between tenure and monthly charges.
 
Alternative Hypothesis(H1): There is a relationship between tenure and monthly charges

*H2*

Null Hypothsis (H0): Customer satisfaction has no direct effect on customer churning

Alternaative Hypothesis (H1): Customer satisfaction has direct effect customer churning

### Questions
1. Does tenure affect a customer's monthly charges?

2. Are customers with dependents likely to have multiplelines?

3. What service generate the more revenue for the telco?

4. How many customers have access to internet service?

5. What is the churn rate of the telco?

6. What percentage of customers do not have network access?

7. Does Dependants affect monthly charges?

## Exporatory Data Analysis

This section explores data visually using descriptive methods including Univariate, Bivariate and Multivariate Analysis

### Univariate Analysis

In [None]:
for col in cats:
    print(col)
    sns.countplot(data=df, x=col)
    plt.title(col)
    plt.tight_layout()
    plt.show()

In [None]:
for col in nums:
    print(col)
    sns.histplot(data=df, x=col, kde = True)
    plt.title(col)
    plt.tight_layout()
    plt.show()

In [None]:
for col in nums:
    print(col)
    sns.boxplot(data=df, x=col)
    plt.title(col)
    plt.tight_layout()
    plt.show()

In [None]:
for col in nums:
    print(col)
    sns.kdeplot(data=df, x=col, hue='Churn', fill=True)
    plt.title(col)
    plt.tight_layout()
    plt.show()

In [None]:
for col in nums:
    print(col)
    sns.boxplot(data=df, x=col, y='Churn')
    plt.title(col)
    plt.tight_layout()
    plt.show()

### Bivariate Analysis
This section deals with the relationship between two different variable answer different questions.

In [None]:
sns.catplot(data=df, x='SeniorCitizen',y='MonthlyCharges', kind='bar')

**tenure vrs MonthlyCharges**

In [None]:
fig = px.density_heatmap(data_frame=df, x='MonthlyCharges',y='tenure', color_continuous_scale='PuBu')
fig.show()

**Interpret graph**
The graph shows that, when tenure was 0-9 and monthly charges was from 70 - 79.99, the network had 348 customers meanwhile when tenure at 70 -79 and monthly charges was 110-119.99 the 100. This means the monthly charges is not based on how long a customer has been wilt the telco.

In [None]:
color_map = {"Yes": "#FF97FF", "No": "#AB63FA"}
fig = px.histogram(df, x="Churn", color="Dependents", barmode="group", title="<b>Dependents distribution</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

**Dependents vrs multiplelines**

In [None]:
dep_mtline=df.groupby(['Dependents']).count()['MultipleLines'].sort_values(ascending=False).reset_index()

In [None]:
fig=px.bar(dep_mtline, x='Dependents', y='MultipleLines', title="Customers with Dependents and MultipleLines")
fig.update_layout(width=500, height=500, bargap=0.1)
fig.show()

**Interpret graph**
From the graph, 3390 customers have dependents but do not have Multiple lines. 2971 of the customers have dependants and also have multiple lines and 682 of the customers have dependants but do not have no phone services. We can conclude that there is no likelihood that customers with dependants will have multiple lines.

In [None]:
g_labels = ['Male', 'Female']
c_labels = ['No', 'Yes']
# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=g_labels, values=df['gender'].value_counts(), name="Gender"),
              1, 1)
fig.add_trace(go.Pie(labels=c_labels, values=df['Churn'].value_counts(), name="Churn"),
              1, 2)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name", textfont_size=16)

fig.update_layout(
    title_text="Gender and Churn Distributions",
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='Gender', x=0.16, y=0.5, font_size=20, showarrow=False),
                 dict(text='Churn', x=0.84, y=0.5, font_size=20, showarrow=False)])
fig.show()

**Question. What is the total charges of the telco by gender?**

In [None]:
genderTotalcharges = df.groupby(['gender']).sum()['TotalCharges']

In [None]:
color_map = {"Female": 'aliceblue', "Male": 'bisque'}
fig=px.bar(genderTotalcharges, y='TotalCharges',title='Total Revenue by gender')
fig.update_layout(width=500, height= 500, bargap=0.1)
fig.show()

In [None]:
color_map = {"Yes": 'aliceblue', "No": 'bisque'}
fig = px.histogram(df, x="Churn", color="Partner", barmode="group", title="<b>Churn distribution vrs Partners</b>", color_discrete_map=color_map)
fig.update_layout(width=500, height=500, bargap=0.1)
fig.show()

In [None]:
color_map = {"Yes": "aqua", "No": "azure"}
fig = px.histogram(df, x="Churn", color="OnlineSecurity", barmode="group", title="<b>Churn vrs Online Security</b>", color_discrete_map=color_map)
fig.update_layout(width=800, height=500, bargap=0.1)
fig.show()

In [None]:
color_map = {"Yes": 'lavenderblush', "No": 'lavender'}
fig = px.histogram(df, x="Churn", color="PaperlessBilling",  title="<b>Churn distribution vrs Paperless Billing</b>", color_discrete_map=color_map)
fig.update_layout(width=500, height=500, bargap=0.1)
fig.show()

In [None]:
color_map = {"Yes": 'cadetblue', "No": 'cyan'}
fig = px.histogram(df, x="Churn", color="PhoneService", title="<b>Churn distribution w.r.t. Phone Service</b>", color_discrete_map=color_map)
fig.update_layout(width=500, height=500, bargap=0.1)
fig.show()

**Multivariate Analysis**

This section is about the relationship that exist between three and more variables in the data set.

In [None]:
fig = go.Figure()

fig.add_trace(go.Bar(
  x = [['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
       ["Female", "Male", "Female", "Male"]],
  y = [965, 992, 219, 240],
  name = 'DSL',
))

fig.add_trace(go.Bar(
  x = [['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
       ["Female", "Male", "Female", "Male"]],
  y = [889, 910, 664, 633],
  name = 'Fiber optic',
))

fig.add_trace(go.Bar(
  x = [['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
       ["Female", "Male", "Female", "Male"]],
  y = [690, 717, 56, 57],
  name = 'No Internet',
))

fig.update_layout(title_text="<b>Churn Distribution with Internet Service and Gender</b>")

fig.show()

In [None]:
df["Churn"][df["Churn"]=="No"].groupby(by=df["gender"]).count()

In [None]:
df["Churn"][df["Churn"]=="Yes"].groupby(by=df["gender"]).count()

In [None]:
labels = df['PaymentMethod'].unique()
values = df['PaymentMethod'].value_counts()

fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3)])
fig.update_layout(title_text="<b>Payment Method Distribution</b>")
fig.show()

In [None]:
plt.figure(figsize=(12,12))
ax = sns.heatmap(df.corr(method='kendall'), annot=True, fmt='.2f', vmin=-1, vmax=1, cmap='coolwarm')
plt.tight_layout()
plt.show()

### Feature Processing & Engineering
Here is the section to clean, process the dataset and create new features.

### impute missing values

In [None]:
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].mean())

## Feature Creation

#### Arrange Data into Feature and target

In [None]:
# Take other columns as features and churn column as target but ignore the customerID column
features = [col for col in df.columns if col not in ['customerID', target_label]]

In [None]:
df[target_label]

In [None]:
df[features]

In [None]:
# Assign x and y to features and target respectively
X = df.loc[:,features]
y = df.loc[:,target_label]

### Dataset Splitting

In [None]:
# split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X.copy(), y.copy(), test_size=0.2, random_state=42, stratify=y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

### Features Encoding

**ONEHOT ENCODING**

In [None]:
# list the columns to apply OneHotEncoding on
categorical = ['gender', 'Partner', 'Dependents', 'PhoneService', 
            'MultipleLines', 'InternetService', 'OnlineSecurity',
            'OnlineBackup','DeviceProtection','TechSupport','StreamingTV',
           'StreamingMovies','Contract','PaperlessBilling','PaymentMethod']

#### Data Transformation (Train Data)

In [None]:
enc_train = OneHotEncoder().fit_transform(X_train)

In [None]:
enc_train

**Data Transformation (Test Data)**

In [None]:
enctest = OneHotEncoder().fit_transform(X_test)

In [None]:
enctest

### Features Scaling - Train Data

In [None]:
Standardisation = preprocessing.StandardScaler()

In [None]:
sc.transform(X_train)
X_train = sc.fit_transform(X_train)
print ("\nAfter Standardisation : \n", X_train_after_Standardisation)

### Features Scaling - Test Data

In [None]:
X_test = sc.fit_transform(X_test)
X_test[:5]

### Optional: Train set Balancing (for Classification only)

**CLASS IMBALANCE**

In [None]:
# Use Over-sampling/Under-sampling methods, more details here: https://imbalanced-learn.org/stable/install.htm
pd.Series(y_train).value_counts()

In [None]:
X_over_SMOTE, y_over_SMOTE = over_sampling.SMOTE(random_state=42, sampling_strategy=0.6).fit_resample(X_train, y_train)

In [None]:
print('SMOTE')
print(pd.Series(y_over_SMOTE).value_counts())

In [None]:
X_train, y_train = X_over_SMOTE, y_over_SMOTE
X_train.shape, y_train.shape

**FEATURE IMPORTANCE**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy

In [None]:
rfc = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
rfc.fit(X_train, y_train)

In [None]:
bota = BorutaPy(rfc, n_estimators='auto', random_state=42)
bota.fit(np.array(X_train), np.array(y_train))
bota_ranking = bota.ranking_

In [None]:
plt.figure(figsize=(8,6))
sns.barplot(y=[col for col in X_train.columns.values], x=bota_ranking, hue=bota_ranking)

In [None]:
selected_features = {}
for i, col in enumerate(X_train.columns):
    if bota_ranking[i] <=2:
        selected_features[col] = bota_ranking[i]
selected_features

In [None]:
features = [k for k in selected_features.keys()]

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [None]:
features_cat = [col for col in np.concatenate((labels, onehotenc.get_feature_names_out(onehotcats))) if col in features]
select = SelectKBest(score_func=chi2, k='all')
selector = select.fit(X_train[features_cat], y_train)

In [None]:
scores = pd.DataFrame(features_cat)
scores['score'] = selector.scores_
scores = scores.sort_values('score', ascending=False)

In [None]:
sns.barplot(data=scores, x='score', y=0)
plt.title('Chi-Squared Test Statistic')
plt.ylabel('')
plt.tight_layout()
plt.show()

In [None]:
filtered_score = scores[scores['score']<=200]
filtered_score[0].values

In [None]:
features_1 = [col for col in features if col not in filtered_score[0].values]
features_1

**MULTICOLLINEARITY**

In [None]:
temper = X_train[features_1].copy()
temper[target_label] = y_train.copy()
corr = temper.corr(method='pearson')

In [None]:
plt.figure(figsize=(14,12))
ax = sns.heatmap(corr, annot=True, fmt='.2f', vmin=-1, vmax=1, cmap='coolwarm')
plt.tight_layout()
plt.show()

In [None]:
drop_columns = ['OnlineSecurity_No internet service', 
                'OnlineBackup_No internet service', 
                'DeviceProtection_No internet service',
               'TechSupport_No internet service'
                'StreamingTV_No internet service'
                'StreamingMovies_no internet service'
                'tenure', 'TotaCharges', 'MonthlyCharges', 
]

features_2 = [f for f in features_1 if f not in drop_columns]

In [None]:
temper = X_train[features_2].copy()
temper[target_label] = y_train.copy()
corr = temper.corr(method='pearson')

In [None]:
plt.figure(figsize=(12,10))
ax = sns.heatmap(corr, annot=True, fmt='.2f', vmin=-1, vmax=1, cmap='Blues')
plt.tight_layout()
plt.show()

In [None]:
drop_columns = ['InternetService_Fiber optic', 'InternetService_No', 'YearlyCharges']

In [None]:
features_3 = [f for f in features_2 if f not in drop_columns]

In [None]:
temp = X_train[features_3].copy()
temp[target_label]= y_train.copy()
corr = temp.corr(method='pearson')

In [None]:
plt.figure(figsize=(10,10))
ax=sns.heatmap(corr, annot=True, fmt='.2f', vmin=-1, vmax=1, cmap='coolwarm')
plt.tight_layout()
plt.show()

In [None]:
feat = features_3

### Machine Learning Modeling

##### Simple Model #001
Please, keep the following structure to try all the model you want.

In [None]:
#lf = LazyClassifier(verbose=0, ignore_warnings=False, custom_metric=recall_score)
#odels, predictions = clf.fit(X_train[feat], X_test[feat], y_train, y_test)
#rint(models)

In [None]:
X_train

### Another model?

In [None]:
XGBClassifier

In [None]:
KNeighborsClassifier

In [None]:
RandomForestClassifier

In [None]:
DecisionTreeClassifier

In [None]:
BaggingClassifier

##### Create the Model

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

##### Train the Model

In [None]:
clf = DecisionTreeClassifier(random_state=42, class_weight='balanced', max_depth=11)

In [None]:
# Use the .fit method
clf = clf.fit(X_train, y_train)

### Evaluate the Model on the Evaluation dataset (Evalset)

In [None]:
features

##### Predict on a unknown dataset (Testset)

In [None]:
# Use .predict method # .predict_proba is available just for classification
Y_pred = clf.predict(X_test[features])

### Simple Model #002

##### Create the Model

In [None]:
# code 

##### Train the Model

In [None]:
# Use the .fit method

### Evaluate the Model on the Evaluation dataset (Evalset)

In [None]:
# Compute the valid metrics for the use case # Optional: show the classification report

##### Predict on a unknown dataset (Testset)

In [None]:
# Use .predict method # .predict_proba is available just for classification

### Models comparison
Create a pandas dataframe that will allow you to compare your models.