## Project Title : 
Churn Prediction Project 

## Project Description: 
This project is known as churn prediction for a telecome company.  Imagine that we are working at a telecom company that offers phone and internet
services, and we have a problem: some of our customers are churning. They no longer are using our services and are going to a different provider. We would like to prevent that from happening, so we develop a system for identifying these customers and offer them an incentive to stay. We want to target them with promotional messages and give them a discount. We also would like to understand why the model thinks our customers churn, and for that, we need to be able to interpret the model’s predictions.
 
We have collected a dataset where we’ve recorded some information about our customers: what type of services they used, how much they paid, and how long they stayed with us. We also know who canceled their contracts and stopped using our services (churned). We will use this information as the target variable in the machinelearning model and predict it using all other available information. 

The project plan is as follows: 
- First, we download the dataset and do some initial preparation: rename columns and change values inside columns to be consistent throughout the entire dataset.
- Then we split the data into train, validation, and test so we can validate our models.
- As part of the initial data analysis, we look at feature importance to identify which features are important in our data.
- We transform categorical variables into numeric variables so we can use them in the model.
- Finally, we train a logistic regression model.

## Dataset Description
- Url:  https://www.kaggle.com/blastchar/telco-customer-churn.

- Column description
    - CustomerID: the ID of the customer
    - Gender: male/female
    - SeniorCitizen: whether the customer is a senior citizen (0/1)
    - Partner: whether they live with a partner (yes/no)
    - Dependents: whether they have dependents (yes/no)
    - Tenure: number of months since the start of the contract
    - PhoneService: whether they have phone service (yes/no)
    - MultipleLines: whether they have multiple phone lines (yes/no/no phone service)
    - InternetService: the type of internet service (no/fiber/optic)
    - OnlineSecurity: if online security is enabled (yes/no/no internet)
    - OnlineBackup: if online backup service is enabled (yes/no/no internet)
    - DeviceProtection: if the device protection service is enabled (yes/no/no internet)
    - TechSupport: if the customer has tech support (yes/no/no internet)
    - StreamingTV: if the TV streaming service is enabled (yes/no/no internet)
    - StreamingMovies: if the movie streaming service is enabled (yes/no/no internet)
    - Contract: the type of contract (monthly/yearly/two years)
    - PaperlessBilling: if the billing is paperless (yes/no)
    - PaymentMethod: payment method (electronic check, mailed check, bank transfer, credit card)
    - MonthlyCharges: the amount charged monthly (numeric)
    - TotalCharges: the total amount charged (numeric)
    - Churn: if the client has canceled the contract (yes/no)

## Environment Configuration
- Installing virtual Env
    - python -m venv venv 

-  Virtual Env
    - source venv/Script/activate


- Installing Packages
    - pip install jupyter notebook pandas pyarrow numpy matplotlib seaborn scikit-learn

- Starting Notebook
    - jupyter notebook 

- Stoping Notebook 
    - Ctrl+c

- Deactiving Virtual Env
    - deactivate

## Importing Libraries

In [None]:
## librarie(s) for loading and preprocessing 
import numpy as np
import pandas as pd 

## libarie(s) for visualization 
import matplotlib.pyplot as plt
import seaborn as sns

## library for building a validation framwork
from sklearn.model_selection import train_test_split

## library for feature engineering 
from sklearn.feature_extraction import DictVectorizer

## library for ml algorithms
from sklearn.linear_model import LogisticRegression

## library for ml metrics 
from sklearn.metrics import accuracy_score


## Loading And Data Overview

In [None]:
## load dataset
data = pd.read_csv("dataset/WA_Fn-UseC_-Telco-Customer-Churn.csv")

## create a copy of the 
df = data.copy()

In [None]:
## view the first five rows 
df.head()

In [None]:
## last five rows 
df.tail().T

In [None]:
## check for the total rows and columns 
print(f'total number of rows: {df.shape[0]} => total number of columns: {df.shape[1]}')

In [None]:
## check for the brief column summary 
df.info()

In [None]:
## check for the datatypes in each column 
df.dtypes

In [None]:
## check for missing values 
df.isnull().sum()

In [None]:
## lets check for duplicates 
df.duplicated()

In [None]:
## check for uniqueness in each column ## use  same approach for other columns 
np.unique(df['TotalCharges'])

## Data Preprocessing 
- Normalizing the column names 
- Replacing empty string with nan and fill for missing values 
- deleted the customer id column 
- change the data type on the columns 

In [None]:
## let convert the the column names to lower case
df.columns = df.columns.str.lower()
df.columns

In [None]:
## preview the columns
df.head()

In [None]:
## replace  values in totalcharges column 
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce') 

In [None]:
## check for missing values 
df.isnull().sum()

In [None]:
## fill in the missing values in the totalcharges column with mean

df.totalcharges = df.totalcharges.fillna(0)

In [None]:
## delete the customer id column 
del df['customerid']

# ## delete the customer id column 
# df = df.drop(['customerid'], axis=1) 

In [None]:
## display the first five rows using the transpose
df.churn.head().T

In [None]:
## lets change the datatype of 'object' columns to category datatypes.


In [None]:
## lets convert the target column, where yes == 1 and no = 0
df.churn = (df.churn == 'Yes').astype(int)

In [None]:
## lets preview the churn column 


Exploratory Data Analysis
- Target Variable Analysis 
- Outlier analysis 

In [None]:
## lets display the distribution of the target column (churn)

In [None]:
## compute the total counts of each category in the target column

## Building a validation framework
- Let’s split the DataFrame such that
    - 20% of data goes to validation.
    - 20% goes to test.
    - The remaining 60% goes to train.

In [None]:
## split the dataset into training, validation, and test sets
df_train_full , df_test = train_test_split(df, test_size=0.2, random_state=11) 
df_train, df_valid = train_test_split(df_train_full, test_size=0.25, random_state=11)


## print the output of the train, validation, and test data sample
print(f'Training dataset: {len(df_train)}')
print(f'Validation dataset: {len(df_valid)}')
print(f'Test dataset: {len(df_test)}')

In [None]:
## select the target column from the dataframe and convert them in matrix format or numpy array
y_train = df_train['churn'].values
y_valid = df_valid['churn'].values
y_test = df_test['churn'].values

In [None]:
## delete the target column from the rest of the dataframe 
del df_train['churn']
del df_valid['churn']
del df_test['churn']

## Baseline Training of Logistics Regression Model
- To build a baseline model, we use only the numerical featues to train a simple ml algorithm to serve as our baseline model.

In [None]:
## select only numerical featues 
df_train_bl = df_train.select_dtypes(exclude=[object])
df_valid_bl = df_valid.select_dtypes(exclude=[object])

In [None]:
## convert the numerical features into numpy array
X_train_bl = df_train_bl.values
X_valid_bl = df_valid_bl.values

In [None]:
## instantiate a logistic regression algorithm 
bl_model = LogisticRegression(solver='liblinear', random_state=1)

## fit the training data to the algorithm 
bl_model.fit(X_train_bl, y_train)

In [None]:
## generate the validation predictions 
y_valid_pred_bl = bl_model.predict_proba(X_valid_bl)

In [None]:
## preview the validation predictions
y_valid_pred_bl

- The predictions of the model: a two-column matrix. 
- The first column contains  the probability that the target is zero (the client won’t churn). 
- The second column contains the opposite probability (the target is one, and the client will churn).

In [None]:
## lets select the data in the second column
y_valid_pred = bl_model.predict_proba(X_valid_bl)[:, 1]

- This output (probabilities) is often called soft predictions. 
- These tell us the probability of churning as a number between zero and one. It’s up to us to decide how to interpret this number and how to use it.
- To make the actual decision about whether to send a promotional letter to our customers, using the probability alone is not enough. 
- We need hard predictions — binary values of True (churn, so send the mail) or False (not churn, so don’t send the mail).
- To get the binary predictions, we take the probabilities and cut them above a certain threshold.

In [None]:
# lets set the prediction threshold to 0.5
churn = y_valid_pred >= 0.5

In [None]:
# # display the output
# (y_valid == churn).mean()

In [None]:
## lets compute the acccuracy using the accuracy_score metric 
acc_score = accuracy_score(y_valid_pred_bl, churn)

## display the output
print(f'Baseline Validation Accuracy Score: {round(acc_score * 100, 1)}%')