# Project Plan
***

The purpose of this notebook will be to do preliminary EDA on the datasets provided for the final project and to complete a work plan for the training, validation and testing of Machine Learning models. The data is provided by the telecom operator Interconnect, for the purpose of forcasting the churn rate of current clients. The company would like to predict which users are likely to stop using their servies in order to offer those customers promotions and special plan options, to maintain brand loyalty. 

**Some notes about the dataset:**
- The contract information is valid as of February 1, 2020.
- The data consists of files obtained from different sources:
    - `contract.csv` — contract information
    - `personal.csv` — the client's personal data
    - `internet.csv` — information about Internet services
    - `phone.csv` — information about telephone services
- In each file, the column `customerID` contains a unique code assigned to each client.

**Target feature:**
- The `EndDate` column equals 'No'.

The metrics that have been selected for use will be:

**Primary metric:** AUC-ROC.

**Additional metric:** Accuracy.

## Initilization

In [1]:
import pandas as pd
import numpy as np

In [2]:
data_path = 'D://Code/FinalProject/final_provider/'

try:
    contract_df = pd.read_csv(data_path + 'contract.csv')
    internet_df = pd.read_csv(data_path + 'internet.csv')
    personal_df = pd.read_csv(data_path + 'personal.csv')
    phone_df = pd.read_csv(data_path + 'phone.csv')
except:
    contract_df = pd.read_csv('/datasets/final_provider/contract.csv')
    internet_df = pd.read_csv('/datasets/final_provider/internet.csv')
    personal_df = pd.read_csv('/datasets/final_provider/personal.csv')
    phone_df = pd.read_csv('/datasets/final_provider/phone.csv')
else:
    print('Datasets Loaded')

Datasets Loaded


## Contract Dataset

In [3]:
contract_df.info()
contract_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
dtypes: float64(1), object(7)
memory usage: 440.3+ KB


Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65


In [9]:
contract_df['BeginDate'].value_counts()

BeginDate
2014-02-01    366
2019-10-01    237
2019-11-01    237
2019-09-01    237
2020-01-01    233
             ... 
2020-02-01     11
2014-01-01      7
2013-10-01      3
2013-12-01      3
2013-11-01      2
Name: count, Length: 77, dtype: int64

In [7]:
class_balance = len(contract_df[contract_df['EndDate'] == 'No']) / len(contract_df['EndDate'])
print('Percent of No entries:', class_balance)

Percent of No entries: 0.7346301292063041


### Notes on Contract Dataset

The contract dataset contains all of the customer's id (7043 total). Dataset is unbalanced(73% not left). No missing values.

**Preprocessing:**
- Change column names
- Change `BeginDate` to datetime
- Change `TotalCharges` to float64
- OHE / Data encoding 

**Feature Engineering:**
- Create boolean target column based on `EndDate`
- Create Average Monthly Charge column
- Check for seasonality in `BeginDate` and `EndDate`

## Internet Dataset

In [4]:
internet_df.info()
internet_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5517 entries, 0 to 5516
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   customerID        5517 non-null   object
 1   InternetService   5517 non-null   object
 2   OnlineSecurity    5517 non-null   object
 3   OnlineBackup      5517 non-null   object
 4   DeviceProtection  5517 non-null   object
 5   TechSupport       5517 non-null   object
 6   StreamingTV       5517 non-null   object
 7   StreamingMovies   5517 non-null   object
dtypes: object(8)
memory usage: 344.9+ KB


Unnamed: 0,customerID,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,Fiber optic,No,No,No,No,No,No


In [10]:
display(internet_df['InternetService'].value_counts())

InternetService
Fiber optic    3096
DSL            2421
Name: count, dtype: int64

### Notes on Internet Dataset

This dataset only contiains entries for 5517 users. All columns can be considered boolean. No missing values. 

**Preprocessing:** 
- Change column names
- OHE / Data Encoding
- Merge to Contract dataset

**Feature Engineering:**
- Analyse each column to determine applicability
- Fill in missing data after merging process

## Personal Dataset

In [5]:
personal_df.info()
personal_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     7043 non-null   object
 1   gender         7043 non-null   object
 2   SeniorCitizen  7043 non-null   int64 
 3   Partner        7043 non-null   object
 4   Dependents     7043 non-null   object
dtypes: int64(1), object(4)
memory usage: 275.2+ KB


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents
0,7590-VHVEG,Female,0,Yes,No
1,5575-GNVDE,Male,0,No,No
2,3668-QPYBK,Male,0,No,No
3,7795-CFOCW,Male,0,No,No
4,9237-HQITU,Female,0,No,No


In [12]:
display(personal_df['gender'].value_counts())
display(personal_df['SeniorCitizen'].value_counts())
print(len(personal_df[personal_df['SeniorCitizen']==1])/len(personal_df))

gender
Male      3555
Female    3488
Name: count, dtype: int64

SeniorCitizen
0    5901
1    1142
Name: count, dtype: int64

0.1621468124378816


### Notes on Personal Dataset

The personal dataset contains entries for all 7043 users. Genders are balanced. Only 16% of the dataset are considered Senior Citizens, could represent outliers. No missing values.

**Preprocessing:**
- Change column names
- OHE / Data Encoding
- Merge to Contract dataset

**Feature Engineering:**
- Create `Divorced/Widowed` column using `Partner` and `Dependents`

## Phone Dataset

In [6]:
phone_df.info()
phone_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6361 entries, 0 to 6360
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     6361 non-null   object
 1   MultipleLines  6361 non-null   object
dtypes: object(2)
memory usage: 99.5+ KB


Unnamed: 0,customerID,MultipleLines
0,5575-GNVDE,No
1,3668-QPYBK,No
2,9237-HQITU,No
3,9305-CDSKC,Yes
4,1452-KIOVK,Yes


### Notes on Phone Dataset

Only contains data of 6361 users. Only has 1 relevant column. No missing values.

**Preprocessing:**
- Change column names
- OHE / Data Encoding
- Merge to Contract dataset

# Preposed Work Plan
***
In the telecom project our goal is to develop a model that can tell if a user will leave the service. The steps to be taken in order to achieve that will be:

1. Download the data.

2. Explore the data in order to determine how to best treat the data during preprocessing.

3. Preform Preprocessing for the data that will include:
    - Merge the data into one main DataFrame
    - Change all column names to a consistent and readable format
    - Convert data to appropriate datatypes
    - Feature Engineering
    - OHE / Feature Encoding
    - Scaling Data

4. Preform EDA to explore the data in depth. 

5. Train multiple models to ensure the best model is chosen:
    - Sanity Check
    - Descision Tree / Random Forest Models
    - Gradient Boosting Models

6. Test the best preforming model on a test dataset, produce conclusions.