# Final Project

## Project Description


The telecom operator Interconnect would like to be able to forecast their churn of clients. If it's discovered that a user is planning to leave, they will be offered promotional codes and special plan options. Interconnect's marketing team has collected some of their clientele's personal data, including information about their plans and contracts.



**Interconnect's Services**

Interconnect mainly provides two types of services:

1. Landline communication. The telephone can be connected to several lines simultaneously.
2. Internet. The network can be set up via a telephone line (DSL, *digital subscriber line*) or through a fiber optic cable.

Some other services the company provides include:

- Internet security: antivirus software (*DeviceProtection*) and a malicious website blocker (*OnlineSecurity*)
- A dedicated technical support line (*TechSupport*)
- Cloud file storage and data backup (*OnlineBackup*)
- TV streaming (*StreamingTV*) and a movie directory (*StreamingMovies*)

The clients can choose either a monthly payment or sign a 1- or 2-year contract. They can use various payment methods and receive an electronic invoice after a transaction.

**Data Description**

The data consists of files obtained from different sources:

- `contract.csv` — contract information
- `personal.csv` — the client's personal data
- `internet.csv` — information about Internet services
- `phone.csv` — information about telephone services

In each file, the column `customerID` contains a unique code assigned to each client.

The contract information is valid as of February 1, 2020.

## Setup

### Library Imports

In [33]:
# Import libraries

import pandas as pd
import numpy as np
import seaborn as sns
import warnings

from datetime import datetime
from IPython.display import display
from matplotlib import pyplot as plt
from datetime import datetime

# Gradient Boosting
import lightgbm as lgb
import xgboost as xgb

# Sklearn
from sklearn.utils import shuffle, resample
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import auc, roc_curve, roc_auc_score, f1_score, confusion_matrix, classification_report, precision_recall_curve, accuracy_score, ConfusionMatrixDisplay
from sklearn.preprocessing import LabelEncoder, StandardScaler

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier

### Data Imports

In [3]:
# Load data -- provide feedback if unsuccessful

try: 
    contract_df = pd.read_csv('/datasets/final_provider/contract.csv')
    personal_df = pd.read_csv('/datasets/final_provider/personal.csv')
    internet_df = pd.read_csv('/datasets/final_provider/internet.csv')
    phone_df = pd.read_csv('/datasets/final_provider/phone.csv')
except FileNotFoundError as e:
    print(f'The datasets were not imported: Error {e}')
else:
    print('The datasets were imported successfully')
    
    # Turn off warning
    warnings.filterwarnings('ignore')

The datasets were imported successfully


## Initial Data Exploration

In [31]:
# A dict of datasets to easily print initial info 

datasets = {
    'contract_df': contract_df,
    'personal_df': personal_df,
    'internet_df': internet_df,
    'phone_df': phone_df
}

for name, dataset in datasets.items():
    print(f"Dataset: {name}\n")
    dataset.info()
    display(dataset.head())
    print("\n---------------------------------------------")

Dataset: contract_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
dtypes: float64(1), object(7)
memory usage: 440.3+ KB


Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65



---------------------------------------------
Dataset: personal_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     7043 non-null   object
 1   gender         7043 non-null   object
 2   SeniorCitizen  7043 non-null   int64 
 3   Partner        7043 non-null   object
 4   Dependents     7043 non-null   object
dtypes: int64(1), object(4)
memory usage: 275.2+ KB


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents
0,7590-VHVEG,Female,0,Yes,No
1,5575-GNVDE,Male,0,No,No
2,3668-QPYBK,Male,0,No,No
3,7795-CFOCW,Male,0,No,No
4,9237-HQITU,Female,0,No,No



---------------------------------------------
Dataset: internet_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5517 entries, 0 to 5516
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   customerID        5517 non-null   object
 1   InternetService   5517 non-null   object
 2   OnlineSecurity    5517 non-null   object
 3   OnlineBackup      5517 non-null   object
 4   DeviceProtection  5517 non-null   object
 5   TechSupport       5517 non-null   object
 6   StreamingTV       5517 non-null   object
 7   StreamingMovies   5517 non-null   object
dtypes: object(8)
memory usage: 344.9+ KB


Unnamed: 0,customerID,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,Fiber optic,No,No,No,No,No,No



---------------------------------------------
Dataset: phone_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6361 entries, 0 to 6360
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     6361 non-null   object
 1   MultipleLines  6361 non-null   object
dtypes: object(2)
memory usage: 99.5+ KB


Unnamed: 0,customerID,MultipleLines
0,5575-GNVDE,No
1,3668-QPYBK,No
2,9237-HQITU,No
3,9305-CDSKC,Yes
4,1452-KIOVK,Yes



---------------------------------------------


### Notes on the Datasets

1) Contract Database
    - There are no missing values
    - Pre-processing
        - Change column names to underscored-lowercase names
        - Check for duplicates in the dataset
        - Change the BeginDate column to datetime
        - Change TotalCharges to a float64 value type
        
    - Feature Engineering
        - Feature engineer columns for `date`, `month`, `year` from the `BeginDate` and `EndDate` columns
        - Create a new column based on EndDate to say if customer has left or not left the service
        - Check seasonality for BeginDate column (month, day of the week, hour)
        - One hot encode the PaymentMethod, PaperlessBilling, and Type columns.


2) Personal Database
    - There are no missing values
    - Preprocessing:
        - Change column names to underscore lowercase
        - Check for duplicates in the dataset

    - Feature Engineering:
        - One hot encode the gender, Partner, and Dependents columns

3) Internet Database
    - There are no missing values
    - There are 1526 fewer customer ID's then internet and personal DFs. 
        - This discrepancy may be explained by the fact that they did not subscribe to any internet service.
        
    - Preprocessing:
        - Change column names to underscore lowercase
        - Check for duplicates in the dataset

    - Feature Engineering:
        - One hot encode all columns except for customer ID.

4) Phone Database
    - There are no missing values
    - There are 682 fewer customer ID's then internet and personal DFs. 
        - This discrepancy may be explained by the fact that they did not subscribe to any phone service.
        
    - Preprocessing:
        - Change column names to underscore lowercase
        - Check for duplicates in the dataset

    - Feature Engineering:
        - One hot encode the MultipleLines column

## Proposed Work Plan

**Goal:** Develop a model to predict user churn

Steps:
1. Download the dataset.
2. Explore the dataset to determine the appropriate preprocessing methods.
3. Preprocess the data:
    - Consolidate all dataframes into a single main dataframe.
    - Standardize column names to underscore-lowercase format.
    - Convert columns to the required data types.
4. Conduct an in-depth exploratory data analysis (EDA).
5. Perform feature engineering.
6. Perform model training and evaluation
    - Test various classification models against a dummy baseline, using Logistic Regression as a starting point.
    - Evaluate models using ROC-AUC and accuracy scores. 
    - Fine-tune models with cross-validation, incorporatinggradient boosting techniques.
7. Assess the final model on the test set.
8. Write the conclusion.