## Predict Customer Churn
Predict behavior to retain customers through reducing churn. I will analyze all relevant customer data and develop focused customer retention programs.

#### Data Source
[Dataset available on IBM Watson Analytics Sample Dataset > "WA_Fn UseC_ Telco Customer Churn.csv"](https://www.ibm.com/communities/analytics/watson-analytics-blog/guide-to-sample-datasets/)

#### Feature description: 
The raw data contains 7043 rows (customers) and 21 columns (features). The “Churn” column is our target/dependent variable. 

#### Feature list: 
customerID

gender (female, male)

SeniorCitizen (Whether the customer is a senior citizen or not (1, 0))

Partner (Whether the customer has a partner or not (Yes, No))

Dependents (Whether the customer has dependents or not (Yes, No))

tenure (Number of months the customer has stayed with the company)

PhoneService (Whether the customer has a phone service or not (Yes, No))

MultipleLines (Whether the customer has multiple lines r not (Yes, No, No phone service)

InternetService (Customer’s internet service provider (DSL, Fiber optic, No)

OnlineSecurity (Whether the customer has online security or not (Yes, No, No internet service)

OnlineBackup (Whether the customer has online backup or not (Yes, No, No internet service)

DeviceProtection (Whether the customer has device protection or not (Yes, No, No internet service)

TechSupport (Whether the customer has tech support or not (Yes, No, No internet service)

streamingTV (Whether the customer has streaming TV or not (Yes, No, No internet service)

streamingMovies (Whether the customer has streaming movies or not (Yes, No, No internet service)

Contract (The contract term of the customer (Month-to-month, One year, Two year)

PaperlessBilling (Whether the customer has paperless billing or not (Yes, No))

PaymentMethod (The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)))

MonthlyCharges (The amount charged to the customer monthly — numeric)

TotalCharges (The total amount charged to the customer — numeric)

Churn ( Whether the customer churned or not (Yes or No))



#### Tech Details
Language: Python

Algorithms: Logistic Regression, Decision Tree, Random Forest and XGBoost

### Structure of the doc below:

1. Import data
2. Data wrangling 
3. Exploratory data analysis
4. Model Training
5. Model Testing
6. Inferencing
7. Data storytelling

### Libraries Imported below

In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

In [16]:
churndata = pd.read_csv('Telco-Customer-Churn.csv')

In [17]:
churndata.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [18]:
churndata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID          7043 non-null object
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: float64(1), int64(2), obj

#### My observation on the data-types:
1 There are no null values! How is this data so clean?! Something is wierd. In particular, why is Total charges an "object"? This probably needs to be investigated further

2 it's interesting that SeniorCitizen is "int64" type. Might look into converting this to string

3 Also need to explore later if string is the right data-type for training the models

4 Also need to explore if tenure will still be a int64 or if there's a scope of create buckets and assign it string data type


#### 1. Why is Total Charges an "object"?

I had to open this in excel to realize that there was blank values in the Total Charges column but for some reason, it was not being picked up. On further research based on post here: https://stackoverflow.com/questions/30604893/pandas-not-recognizing-nan-as-null I leared that python might not be picking this as na values so adjusting the csv reader to accomodate that and re-running it. 

In [33]:
churndata = pd.read_csv('Telco-Customer-Churn.csv', keep_default_na=False, na_values=[' '])
churndata.isnull().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [34]:
churndata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID          7043 non-null object
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7032 non-null float64
Churn               7043 non-null object
dtypes: float64(2), int64(2), ob

#### There we go, Total charges is now a float64 type and there are 11 missing values!

In [35]:
churndata.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges
count,7043.0,7043.0,7043.0,7032.0
mean,0.162147,32.371149,64.761692,2283.300441
std,0.368612,24.559481,30.090047,2266.771362
min,0.0,0.0,18.25,18.8
25%,0.0,9.0,35.5,401.45
50%,0.0,29.0,70.35,1397.475
75%,0.0,55.0,89.85,3794.7375
max,1.0,72.0,118.75,8684.8


#### Dropping rows with missing values

In [39]:
churndata = churndata[churndata['TotalCharges'].notnull()]

In [40]:
churndata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7032 entries, 0 to 7042
Data columns (total 21 columns):
customerID          7032 non-null object
gender              7032 non-null object
SeniorCitizen       7032 non-null int64
Partner             7032 non-null object
Dependents          7032 non-null object
tenure              7032 non-null int64
PhoneService        7032 non-null object
MultipleLines       7032 non-null object
InternetService     7032 non-null object
OnlineSecurity      7032 non-null object
OnlineBackup        7032 non-null object
DeviceProtection    7032 non-null object
TechSupport         7032 non-null object
StreamingTV         7032 non-null object
StreamingMovies     7032 non-null object
Contract            7032 non-null object
PaperlessBilling    7032 non-null object
PaymentMethod       7032 non-null object
MonthlyCharges      7032 non-null float64
TotalCharges        7032 non-null float64
Churn               7032 non-null object
dtypes: float64(2), int64(2), ob

#### Next, let's inspect unique values in each column and see if we need to do any further data wrangling

In [42]:
churndata.gender.unique()

churndata.SeniorCitizen.unique()

churndata.Partner.unique()


array(['Yes', 'No'], dtype=object)

Why is only last result being printed. If I want all, how do I do this?
Found solution here: https://stackoverflow.com/questions/36786722/how-to-display-full-output-in-jupyter-not-only-last-result 

In [43]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [45]:
churndata.gender.unique()

churndata.SeniorCitizen.unique()

churndata.Partner.unique()

churndata.Dependents.unique()

churndata.PhoneService.unique()

churndata.MultipleLines.unique()

churndata.InternetService.unique()

churndata.OnlineSecurity.unique()

array(['Female', 'Male'], dtype=object)

array([0, 1], dtype=int64)

array(['Yes', 'No'], dtype=object)

array(['No', 'Yes'], dtype=object)

array(['No', 'Yes'], dtype=object)

array(['No phone service', 'No', 'Yes'], dtype=object)

array(['DSL', 'Fiber optic', 'No'], dtype=object)

array(['No', 'Yes', 'No internet service'], dtype=object)

I got bored of typing each column. Figured there is a better way which I found here: https://stackoverflow.com/questions/27241253/print-the-unique-values-in-every-column-in-a-pandas-dataframe 

In [52]:
for col in churndata:
    col
    churndata[col].unique()

'customerID'

array(['7590-VHVEG', '5575-GNVDE', '3668-QPYBK', ..., '4801-JZAZL',
       '8361-LTMKD', '3186-AJIEK'], dtype=object)

'gender'

array(['Female', 'Male'], dtype=object)

'SeniorCitizen'

array([0, 1], dtype=int64)

'Partner'

array(['Yes', 'No'], dtype=object)

'Dependents'

array(['No', 'Yes'], dtype=object)

'tenure'

array([ 1, 34,  2, 45,  8, 22, 10, 28, 62, 13, 16, 58, 49, 25, 69, 52, 71,
       21, 12, 30, 47, 72, 17, 27,  5, 46, 11, 70, 63, 43, 15, 60, 18, 66,
        9,  3, 31, 50, 64, 56,  7, 42, 35, 48, 29, 65, 38, 68, 32, 55, 37,
       36, 41,  6,  4, 33, 67, 23, 57, 61, 14, 20, 53, 40, 59, 24, 44, 19,
       54, 51, 26, 39], dtype=int64)

'PhoneService'

array(['No', 'Yes'], dtype=object)

'MultipleLines'

array(['No phone service', 'No', 'Yes'], dtype=object)

'InternetService'

array(['DSL', 'Fiber optic', 'No'], dtype=object)

'OnlineSecurity'

array(['No', 'Yes', 'No internet service'], dtype=object)

'OnlineBackup'

array(['Yes', 'No', 'No internet service'], dtype=object)

'DeviceProtection'

array(['No', 'Yes', 'No internet service'], dtype=object)

'TechSupport'

array(['No', 'Yes', 'No internet service'], dtype=object)

'StreamingTV'

array(['No', 'Yes', 'No internet service'], dtype=object)

'StreamingMovies'

array(['No', 'Yes', 'No internet service'], dtype=object)

'Contract'

array(['Month-to-month', 'One year', 'Two year'], dtype=object)

'PaperlessBilling'

array(['Yes', 'No'], dtype=object)

'PaymentMethod'

array(['Electronic check', 'Mailed check', 'Bank transfer (automatic)',
       'Credit card (automatic)'], dtype=object)

'MonthlyCharges'

array([29.85, 56.95, 53.85, ..., 63.1 , 44.2 , 78.7 ])

'TotalCharges'

array([  29.85, 1889.5 ,  108.15, ...,  346.45,  306.6 , 6844.5 ])

'Churn'

array(['No', 'Yes'], dtype=object)