# Telco Churn Classification Project

___________________________________

## Project Planning:
**Plan** --- Acquire --- Prepare --- Explore --- Model --- Deliver


1. Create README.md with data dictionary, project and businesss goals, create inital hypothesis
2. Acquire data from the Codeup Database and create a function to automate this process. Save the function in an acquire.py file to import into the Final Report Notebook.
3. Clean and prepare data for the first iteration through the pipeline, MVP preparation. Create a function to automate the process, store the function in a prepare.py module, and prepare data in Final Report Notebook by importing and using the funtion.
4. Clearly define two hypotheses, set an alpha, run the statistical tests needed, reject or fail to reject the Null Hypothesis, and document findings and takeaways.
5. Establish a baseline accuracy and document well.
6. Train three different classification models.
7. Evaluate models on train and validate datasets.
8. Choose the model with that performs the best and evaluate that single model on the test dataset.
9. Create csv file with the customer id, the probability of the target values, and the model's prediction for each observation in my test dataset.
10. Document conclusions, takeaways, and next steps in the Final Report Notebook.

____________________________

## Business Goals:
- Find a driver of churn for Telco customers
- Construct a ML classification model that accurately predicts customer churn.
- Document your process well enough to be presented or read like a report

______________________________

## Executive Summary- Conclusions and Next Steps:
- My findings are:
    - I will be using the decision tree model as my best model for prediction my target value, churn because:
        - there is an accuracy of 79.36% on both the train set and 78.95% on the validate set
        - this model outperformed my baseline score of 73.12%
        - there is not a large drop off of accuracy between the two sets (thus it is not overfit)
       
       <br>
- Next Steps/If I had more time:
    - I would run more models and change the hyperparameters on several different versions
    - I would look into adding surveying to exiting customers to further understand their actual cause of churn
    - We can then target the true reason to reduce churn in future customers

_________________________

In [1]:
#import needed libraries

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
from pydataset import data

#import ignore warnings
import warnings
warnings.filterwarnings("ignore")


In [2]:
from sklearn.model_selection import train_test_split #train, test, split
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score, accuracy_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix


# Acquire

Plan --- **Acquire** --- Prepare --- Explore --- Model --- Deliver

In [3]:
# acquire
from env import host, user, password
import acquire

In [4]:
# call acquire function and take a look at data
df = acquire.get_telco_churn_data()

In [5]:
df.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,payment_type_id,monthly_charges,total_charges,churn,contract_type_id,contract_type,payment_type_id.1,payment_type,internet_service_type_id.1,internet_service_type
0,0016-QLJIS,Female,0,Yes,Yes,65,Yes,Yes,1,Yes,...,2,90.45,5957.9,No,3,Two year,2,Mailed check,1,DSL
1,0017-DINOC,Male,0,No,No,54,No,No phone service,1,Yes,...,4,45.2,2460.55,No,3,Two year,4,Credit card (automatic),1,DSL
2,0019-GFNTW,Female,0,No,No,56,No,No phone service,1,Yes,...,3,45.05,2560.1,No,3,Two year,3,Bank transfer (automatic),1,DSL
3,0056-EPFBG,Male,0,Yes,Yes,20,No,No phone service,1,Yes,...,4,39.4,825.4,No,3,Two year,4,Credit card (automatic),1,DSL
4,0078-XZMHT,Male,0,Yes,No,72,Yes,Yes,1,No,...,3,85.15,6316.2,No,3,Two year,3,Bank transfer (automatic),1,DSL


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 27 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               7043 non-null   object 
 1   gender                    7043 non-null   object 
 2   senior_citizen            7043 non-null   int64  
 3   partner                   7043 non-null   object 
 4   dependents                7043 non-null   object 
 5   tenure                    7043 non-null   int64  
 6   phone_service             7043 non-null   object 
 7   multiple_lines            7043 non-null   object 
 8   internet_service_type_id  7043 non-null   int64  
 9   online_security           7043 non-null   object 
 10  online_backup             7043 non-null   object 
 11  device_protection         7043 non-null   object 
 12  tech_support              7043 non-null   object 
 13  streaming_tv              7043 non-null   object 
 14  streamin

## Takeaways of the Acquire process:
- I wrote a SQL query to create the acquire function
- I joined 3 tables together to get all data needed


<hr style="border:2px solid black"> </hr>

# Prepare

Plan --- Acquire --- **Prepare** --- Explore --- Model --- Deliver

In [7]:
import prepare

In [8]:
#reassign prep_telco_churn(df) to just df
df = prepare.prep_telco_churn(df)

In [9]:
#call df and look at it
df.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,streaming_movies,contract_type_id,paperless_billing,payment_type_id,monthly_charges,total_charges,churn,contract_type,payment_type,internet_service_type
0,0016-QLJIS,Female,0,Yes,Yes,65,Yes,Yes,1,Yes,...,Yes,3,Yes,2,90.45,5957.9,0,Two year,Mailed check,DSL
1,0017-DINOC,Male,0,No,No,54,No,No phone service,1,Yes,...,No,3,No,4,45.2,2460.55,0,Two year,Credit card (automatic),DSL
2,0019-GFNTW,Female,0,No,No,56,No,No phone service,1,Yes,...,No,3,No,3,45.05,2560.1,0,Two year,Bank transfer (automatic),DSL
3,0056-EPFBG,Male,0,Yes,Yes,20,No,No phone service,1,Yes,...,No,3,Yes,4,39.4,825.4,0,Two year,Credit card (automatic),DSL
4,0078-XZMHT,Male,0,Yes,No,72,Yes,Yes,1,No,...,Yes,3,Yes,3,85.15,6316.2,0,Two year,Bank transfer (automatic),DSL


In [18]:
df.phone_service.unique()

array(['Yes', 'No'], dtype=object)

In [13]:
#split data
train, test = train_test_split(df, test_size=.2, random_state=123)
train, validate = train_test_split(train, test_size=.3, random_state=123)

In [14]:
print(f'train shape: {train.shape}')
print(f'validate shape: {validate.shape}')
print(f'test shape: {test.shape}')

train shape: (3943, 24)
validate shape: (1691, 24)
test shape: (1409, 24)


## Takeaways of the Prepare process:

- prep_telco_churn(df) function was created to:
    - change data types 
    - remove duplicates (if any) 
   
- "prep_telco_churn(df)" was then renamed to "df"


<hr style="border:2px solid black"> </hr>

# Explore

Plan --- Acquire --- Prepare --- **Explore** --- Model --- Deliver

In [15]:
#take a look at the data
train.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,streaming_movies,contract_type_id,paperless_billing,payment_type_id,monthly_charges,total_charges,churn,contract_type,payment_type,internet_service_type
4604,5564-NEMQO,Female,1,No,No,1,Yes,No,2,No,...,No,1,Yes,3,75.3,75.3,1,Month-to-month,Bank transfer (automatic),Fiber optic
5566,0825-CPPQH,Female,0,Yes,No,71,Yes,No,3,No internet service,...,No internet service,3,No,4,19.1,1372.45,0,Two year,Credit card (automatic),
6204,1561-BWHIN,Male,0,Yes,Yes,19,Yes,No,3,No internet service,...,No internet service,2,No,2,19.8,344.5,0,One year,Mailed check,
5837,4979-HPRFL,Male,0,Yes,Yes,56,Yes,Yes,3,No internet service,...,No internet service,3,No,3,24.15,1402.25,0,Two year,Bank transfer (automatic),
1276,0749-IRGQE,Female,1,Yes,No,13,No,No phone service,1,No,...,Yes,1,No,1,45.3,528.45,0,Month-to-month,Electronic check,DSL


In [16]:
#data has int and object data types
#look at column names
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3943 entries, 4604 to 6958
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               3943 non-null   object 
 1   gender                    3943 non-null   object 
 2   senior_citizen            3943 non-null   int64  
 3   partner                   3943 non-null   object 
 4   dependents                3943 non-null   object 
 5   tenure                    3943 non-null   int64  
 6   phone_service             3943 non-null   object 
 7   multiple_lines            3943 non-null   object 
 8   internet_service_type_id  3943 non-null   int64  
 9   online_security           3943 non-null   object 
 10  online_backup             3943 non-null   object 
 11  device_protection         3943 non-null   object 
 12  tech_support              3943 non-null   object 
 13  streaming_tv              3943 non-null   object 
 14  strea

### data summary:
- 16 object data types
- 6 integer data types
- 2 float data type

In [None]:
#descriptive statistics
train.describe()

In [None]:
#we are trying to determine churn... look into that
df.churn.value_counts()

#this shows 26.54% of all customers churn (1869 out of 7043)

In [None]:
#we are trying to determine churn... look into that
train.churn.value_counts()

#this shows 26.88% of customers in the train data set churn (1060 out of 3943)

In [None]:
#find correlation- I think age correlates the most to churn
telco_correlation = train.corr()
telco_correlation
#this shows that senior_citizen and monthly_charges have the highest pos corr with churn

In [None]:
#this gives the all the correlation with JUST churn
telco_corr_churn = telco_correlation['churn'].sort_values(ascending=False)
telco_corr_churn

## again, senior_citizen is the second highest
##THIS will indict my DRIVER of churn!!

In [None]:
#heatmap to show correlation of all data
plt.figure(figsize=(16,9))

sns.heatmap(train.corr(), cmap='YlGnBu', center=0, annot=True)

plt.title('Correlation of Telco Data')

plt.show()

In [None]:
#find out how many customers churn vs how many do not churn
df.churn.value_counts()

#this shows that (26.54%) of ALL customers DO churn (1869 out of 7043) 

In [None]:
#find out how many customers churn vs how many do not churn
train.churn.value_counts()

#this shows that (26.88%) of customers in the train dataset DO churn (1060 out of 3943) 

In [None]:
#visualize churn data using a countplot
plt.figure(figsize=(10,6))
sns.countplot(x='churn', data=train)
plt.show()

In [None]:
#average monthly charges, max monthly charges, min monthly charges
df.monthly_charges.mean(), df.monthly_charges.max(), df.monthly_charges.min()

In [None]:
#only using TRAIN data average monthly charges, max monthly charges, min monthly charges
train.monthly_charges.mean(), train.monthly_charges.max(), train.monthly_charges.min()

In [None]:
#visualize monthly charge data
plt.figure(figsize=(16,9))
sns.countplot(x='monthly_charges', data=train)
plt.show()

In [None]:
#find out how many are senior citizens and how many are not
df.senior_citizen.value_counts()

#this shows that 16.21% of all customers are seniors (1142 out of 7043)

In [None]:
#find out how many are senior citizens and how many are not
train.senior_citizen.value_counts()

#this shows that 16.21% of customers in the train dataset are seniors (639 out of 3943)

In [None]:
#visualize senior_citizen data using countplot
plt.figure(figsize=(10,6))
sns.countplot(x='senior_citizen', data=train)
plt.show()

In [None]:
#this plot shows senior citizen (1) vs non-senior citizen (0)
## who churn (1) vs do not churn (0)
plt.figure(figsize=(10,6))
sns.countplot(x='churn', hue='senior_citizen', data=train)
plt.show()

In [None]:
#took at look at tenure out of curiosity
#average tenure, max tenure, min tenure??
df.tenure.mean(), df.tenure.max(), df.tenure.min()

In [None]:
#average tenure, max tenure, min tenure??
train.tenure.mean(), train.tenure.max(), train.tenure.min()

In [None]:
#visualize tenure data
plt.figure(figsize=(16,9))
sns.countplot(x='tenure', data=train)
plt.show()

In [None]:
#find the actual count for the top 5 tenures
tenure_df = train['tenure'].value_counts().sort_values(ascending=False).head()
tenure_df

#this shows that 8.34% of all customers only have ONE month of tenure (329 out of 3943)

In [None]:
#find the actual count for the top 5 monthly_chages
monthly_charges_df = train['monthly_charges'].value_counts().sort_values(ascending=False)
monthly_charges_df

## Takeaways of the Explore process:

- 7043 total customers
    - of those: 
        - 5901 are NOT senior_citizen (83.79%) while 1142 ARE senior_citizen (16.21%)
        - 5174 do NOT churn (73.46%) while 1869 DO churn (26.54%)
- Positive Correlation between churn is strongest between 'monthly_charges' and 'senior_citizen'
    - 'senior_citizen' is what interests me most to further explore/test


_______________

### Now.. I will find the appropriate statistical test to use
- we are using the two following variables: churn (discrete/categorical) and senior_citizen (discrete/categorical)
- these are 2 discrete/categorical variables
<br>

- **Therefore, I will be using $\chi^2$ testing**

In [None]:
#create confusion matrix
observed = pd.crosstab(df.churn, df.senior_citizen)
observed

In [None]:
#set alpha
alpha = 0.05

In [None]:
#chi2 contingency returns 4 different values
chi2, p, degf, expected = stats.chi2_contingency(observed)
chi2, p, degf, expected

In [None]:
## make it easier to read
print('Observed\n')
print(observed.values)
print('---------------------\nExpected\n')
print(expected.astype(int))
print('---------------------\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

## Hypothosis:

- **$H_{0}$**: there is no relationship between churn and customer age
<br>

- **$H_{a}$**: there is a relationshp between churn and customer age
<br>


- **True Positive**: I predict there is a relationship between churn and customer age and there is a relationship
<br>

- **True Negative**:I predict there is no relationship between churn and customer age and there is not relationship
<br>

- **False Positive**: I predict there is a relationship between churn and customer age and there is no relationship
<br>

- **False Negative**: I predict there is no relationship between churn and customer age and there is a relationship

## Further takeaways of the Explore process:

- $\chi^2$  test was used because we were dealing with two discrete variable (churn and senior_citizen)


In [None]:
print(f'The p-value is less than the alpha: {p < alpha}')

if p< alpha:
    print('Conclusion: We reject the null hypothesis')
else:
    print('Conclusion: We fail to reject the null hypothesis')

if p< alpha:
    print('Takeaways: There is a relationship between customer age and churn')
else:
    print('Takeaways: There is no relationship between customer age and churn')

<hr style="border:2px solid black"> </hr>

# Model and Evaluate

Plan --- Acquire --- Prepare --- Explore --- **Model** --- Deliver

In [None]:
#we've already split the data in previous steps
#train, test = train_test_split(df, test_size=.2, random_state=123)
#train, validate = train_test_split(train, test_size=.3, random_state=123)

In [None]:
#get value count to determine what baseline will be equal to
train.churn.value_counts()

In [None]:
#create baseline
#because the majority (in value count) was '0', we will use this as our baseline
train['baseline_pred'] = 0

In [None]:
##print statement for accuracy of baseline
baseline_accuracy = (train.churn == train.baseline_pred).mean()
print(f'The baseline accuracy is: {baseline_accuracy:.2%}')

## MVP Models
- I'm going to try Logistic Regression, DecisionTree, and RandomForest Models with
    - 'senior_citizen'
    - 'tenure'
    - 'monthly_charges'
- My goal is to beat my 73.12% baseline accuracy.
- Hyperparameters I've adjusted are:
    - setting the max_depth for the DecisionTree model to max_depth=5 to avoid overfitting
    - setting the.max_depth for the Random Forest Model to max_depth=10 to avoid overfitting
    - setting the random_state=123 for DecistionTree, RandomForest, and LogisticRegression models.

In [None]:
#specify columns to use
X_col= ['senior_citizen','tenure', 'monthly_charges']
y_col= 'churn'

In [None]:
#specify train, validate, test
X_train = train[X_col]
y_train= train[y_col]

X_validate = validate[X_col]
y_validate= validate[y_col]

X_test = test[X_col]
y_test= test[y_col]

In [None]:
#get shape of sets
train.shape, validate.shape, test.shape

In [None]:
#shape of train set
X_train.shape, y_train.shape

In [None]:
#take a look at X_train data
X_train.head()

_______________________

# Evaluate MVP Models

## Logistic Regression Model 

In [None]:
#import
from sklearn.linear_model import LogisticRegression

In [None]:
#Define the logistic regression model
logit_model = LogisticRegression(C=0.1, random_state= 123)

In [None]:
#fit the model with train data 
logit_model.fit(X_train, y_train)

In [None]:
#now use the model to make predictions
y_pred = logit_model.predict(X_train)

In [None]:
#classifcation report
pd.DataFrame(classification_report(y_train, y_pred, output_dict=True))

In [None]:
#import function created to give score, rates, confusion matrix, and classification report
import model_func

In [None]:
model_func.model_performs(X_train, y_train, logit_model)

In [None]:
model_func.model_performs(X_validate, y_validate, logit_model)

### Logistic Regression Model Takeaways: 

- Logistic Regression Model has an accuracy of 79.10% using the train set
- Logistic Regression Model has an accuracy of 78.83% using the validate set
- Both are **higher** than the baseline accuracy of 73.12% 
    

___________

## Random Forest Model

In [None]:
#import
from sklearn.ensemble import RandomForestClassifier

In [None]:
#make our thing
rf_model= RandomForestClassifier(min_samples_leaf = 1, max_depth = 10, random_state= 123)

In [None]:
#fit the thing (ONLY on train set!!)
rf_model.fit(X_train, y_train)

In [None]:
#use the thing
#train data set score, confusion matrix and classification report
model_func.model_performs(X_train, y_train, rf_model)

In [None]:
#use the thing
#validate data set score, confusion matrix and classification report
model_func.model_performs(X_validate, y_validate, rf_model)

### Random Forest Model Takeaways: 

- Random Forest Model has an accuracy of 87.29% using the train set
- Random Forest Model has an accuracy of 78.12% using the validate set
- Both are **higher** than the baseline accuracy of 73.12% 
- **but** there is *too much* difference between train and validate set (too overfit)


___________________

## Decision Tree Model

In [None]:
#create model
dt_model = DecisionTreeClassifier(max_depth=5)

In [None]:
#fit model
dt_model.fit(X_train, y_train)

In [None]:
#use the thing
#train data set score, confusion matrix and classification report
model_func.model_performs(X_train, y_train, dt_model)

In [None]:
#visualize train set
plt.figure(figsize=(24,12))

plot_tree(dt_model, feature_names=X_train.columns.tolist(), filled=True, rounded=True, class_names=['no churn', 'churn'])
plt.show()

In [None]:
#use the thing
#validate data set score, confusion matrix and classification report
model_func.model_performs(X_validate, y_validate, dt_model)

In [None]:
#visualize validate set
plt.figure(figsize=(24,12))

plot_tree(dt_model, feature_names=X_validate.columns.tolist(), filled=True, rounded=True, class_names=['no churn', 'churn'])
plt.show()

### Decision Tree Model Takeaways: 

- Decision Tree Model has an accuracy of 79.36% using the train set using max_depth=5
- Decision Tree Model has an accuracy of 78.95% using the validate set using max_depth=5
- Both are **higher** than the baseline accuracy of 73.12% 


### Run Decision Tree Model on Test Dataset

In [None]:
#use the thing
#train data set score, confusion matrix and classification report
model_func.model_performs(X_test, y_test, dt_model)

### Decision Tree Model with Test Dataset: 

- Decision Tree Model has an accuracy of 78.70% using the train set using max_depth=5
- Both are **higher** than the baseline accuracy of 73.12% 


<hr style="border:2px solid black"> </hr>

## Create CSV for Predictions

In [None]:
import prepare

In [None]:
df_all_data = prepare.prep_telco_churn(acquire.get_telco_churn_data())

In [None]:
#create column that has prediction based on decision tree model
df_all_data ['predictions'] = dt_model.predict(df_all_data[X_col])

In [None]:
#create dataframe that shows if that particular customer_id will churn of not
df_predictions = df_all_data[['customer_id', 'predictions']]

In [None]:
#take a look at this prediction data
df_predictions.head()

In [None]:
df_predictions.predictions.value_counts()
# based on predictions, we have 1592 customers who have the potential to churn!!

In [None]:
#turn this new dataframe of customer_id and predictions into a CSV file
df_predictions.to_csv('telco_churn_predictions.csv')

<hr style="border:2px solid black"> </hr>

## Conclusion/Final Takeaway:
- Overall, the decision tree model performed best

    - there is acceptable accuracy on both the train set and the validate set
    - the score is higher than the baseline
    - there is not a large drop off of accuracy between the two sets (thus it is not overfit)

<br>
- Use the new dataframe of predictions to target those specific customers that have the most potential to churn

<hr style="border:2px solid black"> </hr>

## Next Steps/If I had more time:
- I would run more models and change the hyperparameters on several different versions
- I would look into adding surveying to exiting customers to further understand their actual cause of churn
- We can then target the true reason to reduce churn in future customers
