# Why Are Customers Churning

Project goals:
* Discover drivers of customers churning and other key insights.
* Use main drivers to develop a machine learning model to predict churning customers.
* Use insights and model evaluation to identify features to address when a customer is predicted to churn.

* Random state of 125 for reproducibility

## Imports

In [1]:
import acquire as a
import prepare as p
import explore as e
import mmodel as m

In [2]:
import pandas as pd
pd.set_option('display.max_columns', 100)

## Acquire

Use pandas to read SQL query from our MySQL database, where our Telco data is stored.

In [3]:
telco_raw = a.get_telco_data()
print(f'Number of Rows: {telco_raw.shape[0]}\nNumber of Columns: {telco_raw.shape[1]}')
telco_raw.head(3)

Number of Rows: 7043
Number of Columns: 24


Unnamed: 0,payment_type_id,internet_service_type_id,contract_type_id,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,paperless_billing,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type
0,2,1,2,0002-ORFBO,Female,0,Yes,Yes,9,Yes,No,No,Yes,No,Yes,Yes,No,Yes,65.6,593.3,No,One year,DSL,Mailed check
1,2,1,1,0003-MKNFE,Male,0,No,No,9,Yes,Yes,No,No,No,No,No,Yes,No,59.9,542.4,No,Month-to-month,DSL,Mailed check
2,1,2,1,0004-TLHLJ,Male,0,No,No,4,Yes,No,No,No,Yes,No,No,No,Yes,73.9,280.85,Yes,Month-to-month,Fiber optic,Electronic check


## Prepare

Data Transformations

1. Inspected raw data
    * Dropped foreign key columns
    * Checked for missing values
       * found `' '` in the `total_charges` column and replaced them with `np.nan`s
       * converted ` total_charges` column to a numeric data type and imputed `np.nan`s with the median total charges
    * Checked for duplicate rows and customer id’s and saw none
    * Encoded categorical columns

1. Inspect clean data
    * Ensured data was tidy:
        * one value per cell
        * each observation is one and only one row
        * each feature is one and only one column

1. Split the data
    * Saw class imbalance (73% to 27%) for the target, `churn`.
    * Performed 70/15/15 (train/validate/test) stratified split
    

Clean raw data

In [4]:
telco_clean = p.prep_telco()
print(f"Number of Customers: {telco_clean.shape[0]}\
        \nNumber of Columns: {telco_clean.shape[1]}")
telco_clean.head()

Number of Customers: 7043        
Number of Columns: 42


Unnamed: 0,customer_id,gender_male,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,paperless_billing,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type,multiple_lines_no_phone_service,multiple_lines_yes,online_security_no_internet_service,online_security_yes,online_backup_no_internet_service,online_backup_yes,device_protection_no_internet_service,device_protection_yes,tech_support_no_internet_service,tech_support_yes,streaming_tv_no_internet_service,streaming_tv_yes,streaming_movies_no_internet_service,streaming_movies_yes,contract_type_one_year,contract_type_two_year,internet_service_type_fiber_optic,internet_service_type_none,payment_type_credit_card_(automatic),payment_type_electronic_check,payment_type_mailed_check
0,0002-ORFBO,0,0,1,1,9,1,No,No,Yes,No,Yes,Yes,No,1,65.6,593.3,0,One year,DSL,Mailed check,0,0,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,0,1
1,0003-MKNFE,1,0,0,0,9,1,Yes,No,No,No,No,No,Yes,0,59.9,542.4,0,Month-to-month,DSL,Mailed check,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1
2,0004-TLHLJ,1,0,0,0,4,1,No,No,No,Yes,No,No,No,1,73.9,280.85,1,Month-to-month,Fiber optic,Electronic check,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0
3,0011-IGKFF,1,1,1,0,13,1,No,No,Yes,Yes,No,Yes,Yes,1,98.0,1237.85,1,Month-to-month,Fiber optic,Electronic check,0,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,1,0
4,0013-EXCHZ,0,1,1,0,3,1,No,No,No,No,Yes,Yes,No,1,83.9,267.4,1,Month-to-month,Fiber optic,Mailed check,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1


Perform 70/15/15 train/validate/test split.

In [6]:
telco_clean['churn'].value_counts(normalize=True)

0    0.73463
1    0.26537
Name: churn, dtype: float64

Stratify on `churn` column since we have a 73/27 class imbalance.

In [29]:
train, validate, test = p.split_data(telco_clean, stratify_col='churn', random_state=125)

Verify Split

In [30]:
# check set sizes
print(len(train), len(validate), len(test))
# check target proportions match original data (stratified)
train['churn'].value_counts(normalize=True)

4225 1409 1409


0    0.734675
1    0.265325
Name: churn, dtype: float64

## Explore

Explore data in search of drivers of churn 
   1. General Inspect
       - `.info()` and `.describe()`
       - identify continuous and categorical columns
   1. Univariate Stats: 
       - Categorical
       - Nunerical
   1. Bivariate Stats:
       - Categorical features to target relationships
       - Continuous features to target relationship

   1. Ask and answer specific questions:
       - Which drivers appear to relate to churn the most?

Key Insights:
- Strong Drivers of Churn:
    - Tenure
    - Contract Type
    - Payment by Electronic Check
- Weak Drivers of Churn:
    - Gender

Load data. Explore only on the training data

In [None]:
telco, _, _ = p.split_data(p.prep_telco(), stratify_col='churn', random_state=125)




### 1) Question

### 2) Vizualization

### 3) Statistical test
* Test Name
* Hypotheses and significance level
* Verify Assumptions
* Run test
* Interpret the results of the test

### 4) Insight/Conclusion


## Exploration Summary

## Modeling

* We will evaluate the model performances based on their accuracy scores.

### Baseline 
* create and interpret baseline model

In [None]:
baseline_scores = m.run_baseline_model(train, validate, 'churn')

### Propose best models

In [None]:
knn_scores = m.run_knn_models(train, validate, 'churn')

lr_scores = m.run_lr_models(train, validate, 'churn')

dt_scores = m.run_dt_models(train, validate, 'churn')

scores = pd.comcat([baseline_scores, knn_scores, lr_scores, dt_scores])

### Perform best model on test data


In [10]:
best_model = m.run_best_model(train, validate,
                                features=['contract_type_one_year', 
                                          'internet_service_type_fiber_optic',
                                          'online_security_yes',
                                          'online_security_no_internet_service',
                                          'payment_type_electronic_check',
                                          'tenure'])
best_model

### Modeling wrap and recommendations

* Our best model, a decision tree classifier running on 6 features with a max depth of 8, had an 81% accuracy score, up from 73% of our baseline model.


## Conclusion

### Summary
* summarize findings and and answers

### Recommendations
* Recommed stakeholder actions

### Next Steps