1. [Prep Your Repo](#prep-your-repo)
2. [Import](#import)
2. [Acquire Data](#acquire-data)
3. [Clean, Prep & Split Data](#clean-prep-and-split-df)
5. [Explore Data](#explore-data)
6. [Create a Baseline Model]
7. [Create and Compare Different Models]
8. [Predict on Test Model]
9. [Exporting CSV with Predictions]

# <a name="prep-your-repo"></a>1. Prep Your Repo
1. Create new repo and name it: 'classification_proj'
    - clone
2. Create .gitignore that includes env.py
    - push
3. Create env.py file that store MySQL login credentials to obtain TELCO data.
    - save
    - confirm it is ignored (git status)
4. Create README.md file to begin notating steps taken so far.
    - save
    - push
5. Create a Jupyter Lab environment to continue working in.
6. Create Jupyter Notebook to begin data pipeline: 'telco001'

# <a name="import"></a>2. Import 
Import all necessary libraries and functions. 

In [1]:
# pd & np libraries to make life easier
import pandas as pd
import numpy as np

# visualizers I'll be using
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import graphviz
from graphviz import Graph

# to perform stats tests
from scipy import stats

# all sklearn lib's functs I intend on using
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support

# to view Zach's threshold graph
import logistic_regression_util

#import all functions created in acquire , prepare & explore
import prepare
import acquire
import explore
import model


# ignore warnings
import warnings
warnings.filterwarnings("ignore")

# <a name="acquire-data"></a>3. Acquire Data
Read `TELCO` data from sql

In [2]:
# read TELCO data from sql
df = acquire.get_telco_data()

df.head()

Unnamed: 0,payment_type_id,internet_service_type_id,contract_type_id,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,...,tech_support,streaming_tv,streaming_movies,paperless_billing,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type
0,2,1,1,0003-MKNFE,Male,0,No,No,9,Yes,...,No,No,Yes,No,59.9,542.4,No,Month-to-month,DSL,Mailed check
1,4,1,1,0013-MHZWF,Female,0,No,Yes,9,Yes,...,Yes,Yes,Yes,Yes,69.4,571.45,No,Month-to-month,DSL,Credit card (automatic)
2,1,1,1,0015-UOCOJ,Female,1,No,No,7,Yes,...,No,No,No,Yes,48.2,340.35,No,Month-to-month,DSL,Electronic check
3,1,1,1,0023-HGHWL,Male,1,No,No,1,No,...,No,No,No,Yes,25.1,25.1,Yes,Month-to-month,DSL,Electronic check
4,3,1,1,0032-PGELS,Female,0,Yes,Yes,1,No,...,No,No,No,No,30.5,30.5,Yes,Month-to-month,DSL,Bank transfer (automatic)


# <a name="clean-prep-and-split-df"></a>4. Clean, Prep and Split `df`
Using three functions tied to eachother in `prepare.py` file:
- [`clean_telco()`](prepare.py)
- [`prep_telco()`](prepare.py)
- [`train_validate_test()`](prepare.py)

In [3]:
# test prep_telco & train_validate_test
train, validate, test = prepare.train_validate_test(df)

Verify size of splits

In [4]:
train.shape, validate.shape, test.shape

((3943, 34), (1691, 34), (1409, 34))

In [5]:
# look at the split df
train.head()

Unnamed: 0,customer_id,gender,senior_citizen,tenure,phone_service,multiple_lines,online_security,online_backup,device_protection,tech_support,...,phone_multi_line,phone_sgl_line,sgl_dependents,sgl_no_dep,fam_house,stream_media,online_feats,auto_billpay,sen_int,sen_int_techsup
477,3969-JQABI,0,0,58,1,0,1,1,0,0,...,0,1,1,0,0,1,1,1,0,0
183,1600-DILPE,0,0,12,1,0,0,0,0,0,...,0,1,0,1,0,0,0,1,0,0
2252,7359-SSBJK,0,1,64,1,0,1,1,0,1,...,0,1,0,1,0,1,1,1,1,1
5556,0883-EIBTI,0,0,2,1,0,3,3,3,3,...,0,1,0,1,0,1,1,0,0,0
3626,5519-NPHVG,0,0,12,1,1,0,0,0,0,...,1,0,0,1,0,1,0,0,0,0


In [6]:
# view split df's columns/features
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3943 entries, 477 to 5742
Data columns (total 34 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   customer_id        3943 non-null   object 
 1   gender             3943 non-null   int64  
 2   senior_citizen     3943 non-null   int64  
 3   tenure             3943 non-null   int64  
 4   phone_service      3943 non-null   int64  
 5   multiple_lines     3943 non-null   int64  
 6   online_security    3943 non-null   int64  
 7   online_backup      3943 non-null   int64  
 8   device_protection  3943 non-null   int64  
 9   tech_support       3943 non-null   int64  
 10  paperless_billing  3943 non-null   int64  
 11  monthly_charges    3943 non-null   float64
 12  total_charges      3943 non-null   float64
 13  churn              3943 non-null   int64  
 14  pymt_type_abt      3943 non-null   uint8  
 15  pymt_type_acc      3943 non-null   uint8  
 16  pymt_type_echk     394