## Model Data

### Acquire the Data

In [1]:
import numpy as np
import pandas as pd
import acquire
import prepare
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [2]:
telco = acquire.get_telco_data()
telco.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service_type_id,online_security,...,payment_type_id,monthly_charges,total_charges,churn,contract_type_id.1,contract_type,internet_service_type_id.1,internet_service_type,payment_type_id.1,payment_type
0,0016-QLJIS,Female,0,Yes,Yes,65,Yes,Yes,1,Yes,...,2,90.45,5957.9,No,3,Two year,1,DSL,2,Mailed check
1,0017-DINOC,Male,0,No,No,54,No,No phone service,1,Yes,...,4,45.2,2460.55,No,3,Two year,1,DSL,4,Credit card (automatic)
2,0019-GFNTW,Female,0,No,No,56,No,No phone service,1,Yes,...,3,45.05,2560.1,No,3,Two year,1,DSL,3,Bank transfer (automatic)
3,0056-EPFBG,Male,0,Yes,Yes,20,No,No phone service,1,Yes,...,4,39.4,825.4,No,3,Two year,1,DSL,4,Credit card (automatic)
4,0078-XZMHT,Male,0,Yes,No,72,Yes,Yes,1,No,...,3,85.15,6316.2,No,3,Two year,1,DSL,3,Bank transfer (automatic)


### Prepare the Data

In [3]:
train, validate, test = prepare.prep_with_encoding(telco)

In [4]:
#Split into X and y groups
X_train, y_train = train.drop('churn', axis = 1), train.churn
X_validate, y_validate = validate.drop('churn', axis = 1), validate.churn
X_test, y_test = test.drop('churn', axis = 1), test.churn

### Set Baseline Accuracy

In [5]:
#I will use the most common entry as the baseline for predicting churn.
#What is the most common value?
train.churn.value_counts()

0    2891
1    1046
Name: churn, dtype: int64

In [6]:
#The most common entry is 0, or False for churn.
baseline = DummyClassifier(strategy = 'constant', constant = 0)
baseline.fit(X_train, y_train)
baseline.score(X_train, y_train)

0.7343154686309372

#### Baseline accuracy in training data set is about 73%.

### Random Forest Classifiers
I will use a Random Forest Classifier for my MVP. To find the best version, I will loop through 25 different values for max_depth and find the one that has the best accuracy.

In [7]:
model_dicts = []

for num in range(1, 26):
    #Instantiate new model
    clf = RandomForestClassifier(random_state = 123, max_depth = num)
    
    #Fit the model
    clf.fit(X_train, y_train)
    
    #Score the model on training data
    train_score = clf.score(X_train, y_train)
    
    #Make predictions on validate data to use in confusion matrix
    clf_preds = clf.predict(X_validate)
    
    #Use confusion matrix to find TP, FP, TN, FN
    tp = confusion_matrix(y_validate, clf_preds)[1][1]
    fp = confusion_matrix(y_validate, clf_preds)[0][1]
    tn = confusion_matrix(y_validate, clf_preds)[0][0]
    fn = confusion_matrix(y_validate, clf_preds)[1][0]
    #Score the model on validate data
    validate_score = clf.score(X_validate, y_validate)
    
    #Create a dictionary for model values
    output = {
        'max_depth':num,
        'True Positves': tp,
        'False Positives': fp,
        'True Negatives': tn,
        'False Negatvies': fn,
        'Precision': tp / (tp + fp),
        'Recall': tp / (tp + fn),
        'Training Score': train_score,
        'Validate Score': validate_score,
        'Score Difference': train_score - validate_score
    }
    
    model_dicts.append(output)

#Print out all the info in a dataframe
models = pd.DataFrame(model_dicts)

  'Precision': tp / (tp + fp),


In [8]:
models

Unnamed: 0,max_depth,True Positves,False Positives,True Negatives,False Negatvies,Precision,Recall,Training Score,Validate Score,Score Difference
0,1,0,0,1239,449,,0.0,0.734315,0.734005,0.000311
1,2,39,7,1232,410,0.847826,0.08686,0.753874,0.752962,0.000911
2,3,132,57,1182,317,0.698413,0.293987,0.785878,0.778436,0.007442
3,4,167,72,1167,282,0.698745,0.371938,0.795276,0.790284,0.004991
4,5,197,87,1152,252,0.693662,0.438753,0.803912,0.799171,0.004741
5,6,197,92,1147,252,0.681661,0.438753,0.812294,0.796209,0.016085
6,7,213,104,1135,236,0.671924,0.474388,0.822962,0.798578,0.024383
7,8,215,113,1126,234,0.655488,0.478842,0.838456,0.794431,0.044024
8,9,223,117,1122,226,0.655882,0.496659,0.85903,0.796801,0.062229
9,10,225,124,1115,224,0.644699,0.501114,0.887224,0.793839,0.093385


### Key Takeaways:
The first thing I noticed is that after about a max_depth of 10, the classifier becomes a bit too overfit and the validate accuracy begins dropping. So far, the best performing model was the one with only a max_depth of 5, and it is almost perfectly fit with very little difference between the training score and validate score. 

### Next:
Next, I would like to loop through the first 10 models, but only change the min_samples_leaf.

In [9]:
model_dicts = []

for num in range(1, 11):
    #Now create a new loop that runs through different min_samples_leaf values
    for val in range(1, 26):
        #Instantiate new model
        clf = RandomForestClassifier(random_state = 123, max_depth = num, min_samples_leaf = val)
    
        #Fit the model
        clf.fit(X_train, y_train)
    
        #Score the model on training data
        train_score = clf.score(X_train, y_train)
    
        #Make predictions on validate data to use in confusion matrix
        clf_preds = clf.predict(X_validate)
        
        #Use confusion matrix to find TP, FP, TN, FN
        tp = confusion_matrix(y_validate, clf_preds)[1][1]
        fp = confusion_matrix(y_validate, clf_preds)[0][1]
        tn = confusion_matrix(y_validate, clf_preds)[0][0]
        fn = confusion_matrix(y_validate, clf_preds)[1][0]
        #Score the model on validate data
        validate_score = clf.score(X_validate, y_validate)
    
        #Create a dictionary for model values
        output = {
            'max_depth':num,
            'min_samples_leaf': val,
            'True Positves': tp,
            'False Positives': fp,
            'True Negatives': tn,
            'False Negatvies': fn,
            'Precision': tp / (tp + fp),
            'Recall': tp / (tp + fn),
            'Training Score': train_score,
            'Validate Score': validate_score,
            'Score Difference': train_score - validate_score
        }
        
        model_dicts.append(output)

models = pd.DataFrame(model_dicts)

  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),
  'Precision': tp / (tp + fp),


In [10]:
models

Unnamed: 0,max_depth,min_samples_leaf,True Positves,False Positives,True Negatives,False Negatvies,Precision,Recall,Training Score,Validate Score,Score Difference
0,1,1,0,0,1239,449,,0.000000,0.734315,0.734005,0.000311
1,1,2,0,0,1239,449,,0.000000,0.734315,0.734005,0.000311
2,1,3,0,0,1239,449,,0.000000,0.734315,0.734005,0.000311
3,1,4,0,0,1239,449,,0.000000,0.734315,0.734005,0.000311
4,1,5,0,0,1239,449,,0.000000,0.734315,0.734005,0.000311
...,...,...,...,...,...,...,...,...,...,...,...
245,10,21,215,109,1130,234,0.663580,0.478842,0.817628,0.796801,0.020827
246,10,22,221,108,1131,228,0.671733,0.492205,0.817374,0.800948,0.016426
247,10,23,218,105,1134,231,0.674923,0.485523,0.817882,0.800948,0.016934
248,10,24,215,106,1133,234,0.669782,0.478842,0.814834,0.798578,0.016255


In [11]:
models[models['Validate Score'] > 0.80]

Unnamed: 0,max_depth,min_samples_leaf,True Positves,False Positives,True Negatives,False Negatvies,Precision,Recall,Training Score,Validate Score,Score Difference
114,5,15,201,86,1153,248,0.700348,0.447661,0.801626,0.802133,-0.000507
154,7,5,220,107,1132,229,0.672783,0.489978,0.820676,0.800948,0.019728
159,7,10,218,103,1136,231,0.679128,0.485523,0.818898,0.802133,0.016765
163,7,14,223,107,1132,226,0.675758,0.496659,0.816866,0.802725,0.014141
164,7,15,220,107,1132,229,0.672783,0.489978,0.815342,0.800948,0.014394
172,7,23,214,100,1139,235,0.681529,0.476615,0.816612,0.80154,0.015071
174,7,25,216,104,1135,233,0.675,0.481069,0.812548,0.800355,0.012192
185,8,11,222,105,1134,227,0.678899,0.494432,0.818898,0.803318,0.01558
186,8,12,223,108,1131,226,0.673716,0.496659,0.821438,0.802133,0.019305
192,8,18,222,110,1129,227,0.668675,0.494432,0.817628,0.800355,0.017272


### Key Takeaways:
Despite changing the number of min_samples_leaf, there were no major improvements in performance. However, the model with max_depth of 5 had a slight performance boost when min_samples_leaf was set to 15. I will use this model for my MVP.

### Next:
Next, I will score the chosen model on the test data set.

In [12]:
#Instantiate the model
clf = RandomForestClassifier(random_state = 123, max_depth = 5, min_samples_leaf = 15)

#Fit the model
clf.fit(X_train, y_train)

#Score the model on training data
train_score = clf.score(X_train, y_train)

#Score the model on validate data
validate_score = clf.score(X_validate, y_validate)

#Score the model on test data
test_score = clf.score(X_test, y_test)

#Make predictions on test data to use in confusion matrix
clf_preds = clf.predict(X_test)

#Use confusion matrix to find TP, FP, TN, FN
tp = confusion_matrix(y_test, clf_preds)[1][1]
fp = confusion_matrix(y_test, clf_preds)[0][1]
tn = confusion_matrix(y_test, clf_preds)[0][0]
fn = confusion_matrix(y_test, clf_preds)[1][0]


#Create a dictionary for model values
output = {
    'max_depth':5,
    'min_samples_leaf': 15,
    'True Positves': tp,
    'False Positives': fp,
    'True Negatives': tn,
    'False Negatvies': fn,
    'Precision': tp / (tp + fp),
    'Recall': tp / (tp + fn),
    'Training Score': train_score,
    'Validate Score': validate_score,
    'Test Score': test_score,
    'Score Difference': validate_score - test_score
}

test_set = []
test_set.append(output)

test_model = pd.DataFrame(test_set)
test_model

Unnamed: 0,max_depth,min_samples_leaf,True Positves,False Positives,True Negatives,False Negatvies,Precision,Recall,Training Score,Validate Score,Test Score,Score Difference
0,5,15,171,77,956,203,0.689516,0.457219,0.801626,0.802133,0.800995,0.001138


### Next:
I will create a function to run these models and print out the results in a similar format. However, to limit the number of rows displaying, I will only show those that have a validate score higher than .80. I will save this function in model.py.

### Function is complete.

### Next:
I will create a function that creates a df containing customer_id, churn probability, and churn prediction for the test data set.

In [20]:
#Since I need customer_id, I must get it from the explore version of test.
train_explore, validate_explore, test_explore = prepare.prep_without_encoding(telco)
test_explore

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,online_security,online_backup,...,tech_support,streaming_tv,streaming_movies,paperless_billing,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type
2897,0864-FVJNJ,Female,No,Yes,Yes,64,Yes,Yes,Yes,Yes,...,Yes,Yes,Yes,No,113.35,7222.75,No,One year,Fiber optic,Electronic check
6407,6870-ECSHE,Female,No,No,No,2,Yes,No,No internet service,No internet service,...,No internet service,No internet service,No internet service,No,20.45,34.80,No,One year,,Mailed check
6272,3452-FLHYD,Male,No,Yes,No,25,Yes,No,No internet service,No internet service,...,No internet service,No internet service,No internet service,Yes,20.95,495.15,No,One year,,Bank transfer (automatic)
5638,1927-QEWMY,Female,No,Yes,No,72,Yes,No,No internet service,No internet service,...,No internet service,No internet service,No internet service,No,20.50,1502.25,No,Two year,,Credit card (automatic)
903,4872-VXRIL,Male,No,No,No,56,Yes,Yes,Yes,Yes,...,Yes,No,No,Yes,64.65,3665.55,No,One year,DSL,Bank transfer (automatic)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1524,2833-SLKDQ,Male,No,No,No,1,Yes,No,No,No,...,No,No,No,No,45.05,45.05,Yes,Month-to-month,DSL,Mailed check
3536,0644-OQMDK,Male,Yes,No,No,4,Yes,No,No,No,...,No,No,No,Yes,70.65,293.85,No,Month-to-month,Fiber optic,Electronic check
3927,2632-UCGVD,Male,Yes,Yes,No,66,Yes,Yes,No,No,...,Yes,Yes,Yes,Yes,100.05,6871.90,Yes,Month-to-month,Fiber optic,Credit card (automatic)
263,4671-VJLCL,Female,No,No,No,63,Yes,Yes,Yes,Yes,...,Yes,Yes,No,Yes,79.85,4861.45,No,Two year,DSL,Credit card (automatic)


In [21]:
#Build a df containing churn proba
churn_proba = clf.predict_proba(X_test)
proba_df = pd.DataFrame(churn_proba, columns = ['probability_not_churned', 'probability_churned'])
proba_df

Unnamed: 0,probability_not_churned,probability_churned
0,0.789310,0.210690
1,0.891388,0.108612
2,0.953725,0.046275
3,0.987700,0.012300
4,0.896498,0.103502
...,...,...
1402,0.510527,0.489473
1403,0.362178,0.637822
1404,0.741420,0.258580
1405,0.941308,0.058692


In [22]:
#Add each column one at a time
reset_test = test_explore.reset_index()
reset_test['probability_not_churned'] = proba_df['probability_not_churned']
reset_test['probability_churned'] = proba_df['probability_churned']
reset_test

Unnamed: 0,index,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,online_security,...,streaming_movies,paperless_billing,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type,probability_not_churned,probability_churned
0,2897,0864-FVJNJ,Female,No,Yes,Yes,64,Yes,Yes,Yes,...,Yes,No,113.35,7222.75,No,One year,Fiber optic,Electronic check,0.789310,0.210690
1,6407,6870-ECSHE,Female,No,No,No,2,Yes,No,No internet service,...,No internet service,No,20.45,34.80,No,One year,,Mailed check,0.891388,0.108612
2,6272,3452-FLHYD,Male,No,Yes,No,25,Yes,No,No internet service,...,No internet service,Yes,20.95,495.15,No,One year,,Bank transfer (automatic),0.953725,0.046275
3,5638,1927-QEWMY,Female,No,Yes,No,72,Yes,No,No internet service,...,No internet service,No,20.50,1502.25,No,Two year,,Credit card (automatic),0.987700,0.012300
4,903,4872-VXRIL,Male,No,No,No,56,Yes,Yes,Yes,...,No,Yes,64.65,3665.55,No,One year,DSL,Bank transfer (automatic),0.896498,0.103502
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1402,1524,2833-SLKDQ,Male,No,No,No,1,Yes,No,No,...,No,No,45.05,45.05,Yes,Month-to-month,DSL,Mailed check,0.510527,0.489473
1403,3536,0644-OQMDK,Male,Yes,No,No,4,Yes,No,No,...,No,Yes,70.65,293.85,No,Month-to-month,Fiber optic,Electronic check,0.362178,0.637822
1404,3927,2632-UCGVD,Male,Yes,Yes,No,66,Yes,Yes,No,...,Yes,Yes,100.05,6871.90,Yes,Month-to-month,Fiber optic,Credit card (automatic),0.741420,0.258580
1405,263,4671-VJLCL,Female,No,No,No,63,Yes,Yes,Yes,...,No,Yes,79.85,4861.45,No,Two year,DSL,Credit card (automatic),0.941308,0.058692


In [23]:
#Now add the test predictions
reset_test['predicted'] = clf_preds
reset_test

Unnamed: 0,index,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,online_security,...,paperless_billing,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type,probability_not_churned,probability_churned,predicted
0,2897,0864-FVJNJ,Female,No,Yes,Yes,64,Yes,Yes,Yes,...,No,113.35,7222.75,No,One year,Fiber optic,Electronic check,0.789310,0.210690,0
1,6407,6870-ECSHE,Female,No,No,No,2,Yes,No,No internet service,...,No,20.45,34.80,No,One year,,Mailed check,0.891388,0.108612,0
2,6272,3452-FLHYD,Male,No,Yes,No,25,Yes,No,No internet service,...,Yes,20.95,495.15,No,One year,,Bank transfer (automatic),0.953725,0.046275,0
3,5638,1927-QEWMY,Female,No,Yes,No,72,Yes,No,No internet service,...,No,20.50,1502.25,No,Two year,,Credit card (automatic),0.987700,0.012300,0
4,903,4872-VXRIL,Male,No,No,No,56,Yes,Yes,Yes,...,Yes,64.65,3665.55,No,One year,DSL,Bank transfer (automatic),0.896498,0.103502,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1402,1524,2833-SLKDQ,Male,No,No,No,1,Yes,No,No,...,No,45.05,45.05,Yes,Month-to-month,DSL,Mailed check,0.510527,0.489473,0
1403,3536,0644-OQMDK,Male,Yes,No,No,4,Yes,No,No,...,Yes,70.65,293.85,No,Month-to-month,Fiber optic,Electronic check,0.362178,0.637822,1
1404,3927,2632-UCGVD,Male,Yes,Yes,No,66,Yes,Yes,No,...,Yes,100.05,6871.90,Yes,Month-to-month,Fiber optic,Credit card (automatic),0.741420,0.258580,0
1405,263,4671-VJLCL,Female,No,No,No,63,Yes,Yes,Yes,...,Yes,79.85,4861.45,No,Two year,DSL,Credit card (automatic),0.941308,0.058692,0


In [24]:
#Now select only the columns required in a new data frame.
csv_df = reset_test[['customer_id', 'probability_not_churned', 'probability_churned', 'predicted']]
csv_df

Unnamed: 0,customer_id,probability_not_churned,probability_churned,predicted
0,0864-FVJNJ,0.789310,0.210690,0
1,6870-ECSHE,0.891388,0.108612,0
2,3452-FLHYD,0.953725,0.046275,0
3,1927-QEWMY,0.987700,0.012300,0
4,4872-VXRIL,0.896498,0.103502,0
...,...,...,...,...
1402,2833-SLKDQ,0.510527,0.489473,0
1403,0644-OQMDK,0.362178,0.637822,1
1404,2632-UCGVD,0.741420,0.258580,0
1405,4671-VJLCL,0.941308,0.058692,0


### Final Step:
The last step is to take this df and convert it to a csv file. This is easy to do with pd.to_csv(). I won't do it here, only in the final report.