# Instructions
- **Apply the Random Forests algorithm but this time only by upscaling the data using SMOTE.**
- **Note that since SMOTE works on numerical data only, we will first encode the categorical variables in this case.**

In [1]:
import pandas as pd
import numpy as np
import warnings

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score # accepts only one metric
from sklearn.model_selection import cross_validate # accepts more than one metric 
from sklearn.model_selection import GridSearchCV 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import cohen_kappa_score 

warnings.filterwarnings('ignore')

In [2]:
# Load the data that was already cleaned and encoded in the previous lab

X = pd.read_csv("X.csv")
y = pd.read_csv("y.csv")

In [3]:
X.head()

Unnamed: 0,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MonthlyCharges,TotalCharges,gender_Female,gender_Male,OnlineSecurity_No,...,TechSupport_Yes,StreamingTV_No,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year
0,0,1,0,1,0,29.85,29.85,1,0,1,...,0,1,0,0,1,0,0,1,0,0
1,0,0,0,34,1,56.95,1889.5,0,1,0,...,0,1,0,0,1,0,0,0,1,0
2,0,0,0,2,1,53.85,108.15,0,1,0,...,0,1,0,0,1,0,0,1,0,0
3,0,0,0,45,0,42.3,1840.75,0,1,0,...,1,1,0,0,1,0,0,0,1,0
4,0,0,0,2,1,70.7,151.65,1,0,1,...,0,1,0,0,1,0,0,1,0,0


In [4]:
y.head()

Unnamed: 0,Churn
0,No
1,No
2,Yes
3,No
4,Yes


In [5]:
# Encode also the target to avoid errors in the cross validation

y["Churn"] = y["Churn"].apply(lambda x: 1 if x == "Yes" else 0)

In [6]:
y.value_counts()

Churn
0        5163
1        1869
dtype: int64

In [7]:
# Reduce imbalance in the target using SMOTE

smote = SMOTE()
X_sm, y_sm = smote.fit_resample(X, y)
y_sm.value_counts()

Churn
0        5163
1        5163
dtype: int64

In [8]:
# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.3, random_state=11)

In [9]:
# Run first a model for benchmark with default settings, using 10-fold cross validation

model = RandomForestClassifier(random_state=100)

cross_val_scores = cross_validate(model, X_train, y_train, cv=10, scoring=["accuracy", "f1"], error_score="raise")
cross_val_scores

{'fit_time': array([0.83047891, 0.85589004, 0.89093018, 0.84129691, 0.79818058,
        0.8249042 , 0.80801964, 0.73386931, 0.58877492, 0.59314632]),
 'score_time': array([0.04645324, 0.03638792, 0.03890443, 0.02825594, 0.03200173,
        0.03375816, 0.03181458, 0.01847386, 0.01951385, 0.02445579]),
 'test_accuracy': array([0.8340249 , 0.84094053, 0.84094053, 0.86168741, 0.85615491,
        0.87275242, 0.85753804, 0.84370678, 0.85734072, 0.84210526]),
 'test_f1': array([0.82708934, 0.83870968, 0.84005563, 0.85875706, 0.85310734,
        0.86931818, 0.85472496, 0.83834049, 0.85222382, 0.83806818])}

Here we are looking both at the accuracy and the F1 score, because we are interesting not only in the general performance of the model (accuracy) but also in how well it predicts the "Yes" class, that is, the customers that have churned. In order to obtain that information we look at the F1 score (mean of the precision and recall). <br>
Note: by default it calculates the F1 score of the 1 class (yes). More info: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

In [31]:
accuracy_cv = cross_val_scores["test_accuracy"] 

print(f"""The mean accuracy is {round(np.mean(accuracy_cv), 3)}
The standard deviation of the accuracy is {round(np.std(accuracy_cv), 3)}
The minimum accuracy is {round(min(accuracy_cv), 3)}
The maximum accuracy is {round(max(accuracy_cv), 3)}""")

The mean accuracy is 0.851
The standard deviation of the accuracy is 0.011
The minimum accuracy is 0.834
The maximum accuracy is 0.873


In [32]:
f1_cv = cross_val_scores["test_f1"] 

print(f"""The mean F1 score is {round(np.mean(f1_cv), 3)}
The standard deviation of the F1 score is {round(np.std(f1_cv), 3)}
The minimum F1 score is {round(min(f1_cv), 3)}
The maximum F1 score is {round(max(f1_cv), 3)}""")

The mean F1 score is 0.847
The standard deviation of the F1 score is 0.012
The minimum F1 score is 0.827
The maximum F1 score is 0.869


We obtained a mean accuracy and F1 score of 0.85 with the default parameters. <br>
Now we will try to improve these results via hyperparameter tuning.

In [21]:
# Grid search for hyperparameter tuning

param_grid = {
    "n_estimators": [50, 100], 
    "max_depth": [5, 25, 50, None],
    "min_samples_split": [2, 4], 
    "min_samples_leaf" : [1, 2, 5], 
    "max_features": ["sqrt", "log2"]
    }

grid_search = GridSearchCV(model, param_grid, cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
grid_search.best_params_ 

{'max_depth': 25,
 'max_features': 'log2',
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'n_estimators': 100}

In [22]:
grid_search.best_score_

0.8521032606354721

In [19]:
param_grid = {
    "n_estimators": [100, 150], 
    "max_depth": [15, 25, 35],
    "min_samples_split": [2, 3], 
    "min_samples_leaf" : [1, 2, 3], 
    "max_features": ["sqrt", "log2"]
    }

grid_search = GridSearchCV(model, param_grid, cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
grid_search.best_params_ 

{'max_depth': 15,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 3,
 'n_estimators': 100}

In [20]:
grid_search.best_score_

0.8532101441992467

In [25]:
param_grid = {
    "max_depth": [10, 15, 20],
    "min_samples_split": [2, 3, 5], 
    "min_samples_leaf" : [1, 2, 3, 5], 
    "max_features": ["sqrt", "log2"]
    }

grid_search = GridSearchCV(model, param_grid, cv=5, return_train_score=True)
grid_search.fit(X_train, y_train)
grid_search.best_params_ 

{'max_depth': 20,
 'max_features': 'log2',
 'min_samples_leaf': 1,
 'min_samples_split': 5}

In [26]:
grid_search.best_score_

0.853348456785692

In [27]:
model = grid_search.best_estimator_
predictions = model.predict(X_test)

print(classification_report(y_test, predictions))
print("The Kappa of your model is: %4.2f" % (cohen_kappa_score(y_test, predictions)))

              precision    recall  f1-score   support

           0       0.84      0.86      0.85      1513
           1       0.86      0.85      0.85      1585

    accuracy                           0.85      3098
   macro avg       0.85      0.85      0.85      3098
weighted avg       0.85      0.85      0.85      3098

The Kappa of your model is: 0.70


In [28]:
# Check the importance of the features

feature_names = X.columns
feature_names = list(feature_names)

df = pd.DataFrame(list(zip(feature_names, model.feature_importances_)))
df.columns = ['columns_name', 'score_feature_importance']
df.sort_values(by=['score_feature_importance'], ascending = False)

Unnamed: 0,columns_name,score_feature_importance
3,tenure,0.125843
6,TotalCharges,0.125448
5,MonthlyCharges,0.120008
29,Contract_Two year,0.066263
11,OnlineSecurity_Yes,0.058945
20,TechSupport_Yes,0.052421
27,Contract_Month-to-month,0.046095
28,Contract_One year,0.039796
2,Dependents,0.035535
1,Partner,0.028835


### Conclusion
The accuracy of the model has slightly improved from 0.851 to 0.853 thanks to the hyperparameter tuning, but it's virtually the same. <br>
Looking at the final model, and the precision, recall and F1-score for both classes, "No" and "Yes", they are all around 0.84-0.86. This indicates that the model has a balanced performance for both classes.
Regarding the "Yes" class (customers that have churned), precision is 0.86 and recall is 0.85. This means that when the model predicts "Yes", it is correct about 86% of the time, and it can identify 85% of the actual "Yes" customers.
The accuracy of the model is 0.85, which indicates that it correctly classifies 85% of the cases overall.
Lastly, the kappa coefficient of 0.70 suggests a substantial level of agreement beyond chance between the model and the actual observations. <br>
Overall, these results indicate that the random forest model is performing well with balanced precision, recall, and F1-scores for both classes. The accuracy and kappa coefficient further support the model's effectiveness in capturing patterns and making accurate predictions. <br>
**Next steps**: To continue improving the model, we could perform some feature selection (since until now we have used the full dataframe) based on the feature importances above, or on other of the techniques we have seen (correlation, variance threshold, 
recursive feature elimination, SelectKBest, ANOVA...), and continue with the hyperparameter tuning.