## Review

Hi Jing! My name is Soslan. I'm reviewing your work. I've added all my comments to new cells with the title "Review". My apologies for the delay in the review. We will be faster next time :)

```diff
+ If you did something great I'm using green color for my comment
- If the topic requires some extra work so I can accept it then the color will be red.
```

Your project is of high quality. You demonstrate knowledge of such technics as cross-validation, grid-search, etc. Your code is well designed. So I'm just accepting your project.

---

# Project description
Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would recommend the user to switch to a newer plan: Smart or Ultra.
- You have access to data about subscribers who have already switched to the new plans.
- For this classification task, you need to develop a model that will pick the right plan.
- Since you’ve already performed the data preprocessing step, you can move straight to creating the model.
- Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75.

In [1]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
import itertools
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Step 1. Data Preparation

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')
display(df)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [3]:
# train-test split / 80-20

train, val = train_test_split(df, test_size=0.20, random_state=123)
x_train, x_val = train.drop('is_ultra', axis=1), val.drop('is_ultra', axis=1)
y_train, y_val = train['is_ultra'], val['is_ultra']

# Step 2. Baseline Performance
Model used:
- Decision Tree
- Random Forest
- Gradient Boosting

In [4]:
# Decision Tree

tree = DecisionTreeClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=123)
tree_result = cross_val_score(tree, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (tree_result.mean(), tree_result.std()))

Accuracy: 0.724 (0.025)


In [5]:
# Random Forest

rf = RandomForestClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=123)
rf_result = cross_val_score(rf, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (rf_result.mean(), rf_result.std()))

Accuracy: 0.792 (0.025)


In [6]:
# Gradient Boosting

gre = GradientBoostingClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=123)
gre_result = cross_val_score(gre, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (gre_result.mean(), gre_result.std()))

Accuracy: 0.806 (0.023)


## Review

```diff
+ Good start, your using cv, this is OK :)
```

---

# Step 3. Hyperparameter Tuning
Tuning Steps:
- find the right combination of n_estiamtors and learning rate
- tune tree related parameters
- re-adjust boosting related parameters

Since our dataset is small enough, for each step, I will try to use exhaust search as frequent as possile.

In [7]:
# tune learning rate and tune n_estimators, this is a preparation step for tuning tree parameters

grid = {'learning_rate':[0.5, 0.2, 0.1, 0.05], 'n_estimators':[10, 20, 50, 100, 200]}
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=123)
tune = GridSearchCV(estimator =GradientBoostingClassifier(), param_grid = grid, scoring='accuracy', n_jobs=-1, cv=cv)
tune.fit(x_train, y_train)

GridSearchCV(cv=<sklearn.model_selection._split.RepeatedStratifiedKFold object at 0x7f0d847ec450>,
             error_score='raise-deprecating',
             estimator=GradientBoostingClassifier(criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=...
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_change=None,
                                   

In [8]:
tune.best_params_, tune.best_score_

({'learning_rate': 0.05, 'n_estimators': 200}, 0.8070011668611435)

In [None]:
# fix learning rate and n_estimators, tune tree related parameters
# since the number of features is small, I will not consider to tune max_features

grid = {'max_depth':[3,5,7,9], 'min_samples_split':[2,4,6,8,10,20,40,60,100], 'min_samples_leaf':[1,3,5,7,9]}
tune = GridSearchCV(estimator =GradientBoostingClassifier(learning_rate=0.05, n_estimators=200), param_grid = grid, 
                    scoring='accuracy', n_jobs=-1, cv=5)
tune.fit(x_train, y_train)

In [None]:
tune.best_params_, tune.best_score_

In [None]:
# re-adjusting boosting parameters
# since this step is really a fine tuning step, I will use random search instead

grid = {'subsample':[0.7,0.75,0.8,0.85,0.9,0.95,1], 'n_estimators':[100, 150, 200, 250, 300, 350, 400, 500]}
tune = RandomizedSearchCV(estimator=GradientBoostingClassifier(learning_rate=0.05, max_depth=3, min_samples_split=5, 
                                                               min_samples_leaf=2), 
                          param_distributions=grid, n_iter=100, scoring='accuracy', n_jobs=-1, cv=5)
tune.fit(x_train, y_train)

In [12]:
tune.best_params_, tune.best_score_

({'subsample': 0.9, 'n_estimators': 150}, 0.8121353558926487)

## Review

```diff
+ And again very good work.
```

---

# Step 4. Model Refitting and Accessing Results

In [None]:
params = {'learning_rate': 0.05, 'n_estimators': 150,
          'max_depth': 3, 'min_samples_leaf': 5, 'min_samples_split': 2,
          'subsample': 0.9}

final = GradientBoostingClassifier(**params)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=123)
final_result = cross_val_score(final, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (final_result.mean(), final_result.std()))

In [None]:
# Confusion matrix plot

def plot_confusion_matrix(cm, classes, normalize=False):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    
    cmap = plt.cm.Blues
    title = "Confusion Matrix"
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        cm = np.around(cm, decimals=3)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
y_pred = GradientBoostingClassifier(**params).fit(x_train, y_train).predict(x_val)
cfm = confusion_matrix(y_val, y_pred, labels=[0, 1])
plt.figure(figsize=(10,6))
plot_confusion_matrix(cfm, classes=["not switching", "switched"], normalize=True)

In [None]:
# calculate the fpr and tpr for all thresholds of the classification

probs = GradientBoostingClassifier(**params).fit(x_train, y_train).predict_proba(x_val)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_val, preds)
roc_auc = metrics.auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

# Conclusion
- The final model is a gredient boosting model as it shows to have the most potentials.
- The final accuracy after tuning is 0.808.
- However, the recall rate is extremely low and this could be a very costly problem to many companies. Sometime, solely rely on accuracy is not very informative as the dataset could be imbalanced. 

## Review

```diff
+ I'm just fixing that you did everything at very high level :)
```

---