Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.
Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

In [1]:
import pandas as pd

In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [3]:
df = pd.read_csv('/datasets/users_behavior.csv')
df

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


If the goal is to determine plan based on usage, then 'is_ultra' will be the target and the rest of the columns will be the features.

In [4]:
df_train, df_other = train_test_split(df, test_size=0.4, random_state=42)
df_test, df_valid = train_test_split(df_other, test_size=.5, random_state=42)

Split data into three groups, training, testing, and validation, at a ratio of 3:1:1

In [5]:
train_features = df_train.drop(['is_ultra'], axis =1)
train_target = df_train['is_ultra']
test_features = df_test.drop(['is_ultra'], axis =1)
test_target = df_test['is_ultra']
valid_features = df_valid.drop(['is_ultra'], axis =1)
valid_target = df_valid['is_ultra']

In [6]:
model_dt = DecisionTreeClassifier()
model_dt.fit(train_features, train_target)
model_dt.score(test_features, test_target)

0.7309486780715396

Accuracy for decision tree is .72, close to the .75 threshold.

In [7]:
for estim in range(5, 101, 5):
    model_rf = RandomForestClassifier(n_estimators=estim, max_depth=10, random_state=42)
    model_rf.fit(train_features, train_target)
    accuracy = model_rf.score(test_features, test_target)
    print('N-Estimator:', estim, 'Accuracy:', accuracy)

N-Estimator: 5 Accuracy: 0.7993779160186625
N-Estimator: 10 Accuracy: 0.7962674961119751
N-Estimator: 15 Accuracy: 0.8009331259720062
N-Estimator: 20 Accuracy: 0.8009331259720062
N-Estimator: 25 Accuracy: 0.8009331259720062
N-Estimator: 30 Accuracy: 0.7993779160186625
N-Estimator: 35 Accuracy: 0.7993779160186625
N-Estimator: 40 Accuracy: 0.7993779160186625
N-Estimator: 45 Accuracy: 0.8009331259720062
N-Estimator: 50 Accuracy: 0.8009331259720062
N-Estimator: 55 Accuracy: 0.7978227060653188
N-Estimator: 60 Accuracy: 0.7962674961119751
N-Estimator: 65 Accuracy: 0.7993779160186625
N-Estimator: 70 Accuracy: 0.8040435458786936
N-Estimator: 75 Accuracy: 0.8055987558320373
N-Estimator: 80 Accuracy: 0.8087091757387247
N-Estimator: 85 Accuracy: 0.8040435458786936
N-Estimator: 90 Accuracy: 0.8040435458786936
N-Estimator: 95 Accuracy: 0.8040435458786936
N-Estimator: 100 Accuracy: 0.8040435458786936


Looking for highest accuracy, Random Forest with n-estimator of 80 seems to be the winner.

In [8]:
model_rf = RandomForestClassifier(n_estimators=80, max_depth=10, random_state=42)
model_rf.fit(train_features, train_target)
accuracy = model_rf.score(valid_features, valid_target)
accuracy

0.8149300155520995

Accuracy of chosen model is above the threshold of .75.

Sanity checking the models

In [9]:
dt_predict_valid = model_dt.predict(valid_features)
rmse_dt = mean_squared_error(valid_target, dt_predict_valid)**0.5
rmse_dt

0.5034870627881349

In [10]:
rf_predict_valid = model_rf.predict(valid_features)
rmse_rf = mean_squared_error(valid_target, rf_predict_valid)**0.5
rmse_rf

0.43019761092769965

Random Forest still wins in the sanity check, as it has a lower RMSE, which should be as close to 0 as possible.