### **Business Problem**
Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want a model that will analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra, with a minimum accuracy score of 75%. For this binary classification task I aim to develop a model with the highest possible accuracy. I will employ three classification algorithms to train the model. The trained model with the highest accuracy score will be delivered to Megaline. The three learning algorithms I will employ are:

1. Decision Tree Classifier
2. Random Forest Classifier
3. Logistic Regression Classifier

The data Megaline has provided represents the behavior of users who have already switch to one of the new plans from the legacy plans. The data was pre-processed in a previous analysis I did for Megaline. Every observation in the dataset contains monthly behavior information about one user. The information given is as follows: 

- сalls — number of calls
- minutes — total call duration in minutes
- messages — number of text messages
- mb_used — Internet traffic used in MB
- is_ultra — plan for the current month (Ultra - 1, Smart - 0)

I will use the accuracy score function provided by the sklearn metrics library to evaluate the trained models

### Import Libraries & Dataset

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('data/TripleTen user behavior.csv')

### Descriptive Statistics & Data Preperation
First let's inspect the complete dataset 

In [3]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
def summary_stats(df):
    stats = pd.DataFrame(index = list(df))
    stats['type'] = df.dtypes
    stats['count'] = df.count()
    stats['nunique'] = df.nunique()
    stats['%unique'] = stats['nunique'] / len(df) * 100
    stats['null'] = df.isnull().sum()
    stats['%null'] = stats['null'] / len(df) * 100
    stats['min'] = df.min()
    stats['max'] = df.max()
    return stats

summary_stats(df)

Unnamed: 0,type,count,nunique,%unique,null,%null,min,max
calls,float64,3214,184,5.724953,0,0.0,0.0,244.0
minutes,float64,3214,3149,97.977598,0,0.0,0.0,1632.06
messages,float64,3214,180,5.600498,0,0.0,0.0,224.0
mb_used,float64,3214,3203,99.657747,0,0.0,0.0,49745.73
is_ultra,int64,3214,2,0.062228,0,0.0,0.0,1.0


The data has **3,214 rows & 5 columns**, including the target variable, which means I will have **4 features** to train the models on. There are no missing values and all 4 features are numeric. The target variable here is the **is_ultra** column which contains a **1** if the user is on the Ultra phone plan, or a **0** if the user is not on the Ultra phone plan.

I will now split the data into training, validation, & test sets using a 3:1:1 ratio respectively. I chose this ratio because no test data exists, meaning we will have to partition our own test data. To do that we will first need to seperate the target variable & features from each other.

In [5]:
# seperating the features into their own variable
features = df.drop(['is_ultra'], axis=1)

# seperating the target variable from the features
target = df['is_ultra']

# splitting 60% of the original dataset to be used for training
features_train, features_temp, target_train, target_temp = train_test_split(
    features, target, test_size=0.6, random_state=54321
    )

# splitting the remaining data in half to achieve a 3:1:1 ratio
features_test, features_valid, target_test, target_valid = train_test_split(
    features_temp, target_temp, test_size=0.5, random_state=54321
    )



In [26]:
print(features_test.shape, target_test.shape)
print(features_valid.shape, target_valid.shape)
print(features_train.shape, target_train.shape)

(964, 4) (964,)
(965, 4) (965,)
(1285, 4) (1285,)


### Decision Tree Algorithm

First I will train the model using a decision tree. The most important hyperparameter of a decision tree is the tree's maximum depth. To ensure we train the model using the optimal value for the tree's maximum depth I will create a loop that iteratively trains the model with a different depth set for the tree, makes prediction using the trained model, & calculates the accuracy of the model. Once the loop terminates I will have saved the best performing model, it's accuracy, & the value of the max_depth hyperparameter.

In [49]:
best_tree_model = None
best_tree_score = 0
best_depth = 0

for depth in range(1, 10):
    # training model
    model = DecisionTreeClassifier(max_depth=depth, random_state=54321, criterion='gini')
    model.fit(features_train, target_train)
    # using trained model to make predictions on validation data
    tree_prediction = model.predict(features_valid)
    # calculating the accuracy of the  trained model on validation data
    valid_accuracy = accuracy_score(target_valid, tree_prediction)
    # checking if this iteration's trained model performed better than the last
    if valid_accuracy > best_tree_score:
        best_tree_model = model
        best_tree_score = valid_accuracy
        best_depth = depth
        
print(f"""  
Accuracy on Validation Set: {best_tree_score}
Optimal Depth Value: {best_depth}
""")

  
Accuracy on Validation Set: 0.805181347150259
Optimal Depth Value: 5



It appears that the optimal maximum depth of the tree is 5. This model correctly predicted a user's phone plan 80% of the time on the validation data. Accuracy decreased at any maximum depth value set higher or lower than 5.

In [38]:
tree_test_prediction = best_tree_model.predict(features_test)
tree_accuracy_test = accuracy_score(target_test, tree_test_prediction)
print(f"Accuracy on Test Data: {tree_accuracy_test}")

Accuracy on Test Data: 0.7800829875518672


The model was less accurate on the test data with an accuracy of 78%

### Random Forest Algorithm

Let's move onto Random Forest and find the optimal value for the n_estimators hyperparameter by creating a loop similar to how we did while tuning the decision tree's max_depth hyperparameter.

In [54]:
# random forest model
best_forest_model = None
best_forest_score = 0
best_est = 0

for est in range(1, 50):
    model = RandomForestClassifier(random_state=54321, n_estimators=est)
    model.fit(features_train, target_train)
    prediction = model.predict(features_valid)
    score = accuracy_score(target_valid, prediction)
    if score > best_forest_score:
        best_forest_score = score
        best_forest_model = model
        best_est = est

print(f"""     
Accuracy on Validation Data: {best_forest_score}
Best Number of Estimators for Random Forest Model: {best_est}
""")

     
Accuracy on Validation Data: 0.8186528497409327
Best Number of Estimators for Random Forest Model: 20



It appears that the optimal number of estimators for our Random Forest model is 16, which accuracately predicted the user's plan 81% of the time

In [55]:
forest_test_predictions = best_forest_model.predict(features_test)
forest_test_accuracy = accuracy_score(target_test, forest_test_predictions)
print(f"Accuracy on Test Data: {forest_test_accuracy}")

Accuracy on Test Data: 0.770746887966805


The random forest model was also less accurate on the test data with an accuracy of 76%

### Logistic Regression Algorithm

In [48]:
# training logistic regression model
log_reg_model = LogisticRegression(random_state=54321, solver='liblinear')
log_reg_model.fit(features_train, target_train)
prediction = log_reg_model.predict(features_valid)

score_valid = accuracy_score(target_valid, prediction)

print(f"Accuracy of Logistic Regression Model on Validation Set: {score_valid}")

Accuracy of Logistic Regression Model on Validation Set: 0.7274611398963731


The Logistic Regression model performed the worst on the validation data with an accuracy score of 72% which is below the minimum threshold set by Megaline. Let's see how it performs on the test data.

In [47]:
log_reg_test_prediction = log_reg_model.predict(features_test)
log_reg_test_accuracy = accuracy_score(target_test, log_reg_test_prediction)

print(f'Accuracy of Logistic Regression Model on Test Set: {log_reg_test_accuracy}')

Accuracy of Logistic Regression Model on Test Set: 0.6939834024896265


The Logistic Regression model's accuracy fell further below the minimum threshold with an accuracy of 69% on the test data.

### Conclusion

The best performing model was surprisingly the Decision Tree Classifier with a maximum depth value of 5 which successfully predicted the user's phone plan 78% of the time on the test data. The second best model was Random Forest with an accuracy of 77% on the test data, & the third best was Logistic Regression with an accuracy of 69% which falls below Megaline's minimum threshold of 75%.