# Megaline Phone Plan Recommendation

# Introduction

This project proceeds from an earlier project I completed (Sprint 3) where I processed the call/text/internet data for Megaline phone plan customers and ran statistical analysis to determine the most lucrative phone plans. This time, I am using this data to offer recommendations to customers for a new phone plan based off of their calls/texts/internet data. 

There are two possible plans. I will try out decision tree classification, random forest classification, and logistic regression models for this goal, and use a simple measure of accuracy to select the best-performing model for this problem. An accuracy of at least 0.75 will be needed, for Megaline to be able to offer robust recommendations. I will tweak the hyperparamters of these models to obtain the best performance.

I am going to split the data into a training set, a validation set, and a test set up front and use the same sets throughout this project. I can use the validation set while I tweak hyperparameters, checking accuracy for each iteration, and then use the test set once at the very end of the evaluation to test the final model's performance. Then, I'll train the highest-performing model with the highest-performing hyperparameters on the entirety of the dataset to get an even better model. 

I will also perform a sanity check to ensure that the chosen model performs better than chance. To do this, I will determine the accuracy from guessing at random, and compare the model's accuracy to this. I want to make sure the model performs better than simple chance.

Note - a plan number 1 indicates Ultra plan, whereas plan number 0 indicates Smart plan.

In [1]:
# Import necessary libraries and models/functions

import pandas as pd

# from sklearnex import patch_sklearn # Enhanced performance package for Intel processors
# patch_sklearn()

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from joblib import dump

# EDA

In [2]:
try:
    df = pd.read_csv('users_behavior.csv')
except:
    df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
df.head() # Look at dataset firsthand

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
df.info() # Check for datatypes, dataset shape/size, any missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [5]:
df.describe(include='all') # Make sure numbers are reasonable

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


It appears from the low (<0.5) mean and the median of 0.0 that the Smart plan is most common. Let's quantify how many users there are for each plan, just out of curiosity.

In [6]:
ultra_users = df.is_ultra.sum()
print("Number of Ultra plan users:", ultra_users)
print("Percentage of users with Ultra plan:", ultra_users / len(df))

Number of Ultra plan users: 985
Percentage of users with Ultra plan: 0.30647168637212197


So, out of ~3200 total users, a little under a third of them are Ultra plan users. The rest are Smart plan users.

# Model Testing

Now we will split the data into three different sets for training, validation, and testing, before employing a decision tree classification, a random forest classification, and a logistic regression classification model and ultimately choosing a top model.

## Split data

Let's collect the features into one dataset by dropping the column with the target feature. And let's also collect the target feature into a series of its own.

In [7]:
features = df.drop('is_ultra', axis=1)
target = df.is_ultra

I want 60% of the data to be used for training, and 20% for the validation and test sets each. 
We will first split the features and target datasets into a training set and a test set, at an 80/20 ratio. 
Then, we can further divide this new training set into an ultimate training set and a validation set. In this case, the validation set needs to come from 25% of the training set to give us the 60/20/20 ratio. This is because 25% of 80 equals 20.

In [8]:
x_train, features_test, y_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=123)

features_train, features_valid, target_train, target_valid = train_test_split(
    x_train, y_train, test_size=0.25, random_state=123)

## Decision tree

Let's first try out the decision tree. Its main hyperparameter is the maximum tree depth. Let's write a loop to try out a variety of tree depths and find the one with the highest accuracy.

In [9]:
best_depth = 0
best_score = 0
model_tree = None

for depth in range(1,11):
    model = DecisionTreeClassifier(max_depth=depth, random_state = 123) # create instance of class
    model.fit(features_train, target_train) # Fit model with training data
    score = model.score(features_valid, target_valid) # Calculate accuracy
    
    # Document the highest-performing hyperparamaters, along with their corresponding accuracy
    if score > best_score: 
        best_score = score
        best_depth = depth
        model_tree = model
        
print(f'Best accuracy: {best_score} obtained using max depth {best_depth}.') 

Best accuracy: 0.7947122861586314 obtained using max depth 9.


It looks like the simple decision tree has an accuracy of 79.5% when using the validation set.

## Random forest

The random forest model's main hyperparameters are maximum tree depth and the number of estimators. Let's iterate through both.

In [10]:
best_score = 0
best_est = 0
best_depth = 0
model_forest = None

for est in range (10, 81, 10):
    for depth in range (1, 11):
        model = RandomForestClassifier(
            max_features=1.0, # The lack of this hyperparameter was causing warnings.
            n_estimators=est, max_depth=depth, random_state=123)
        model.fit(features_train, target_train)
        score = model.score(features_valid, target_valid)
        
        if score > best_score:
            best_score = score
            best_est = est
            best_depth = depth
            model_forest = model
            
print(f"Best accuracy: {best_score} obtained using {best_est} trees with\
 max_depth {best_depth}.")

Best accuracy: 0.7978227060653188 obtained using 40 trees with max_depth 6.


Hmm, the accuracy on this model isn't that much better than it was on the decision tree, but it is better.

## Logistic regression

I don't see a reason to use both a validation and a testing set for logistic regression, because there are no hyperparameters to tune using an intermediate validation set. I don't want to under-utilize the dataset. I could resplit the original dataset into a 75/25 ratio, but I already have a 80/20 training/testing split from my original split that I can use.

In [11]:
model = LogisticRegression(solver='liblinear', random_state=123) # Using liblinear for small dataset
model.fit(features_train, target_train)

score_train = model.score(features_train, target_train)
print(f'Training accuracy: {score_train}')

score_valid = model.score(features_valid, target_valid)
print(f'Validation accuracy: {score_valid}')

Training accuracy: 0.7142116182572614
Validation accuracy: 0.702954898911353


This logistic regression model performs below both of the other models we tested, below above the cutoff of 75%. The lower validation accuracy also could indicate some overfitting.

## Top model

We found the top-performing model in terms of accuracy to be the random forest classifier, using 20 trees with a max depth of 7. 

In [12]:
score_test = model_forest.score(features_test, target_test)
print(f'Testing accuracy: {score_test}')

Testing accuracy: 0.8133748055987559


The tested accuracy is 81.3%, and it has exceeded the validation set accuracy. It was slower to train than the other models, so for a larger dataset, this may become too large of a drawback - but it worked quickly enough for this small dataset.

Let's do a sanity check - there are two plan options, which implies a 50/50 split in users with each plan. In reality, it is approximately a 70/30 split (Smart/Ultra). The model should be performing with over 70% accuracy, as simply guessing "Smart" for each observation would yield approximately a 70% accuracy rate. So, our model performs better than chance.

In [13]:
# Save the model to a joblib file for sharing
dump(model_forest, 'MegalinePlanRecommender.joblib')

['MegalinePlanRecommender.joblib']

# Conclusion

I have attempted to offer recommendations to customers for a new phone plan (Ultra vs Smart) based off of their calls/texts/internet data, using supervised learning classifier models. I tried out a decision tree classifier, a random forest classifier, and a logistic regression classifer, and split my data 60/20/20 (training/validation/testing). The random model performed best in terms of accuracy, with a tested accuracy of 81.0%, though it was the slowest by far to train. For this small dataset, the training speed was very much adequate. I included code to save the model into a joblib file.