# Megaline Phone Plan Recommendation

# Introduction

This project proceeds from an earlier project I completed (Sprint 3) where I processed the call/text/internet data for Megaline phone plan customers and ran statistical analysis to determine the most lucrative phone plans. This time, I am using this data to offer recommendations to customers for a new phone plan based off of their calls/texts/internet data. 

There are two possible plans. I will try out decision tree classification, random forest classification, and logistic regression models for this goal, and use a simple measure of accuracy to select the best-performing model for this problem. An accuracy of at least 0.75 will be needed, for Megaline to be able to offer robust recommendations. I will tweak the hyperparamters of these models to obtain the best performance.

I am going to split the data into a training set, a validation set, and a test set up front and use the same sets throughout this project. I can use the validation set while I tweak hyperparameters, checking accuracy for each iteration, and then use the test set once at the very end of each model's evaluation to ultimately test each model's performance. Then, I'll train the highest-performing model with the highest-performing hyperparameters on the entirety of the dataset to get an even better model. 

I will also perform a sanity check to ensure that the chosen model performs better than chance. To do this, I will randomly assign 0's and 1's to each observation, and compare the accuracy measures of these values and the predicted values, which also come in the form of 0's and 1's. I want to make sure the model performs better than simple chance.

Note - a plan number 1 indicates Ultra plan, whereas plan number 0 indicates Smart plan.

In [16]:
# Import necessary libraries and models/functions

import pandas as pd

from sklearnex import patch_sklearn # Enhanced performance package for Intel processors
patch_sklearn()

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from joblib import dump
from joblib import load

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


# EDA

In [2]:
df = pd.read_csv('users_behavior.csv')

In [3]:
df.head() # Look at dataset firsthand

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
df.info() # Check for datatypes, dataset shape/size, any missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [7]:
df.describe(include='all') # Make sure numbers are reasonable

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


It appears from the low (<0.5) mean and the median of 0.0 that the Smart plan is most common. Let's quantify how many users there are for each plan, just out of curiosity.

In [18]:
ultra_users = df.is_ultra.sum()
print("Number of Ultra plan users:", ultra_users)

Number of Ultra plan users: 985


So, out of ~3200 total users, a little under a third of them are Ultra plan users.

# Model Testing

Now we will split the data into three different sets for training, validation, and testing, before employing a decision tree classification, a random forest classification, and a logistic regression classification model and ultimately choosing a top model.

## Split data