# Which Model Predicts Megaline Phone Plan Preference Most Accurately?

## Project Description

The following project uses pre-processed data from the Megaline mobile phone company detailing their customer's monthly beahviors (number of calls made, data used, etc.). The data is derived from customers who have already switched to one of Megaline's new phone plans (Smart or Ultra) and will be used to train several classification models. Each model will be fitted to predict which of Megaline's new phone plans should be advertised to customers who have not yet switched to one of the Smart or Ultra plans. The purpose of this project is to test each model for optimal accuracy by iterating through different combinations of hyperparemters in order to provide Megaline with the most accurate and time efficient model for targeted advertisement. 

## Import Neccesary Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import ParameterGrid
from sklearn.dummy import DummyClassifier

## Open Dataset

In [2]:
url = 'https://raw.githubusercontent.com/pvnkd0v3/megaline_model_training_tt_project/main/users_behavior.csv'
megaline = pd.read_csv(url)

## View Data

In [3]:
megaline

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


The megaline dataframe contains observations of 3214 customers of Megaline, a mobile carrier company. Each observation represents the monthly behavior of a single customer including the number of calls made, total call durations (in minutes), number of text messages sent, data used (in MB), and which of Megaline's new plans the user is subscribed to (0 for Smart 1 for Ultra). **Note:** This dataset has been previously processed in preperation for this project.

## Define Model Features and Targets

In [4]:
features = megaline.drop('is_ultra', axis=1) #User behavior that will be used to predict target (new phone plan)
target = megaline['is_ultra'] #New phone plans to be predicted using features (user behavior)

## Create Training, Test, and Validation Datasets

In [5]:
# Create Train, Test, and Validation Datasets
features_train, features_test, target_train, target_test = train_test_split(features, 
                                                                              target, 
                                                                              test_size=0.20, 
                                                                              random_state=12345)

features_train, features_valid, target_train, target_valid = train_test_split(features_train, 
                                                                             target_train,
                                                                             test_size=0.25, 
                                                                              random_state=12345)

#Confirm sizes of datasets
print('Training dataset:')
print('X:', features_train.shape)
print('Y:', target_train.shape)
print()
print('Validation dataset:')
print('X:', features_valid.shape)
print('Y:', target_valid.shape)
print()
print('Test dataset:')
print('X:', features_test.shape)
print('Y:', target_test.shape)

Training dataset:
X: (1928, 4)
Y: (1928,)

Validation dataset:
X: (643, 4)
Y: (643,)

Test dataset:
X: (643, 4)
Y: (643,)


## Test Accuracy of Different Models

In [6]:
#Function that takes a class of model, a grid of parameters to iterate through, 
#and the features and targets of both training and validation datasets. It fits 
#the specified model class with each possible hyperparameter in the parameter 
#grid and tests each iteration of the model for accuracy. It returns the parameters
#that resulted in the highest accuracy score

def find_best_model(model_class, 
                    param_grid, 
                    features_train, 
                    target_train, 
                    features_valid, 
                    target_valid):  
    
    best_score = 0  #Placeholder value
    best_params= None #Placeholder value
    
    for params in ParameterGrid(param_grid): #For loop iterating through every possible combination of hyperparamters in the given parameter grid
        model = model_class(**params, random_state=12345) #Define model using given model class and current hyperparemter in loop
        model.fit(features_train, target_train) #Fit the model from given training dataset's features and target 
        
        score = model.score(features_valid, target_valid) #Define accuracy score of the current model in the loop
        if score > best_score: #Make best_score and best_params variable equal to the current model's score and params if the score was greater than the previous model's score
            best_score = score
            best_params = params

    print(f'Best parameters: {best_params}. Best accuracy: {best_score}') #Print statement returning the most accurate model's parameters and accuracy score

### Decision Tree Model

In [7]:
param_grid_dt = {'max_depth' : range(1, 51)} #Chosen paramater grid for DecisionTreeClassifier model class

find_best_model(DecisionTreeClassifier, 
               param_grid_dt,
               features_train,
               target_train,
               features_valid,
               target_valid)

Best parameters: {'max_depth': 7}. Best accuracy: 0.7744945567651633


When given a range of depths between 1 and 50 (for sake of time efficiency) the Decision Tree model had the highest accuracy with a depth of 7 which returned an accuracy of approximately 0.774 when tested on the validation dataset. 

### Random Forest Model

In [8]:
param_grid_rf = {'n_estimators': range(1, 51, 10), #Chosen parameter grid for RandomForestClassifier model class
                'max_depth': range(1, 51)}

find_best_model(RandomForestClassifier,
               param_grid_rf,
               features_train,
               target_train,
               features_valid,
               target_valid)


Best parameters: {'max_depth': 15, 'n_estimators': 21}. Best accuracy: 0.8009331259720062


When given a range of estimators between 1 and 50 in increments of 10 and a range of depths between 1 and 50 (both ranges for sake of time efficiency) the Random Forest model had the highest accuracy with 21 estimators and a depth of 15 which returned an accuracy of approximately 0.801 when tested on the validation dataset. Although both models exceed the 0.75 accuracy threshold, the Random Forest model is more accurate than the Decision Tree model.

### Logistic Regression Model

In [9]:
param_grid_lr = {'solver': ['liblinear', 'lbfgs', 'newton-cg', 'sag', 'saga']} #Chosen parameter grid for LogisticRegression model class

find_best_model(LogisticRegression,
               param_grid_lr,
               features_train,
               target_train,
               features_valid,
               target_valid)

Best parameters: {'solver': 'lbfgs'}. Best accuracy: 0.7262830482115086




The Logistic Regression model was most accurate when using the 'liblinear' solver. When this model was tested on the validation dataset, it returned an accuracy of approximately 0.726 making it not only the least accurate of the three models but not accurate at all as it doesn't reach the accuracy threshold of 0.75.

## Test Best Model Using New Data

In [10]:
best_model = RandomForestClassifier(n_estimators=21, 
                          max_depth=15, 
                          random_state=12345)

In [11]:
model = best_model.fit(features_train, target_train)

score = model.score(features_test, target_test)
print(f'Accuracy score of best model when tested using test dataset: {score}')

Accuracy score of best model when tested using test dataset: 0.7822706065318819


When presented with new data from the test dataset, the RandomForestClassifier model with 21 estimators and depth of 15 was confirmed to be accurate with an accuracy score of approximately 0.782.

## Sanity Check Best Model

In [12]:
dummy_clf = DummyClassifier(strategy='most_frequent', random_state=0) #Create a dummy model that predicts most frequent class (phone plan)
dummy_model = dummy_clf.fit(features_train, target_train)

score = dummy_model.score(features_test, target_test) #Obtain and print accuracy score of dummy model
print(f'Dummy model accuracy: {score}.')

Dummy model accuracy: 0.6951788491446346.


Considering the original Megaline data is heavily skewed towards the Smart Plan, a dummy model was fitted with the training data using the 'most frequent' strategy to predict the phone plan from user data. Not only was this model less accurate than the best RandomForestCLassifier model, but it did not cross the accuracy threshold at all with a score of 0.695. This conforms that with the given range of hyperparameters considering the need for both accuracy and time efficiency, the Random Forest model is the most accurate when predicting which new phone plan a Megaline customer would most likely switch to.

## Conclusion

Three classification models (DecisionTreeClassifier, RandomForestClassifier, and LogisticRegression) were tested using a validation dataset for their accuracy in determining which Megaline phone plan a customer will be drawn to based on monthly phone data. Furthermore, each model was tested using different hyperparameter arguements to determine each model's optimality. The DecisionTreeClassifier model was tested using different depths from 1 to 50, The RandomForestClassifier with different depths from 1 to 50 in increments of 10 and number of estimators from 1 to 50, and LogisticRegression with different solvers. The model that returned the highest accuracy score was the RandomForestClassifier with hyperparemeters depth set to 15 and the number of estimators set to 21. With the threshold for accuracy being 0.75, the RandomForestClassifier returned an accuracy score of approximately 0.801. Following was the DecisionTreeClassifier model with a depth of 7 returning an accuracy score of 0.774, and the LogisticRegression model using the liblinear solver returning an accuracy score of 0.726. The RandomForestClassifier model with the optimal hyperparameters was then tested for accuracy using new data from the test dataset returning an sufficient accuracy score of approximately 0.782. Furthermore, the Random Forest model's accuracy was compared to that of a baseline dummy model that was fitted to predict the phone plan from user data using the most frequent phone plan from the training set as its strategy. The Random Forest model was found to be more accurate than the dummy model confirming that it is the most accurate model. Ranges of hyperparemters can be increased to possibly achieve higher accuracy of the models at the cost of more time to run them, but in this case of time efficiency and optimal accuracy, it appears the RandomForestClassifier with a depth of 15 and 21 estimators should be used by Megaline to determine which of their new phone plans to advertise to customers who have not yet switched.