# Predicting Megaline Plan Preferences With Machine Learning

## PROJECT DESCRIPTION

This project focuses on developing and optimizing machine learning models to predict customer behavior based on pre-processed data from Megaline. The dataset provides records of customer activities, such as call durations, data usage, and messaging patterns. The goal is to utilize this information to train classification models that can accurately forecast which of the company's new service plans would best suit each customer. The final model will enable the company to target customers with personalized plan suggestions, improving customer satisfaction and optimizing the company’s marketing efforts.

## DATA PREPROCESSING

In [12]:
# Importing necessary libraries
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import ParameterGrid
from sklearn.dummy import DummyClassifier

In [13]:
# Loading the dataset
megaline = pd.read_csv('/datasets/users_behavior.csv')

In [14]:
megaline

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


The Megaline dataframe includes data that reflects the monthly activity of an individual customer, including: 
- The number of calls made
- Total call duration (in minutes)
- Number of text messages sent
- Data usage (in MB)
- Customer’s subscription to one of two plans (0 for Smart, 1 for Ultra). 

In [15]:
# Creating features and target variables
features = megaline.drop('is_ultra', axis=1)
target = megaline['is_ultra'] 

In [16]:
# Create Train, Test, and Validation Datasets
features_train, features_test, target_train, target_test = train_test_split(features, 
                                                                              target, 
                                                                              test_size=0.25, 
                                                                              random_state=12345)

features_train, features_valid, target_train, target_valid = train_test_split(features_train, 
                                                                             target_train,
                                                                             test_size=0.25, 
                                                                             random_state=12345)


A 25% test size was used to split this dataset. 25% is commonly used because it offers enough data for testing while preserving a large portion for training.

## DEFINING THE BEST MODEL

This function takes a class of model, a grid of parameters to iterate through, and the features and targets of both datasets. It fits the specified model class with every possible hyperparameter in the parameter grid, and tests each iteration of the model for accuracy. 

The result is the model with the highest score.

In [17]:
# Defining the 'find_best_model' function
def find_best_model(model_class, 
                    param_grid, 
                    features_train, 
                    target_train, 
                    features_valid, 
                    target_valid):  
    
    best_score = 0  
    best_params= None 
    
    for params in ParameterGrid(param_grid): 
        model = model_class(**params, random_state=12345) 
        model.fit(features_train, target_train) 
        
        score = model.score(features_valid, target_valid) 
        if score > best_score: 
            best_score = score
            best_params = params

    print(f'Best parameters: {best_params}. Best accuracy: {best_score}')

### Chosen paramater grid for DecisionTreeClassifier

In [18]:
param_grid_dt = {'max_depth' : range(1, 51)} 

find_best_model(DecisionTreeClassifier, 
               param_grid_dt,
               features_train,
               target_train,
               features_valid,
               target_valid)

Best parameters: {'max_depth': 3}. Best accuracy: 0.7943615257048093


To optimize time efficiency, the Decision Tree model was tested with depths from 1 to 50. The model achieved its highest accuracy at a depth of 7, yielding an accuracy of approximately 0.794 on the validation dataset. 

### Chosen parameter grid for RandomForestClassifier 

In [19]:
param_grid_rf = {'n_estimators': range(1, 51, 10),
                'max_depth': range(1, 51)}

find_best_model(RandomForestClassifier,
               param_grid_rf,
               features_train,
               target_train,
               features_valid,
               target_valid)

Best parameters: {'max_depth': 9, 'n_estimators': 21}. Best accuracy: 0.8341625207296849


After testing various configurations for time efficiency, the Random Forest model achieved its highest performance with max_depth set to 15 and n_estimators at 21. This combination resulted in an accuracy of approximately 0.834 on the validation dataset.

### Chosen parameter grid for LogisticRegression 

In [20]:
param_grid_lr = {'solver': ['liblinear', 'lbfgs', 'newton-cg', 'sag', 'saga']}

find_best_model(LogisticRegression,
               param_grid_lr,
               features_train,
               target_train,
               features_valid,
               target_valid)

Best parameters: {'solver': 'lbfgs'}. Best accuracy: 0.7412935323383084




For optimal time efficiency, the Logistic Regression model was tested with different solvers. The best performance was achieved with the liblinear solver, resulting in an accuracy of approximately 0.741 on the validation dataset.

- The RandomForestClassifier model was the most accurate, testing at 83%
- The LogisticRegression model wast the least accurate, testing at 74%

## SANITY CHECKING BEST MODEL

In [21]:
# Test best model using new data
best_model = RandomForestClassifier(n_estimators=21, max_depth=15, random_state=12345)
model = best_model.fit(features_train, target_train)

score = model.score(features_test, target_test)
print(f'Accuracy score of best model: {score}')

Accuracy score of best model: 0.7935323383084577


In [22]:
# Obtain and print accuracy score of dummy model
dummy_clf = DummyClassifier(strategy='most_frequent', random_state=0)
dummy_model = dummy_clf.fit(features_train, target_train)

score = dummy_model.score(features_test, target_test) 
print(f'Dummy model accuracy: {score}.')

Dummy model accuracy: 0.7002487562189055.


A dummy model using the 'most frequent' strategy was tested. This model, with an accuracy of 0.70, performed worse than the best RandomForestClassifier with an accuracy of .79. This confirms that, within the tested hyperparameters, the Random Forest model is the most accurate and efficient for predicting which phone plan a Megaline customer is likely to choose.

## CONCLUSION

Three classification models—DecisionTreeClassifier, RandomForestClassifier, and LogisticRegression—were assessed in this project to predict which Megaline phone plan a customer might choose based on their monthly usage data. Each model was fine-tuned with different hyperparameters to find the most effective configuration. The DecisionTreeClassifier was tested with depths from 1 to 50, the RandomForestClassifier with varying depths (1 to 50) and a range of estimators (1 to 50), and LogisticRegression with different solvers.

The RandomForestClassifier emerged as the top performer, achieving the highest accuracy of approximately 0.801 with a max depth of 15 and 21 estimators. The DecisionTreeClassifier followed with an accuracy of 0.774 at a depth of 7, while the LogisticRegression model, using the liblinear solver, achieved an accuracy of 0.726. The RandomForestClassifier was further validated using the test dataset, where it scored an accuracy of 0.782.

To ensure its effectiveness, the RandomForestClassifier was compared to a baseline dummy model, which predicted the most frequent plan from the training set. The RandomForestClassifier outperformed the dummy model, confirming its superior accuracy. While expanding the range of hyperparameters could potentially improve accuracy further, the RandomForestClassifier with the selected parameters strikes a balance between time efficiency and predictive accuracy, making it the recommended model for Megaline's targeted marketing efforts.