# Megaline Recommended Plan Based on User Behavior

In this project, we are working with a mobile carrier named Megaline. The company offers its customers two prepaid plans, “Smart” and “Ultra”. Many of the company’s subscribers use legacy plans, and Megaline wants to encourage them to switch to these newer plans.

The main objective of the project is to develop a machine learning model that can analyze the behavior of the subscribers and predict which plan is more suitable for them. The model will use data about the subscribers’ behavior, such as the number of calls they made, total call duration in minutes, number of text messages sent, and internet traffic used in MB.

The model will be trained on a dataset containing behavior data about subscribers who have already switched to the new plans. The goal is to achieve a model with the highest possible accuracy, with a threshold for accuracy set at 0.75.

The project involves several steps, including data preprocessing, model training, hyperparameter tuning, and model evaluation. We will also perform a sanity check on the model.

## Initialization

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

# from sklearn.metrics import mean_squared_error

## Load, examine, and clean the data

In [2]:
# Load the data and examine it
try:
    behavior = pd.read_csv('./datasets/users_behavior.csv')
except:
    behavior = pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/users_behavior.csv')

# Show the information of the dataframe
behavior.info()

# Show statistical summary of the data
print('\nStatistical summary:\n', behavior.describe())

# See if there are any correlations
print('\nCorrelations:\n', behavior.corr())
display(behavior.sample(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB

Statistical summary:
              calls      minutes     messages       mb_used     is_ultra
count  3214.000000  3214.000000  3214.000000   3214.000000  3214.000000
mean     63.038892   438.208787    38.281269  17207.673836     0.306472
std      33.236368   234.569872    36.148326   7570.968246     0.461100
min       0.000000     0.000000     0.000000      0.000000     0.000000
25%      40.000000   274.575000     9.000000  12491.902500     0.000000
50%      62.000000   430.600000    30.000000  16943.235000     0.000000
75%      82.000000   571.927500    57.0000

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1834,62.0,448.76,62.0,15522.01,0
2665,47.0,378.52,47.0,9804.44,0
730,36.0,259.5,8.0,7417.14,0
2294,76.0,484.49,11.0,22454.35,0
46,76.0,535.91,65.0,11968.22,0


#### Convert columns into appropriate data types

In [3]:
# Convert calls and messages from float to int
if np.array_equal(behavior['calls'], behavior['calls'].astype('int')):
  behavior['calls'] = behavior['calls'].astype('int')
if np.array_equal(behavior['messages'], behavior['messages'].astype('int')):
  behavior['messages'] = behavior['messages'].astype('int')

behavior.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   int32  
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   int32  
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(2), int32(2), int64(1)
memory usage: 100.6 KB


#### Check for missing values

In [4]:
print(behavior.isna().sum())

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64


It looks line there are no missing values in the dataset.

#### Check for duplicates

In [5]:
# Check for duplicates
duplicates = behavior.duplicated()
print(f"Number of duplicate rows = {duplicates.sum()}")

Number of duplicate rows = 0


There are no duplicates in the dataset.

## Choose a model

#### Split the source data into data sets

In [6]:
# Split the source data into a training set, a validation set, and a test set.

# Split the data into a training set and a temporary set with an 60-40 split
train, temp = train_test_split(behavior, test_size=0.25, random_state=54321)

# Split the temporary set into a test set and validation set
# This gives the data the 60-20-20 split with equal test and validation sizes
valid, test = train_test_split(temp, test_size=0.5, random_state=54321)

#### Prepare the features and target variables

In [7]:
# Create the features variables
features_train = train.drop('is_ultra', axis=1) # X_train
features_valid = valid.drop('is_ultra', axis=1) # X_valid

# Create the target variables:
target_train = train['is_ultra'] # y_train
target_valid = valid['is_ultra'] # y_valid

In [8]:
# Decision Tree with different max depths
for depth in range(1,6):
  model = DecisionTreeClassifier(random_state=54321, max_depth=depth)
  model.fit(features_train, target_train)
  predictions_valid = model.predict(features_valid)
  print("Max Depth =", depth, ": ", end='')
  print(accuracy_score(target_valid, predictions_valid))

Max Depth = 1 : 0.7313432835820896
Max Depth = 2 : 0.7611940298507462
Max Depth = 3 : 0.7736318407960199
Max Depth = 4 : 0.7711442786069652
Max Depth = 5 : 0.7786069651741293


For the decision tree, the higher the max depth is, the more accurate it is. At max depth of 5, the model is 77.8% accurate in its predictions.

In [9]:
# Random Forest
for n in range(10, 101, 10):
  model = RandomForestClassifier(random_state=54321, n_estimators=n)
  model.fit(features_train, target_train)
  predictions_valid = model.predict(features_valid)
  print("n_estimators =", n, ": ", end='')
  print(accuracy_score(target_valid, predictions_valid))

n_estimators = 10 : 0.7860696517412935


n_estimators = 20 : 0.7960199004975125
n_estimators = 30 : 0.7985074626865671
n_estimators = 40 : 0.7935323383084577
n_estimators = 50 : 0.7960199004975125
n_estimators = 60 : 0.7985074626865671
n_estimators = 70 : 0.8009950248756219
n_estimators = 80 : 0.8009950248756219
n_estimators = 90 : 0.8059701492537313
n_estimators = 100 : 0.8059701492537313


Even at 10 n_estimators, the random forest model seems to have a higher accuracy than the decision tree model. It is slower since it uses many trees for its decision, but it is more accurate.

In [10]:
# Initialize the Logistic Regression constructor
model = LogisticRegression(random_state=54321, solver='liblinear')

# Train the model on the training set
model.fit(features_train, target_train)

# Calculate the accuracy score on the training set
score_train = model.score(features_train, target_train)

# Calculate the accuracy score on the validation set
score_valid = model.score(features_valid, target_valid)

# Print the accuracy of the logistic regression model on the training set
print("Accuracy of the logistic regression model on the training set:", score_train)

# Print the accuracy of the logistic regression model on the validation set
print("Accuracy of the logistic regression model on the validation set:", score_valid)

Accuracy of the logistic regression model on the training set: 0.7178423236514523
Accuracy of the logistic regression model on the validation set: 0.6940298507462687


For the logistic regression model, we got 71% and 69% accuracy on the training set and validation set respectively. Although logistic regression is faster than the random forest model, it is less accurate.

We will use the random forest model for our test.

## Test the chosen model

In [11]:
# We are using the Random Forest model since it is the most accurate.
# Create the variables for the test set
features_test = test.drop('is_ultra', axis=1)
target_test = test['is_ultra']

model = RandomForestClassifier(random_state=54321, n_estimators=70)
model.fit(features_train, target_train)
predictions_test = model.predict(features_test)
print("Test Accuracy: ", accuracy_score(target_test, predictions_test))

Test Accuracy:  0.8009950248756219


The accuracy of the random forest model used is at 80%. This means that the predictions that the model will make will be correct 80% of the time.

## Sanity check the model

#### Check that the model is better than a random model.

In [12]:
# Generate random predictions based on the class distribution of the training set
random_predictions = np.random.choice([0, 1], size=len(target_test), p=[1 - target_train.mean(), target_train.mean()])

# Calculate the accuracy of the random predictions
random_accuracy = accuracy_score(target_test, random_predictions)
print(f"Random Model Accuracy: {random_accuracy}")

Random Model Accuracy: 0.5870646766169154


The random model accuracy fluctuates from 0.53 to 0.63. Since the random forest model has a higher accuracy than the random model accuracy, there's a good chance the model is learning something from the data and not just making random guesses.

#### Check that the model is better than a simple model

In [13]:
# Generate simple predictions by always predicting the most common class
simple_predictions = np.full_like(target_test, target_train.mode()[0])

# Calculate the accuracy of the simple predictions
simple_accuracy = accuracy_score(target_test, simple_predictions)
print(f"Simple Model Accuracy: {simple_accuracy}")

Simple Model Accuracy: 0.681592039800995


This is the accuracy of the model that always predicts the most common class. The accuracy of 0.68, which is lower than the model's accuracy. It is a good sign that the model is learning from the data rather than just predicting the most common class.

#### Check the confusion matrix

In [14]:
# Calculate the confusion matrix
conf_mat = confusion_matrix(target_test, model.predict(features_test))

# Print the confusion matrix
print(conf_mat)

[[252  22]
 [ 58  70]]


From the output, the model has:

[[True Negatives (TN)  False Positives (FP)]\
 [False Negatives (FN) True Positives (TP)]]

- 252 True Negatives where the model predicted 0 (Smart plan) and the actual is 0.
- 22 False Positives where the model incorrectly predicted 1 when the actual is 0.
- 58 False negatives where the model incorrectly predicted 0 when the actual is 1.
- 70 True Positives where the model correctly predicted 1 when the actual is 1.

#### Check feature importances

In [15]:
# Get feature importances
importances = model.feature_importances_

# Print feature importances
for feature, importance in zip(features_train.columns, importances):
    print(f"{feature}: {importance}")

calls: 0.20454833414323342
minutes: 0.26672329580491727
messages: 0.2004702288813261
mb_used: 0.32825814117052327


This shows that the most important feature for making predictions is the mb_used column.

## Conclusion

In this project, we aimed to develop a model that could analyze the behavior of Megaline’s subscribers and recommend one of Megaline’s newer plans: Smart or Ultra. The dataset provided contained monthly behavior information about each user, including the number of calls, total call duration in minutes, number of text messages, internet traffic used in MB, and the plan for the current month.

The data was already clean with no missing or duplicate values. We split the data into a training set, a validation set, and a test set. We then investigated the quality of different models by changing hyperparameters. We trained a Decision Tree model, a Random Forest model, and a Logistic Regression model, and tuned their hyperparameters to achieve the best performance.

The Random Forest model achieved the highest accuracy on the validation set, so we chose it as our final model. We then checked the quality of this model using the test set and achieved an accuracy that met the project’s threshold.

We also performed a sanity check on the model by comparing its performance to a random model and a simple model, checking the confusion matrix, and checking the feature importances. The results showed that our model was learning from the data and making sensible predictions.

Overall, this project demonstrated the effectiveness of machine learning models in analyzing user behavior and making recommendations. It also highlighted the importance of model selection, hyperparameter tuning, and model evaluation in the machine learning workflow.