# Mobile Carrier Megaline Subscribers Behavior

## Introduction

The project is based on the behavior data of Mobile carrier Megaline subscribers to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. The model would be based on binary classification of newer plans so that it can pick the right plan based on subscribers' behavior. Along with this, the model developed should have the highest possible accuracy with the threshold 0.75.

## Data Overview

In [1]:
# Import necessary libraries
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats as st

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [2]:
# Load the datasets
users_behavior = pd.read_csv('users_behavior.csv')

# Display the first few rows of the dataset
display(users_behavior.head())

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


### Data preprocessing

In [3]:
# Dislay the column names of the dataset
column_names = users_behavior.columns.tolist()
display(column_names)

['calls', 'minutes', 'messages', 'mb_used', 'is_ultra']

In [4]:
# Display the shape of the dataset
n_rows, n_cols = users_behavior.shape
print(f"The DataFrame has {n_rows} rows and {n_cols} columns")

The DataFrame has 3214 rows and 5 columns


In [5]:
# Display the informative summary of the dataset
users_behavior.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [6]:
# Display the descriptive statistics of the dataset
users_behavior.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [7]:
# Display the number of duplicates in the dataset
duplicated_users_behavior = users_behavior[users_behavior.duplicated()]
display(f"Number of duplicated data: {duplicated_users_behavior.shape[0]}")

'Number of duplicated data: 0'

In [8]:
# Display the number of missing values in the dataset
display(users_behavior.isna().sum())

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [9]:
# Display the number of '0' and '1' values in the 'is_ultra' column
users_behavior.value_counts('is_ultra')

is_ultra
0    2229
1     985
Name: count, dtype: int64

## Machine Learning

I chose Random Forest for this project because it has highest accuracy among all the different classification models as it uses an ensembles of trees instead of just one. But, along with this, it is the slowest model as the more tree there are, the slower the model works.

In [10]:
# Define the feature and target variables
features = users_behavior.drop(['is_ultra'], axis=1)
print(features.shape)

target = users_behavior['is_ultra']
print(target.shape)

(3214, 4)
(3214,)


### Splitting the source dataset: Training, Test and Validation Datasets

As it asks to split the source data into a training set, a validation set, and a test set - I split the source data into three parts: training, validation, and test. The sizes of validation set and test set are equal. and so divided the source data into 3:1:1 ratio.

In [11]:
# Split the dataset into training and temporary sets
users_behavior_train, users_behavior_temp = train_test_split(users_behavior, test_size=0.40, random_state=7)

In [12]:
# Define the feature and target variables for the training set
features_train = users_behavior_train.drop(['is_ultra'], axis=1)
print(features_train.shape)

target_train = users_behavior_train['is_ultra']
print(target_train.shape)

(1928, 4)
(1928,)


In [13]:
# Split the temporary dataset into validation and test sets
users_behavior_valid, users_behavior_test = train_test_split(users_behavior_temp, test_size=0.50, random_state=7)

In [14]:
# Define the feature and target variables for the validation set
features_valid = users_behavior_valid.drop(['is_ultra'], axis=1)
print(features_valid.shape)

target_valid = users_behavior_valid['is_ultra']
print(target_valid.shape)

(643, 4)
(643,)


In [15]:
# Define the feature and target variables for the test set
features_test = users_behavior_test.drop(['is_ultra'], axis=1)
print(features_test.shape)

target_test = users_behavior_test['is_ultra']
print(target_test.shape)

(643, 4)
(643,)


### Model Training

In [16]:
# Define the model and fit it to the training data 
best_score = 0
best_est = 0
model = None

for est in range(1,11):
    model = RandomForestClassifier(random_state=57, 
                                   n_estimators=est)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    if score > best_score:
        best_score = score
        best_est = est
        best_model = model
        
print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 10): 0.7962674961119751


Accuracy of the model with estimators range 1-11 is 0.7962674961119751 which is higher than the threshold for accuracy which is 0.75. So, it is a good model. But, we need a model with the highest accuracy, so need to investigate the quality of different models by changing hyperparameters.

####  Investigate the quality of different models by changing hyperparameters. 

In [17]:
# Define previous model with different hyperparameters and fit it to the training data
best_score = 0
best_est = 0
best_model = None

for est in range(1,51):
    model = RandomForestClassifier(random_state=57,
                                   n_estimators=est)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    if score > best_score:
        best_score = score
        best_est = est
        best_model = model
        
print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 23): 0.807153965785381


Compare to the previous model, I think the model with estimators range 1-51 is 0.807153965785381 which is higher than threshold and the previous model and doesn't take that long to load, so I think that its better than the previous one.

Along with this, I tried to increase the estimators range to 1-101, but there was no change to the accuracy and the model became really slower. Thus, I think the above model I got with good accuracy with good speed.

### Check the quality of the model using the test set

In [18]:
# Checking the quality of the model on the test set
test_accuracy = best_model.score(features_test, target_test) 

print("Best model's accuracy on the validation set (n_estimators = {}): {}".format(best_est, best_score))

print("Test set accuracy of the best model: {}".format(test_accuracy))

Best model's accuracy on the validation set (n_estimators = 23): 0.807153965785381
Test set accuracy of the best model: 0.7807153965785381


Test set accuracy of the best model is 0.7807153965785381 which is close to the accuracy of the validation set (0.807153965785381). This shows that the model is performing well and it generalizes effectively to new data. Slight drop in accuracy from validation set to test set is normal which suggests that the model is not overfitting.