# Project Objective

The Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

Using monthly behaviorial data about subscribers who have already switched to the new plans, the objective of this project is to develop a machine learning model to help pick the right plan for legacy plan users.

I will optimize models suitable for a classification task with the goal of producing the highest possible accuracy score for each model. It must have an accuracy score of at least 75%.

# Data description

Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:
- сalls — number of calls
- minutes — total call duration in minutes
- messages — number of text messages
- mb_used — Internet traffic used in MB
- is_ultra — plan for the current month (Ultra - 1, Smart - 0)

# Project Plan
- Load and inspect the data.
- Split the data into train, validate, and test sets then specify features and a target for each.
- Train the Decision Tree Classifier, Random Forest Classifier, and Logistic Regression models using the training data set.
- Optimize the accuracy score of the validation data set for each model.
- Assess the accuracy score of the testing data set for the best instance of each model.
- Assess the best model to use in order to help Megaline pick the right plan for its legacy plan users.


In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


### Load and Inspect the Data

In [2]:
# read csv to dataframe
df = pd.read_csv('/datasets/users_behavior.csv')

# preview data
print(f"{df.sample(10)}\n")

# get info about the data
df.info()

# validate that it's safe to convert calls and messages to int without losing data
print("\nIt's safe to convert df['calls'] to int:",np.array_equal(df['calls'],df['calls'].astype(int)))
print("It's safe to convert df['messages'] to int:", np.array_equal(df['messages'],df['messages'].astype(int)))

# convert calls and messages to int
df['calls'] = df['calls'].astype(int)
df['messages'] = df['messages'].astype(int)

      calls  minutes  messages   mb_used  is_ultra
2024   27.0   147.66      39.0   7545.04         0
1822  115.0   679.27       1.0  28668.40         1
3095   63.0   409.35       0.0   4300.48         1
1732   34.0   217.52      23.0  14040.66         0
1768   70.0   497.55      66.0  24918.49         0
1714   56.0   324.99      78.0  10977.57         0
3053   35.0   219.89      40.0  13664.23         0
764    66.0   488.95      46.0  17383.22         0
946    33.0   247.29       0.0  30996.30         1
720     1.0    14.91      55.0  23828.60         0

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB

It's s

The data is as expected for the most part except that it's unnecessary for the 'calls' and 'messages' columns to be float data types so I converted them to int data types after validating that it was safe to do so (no data would be lost).

### Split the data into train, validate, and test sets then specify features and a target for each

In [3]:
# split the source data into a training set and validation/testing set
df_train, df_test_valid = train_test_split(df, test_size=0.4, random_state=12345)

# split the validation/testing df into a separate validation set and testing set
df_test, df_valid = train_test_split(df_test_valid, test_size=0.5, random_state=12345)

# designate a list of features and a target in separate variables for df_train
train_features = df_train.drop(['is_ultra'], axis=1)
train_target = df_train['is_ultra']

# designate a list of features and a target in separate variables for df_test
test_features = df_test.drop(['is_ultra'], axis=1)
test_target = df_test['is_ultra']

# designate a list of features and a target in separate variables for df_valid
valid_features = df_valid.drop(['is_ultra'], axis=1)
valid_target = df_valid['is_ultra']

The test set doesn't exist yet so the source data (df) has to be split into three parts: training, validation, and test. The sizes of the validation set and test set are usually equal so that gives us source data split in a 3:1:1 ratio.

Consequently, I split the source data into two parts: 60% training data and 40% validation/testing data.
Next, I split the validation/testing data in two equal 50% parts.
Overall, this leaves me with a 3:1:1 ratio as desired with a majority of the data (a sufficient amount) available to train my models with.

### Optimize the Decision Tree Classifier Model for the Validation Data Set

In [13]:
# initialize values for the best_model and best_result to use in the subsequent loop
best_model = None
best_result = 0

# tune the hyperparameters of the Decision Tree Classifier model to find the best model
for depth in range(1, 9): # loop through values 1-8 for the max_depth= parameter
    model_dt = DecisionTreeClassifier( # ititialize the Decision Tree Classifier Model
        random_state=12345, 
        max_depth=depth) 
    model_dt.fit(train_features, train_target) # train the model
    valid_result = model_dt.score(valid_features, valid_target) # calculate the accuracy score for the validation data
    train_result = model_dt.score(train_features, train_target) # calculate the accuracy score for the training data
    if valid_result > best_result: # store the best model, its best accuracy score, and the max_depth parameter value for its best score in variables
        best_model = model_dt
        best_result = valid_result
        best_depth = depth
        best_train_result = train_result # also store the training data's accuracy score for the best model

# print the accuracy score of the best model and the value of its max_depth parameter
print("Accuracy of the best model:", best_result)
print("Max_depth of best model:", best_depth)

# print the accuracy score of the best model for the training data as well
print("\nAccuracy of the model (training data):", best_train_result)


Accuracy of the best model: 0.7993779160186625
Max_depth of best model: 7

Accuracy of the model (training data): 0.8558091286307054


The model is overfitted with a max_depth of 7 and a training set accuracy score that's notably higher than the accuracy score of the validation set.

### Optimize the Random Forest Classifier Model for the Validation Data Set

In [10]:
# initialize values for the best_model and best_result to use in the subsequent loop
best_score = 0
best_est = 0

# tune the hyperparameters of the Random Forest Classifier model to find the best model
for est in range(1, 10): # choose hyperparameter range
    model_rf = RandomForestClassifier( # ititialize the Random Forest Classifier model
        random_state=54321, 
        n_estimators=est
    ) # set number of trees
    model_rf.fit(train_features, train_target) # train model on training set
    score = model_rf.score(valid_features, valid_target) # calculate accuracy score on validation set
    score_train = model_rf.score(train_features, train_target) # calculate accuracy score on the training set
    if score > best_score:
        best_score = score # save best accuracy score on validation set
        best_est = est # save number of estimators corresponding to best accuracy score
        best_train_score = score_train # save accuracy score for the training set for the validation set's best model

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))
print("Accuracy of the best model on the training set:", best_train_score)

Accuracy of the best model on the validation set (n_estimators = 6): 0.7807153965785381
Accuracy of the best model on the training set: 0.966804979253112


The model is overfitted with a high training set accuracy score of 96.68% (much higher than the accuracy score of the validation set) and may run slowly with 6 estimators.

### Build a Logistic Regression Model for the Validation Data Set

In [12]:
# ititialize the Logistic Regression model
model_lg =  LogisticRegression(
    random_state=54321,
    solver='liblinear'
)

# fit the model
model_lg.fit(train_features, train_target)  

# score the model on the training data
score_train = model_lg.score(train_features, train_target)

# score the model on the validation data
score_valid = model_lg.score(valid_features, valid_target)  

# print the accuracy of the model on both the training and validation data sets
print(
    "Accuracy of the logistic regression model on the training set:",
    score_train,
)
print(
    "Accuracy of the logistic regression model on the validation set:",
    score_valid,
)


Accuracy of the logistic regression model on the training set: 0.7505186721991701
Accuracy of the logistic regression model on the validation set: 0.7402799377916018


The model is not overfitted and runs quickly.

### Calculate the accuracy score for the best Decision Tree Classifier model using the test set data

In [15]:
# check the accuracy of the best Descision Tree Classifier model on the test data set
best_dt_model = DecisionTreeClassifier( # ititialize the Decision Tree Classifier Model
        random_state=12345, 
        max_depth=7) 

# train the model
best_dt_model.fit(train_features, train_target) 

# get the accuracy score of the model on the test data set
test_result = best_dt_model.score(test_features, test_target) 

# print the accuracy score of the best Decision Tree Classifier model for the testing data set
print("Accuracy of the best Decision Tree Classifier model (testing set):", test_result)

Accuracy score of the best Decision Tree Classifier model (testing set): 0.7822706065318819


The accuracy score of model using the testing data is lower than that using the validation data, but it's still the highest accuracy score at 78.22%.

### Calculate the accuracy score for the best Random Forest Classifier model using the test set data

In [17]:
# check the accuracy of the best Random Forest Classifier model on the test data set
best_model_rf = RandomForestClassifier( # ititialize the Random Forest Classifier model
        random_state=54321, 
        n_estimators=6 
    ) 

# fit the model
best_model_rf.fit(train_features, train_target)

# get the accuracy score of the model on the test data set
test_score = best_model_rf.score(test_features, test_target) # calculate accuracy score on validation set

# print the accuracy score of the best Random Forest Classifier model for the testing data set
print("Accuracy of the best Random Forest Classifier model (testing set):", test_score)

Accuracy of the best Random Forest Classifier model (testing set): 0.7573872472783826


The accuracy score of model using the testing data is lower than that using the validation data by about 3%.

### Calculate the accuracy score for the Logistic Regression model using the test set data

In [16]:
# check the accuracy of the Logistic Regression model on the test data set
best_model_lg =  LogisticRegression(
    random_state=54321,
    solver='liblinear'
)

# fit the model
best_model_lg.fit(train_features, train_target)  

# score the model on the training data
score_test = best_model_lg.score(test_features, test_target)

# print the accuracy score of the best Logistic Regression model for the testing data set
print("Accuracy of the best Logistic Regression model (testing set):", score_test)

Accuracy of the best Logistic Regression model (testing set): 0.7589424572317263


The accuracy score of model using the testing data is slightly higher than that using the validation data, which is a solid result.

### Sanity check the model

In [23]:
# store df's features and target in variables
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

# set predictions equal to the target median
median_value = pd.Series(target.median(), index=target.index)

# sanity check accuracy score of the target vs. the target median value
sanity_check_result = accuracy_score(target, median_value) 

# print sanity check accuracy score
print("Sanity check accuracy score:", sanity_check_result)

Sanity check accuracy score: 0.693528313627878


The accuracy score of the best model is higher than the accuracy score of simply using the median value of the target as the prediction. The best model (the Decision Tree Classifier model) works better than a simple approach to guessing the most common target.

# Conclusion

The decision tree classifier model is the best model overall with an accuracy score of 78.22%.

The logistic regression model performed better than the random forest classifier model with an accuracy score of 75.89% vs. 75.73%, respectively. 

All three models beat our project's minimum accuracy score requirement of 75%.