# Project Description

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.

Develop a model with the highest possible _accuracy_. In this project, the threshold for _accuracy_ is 0.75. Check the _accuracy_ using the test dataset.

## Opening and looking through the data file

In [1]:
# Importing libraries relevant to the project
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error 

In [3]:
# Reading the csv
df = pd.read_csv('users_behavior.csv')

In [5]:
# Taking a closer look at the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [7]:
# Taking a closer look at the dataframe
df.describe()
display(df.head(10))

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


## Splitting source data into training set, validation set, and test set

In [9]:
# Splitting the data into training and testing and validation sets at a ratio of 3:1:1 or 60%:20%:20%

# Defining features and target
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

# Splitting the data into a training and testing set
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=12345)

# Splitting the data into a validation set
features_train, features_valid, target_train, target_valid = train_test_split(features_train, target_train, test_size=0.25, random_state=12345)

print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)
print(features_test.shape)
print(target_test.shape)

(1928, 4)
(1928,)
(643, 4)
(643,)
(643, 4)
(643,)


## Investigating the Quality of Different Models


In [11]:
# Model 1: Decsion Tree Classifier, tuning max_depth parameter

for depth in range(1,6):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train) # training the model
    predictions_valid = model.predict(features_valid)
    print("max_depth =", depth, ": ", end='')
    print(accuracy_score(target_valid, predictions_valid))

max_depth = 1 : 0.7387247278382582
max_depth = 2 : 0.7573872472783826
max_depth = 3 : 0.7651632970451011
max_depth = 4 : 0.7636080870917574
max_depth = 5 : 0.7589424572317263


The highest accuracy score of any Deceision Tree Classifier model was at max_depth = 3 : 0.7651632970451011

In [13]:
# Model 2: Random Forest Classifier, tuning the n_estimators parameter

best_score = 0
best_est = 0
for est in range(1, 50): # hyperparameter range
    model = RandomForestClassifier(random_state=12345, n_estimators=est) # set number of trees
    model.fit(features_train, target_train) # training model on the training set
    score = model.score(features_valid, target_valid) #calculating the accuracy score on the validation test set
    if score > best_score:
        best_score = score
        best_est = est

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))


Accuracy of the best model on the validation set (n_estimators = 44): 0.7947122861586314


In [34]:
# Model 3: Logistic Regression 

model = LogisticRegression(random_state=54321, solver='liblinear')
model.fit(features_train, target_train) # training model on the training set
score_train = model.score(features_train, target_train) # calculating accuracy score on training set
score_valid = model.score(features_valid, target_valid) # calculating accuracy score on validation set

print(
    "Accuracy of the logistic regression model on the training set:", score_train)

print(
    "Accuracy of the logistic regression model on the validation set:", score_valid)

Accuracy of the logistic regression model on the training set: 0.7422199170124482
Accuracy of the logistic regression model on the validation set: 0.7293934681181959


## Checking the Quality of the Model Using the Test Set 

In [15]:
# Appling Random Forest Classifier to Test Set

test_model = RandomForestClassifier(random_state=12345, n_estimators=44)
test_model.fit(features_train, target_train) # Training the model on the training set
test_model_predictions = test_model.predict(features_test)
test_model_accuracy = accuracy_score(target_test, test_model_predictions)

print(f'Accuracy Score of the Final Model:', test_model_accuracy)

Accuracy Score of the Final Model: 0.7916018662519441


## Conclusion

Of the three classification models, the highest accuracy score on the validation set was found using the RandomForestClassifier, which yielded a 79.47% accuracy when n_estimators = 44. The DecisionTreeClassifier yielded an accuracy of 76.5% on the validation set, at a max depth of 3, the models highest accuracy. The lowest accuracy score came using Logistic Regression, which scored 72.9%. 

While the RandomForestClassifer had the best results, it is also the slowest and only beat out the DecisionTreeClassifier by a few percentage points. 

The accuracy of the RandomForestClassifer on the test set yielded a 79.16% score (when n_estimators = 44). Thus, Megaline is advised to utilize the RandomForestClassifier model. 