# MACHINE LEARNING PROJECT

# Project description

- Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

- You have access to behavior data about subscribers who have already switched to the new plans. For this classification task, you need to develop a model that will pick the right plan. 

- Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

# Project instructions

- Open and look through the data file. Path to the file:datasets/users_behavior.csv 

- Split the source data into a training set, a validation set, and a test set.

- Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.

- Check the quality of the model using the test set.

- Additional task: sanity check the model. This data is more complex than what you’re used to working with, so it's not an easy task. We'll take a closer look at it later.

# Data description

- Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:
    - сalls — number of calls,
    - minutes — total call duration in minutes,
    - messages — number of text messages,
    - mb_used — Internet traffic used in MB,
    - is_ultra — plan for the current month (Ultra - 1, Smart - 0).

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')
print(df.head())

   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [5]:
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

print(features.shape)
print(target.shape)

(3214, 4)
(3214,)


In [48]:
# Splitting data into training, validation and test set
features_1, features_test, target_1, target_test = train_test_split(features, target, test_size=0.2, random_state=42)
features_train, features_valid, target_train, target_valid = train_test_split(features_1, target_1, test_size=0.25, random_state=42)


In [49]:
features_test.shape

(643, 4)

In [50]:
features_train.shape

(1928, 4)

In [51]:
features_valid.shape

(643, 4)

In [53]:
# Decision Tree Classifier Model
model = DecisionTreeClassifier(random_state=12345)

model.fit(features_train, target_train)
model.score(features_valid, target_valid)

0.7231726283048211

Accuracy of 72%, does not meet threshold of 75%. 

In [54]:
# Random Forest Classifier Model
rf_model = RandomForestClassifier(random_state=12345)
rf_model.fit(features_train, target_train)
rf_model.score(features_valid, target_valid)

0.7947122861586314

Accuracy of 79%, RandomForestClassifer model does meet threshold of 75%

In [70]:
# Decision Tree Classifier with different parameter
best_model = None
best_result = 0
for depth in range(1, 6):
	model = DecisionTreeClassifier(random_state=12345, max_depth=depth) # create a model with the given depth
	model.fit(features_train, target_train) # train the model
	predictions = model.predict(features_train) # get the model's predictions
	result = accuracy_score(target_train, predictions) # calculate the accuracy
	if result > best_result:
		best_model = model
		best_result = result

In [71]:
print("Accuracy of the best model:", best_result)

Accuracy of the best model: 0.8293568464730291


In [72]:
# Determine best max_depth parameter for DecisionTreeClassifier model
for depth in range(1, 6):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    print('max_depth =', depth, ': ', end='')
    print(accuracy_score(target_valid, predictions_valid))

max_depth = 1 : 0.7418351477449455
max_depth = 2 : 0.7744945567651633
max_depth = 3 : 0.7744945567651633
max_depth = 4 : 0.7807153965785381
max_depth = 5 : 0.7713841368584758


Max_depth of 4 yields highest accuracy, 78%

In [73]:
best_score = 0
best_est = 0

In [74]:
for est in range(1, 11): # choose hyperparameter range
    model = RandomForestClassifier(random_state=54321, n_estimators=est) # set number of trees
    model.fit(features_train, target_train) # train model on training set
    score = model.score(features_valid, target_valid) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score # save best accuracy score on validation set
        best_est = est # save number of estimators corresponding to best accuracy score

In [75]:
print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 8): 0.7900466562986003


In [77]:
# Random Forest Classifier model selected as final model as it yields best accuracy 
final_model = RandomForestClassifier(random_state=54321, n_estimators=8) # change n_estimators to get best model
final_model.fit(features_train, target_train)
final_model.score(features_valid, target_valid)

0.7900466562986003

N_estimators = 8 yields highest accuracy percantage, 79%

In [66]:
model = LogisticRegression(random_state=54321, solver='liblinear') # initialize logistic regression constructor with parameters random_state=54321 and solver='liblinear'
model.fit(features_train, target_train)  # train model on training set
score_train = model.score(
    features_train, target_train # calculate accuracy score on training set
)  
score_valid = model.score(
    features_valid, target_valid # calculate accuracy score on validation set
)  

In [67]:
print(
    "Accuracy of the logistic regression model on the training set:",
    score_train,
)
print(
    "Accuracy of the logistic regression model on the validation set:",
    score_valid,
)

Accuracy of the logistic regression model on the training set: 0.703838174273859
Accuracy of the logistic regression model on the validation set: 0.7216174183514774


In [80]:
final_model.score(features_test, target_test)

0.7978227060653188

# Conclusion

After trying different Machine learning models and comparing the results between them, we can determine that the best Machine learning model for this dataset would be the Random Forest Classifier model as this model yielded that highest accuracy score of 79%. By changing hyperparameters, specifically the "n_estimators" hyperparameter, it was determined that the number of estimators that would yield the highest accuracy would be 8 estimators. 

We were provided with a 75% percent threshold, which the Random Forest Classifier with 8 estimators successfully surpasses this threshold at 79%. 