<div style="border-radius: 15px; border: 3px solid indigo; padding: 15px;">
<b> Reviewer's comment</b>
    
Hi Elizabeth, I am a reviewer on this project. Congratulations on submitting your first machine learning project! 🎉
    

Before we start, I want to pay your attention to the color marking:
    

   
    
<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">

Great solutions and ideas that can and should be used in the future are in green comments. Some of them are: 
    
    
    
- You have added a good introduction that describes the project, well done! 
    
    
    
- You have successfully prepared the subsets. It is important to split the data correctly in order to ensure there's no intersection;    
    
    
    
- Trained several models; 


    
- Compared the results; 

    

- Tuned hyperparameters. We tune them to identify the hyperparameters that will yield the desired metric value;


- Analyzed accuracy values. It is not enough to just fit the model and print the result. Instead, we have to analyze the results as it helps us identify what can be improved;

   
    
- Wrote an excellent conclusion! A well-written conclusion shows how the project met its objectives and provides a concise and understandable summary for those who may not have been involved in the details of the project. Good job! 

</div>
    
<div class="alert alert-warning" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<b> Reviewer's comment </b>

Yellow color indicates what should be optimized. This is not necessary, but it will be great if you make changes to this project. There're only two such recommendations: it will be perfect if you add titles that describe the core goals of your projects, and you can can also create a variable with the best model in the hyperparameter tuning loop, which will save you some time, since you will not need to fit the model again to conduct the final test. 
 
</div>
<div class="alert alert-danger" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<b> Reviewer's comment </b>

Issues that must be corrected to achieve accurate results are indicated in red comments. However, there are no issues that need to be fixed, well done!  
</div>        
<hr>
    
<font color='dodgerblue'>**To sum up:**</font>  thank you very much for submitting the project! You did a fantastic job here. You have correctly splitted the data, successfully trained several models, implemented hyperparameter tuning and conducted the final test, which is great! I do not have any questions, so the project can be accepted. The next sprints will cover more advanced machine learning methods, I hope you will like it. Good luck! 
    

<hr>
    
Please use some color other than those listed to highlight answers to my comments.
I would also ask you **not to change, move or delete my comments** to make it easier for me to navigate during the next review.
    
In addition, my comments are defined as headings. 
They can mess up the content; however, they are convenient, since you can immediately go to them. I will remove the headings from my comments in the next review if you ask me to. 


<hr>
    
    
✍️ Here's a nice article about [5 hyperparameter tuning techniques](https://www.run.ai/guides/hyperparameter-tuning) that you may find interesting.  
    
<hr>
    
📌 Please feel free to schedule a 1:1 with our tutors or TAs, join daily coworking sessions, or ask questions in the sprint channels on Discord if you need assistance. 
</div>

# Machine Learning Project

Our goal for this project is to help customers by recommending one of Megaline's newer plans, Smart or Ultra, based on their behavior (how they use their phone). To do this we can use the data of customers who have already switched to new plans (from our previous project with Megaline). Using this data we can create a model that will help us pick the right plan for future subscribers who want to switch.

The users_behavior dataset contains monthly behavior information about one user. The features of this dataset are as follows:
1. сalls — number of calls,
2. minutes — total call duration in minutes,
3. messages — number of text messages,
4. mb_used — Internet traffic used in MB,
5. is_ultra — plan for the current month (Ultra - 1, Smart - 0).

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment </h2>
    
There's an introduction, which is good. It is important to write an introductory part, because it gives an idea about the content of the project.
</div>
<div class="alert alert-warning" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment </h2>
    
Please don't forget about project title :) A title should reflect the core goals.
    
</div>

## Importing the Libraries and DataSet

In [1]:
#import necessary libraries

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [2]:
#import data set

df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
#We don't have preprocess the data as that was done previously, but it is still good to see what data we have.
df.info()
df.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
552,16.0,132.8,107.0,21893.35,1
2855,79.0,528.2,56.0,11925.48,0
2950,97.0,705.23,30.0,14103.81,0
728,43.0,298.42,12.0,18311.24,0
886,83.0,558.94,62.0,24179.89,0


In looking into our code we can determine down the line what our target will be vs the features we will use to predict our target. As the goal is to predict what plan a user will/should pick the column is_ultra will be our best target.  We can dive into this a little more when we are creating the features and targets for each of our sets.

## Splitting the Data into Multiple Sets

In the next steps we will split the data into a training set, a validation set and a test set.  When splitting the data into these 3 sets the preferable ratio is 3:1:1.  So our training set will be the largest at 60%, then our validation and test set will each be 20%.

In [4]:
#splitting the data into a training set, a validation set and a test set (this will take 2 steps)

# 1- split into training (60%) and a temp set (40%)
df_train, df_temp = train_test_split(df, test_size=0.40, random_state=12345)

# 2- split the temp set into validation (20%) and test (20%)
df_valid, df_test = train_test_split(df_temp, test_size=0.50, random_state=12345)

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment </h2>
    
Correct! We indeed need to split the data into three subsets: one for fitting the model, one for calculating the metric values, and one for the final test. 
    
</div>

In [5]:
#verify that these have been split appropriately
df_train.info()
df_valid.info()
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1928 entries, 3027 to 482
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     1928 non-null   float64
 1   minutes   1928 non-null   float64
 2   messages  1928 non-null   float64
 3   mb_used   1928 non-null   float64
 4   is_ultra  1928 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 90.4 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 643 entries, 1386 to 3197
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     643 non-null    float64
 1   minutes   643 non-null    float64
 2   messages  643 non-null    float64
 3   mb_used   643 non-null    float64
 4   is_ultra  643 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 30.1 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 643 entries, 160 to 2313
Data columns (total 5 columns):
 #   Column    Non-Null Count

## Identifying our Target

As stated above when we downloaded our data, our goal is to utilize user behaviors to determine or predict what plan people should pick, ultra(1) or Smart(2), based on their phone usage. Here we will separate the features and target for each of our split datasets.

In [6]:
#Training set features and target

features_train= df_train.drop(['is_ultra'], axis = 1)
target_train = df_train['is_ultra']

features_valid= df_valid.drop(['is_ultra'], axis = 1)
target_valid = df_valid['is_ultra']

features_test= df_test.drop(['is_ultra'], axis = 1)
target_test = df_test['is_ultra']

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment </h2>
    
Good! 
    
</div>

## Testing Different Models

The first model type we will investigate is the Decision Tree. First we are going to run through different tree depths to determine which model version of the Decision tree would fit our needs best. 

In [7]:
for depth in range(1, 10):
    dt_model = DecisionTreeClassifier(random_state=12345, max_depth= depth) # create a model with the given depth
    dt_model.fit(features_train , target_train) # train the model
    predictions_valid = dt_model.predict(features_valid)# < find the predictions using validation set >

    print("max_depth =", depth, ": ", end='')
    print(accuracy_score(target_valid, predictions_valid))

max_depth = 1 : 0.7542768273716952
max_depth = 2 : 0.7822706065318819
max_depth = 3 : 0.7853810264385692
max_depth = 4 : 0.7791601866251944
max_depth = 5 : 0.7791601866251944
max_depth = 6 : 0.7838258164852255
max_depth = 7 : 0.7822706065318819
max_depth = 8 : 0.7791601866251944
max_depth = 9 : 0.7822706065318819


<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment</h2>
    
You have successfully implemented hyperparameters tuning, well done! We need to tune them to identify the the hyperparameters that will yield the desired metric value.
</div>

Based on these results we can determine that a max_depth of 3 is the best decision tree model as it has the highest accuracy of 0.785 (rounded).  Therefore we will create our final decision tree model with a max_depth of 3.  It is also worth noting that all iterations of our model hit our accuracy threshold of 0.75.

In [8]:
# final decision tree model

dt_final_model = DecisionTreeClassifier(random_state=12345, max_depth = 3) # change max_depth to get best model
dt_final_model.fit(features_train, target_train)

DecisionTreeClassifier(max_depth=3, random_state=12345)

<div class="alert alert-warning" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment</h2>
    
A piece of advice: instead of fitting the model again, you can create a variable with the best model found after tuning in the loop above. 
</div>

Next we will look into Random Forest Models.  Like the Decision Tree we want to find the best version of this model so we will run it through different numbers of trees in the forest with the n_estimators hyperparameter.

In [9]:
best_score = 0
best_est = 0
for est in range(1, 30): # choose hyperparameter range
    rf_model = RandomForestClassifier(random_state=12345, n_estimators= est) # set number of trees
    rf_model.fit(features_train , target_train) # train model on training set
    score = rf_model.score(features_valid, target_valid) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score# save best accuracy score on validation set
        best_est = est# save number of estimators corresponding to best accuracy score

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 23): 0.7947122861586314


We ran through this code in a range of 10, 20, 30, 40 and 50.  Once we got to 23, there was no more increased accuracy with more trees, so I decided to leave the loop at 30 so that it can run through more quickly, as the more trees there are the longer it takes to run through. As we can see there is increased accuracy using Random Forest over Decision Tree models with the most accurate Rain Forest Model having an accuracy of 0.795 (rounded) compared to the most accurate Decision Tree model at 0.785 (rounded)

In [10]:
rf_final_model = RandomForestClassifier(random_state=12345, n_estimators=23) # change n_estimators to get best model
rf_final_model.fit(features_train, target_train)

RandomForestClassifier(n_estimators=23, random_state=12345)

Finally, we will look a Logistic Regression Models.

In [11]:
lr_model =  LogisticRegression(random_state=12345, solver='liblinear')# initialize logistic regression constructor with parameters random_state=54321 and solver='liblinear'
lr_model.fit(features_train , target_train)  # train model on training set
score_train = lr_model.score(features_train, target_train)  
score_valid = lr_model.score(features_valid , target_valid)  

print(
    "Accuracy of the logistic regression model on the training set:",
    score_train,
)
print(
    "Accuracy of the logistic regression model on the validation set:",
    score_valid,)

Accuracy of the logistic regression model on the training set: 0.7505186721991701
Accuracy of the logistic regression model on the validation set: 0.7589424572317263


Here we can see with a logistic regression model our training set had an accuracy of 0.751 (rounded) and our validation set had slightly better at 0.759(rounded). Based on all these models we can see that our best random forest model (23), has the greatest accuracy of 0.795 (rounded), then our best decision tree model (3) with an accuracy of 0.785 (rounded), and finally our logistic regression model with about 0.751 (rounded).  All our models reach our threshold accuracy of 0.75. 

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment</h2>
    
Good, you compared three models with different hyperparameters!


</div>

## Checking the Quality of our Best Model Using the Test Set

In this step we will use our trained random forest model and run our test data set through it and test it's accuracy.

In [12]:
# Predict on the test set
test_predictions = rf_final_model.predict(features_test)

# Calculate accuracy
test_accuracy = accuracy_score(target_test, test_predictions)

print(f"Test Set Accuracy: {test_accuracy:.4f}")

Test Set Accuracy: 0.7807


Here we can see that our model, when tested, continues to exceed our accuracy threshold of 0.75 with a score of 0.78.

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment</h2>
    
Correct. Here we indeed use the best model (it's one model usually) among all models we trained and tuned to run the final test using the test subset. It helps us evaluate the model's generalization ability. 

</div>

## Sanity Check
Finally, we will sanity check the model. To do this we will create a baseline model that predicts if people will choose Ultra or Smart plans essentially by chance. Then we can compare the accuracy of our "chance" model to our best fit model for our test set.

In [13]:
from sklearn.dummy import DummyClassifier

# Create a baseline classifier
baseline_model = DummyClassifier(strategy="most_frequent")

# Fit the model on the training data
baseline_model.fit(features_train, target_train)

# Predict on the test data
baseline_predictions = baseline_model.predict(features_test)

# Calculate accuracy
baseline_accuracy = accuracy_score(target_test, baseline_predictions)

print(f"Baseline Model Test Set Accuracy: {baseline_accuracy:.2f}")

# Compare with your model's accuracy
print(f"Your Model Test Set Accuracy: {test_accuracy:.2f}")

Baseline Model Test Set Accuracy: 0.68
Your Model Test Set Accuracy: 0.78


Here we can see that our model, when tested demonstrates a greater accuracy than our baseline model based on chance. 

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment</h2>
    
Good idea to check it!    
</div>

# Conclusion

In conclusion we can use our selected model (random forest with est = 23), to predict which plan costumers will chose based on how they use their phone.  We can also use this model, to help direct customers for the best plan for them with fairly reliable accuracy.

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment</h2>
    
Excellent job! 
    
</div>