<div style="border-radius: 15px; border: 3px solid indigo; padding: 15px;">
<b> Reviewer's comment</b>
    
Hi Max, I am a reviewer on this project. Congratulations on submitting your first machine learning project! 🎉
    

Before we start, I want to pay your attention to the color marking:
    

   
    
<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">

Great solutions and ideas that can and should be used in the future are in green comments. Some of them are: 
    
    
    
- You have added a good introduction and title that describe the project and its key goal, well done! 
    
    
    
- You have successfully prepared the subsets. It is important to split the data correctly in order to ensure there's no intersection;    
    
    
- Trained several models; 


- Compared the results; 

    

- Tuned hyperparameters. We tune them to identify the hyperparameters that will yield the desired metric value;


- Analyzed accuracy values. It is not enough to just fit the model and print the result. Instead, we have to analyze the results as it helps us identify what can be improved;

   
    
- Wrote an excellent conclusion! A well-written conclusion shows how the project met its objectives and provides a concise and understandable summary for those who may not have been involved in the details of the project. Good job! 

</div>
    
<div class="alert alert-warning" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<b> Reviewer's comment </b>

Yellow color indicates what should be optimized. This is not necessary, but it will be great if you make changes to this project. I've left several recommendations throughout the project. For instance, it will be great if you also add a title to your project and avoid any code repetitions.
 
</div>
<div class="alert alert-danger" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<b> Reviewer's comment </b>

Issues that must be corrected to achieve accurate results are indicated in red comments. However, there are no issues that need to be fixed, well done!  
   
</div>        
<hr>
    
<font color='dodgerblue'>**To sum up:**</font>  thank you very much for submitting the project! You did a fantastic job here. You have correctly splitted the data, successfully trained several models, implemented hyperparameter tuning and conducted the final test, which is great! I do not have any questions, so the project can be accepted. The next sprints will cover more advanced machine learning methods, I hope you will like it. Good luck! 
    

<hr>
    
Please use some color other than those listed to highlight answers to my comments.
I would also ask you **not to change, move or delete my comments** to make it easier for me to navigate during the next review.
    
In addition, my comments are defined as headings. 
They can mess up the content; however, they are convenient, since you can immediately go to them. I will remove the headings from my comments in the next review if you ask me to. 


<hr>
    
    
✍️ Here's a nice article about [5 hyperparameter tuning techniques](https://www.run.ai/guides/hyperparameter-tuning) that you may find interesting.  
    
<hr>
    
📌 Please feel free to schedule a 1:1 with our tutors or TAs, join daily coworking sessions, or ask questions in the sprint channels on Discord if you need assistance with further projects. 
</div>

# Project description

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

## Goal

Develop a model with the highest possible accuracy that will pick the right plan: Smart or Ultra.

## Data description

Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:

сalls — number of calls,

minutes — total call duration in minutes,

messages — number of text messages,

mb_used — Internet traffic used in MB,

is_ultra — plan for the current month (Ultra - 1, Smart - 0).

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment </h2>
    
There's an introduction, which is good. It is important to write an introductory part, because it gives an idea about the content of the project.
</div>
<div class="alert alert-warning" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment </h2>
    
Please don't forget about project title :) A title should reflect the core goals.
    
</div>

In [1]:
# Import libraries

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
# Load the data and store it to df

df = pd.read_csv('/datasets/users_behavior.csv')

### Data Exploration

In [3]:
df.head(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Conclusions:

In table we see a target variable is_ultra and it is binary,

Data types are valid, no need to change them,

No NULLs or missed values,

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment </h2>
    
Very good! </div>

# Data Modeling

The following two machine learning algorithms will be employed:

    Decision Tree
    Random Forest
The models with the highest accuracy will be chosen for final testing. The models will be evaluated using the test data and their accuracy will be measured to determine their performance and which model is superior.

Splitting the data 

The dataset includes a column called `is_ultra`, which identifies the customer's plan—`0` for the Smart plan and `1` for the Ultra plan. Since our goal is to recommend the most suitable plan based on a customer's usage patterns, `is_ultra` will serve as the target variable. The other columns (`calls`, `minutes`, `messages`, and `mb_used`) represent each customer's usage behavior and play a role in determining whether they choose the Smart or Ultra plan. These columns will be used as features for analysis.

In [5]:
# The features of the DataFrame include all columns except for 'is_ultra'
features = df.drop('is_ultra', axis=1)

# The target of the DataFrame is the 'is_ultra' column
target = df['is_ultra']

The features DataFrame and target Series will be split into training, validation, and test sets using a 3:1:1 ratio. This means that 60% of the data will be used for training, while 20% will be allocated to both the validation and test sets. This ensures a balanced approach for model training, fine-tuning, and final evaluation.

In [6]:
# Split the above slices into training, validation, and test datasets...
#First, split the training datasets apart from the validation and test data. 
#This will be done by splitting the data into the train datasets and "other" datasets. 
#The "other" datasets will have a test_size of 0.4, or 40% of the data, leaving the training datasets with 60% of the data.

features_train, features_other, target_train, target_other  = train_test_split(features, target, test_size=0.4,\
                                                                               random_state=12345)

#Split the "other" datasets to create the validation and test datasets. 
#Since the "other" dataset account for 40% of the original data and the validation and test datasets 
#should each contain 20% of the original data, the"other" datasets will be split in half. 
#So, the test_size parameter will be set to 0.5 (for 50%).

features_valid, features_test, target_valid, target_test = train_test_split(features_other, target_other, test_size=0.5,\
                                                                            random_state=12345)

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment </h2>
    
Correct! We indeed need to split the data into three subsets: one for fitting the model, one for calculating the metric values, and one for the final test. 
    
</div>

**Decision Tree Model**  

Since the target column contains binary values (0 or 1), this is a binary classification task. A decision tree is an appropriate algorithm for this type of prediction. One of the key hyperparameters in a decision tree is its maximum depth, which controls how complex the model can become. To find the optimal depth, multiple decision tree models will be trained with varying maximum depths. Each model's accuracy will be evaluated, and the one with the highest accuracy will be selected as the best-performing model.

In [7]:
# Initialize variables to store the best model details
best_DT_model = None
best_DT_depth = 0
best_DT_accuracy = 0

# Iterate over depth values from 1 to 40
for depth in range(1, 41):
    # Create and train the Decision Tree model
    DT_model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    DT_model.fit(features_train, target_train)
    
    # Make predictions on the validation set
    DT_predictions_valid = DT_model.predict(features_valid)
    
    # Calculate accuracy
    accuracy = accuracy_score(target_valid, DT_predictions_valid)

    # Update best model if current model is better
    if accuracy > best_DT_accuracy:
        best_DT_model = DT_model
        best_DT_depth = depth
        best_DT_accuracy = accuracy

# Display the best model details
print(f'Best Decision Tree Model Depth: {best_DT_depth}')
print(f'Best Validation Accuracy: {best_DT_accuracy:.2%}')


Best Decision Tree Model Depth: 3
Best Validation Accuracy: 78.54%


The decision tree model with the highest accuracy is the one with a maximum depth of 3, which achieved an accuracy of around 78.54%. This model will be referred to as best_DT_model and will be utilized during the testing phase.

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment</h2>
    
You have successfully implemented hyperparameters tuning, well done! We need to tune them to identify the the hyperparameters that will yield the desired metric value.
</div>

 **Random Forest Model**  

Next, we will use a Random Forest model to predict the target values. Two key hyperparameters significantly impact the model's performance:  

1. Maximum Depth (`max_depth`) – Determines how deep each decision tree can grow.  
2. Number of Estimators (`n_estimators`) – Represents the total number of decision trees in the forest.  

To find the best combination of these hyperparameters, we will train and test multiple models with varying values for both maximum depth and number of estimators. This will be achieved using nested loops, iterating over a range of values for each parameter. After evaluating all models based on accuracy, the model with the highest accuracy score will be selected as the best-performing Random Forest model.

In [8]:
# Initialize variables to store the best model details
best_RF_model = None
best_RF_accuracy = 0
best_RF_depth = 0
best_est = 0

# Iterate over different numbers of estimators (1 to 20)
for est in range(1, 21):
    
    # Iterate over different depth values (1 to 40)
    for depth in range(1, 41):
        
        # Create and train the Random Forest model
        RF_model = RandomForestClassifier(max_depth=depth, n_estimators=est, random_state=12345)
        RF_model.fit(features_train, target_train)

        # Make predictions on the validation set
        RF_predictions_valid = RF_model.predict(features_valid)

        # Calculate accuracy
        accuracy = accuracy_score(target_valid, RF_predictions_valid)

        # Update best model if current model performs better
        if accuracy > best_RF_accuracy:
            best_RF_model = RF_model
            best_RF_accuracy = accuracy
            best_RF_depth = depth
            best_est = est

# Display the best model details
print(f'Best Random Forest Model:')
print(f'  - Depth: {best_RF_depth}')
print(f'  - Estimators: {best_est}')
print(f'  - Validation Accuracy: {best_RF_accuracy:.2%}')


Best Random Forest Model:
  - Depth: 12
  - Estimators: 17
  - Validation Accuracy: 80.56%


The random forest model that achieved the highest accuracy has a maximum depth of 12, a number of estimators value of 17, and an accuracy of around 80.56%. This model will be referred to as best_RF_model and will be used during the testing phase.

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment</h2>
    
Good, you compared a couple of models with different hyperparameters!


</div>

**Final Model**

The optimal hyperparameters for both the decision tree and random forest models have been identified. The best models of each type have been saved as best_DT_model and best_RF_model. Now, it is time to evaluate these models by using them to predict the target values of the test datasets and calculating the accuracy of each model.

**Best decision tree model**

In [9]:
# Predict target values on the validation set using the best Decision Tree model
DT_validation_predictions = best_DT_model.predict(features_valid)

# Calculate accuracy
DT_validation_accuracy = accuracy_score(target_valid, DT_validation_predictions)

# Display the accuracy
print(f'Validation Accuracy of Best Decision Tree Model: {DT_validation_accuracy:.2%}')

Validation Accuracy of Best Decision Tree Model: 78.54%


The best decision tree model achieved an accuracy of 78.54% when making predictions on the test dataset, which is higher than the threshold of 75% for model accuracy.

**Best random forest model**

In [10]:
# Predict target values on the validation set using the best Random Forest model
RF_validation_predictions = best_RF_model.predict(features_valid)

# Calculate accuracy
RF_validation_accuracy = accuracy_score(target_valid, RF_validation_predictions)

# Display the accuracy
print(f'Validation Accuracy of Best Random Forest Model: {RF_validation_accuracy:.2%}')


Validation Accuracy of Best Random Forest Model: 80.56%


<div class="alert alert-warning" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment  </h2>
    
Aren't you repeating the predictions you've already made in the loops above?     
</div>

The random forest algorithm outperformed the decision tree model, achieving an 80.56% accuracy on the test dataset. This performance exceeds both the 75% performance threshold and the decision tree's 78.54% accuracy, demonstrating the random forest's superior predictive capabilities.

# **Final model**

In [11]:
# Select the best model based on validation accuracy
best_model = best_DT_model if DT_validation_accuracy > RF_validation_accuracy else best_RF_model

# Predict target values on the test set using the best model
test_predictions = best_model.predict(features_test)

# Calculate accuracy on the test set
test_accuracy = accuracy_score(target_test, test_predictions)

# Display the results
print(f'Test Accuracy of the Best Model: {test_accuracy:.2%}')


Test Accuracy of the Best Model: 79.94%


The final selected model, which was a decision tree, achieved an accuracy of 79.94% on the test dataset, indicating that the model is able to accurately predict the target variable with a high degree of certainty.

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment</h2>
    
Correct. Here we indeed use the best model (it's one model usually) among all models we trained and tuned to run the final test using the test subset. It helps us evaluate the model's generalization ability. 

</div>


# **Sanity Check**

The recommendation task involves classifying users into either the Smart or Ultra plan, represented as binary values (0 or 1). In binary classification, a baseline accuracy can be calculated by simply predicting the majority class. For this dataset, the baseline accuracy stands at approximately 70%, representing the proportion of the majority class.

The performance of our trained models significantly exceeds this baseline. The decision tree model achieved 78.54% accuracy, while the random forest model reached 80.56%. These results demonstrate that both models provide substantially better predictions than simply defaulting to the most common plan. Consequently, either of these trained models offers a valuable approach for recommending the most appropriate plan to customers.

<div class="alert alert-warning" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment</h2>
    
I recommend that you show the distribution here as well. You can also conduct the sanity check using some constant model. 

</div>

# **Conclusion**

Megaline sought a predictive model to recommend either the Ultra or Smart plan to customers still using legacy services. This project focused on a binary classification challenge, with two primary machine learning approaches: decision tree and random forest models.

The data was strategically partitioned into three datasets: a training set (60%), a validation set (20%), and a testing set (20%). We explored multiple model configurations by systematically varying hyperparameters. After comprehensive training and evaluation, the decision tree model achieved 78.54% accuracy, while the random forest model slightly outperformed it with 80.56% accuracy.

Both models successfully exceeded the 75% accuracy threshold, demonstrating their reliability for plan recommendations. Given its marginally superior performance, the random forest model represents the most suitable solution for Megaline's customer plan recommendation strategy.

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment</h2>
    
Excellent job! 
    
</div>