## Reviewer Introduction

Hello STUDENT NAME!

Hello, my name is Juan Miguel Gutierrez, also known as Juanmi ! 

I'm delighted to assist you with your project today.As part of my role, I will review your work and provide feedback. Initially, if I notice any mistakes, I will simply point them out, allowing you to identify and correct them on your own. This approach is aimed at helping you develop the skills required for a career as a Data Scientist.

In a real job setting, it's common for a team lead or supervisor to follow a similar approach, encouraging you to troubleshoot and find solutions independently. However, if you find the task challenging and need further guidance, I will offer more precise hints and suggestions in subsequent iterations.

Please feel free to ask for assistance or clarification whenever needed. I'm here to support you in your journey towards becoming a skilled Data Scientist.

You will find my comments below - please do not move, modify or delete them.
You can find my comments in green, yellow or red boxes like this:

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>
Success. Everything is done succesfully.
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>
Remarks. Some recommendations.
</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>
Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>

You can answer me by using this:
<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>
</div>


### General feedback V1


The analysis for the project was excellent, and I thoroughly enjoyed reading the entire project. There were several aspects that stood out to me:

* The project demonstrated a solid understanding of the hyperparameters used in the decision tree model.
* The data was correctly split and a sanity check was performed on the shape, ensuring the data was handled appropriately.
* The quality check of the data was thorough and well-executed, resulting in clean and reliable data.
* The code itself was clean and well-organized, making it easy to follow and understand.

However, there are a few issues that need to be addressed in order to conclude the project. These issues are marked in the code with a "Danger" tag, and I would recommend focusing on resolving them:

* It seems that the data split may be unbalanced, which could affect the model's performance. Please refer to [Warning 1](#warning1) and [Warning 0](#warning0) in the code for more details.
* There was not a sufficient iteration of parameters for the model. It is recommended to try at least three different parameter combinations to find the optimal configuration. Please refer to [Danger 1](#danger1) in the code for more information.
* The accuracy of the train set and other important sanity checks were not displayed. It would be beneficial to include these metrics for a comprehensive evaluation. Please see [Warning 2](#warning2) in the code for further guidance.

* A proper sanity check of the model, such as comparing it with baselines or using a confusion matrix, was not performed. This step is crucial in assessing the model's performance. Please refer to [Warning 3](#warning3) in the code for suggestions on how to address this.

In general, with these fixes, the project will be fantastic and well-rounded. I appreciate the ordered code, excellent explanations, and your proficiency in Python. Don't forget to implement the necessary fixes and update the conclusions accordingly.

If you have any further questions or need assistance, feel free to ask. Good luck with the project!


**Review Checklist**

- [x] Data is loaded
- [x] Data is split into three sets
- [x] Sets' sizes are chosen correctly
- [x] Models tuning is conducted 
- [x] Tuning is conducted correctly
- [ ] At least 2 algorithms and at least 3 values of hyperparameters are considered
- [x] Study findings are complete 
- [x] Model testing is complete
- [x] Testing is done correctly 
- [x] Accuracy is at least 0.75

# Project description
Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

I will finish the project using the following steps:
    
- Open and look through the data file. Path to the file: /datasets/users_behavior.csv .

- Spliting the source data into a training set, a validation set, and a test set.

- Investigate the quality of different models by changing hyperparameters. While Briefly describing the findings of the study.

- Checking the quality of the model using the test set.

- Additional task: sanity check the model. This data is more complex than what I am used to working with, so it's not an easy task. We'll take a closer look at it later.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Review-Iterations-1" data-toc-modified-id="Review-Iterations-1-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Review Iterations 1</a></span></li><li><span><a href="#Importing-files" data-toc-modified-id="Importing-files-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Importing files</a></span></li><li><span><a href="#Spliting-the-source-data-into-a-training-set,-a-validation-set,-and-a-test-set." data-toc-modified-id="Spliting-the-source-data-into-a-training-set,-a-validation-set,-and-a-test-set.-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Spliting the source data into a training set, a validation set, and a test set.</a></span></li><li><span><a href="#Investigate-the-quality-of-different-models-by-changing-hyperparameters.-While-Briefly-describing-the-findings-of-the-study." data-toc-modified-id="Investigate-the-quality-of-different-models-by-changing-hyperparameters.-While-Briefly-describing-the-findings-of-the-study.-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Investigate the quality of different models by changing hyperparameters. While Briefly describing the findings of the study.</a></span></li><li><span><a href="#Checking-the-quality-of-the-model-using-the-test-set." data-toc-modified-id="Checking-the-quality-of-the-model-using-the-test-set.-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Checking the quality of the model using the test set.</a></span></li><li><span><a href="#Additional-task:-sanity-check-the-model." data-toc-modified-id="Additional-task:-sanity-check-the-model.-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Additional task: sanity check the model.</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

## Importing files
Importing files and checking data for any incorrect data types, missing values or duplicate values.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error

In [2]:
data = pd.read_csv("https://code.s3.yandex.net/datasets/users_behavior.csv")

In [3]:
display(data.head())

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
display(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


None

In [5]:
print(data["is_ultra"].unique())

[0 1]


All data types seems correct, with the only column that might cause problems being is_ultra. is _ultra is acting as bolean values where 0 is False and 1 is True. This might not cause problems for the model that will be developed, but if any do happen to pop up. This might be one of the possible issues.

In [6]:
display(data.describe())

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


<a id='warning0'></a>
<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>
I found very valuable this table. You can see that it may be a possible unbalance of ultra and smart plans. There are 30% of ultra values while 70% in smart. This can affect the accuracy intra group of your future model.
</div>

In [7]:
print(data.isnull().sum())

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64


In [8]:
print(data.duplicated().sum())

0


<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>
You did the standard approach which is cool. Remember duplicates and null values are not the only enemy of data quality. For this case everything seems ok, but remember always double check data makes sense, with some random specific cases. See row by row, sometimes you may find data does not make sense and can lead you to filter more data.
</div>

There seems to be no missing values or duplicate rows in the dataset, and would be safe to say that we can continue.

##  Spliting the source data into a training set, a validation set, and a test set.
Splitting the source data into a training set, a validation set and a test set for the model that will be selected. 

I will split the data in the following order, putting 50% of the data in the training set, 25% in the validation set, and 25% in the test set. It is default to use the largest portion of data to train the model, so it can be as accurate as possible.

In [9]:
data_train, data_valid = train_test_split(data, test_size=0.5, random_state=12345)
data_valid, data_test = train_test_split(data_valid, test_size=0.5, random_state=12345)

In [20]:
features_train = data_train.drop(["is_ultra"], axis=1)
target_train = data_train['is_ultra']
features_valid = data_valid.drop(["is_ultra"], axis=1)
target_valid = data_valid['is_ultra']
features_test = data_test.drop(["is_ultra"], axis=1)
target_test = data_test["is_ultra"]

print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)
print(features_test.shape)
print(target_test.shape)

(1607, 4)
(1607,)
(803, 4)
(803,)
(804, 4)
(804,)


Seems that the dataset we are working with has a uneven number of rows, thus the awkward 1 row difference in the valid and test set.

<a id='warning1'></a>
<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>
Remember to review the balance of data classes before splitting. Lets say in a extreme case you have a complete class in test size, possibly ultra class. Remember to stratify when you split between train and test, check the ```stratify``` parameter. 

Remember from the lesson optimal split would be 60% train, 20% validation, 20% test. Nevertheless as you only have 3000 rows, I will increase the train percentaje split and reduce validation a test set.

</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>
You separated it as a master ! and made a shape sanity check
</div>

## Investigate the quality of different models by changing hyperparameters. While Briefly describing the findings of the study.

I will test various models with accuracy score. This will be done by changing hyperparameters. I will set a 0.75 threshold.

In [11]:
for depth in range(1, 6):
    model1 = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model1.fit(features_train, target_train)
    predictions_valid1 = model1.predict(features_valid)
    print("max_depth =", depth, ": ", end="")
    print(accuracy_score(target_valid, predictions_valid1))

max_depth = 1 : 0.7571606475716065
max_depth = 2 : 0.7808219178082192
max_depth = 3 : 0.7870485678704857
max_depth = 4 : 0.7820672478206725
max_depth = 5 : 0.7820672478206725


max_depth of 3 seems to have the best quality of results being 0.787. This is good considering that it beats the 50/50 odds of guessing and also doesn't have a higher number max_depth of 5 since that could cause overfitting in the decision tree.

In [12]:
best_est=0
best_score=0

for est in range(1, 11): 
    model2 = RandomForestClassifier(random_state=12345, n_estimators=est) 
    model2.fit(features_train, target_train)
    score2 = model2.score(features_valid, target_valid)
    if score2 > best_score:
        best_score = score2
        best_est = est
print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 8): 0.7858032378580324


N_estimators of 8 seems to have best score, but it doesn't seem to be doing better then the decision tree. With the Random Forest having a score of 0.785 and decision tree with 0.787.

In [13]:
model3 = LogisticRegression(random_state=12345, solver="liblinear")
model3.fit(features_train, target_train)
score_train = model3.score(features_train, target_train)
score_valid = model3.score(features_valid, target_valid)
print("Accuracy of the logistic regression model on the training set:", score_train)
print("Accuracy of the logistic regression model on the validation set:", score_valid)

Accuracy of the logistic regression model on the training set: 0.7423771001866832
Accuracy of the logistic regression model on the validation set: 0.75093399750934


Seems that the logistic regression model did the worst with an accuracy score of 0.748, not meeting the accuracy threshold of 0.75.

Although not all models beat the 0.75 threshold, the best model was the decision tree model with a score of 0.787. This model will be used for the project. This makes sense, since the is_ultra column is like a bolean value with there being only 1 or 0.

<a id='danger1'></a>
<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

1. I recommend use at least 3 hyperparameters for Decision Tree Classifier. You are already using 2 (N Estimators, max_depth), check at least one more hyper parameter: min_samples_split, min_samples_leaf, min_weight_fraction_lea, max_features, min_impurity_decrease. 

2. Also apply the random-forest model, the more tools you have to attack a problem the better.

</div>

<a id='warning2'></a>
<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

1. Also take in consideration when iterating over the hyperparameter to combine them (N Estimators x max_depth) combinations in your problems leds to 11*6 = 60 possibilities. You can after in the future iterate over this last option OPTUNA, a library dedicated to perform hyper parameter fast tuning, and used for Expert Data Scientists.

2. I will Recommend you to plot the confusion matrix of train and validation sets to know if there is a unbalance classification. There maybe an improve room on this one.

3. Is ok to summary all the validations results of accuracy in a summary table, better if disaggregated.

4. Is sane to show accuracy on training set also.
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>
Its nice you follow recomendations to use validation set to select best model and add random_state for replicability purposes
</div>

## Checking the quality of the model using the test set.

I will check the quality of the final model with the use of the test set.

In [14]:
final_model = DecisionTreeClassifier(random_state=12345, max_depth=3)
final_model.fit(features_train, target_train)

In [15]:
predictions = final_model.predict(features_test)

In [16]:
def error_count(answers, predictions):
    count = 0
    for i in range(len(answers)):
        if answers[i] != predictions[i]:
            count += 1
    return count
target_test = target_test.reset_index(drop=True)
print('Errors:', error_count(target_test, predictions))

Errors: 167


Target test was givving issues with the function passed and wouldn't display without resetting index. This was most likely due to the indices not aligning with the prediction list, so I reset the target_test to allign with predictions.

In [17]:
def accuracy(answers, predictions):
    new = len(answers) - error_count(answers, predictions)
    new = new/len(answers)
    return new

print('Accuracy:', accuracy(target_test, predictions))

Accuracy: 0.7922885572139303


About 8/10 this model isn't perfect, but it definitely does better then guessing(5/10) the outcome.

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>
You got a good model that hast more than 75% accuracy in test set ! 
</div>

## Additional task: sanity check the model.

Checking how far off we are by making use of the mean squared error function in sklearn and getting the squared root of that answer.

In [18]:
result = mean_squared_error(target_test, predictions)
print(result)

0.20771144278606965


In [19]:
rmse = result**2
print(rmse)

0.043144043464270684


The rmse tells us that the predictions are roughly off by 0.0431.

<a id='warning3'></a>


<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

1. RMSE is not a metric for clasification model. There can be three approachs for sanity check model. Since this is a bonus point complete one in cas you want to have the bonus!

* Compare with constant value prediction (1 or 0).
* Compare with random selection value prediction.
* Plot Confusion Matrix, in order to determine you have a correct precision on each group. For the sake of the example lets say you may have a 70% good accuracy corresponding to ultra labels, and 5% corresponding to smart label, leading to 75% accuracy. But in reality you have 100% accuracy for one group and 5%/30% =16% accuray in the other grou (smart)p. Then you may a bias in the model that you should correct.
</div>

## Conclusion

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

I have succesfully developed a model that would assign the appropriate plan on existing user behavior. Conclusions are as followed:

- There is no missing values or duplicate rows in the dataset, and all data types are correct.

- I will split the data in the following order, putting 50% of the data in the training set, 25% in the validation set, and 25% in the test set. It is default to use the largest portion of data to train the model, so it can be as accurate as possible.

- Although not all models beat the 0.75 threshold, the best model was the decision tree with a max_depth of 3 model, with a score of 0.787. This model will be used for the project. This makes sense, since the is_ultra column is like a bolean value with there being only 1 or 0.

- The model score on the test set is about 8/10 this model isn't perfect, but it definitely does better then guessing(5/10) the outcome.

- The rmse tells us that the predictions are roughly off by 0.0431.

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>
Good conclusion, concise and to the point.
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>
I Enjoy your approach, the preprocessing, the clean code! The whole project is amazing, correct the tuning section with more hyperparameters and we are set. Keep going ! 
</div>