<div style="border:solid green 2px; padding: 20px">
    
<b>Hello, James!</b> We're glad to see you in code-reviewer territory. You've done a great job on the project, but let's get to know each other and make it even better! We have our own atmosphere here and a few rules:


1. My name is Alexander Matveevsky. I work as a code reviewer, and my main goal is not to point out your mistakes, but to share my experience and help you become a data analyst.
2. We speak on a first-come-first-served basis.
3. if you want to write or ask a question, don't be shy. Just choose your color for your comment.  
4. this is a training project, you don't have to be afraid of making a mistake.  
5. You have an unlimited number of attempts to pass the project.  
6. Let's Go!


---
I'll be color-coding comments, please don't delete them:

<div class="alert alert-block alert-danger">✍
    

__Reviewer's comment №1__

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>
    
---

<div class="alert alert-block alert-warning">📝
    

__Reviewer's comment №1__


Remarks. Some recommendations.
</div>

---

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №1__

Success. Everything is done succesfully.
</div>
    
---
    
I suggest that we work on the project in dialogue: if you change something in the project or respond to my comments, write about it. It will be easier for me to track changes if you highlight your comments:   
    
<div class="alert alert-info"> <b>Student сomments:</b> Student answer..</div>
    
All this will help to make the recheck of your project faster. If you have any questions about my comments, let me know, we'll figure it out together :)   
    
---

# Content

    Introduction
    Data Initiation
        . Examine Data
        . Data Description
    Machine Learning
        . Data Segmentation
        . Training
            . Algorithm DecisionTreeClassifier
            . Algorithm RandomForestClassifier
            . Algorithm LogisticRegression
        . Model Quality
        . Sanity Test
    General Conclusion

# Introduction
The mobile company Megaline is not satisfied seeing that many of its customers are using legacy plans. They want to develop a model that can analyze customer behavior and recommend one of Megaline's new plans: Smart or Ultra. You have access to the behavior data of subscribers who have already switched to the new plans.

For this classification task, we will create a model that selects the correct plan.

We will develop a model with the highest possible accuracy, aiming for at least a 75% accuracy threshold. We will use the dataset to verify accuracy.


# Objective
The objective is to obtain a model capable of predicting the type of plan, Ultra or Smart, that users need.

# Data Initiation
We import the libraries:


<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №1__
An excellent practice is to describe the goal and main steps in your own words (a skill that will help a lot on a final project). It would be good to add the progress and purpose of the study.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №1__
    
Great, the libraries are loaded    

We load the dataset:

In [2]:
users = pd.read_csv(r'/datasets/users_behavior.csv')

# Examine Data

In [3]:
users.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [5]:
users.duplicated().sum()

0

The dataset contains 3214 records and 5 columns, the data types are correct, there are no missing values or duplicate records.

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №1__
    
Duplicate checking is the basis of data preprocessing

# Data Description
Each observation in the dataset contains information on the monthly behavior of a user. The information provided is as follows:

    . calls: number of calls
    . minutes: total call duration in minutes
    . messages: number of text messages
    . mb_used: Internet traffic used in MB
    . is_ultra: plan for the current month (Ultra - 1, Smart - 0)
The is_ultra column will be our dependent/target variable, and our independent variables will be the other columns: calls, minutes, messages,

# Machine Learning
# Data Segmentation
Before anything else, we need to segment our users dataset into 3 parts with proportions of 3:1:1 as follows:

    . 60% of the data for training
    . 20% of the data for validation
    . 20% of the data for testing

In [6]:
features = users.drop('is_ultra', axis=1)
target = users['is_ultra']
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.40, random_state=12345)
features_valid, features_test, target_valid, target_test = train_test_split(features_valid, target_valid, test_size=0.50, random_state=12345)

print('dataset', users.shape)
print('features_train:', features_train.shape)
print('target_train:', target_train.shape)

print('features_valid:', features_valid.shape)
print('target_valid:', target_valid.shape)

print('features_test:', features_test.shape)
print('target_test:', target_test.shape)

dataset (3214, 5)
features_train: (1928, 4)
target_train: (1928,)
features_valid: (643, 4)
target_valid: (643,)
features_test: (643, 4)
target_test: (643,)



Training
We will create models using 3 training algorithms for classification and hyperparameters that will allow us to learn from the data and predict new observations.

We will use:

    . Decision tree DecisionTreeClassifier, we will use max_depth to determine various depths.
    . Random forest RandomForestClassifier, we will use n_estimators to determine several trees and 'max_depth' for various depths.
    . Logistic regression LogisticRegression, we will use 'solver-liblinear', random_state=12345.

Algorithm DecisionTreeClassifier
We are going to train a model with the decision tree algorithm. The hyperparameters to use are:

    . random_state=12345
    . max_depth= 1 to 5

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №1__

1. It is good here, random_state is fixed. We have ensured reproducibility of the results of splitting the sample into training (training) / test / validation samples, so the subsamples will be identical in all subsequent runs of our code.
    
2. Fraction of train/valid/test sizes 3:1:1 is good.


</div>

In [7]:
best_est = 0
best_score = 0

for depth in range(1, 6):
    print('depth', depth)

    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)

    model.fit(features_train, target_train)

    prediction_train = model.predict(features_train)
    score_train = accuracy_score(target_train, prediction_train)
    print('Train data accuracy ', score_train)

    prediction_valid = model.predict(features_valid)
    score = accuracy_score(target_valid, prediction_valid)
    print('Valid data accuracy ', score)
    print()

    if score > best_score:
        best_score = score
        best_est = depth


print('RESULT: Best model with DecisionTreeClassifier on the validation dataset with depth = {} and accuracy {}'.format(best_est, best_score)) 

depth 1
Train data accuracy  0.7577800829875518
Valid data accuracy  0.7542768273716952

depth 2
Train data accuracy  0.7878630705394191
Valid data accuracy  0.7822706065318819

depth 3
Train data accuracy  0.8075726141078838
Valid data accuracy  0.7853810264385692

depth 4
Train data accuracy  0.8106846473029046
Valid data accuracy  0.7791601866251944

depth 5
Train data accuracy  0.8200207468879668
Valid data accuracy  0.7791601866251944

RESULT: Best model with DecisionTreeClassifier on the validation dataset with depth = 3 and accuracy 0.7853810264385692


We have carried out our training using the DecisionTreeClassifier decision tree classification by varying the depth of the tree from 1 to 10. We found that the best model has a depth of 3 and its accuracy is 78%. Furthermore, we were able to observe the behavior of the predictions with the training data and with the validation data, we can note that the deeper we make the tree, the more overfitting begins to show in the training data, as the training data have a higher precision value with respect to the predictions made on the validation data set.

Algorithm RandomForestClassifier
We will use the random forest algorithm, with the hyperparameters:

    . random_state=12345
    . n_estimators= 10 to 20
    . max_depth= 1 to 3

In [8]:
best_score = 0
best_tree = 0
best_depth = 0

for tree in range(10, 21):
    for depth in range(1, 4):
        
        print('----------', tree,'tree -', 'depth', depth,'----------')

        model = RandomForestClassifier(
            random_state=12345, n_estimators=tree, max_depth=depth)

        model.fit(features_train, target_train)

        score_train = model.score(features_train, target_train)
        print('Train data accuracy', score_train)

        score = model.score(features_valid, target_valid)
        print('Valid data accuracy', score)
        print()

        if score > best_score:
            best_score = score
            best_tree = tree
            best_depth = depth

print('RESULT: Best model with RandomForestClassifier on the validation dataset with trees:{}-depth:{} and accuracy {}'.format(best_tree, best_depth, best_score))

---------- 10 tree - depth 1 ----------
Train data accuracy 0.7442946058091287
Valid data accuracy 0.7558320373250389

---------- 10 tree - depth 2 ----------
Train data accuracy 0.7785269709543569
Valid data accuracy 0.7776049766718507

---------- 10 tree - depth 3 ----------
Train data accuracy 0.8101659751037344
Valid data accuracy 0.7853810264385692

---------- 11 tree - depth 1 ----------
Train data accuracy 0.7448132780082988
Valid data accuracy 0.7542768273716952

---------- 11 tree - depth 2 ----------
Train data accuracy 0.7883817427385892
Valid data accuracy 0.7853810264385692

---------- 11 tree - depth 3 ----------
Train data accuracy 0.8101659751037344
Valid data accuracy 0.7838258164852255

---------- 12 tree - depth 1 ----------
Train data accuracy 0.7442946058091287
Valid data accuracy 0.7527216174183515

---------- 12 tree - depth 2 ----------
Train data accuracy 0.7883817427385892
Valid data accuracy 0.7838258164852255

---------- 12 tree - depth 3 ----------
Train da

We have tested various training models with the RandomForestClassifier algorithm and obtained the best model with a 79% accuracy using 13 trees at a depth of 2. We also noticed that a 79% accuracy was achieved with the training data set, which suggests that we have controlled for overfitting in our model, something that did occur in the previous model.

Algorithm LogisticRegression
The last learning algorithm we will try is logistic regression. The hyperparameters to use are:

    . random_state=12345
    . solver= liblinear

In [9]:
model = LogisticRegression(random_state=12345, solver='liblinear')

model.fit(features_train, target_train)

score_train = model.score(features_train, target_train)
score_valid = model.score(features_valid, target_valid)

print("Accuracy of the Logistic Regression Model in the training dataset", score_train)
print("Accuracy of the Logistic Regression Model in the validation dataset:", score_valid)

Accuracy of the Logistic Regression Model in the training dataset 0.7505186721991701
Accuracy of the Logistic Regression Model in the validation dataset: 0.7589424572317263


With logistic regression, we have found that our model achieved a 76% accuracy, which is lower than the previous models. We can also observe that the predictions with the training set do not overtake the predictions with the validation set, meaning there is no overfitting from the training data.

Based on the models previously reviewed, we determined that the best model has the following characteristics:

    . Learning algorithm: RandomForestClassifier
    . Hyperparameters to use: 13 trees and depth of 2.

## Model Quality
We will verify the quality of our best model now using our test set.

In [11]:
best_model = RandomForestClassifier(random_state=12345, n_estimators=13, max_depth=2)

best_model.fit(features_train, target_train)

predictions_test = best_model.predict(features_test)

score_test = accuracy_score(target_test, predictions_test)
print('Accuracy of the best model RandomForestClassifier on the test dataset', score_test)

Accuracy of the best model RandomForestClassifier on the test dataset 0.7744945567651633


We have used our model (RandomForestClassifier with 13 trees and depth of 2) and obtained predictions from the test dataset, achieving a 77% accuracy, which is 2 percentage points lower than what was obtained in our validation dataset.

## Sanity Check
To demonstrate that our model performs better than chance, we will conduct a sanity check. This test will look for classification issues in our model, so we will compare our predictions with predictions based on randomness to demonstrate that our model is better.

With that said, we will simulate responses randomly to assess their accuracy. Then we will create a series of the same size as our test target set and fill it with 0s and 1s randomly.

In [12]:
np.random.seed(12345)

random_test =  pd.Series(np.random.choice([0, 1], size=len(target_test)))

print(random_test.value_counts(dropna=False))
print()
print(random_test)

1    328
0    315
dtype: int64

0      0
1      1
2      1
3      1
4      0
      ..
638    0
639    1
640    1
641    0
642    1
Length: 643, dtype: int64


As we can see, random_test contains responses of 0s and 1s in a random manner. To complete our sanity check, we will measure the accuracy of these random predictions. This will help us to determine if our model's predictions are indeed better than random chance, thereby validating the effectiveness of the model.

In [13]:
print('Accuracy of the random responses:', accuracy_score(target_test, random_test))
print('Accuracy of the model RandomForestClassifier', score_test)

Accuracy of the random responses: 0.4821150855365474
Accuracy of the model RandomForestClassifier 0.7744945567651633


We measured the accuracy of the random responses and obtained 48%, clearly demonstrating that our prediction model is better as it has an accuracy of 77%. This means our model has passed the sanity check as it predicts better than random guesses.

# General Conclusion
We conducted research on classification algorithms to create predictive models by varying their hyperparameters. From this, we obtained:

1. With DecisionTreeClassifier, we varied the depth of the tree from 1 to 10. We found that the best model has a depth of 3 and an accuracy of 78%. However, the predictions in the training set were overfitted.
2. With RandomForestClassifier, we obtained the best model with a 79% accuracy using 13 trees with a depth of 2. The predictions in the training set did not show overfitting, which is a good indication.
3. With LogisticRegression, our model achieved a 76% accuracy, which is lower than the previous models. We also observed that the prediction with the training set does not precede the prediction with the validation set, meaning there is no overfitting from the training data.
Therefore, we can determine that the predictive model uses the RandomForestClassifier learning algorithm. The forest is composed of 13 trees with a depth of 2. The quality of the model was measured by an accuracy of 77%, exceeding the accuracy threshold initially set in the objective of this analysis and also surpassing the sanity check.

We have obtained a model capable of predicting the type of plan (Ultra or Smart) that users of the Megaline company need.

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №1__

Here's the great thing: we picked the best hyperparameters for all our models (in this case, maximizing the accuracy_score metric). Here we also identified the MOST optimal model. On validation, it turned out to be the "random forest" model.

After the hyperparameters are selected for validation, we test the models on the test data. Based on the results of testing on the test (sorry for the tautalogy), we choose a model that we can pass to production.

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №1__


Otherwise it's great😊. Your project is begging for github =)   
    
Congratulations on the successful completion of the project 😊👍
And I wish you success in new works 😊