## Review (2)

Thank you for the fast and correct update. Everything is great now. Also answered your question in the very end of project.

---

## Review

Hi Julia. As always I've added all my comments to new cells with different coloring.

<div class="alert alert-success" role="alert">
  If you did something great I'm using green color for my comment
</div>

<div class="alert alert-warning" role="alert">
If I want to give you advice or think that something can be improved, then I'll use yellow. This is an optional recommendation.
</div>

<div class="alert alert-danger" role="alert">
  If the topic requires some extra work so I can accept it then the color will be red
</div>

You did the project correctly, but you chose irrelevant models for this project. Two of three of your models are regression models while our task is classificational. That is why they showed such poor results. So can you please change your models to correct ones.

---

Hi, Julia. This is Soslan. You sent an empty notebook or something happend. Can you please resubmit your project?

---

Hi. I renamed my project and then for some reason it did not submit that renamed one and submitted an empty one. Sorry. Now, I know not to rename my projects.

# Introduction to Machine Learning: Project

Mobile carrier Megaline has found out that many of their subscribers use
legacy plans. They want to develop a model that would analyze subscribers'
behavior and recommend one of Megaline's newer plans: Smart or Ultra.

You have access to behavior data about subscribers who have already
switched to the new plans (from the project for the Statistical Data Analysis
course). For this classification task, you need to develop a model that will pick
the right plan. Since you’ve already performed the data preprocessing step,
you can move straight to creating the model.

Develop a model with the highest possible accuracy. In this project, the
threshold for accuracy is 0.75. Check the accuracy using the test dataset.

#### Data description
Every observation in the dataset contains monthly behavior information about
one user. 

The information given is as follows:

сalls — number of calls,

minutes — total call duration in minutes,

messages — number of text messages,

mb_used — Internet traffic used in MB,

is_ultra — plan for the current month (Ultra - 1, Smart - 0)

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier


##  1) Open and look through the data file.

In [25]:
df = pd.read_csv('/datasets/users_behavior.csv')

In [26]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


### No missing values and data types are okay. This is a classification task and the target is is_ultra.

## 2) Split the source data into a training set, a validation set, and a test set.

In [28]:
df_train, df_valid = train_test_split(df, test_size=0.40, random_state=12345)

In [29]:
df_valid, df_test = train_test_split(df_valid, test_size=0.50, random_state=12345)

In [30]:
print('training set:', (len(df_train)/len(df))*100, '%')

training set: 59.98755444928439 %


In [31]:
print('validation set:', (len(df_valid)/len(df))*100, '%')

validation set: 20.00622277535781 %


In [32]:
print('test set:', (len(df_test)/len(df))*100, '%')

test set: 20.00622277535781 %


<div class="alert alert-success" role="alert">
Good start. Correct split.</div>


### 60% of data is training set, 20% of data is validation set, and 20% is test set. The small error is okay and inevitable with computers.

In [33]:
df_train_features= df_train.drop(['is_ultra'], axis=1)
df_train_target= df_train['is_ultra']
df_valid_features= df_valid.drop(['is_ultra'], axis=1)
df_valid_target= df_valid['is_ultra']
df_test_features= df_test.drop(['is_ultra'], axis=1)
df_test_target= df_test['is_ultra']

##  3) Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.

<div class="alert alert-danger" role="alert">
<s>As this is classification task - our targets are classes 0 and 1 - it is better to use here DecisionTreeClassifier, not DecisionTreeRegressor</s>
</div>

<div class="alert alert-success" role="alert">
Fixed</div>

In [34]:
for depth in range(1,21):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(df_train_features, df_train_target)

    print(depth, 'depth', model.score(df_valid_features, df_valid_target), 'accuracy')

1 depth 0.7542768273716952 accuracy
2 depth 0.7822706065318819 accuracy
3 depth 0.7853810264385692 accuracy
4 depth 0.7791601866251944 accuracy
5 depth 0.7791601866251944 accuracy
6 depth 0.7838258164852255 accuracy
7 depth 0.7822706065318819 accuracy
8 depth 0.7791601866251944 accuracy
9 depth 0.7822706065318819 accuracy
10 depth 0.7744945567651633 accuracy
11 depth 0.7620528771384136 accuracy
12 depth 0.7620528771384136 accuracy
13 depth 0.7558320373250389 accuracy
14 depth 0.7589424572317263 accuracy
15 depth 0.7465007776049767 accuracy
16 depth 0.7340590979782271 accuracy
17 depth 0.7356143079315708 accuracy
18 depth 0.7309486780715396 accuracy
19 depth 0.7278382581648523 accuracy
20 depth 0.7216174183514774 accuracy


Decision Tree with a depth of 6 has the best accuracy ay 78.4%.

<div class="alert alert-danger" role="alert">
<s>Same here RandomForestClassiffier will act much better.</s></div>

<div class="alert alert-success" role="alert">
Fixed</div>

In [36]:
for trees in range(100,1001,100):

    model = RandomForestClassifier(random_state=12345, n_estimators=trees)
    model.fit(df_train_features, df_train_target)

    print(trees, 'trees:', model.score(df_valid_features, df_valid_target), 'accuracy')

100 trees: 0.7853810264385692 accuracy
200 trees: 0.7869362363919129 accuracy
300 trees: 0.7869362363919129 accuracy
400 trees: 0.7853810264385692 accuracy
500 trees: 0.7853810264385692 accuracy
600 trees: 0.7838258164852255 accuracy
700 trees: 0.7822706065318819 accuracy
800 trees: 0.7838258164852255 accuracy
900 trees: 0.7838258164852255 accuracy
1000 trees: 0.7853810264385692 accuracy


Random Forest with 200 trees has the best accuracy of 78.7%.

In [37]:
model = LogisticRegression(random_state=12345)
model.fit(df_train_features, df_train_target)

model.score(df_valid_features, df_valid_target)



0.7589424572317263

<div class="alert alert-success" role="alert">
LogisticRegression is the only model I know with word Regression, which used for classification.</div>

### The Random Forest with 200 trees has the highest accuracy, so I will use this one.

## 4) Check the quality of the model using the test set.

In [38]:
model = RandomForestClassifier(random_state=12345, n_estimators=200)
model.fit(df_train_features, df_train_target)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)

In [39]:
model.score(df_test_features, df_test_target)

0.7869362363919129

### This is a pretty good accuracy for the test set. It is accurate about 78.7% of the time.

## 5) Additional task: sanity check the model. This data is more complex than what you’re used to working with, so it’s okay if it doesn’t work out. We'll take a closer look at it later. 

In [40]:
model.score(df_test_features, df_test_target)

0.7869362363919129

In [41]:
df['is_ultra'].unique()

array([0, 1])

In [42]:
import numpy as np
choices = [0,1]
random_predictions = np.random.choice(choices, size=len(df_test_target))
accuracy = accuracy_score(df_test_target, random_predictions)

In [43]:
accuracy

0.5241057542768274

<div class="alert alert-success" role="alert">
Correct, but for sanity check you can also take model which predicts constantly bigger class - 0. It's accuracy will be about 0.68</div>


In [53]:
# what does 'model which predicts constantly biggeer class' mean?

<div class="alert alert-warning" role="alert">
As we have more 0s than 1s in target, we can predict only 0s each time and will receive bigger accuracy. That is why accuracy isn't a suitable parameter for prediction of imbalanced classes. Yes, it is dummy model :) Here an explaining code.</div>

In [57]:
## reviewer's code

zero_predictions = np.zeros(len(df_test_target))
accuracy = accuracy_score(df_test_target, zero_predictions)
accuracy

0.6842923794712286

### My model performs better than chance so it passes the sanity check.