**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a great job! The project is accepted. Good luck on the next sprint!

# PROJECT DESCIPTION

Megaline, a mobile carrier, found that most of their customers are using the legacy plans and wants to recommend one of their newer plans (Smart or Ultra) to their customers. They want to us to develop a model based on customer behavior to pick the apprioprate plan with the highest possible accuracy. The threshold for accuracy is 0.75. This is a binary classification task; Decision Tree, Random Forest, and Logistic Regression models will be tested to determine the highest accuracy. 

## Importing Data

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

In [2]:
df=pd.read_csv('/datasets/users_behavior.csv')

In [3]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [6]:
df.shape

(3214, 5)

The data does not need any cleanup. There are no missing values, datatype is correct, and names are named correctly. I will leave the data as it is. 

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected!

</div>

## Splitting Data 

In [7]:
#the source data will be spit into 3: training, validation, and test dataset
df_train , df_valid_test = train_test_split(df, test_size=0.4, random_state=12345)
df_valid, df_test = train_test_split(df_valid_test, test_size = 0.5, random_state=12345)
print(f'Train dataset:{len(df_train)/len(df):.0%}')
print(f'Valid dataset:{len(df_valid)/len(df):.0%}')
print(f'Test dataset:{len(df_test)/len(df):.0%}')

Train dataset:60%
Valid dataset:20%
Test dataset:20%


Because there is no existing test dataste, the source is splitted into 3 datatsets with the a 3:1:1 ratio. Train, valid, and test data is 60%,20% and 20% of the source dataset, respectively. 

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was split into train, validation and test sets. The proportions are reasonable.

</div>

In [8]:
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']

features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']

features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)
print(features_test.shape)
print(target_test.shape)

(1928, 4)
(1928,)
(643, 4)
(643,)
(643, 4)
(643,)


## Testing Models

# Decision Tree

In [24]:
for depth in range(1,11): 
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    print('max_depth=',depth,':', end='')
    print(score)

max_depth= 1 :0.7542768273716952
max_depth= 2 :0.7822706065318819
max_depth= 3 :0.7853810264385692
max_depth= 4 :0.7791601866251944
max_depth= 5 :0.7791601866251944
max_depth= 6 :0.7838258164852255
max_depth= 7 :0.7822706065318819
max_depth= 8 :0.7791601866251944
max_depth= 9 :0.7822706065318819
max_depth= 10 :0.7744945567651633


With the decision tree model, the max_depth = 3 has the highest accuracy of 78.5%. As the max_depth increases, the accuracy fluctuates up and down with no general pattern.  

# Random Forest

In [36]:
for est in range(1,30,5): 
    model = RandomForestClassifier(random_state=12345, n_estimators=est)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    print("n_estimators = {}: {}".format(est, score)) 

n_estimators = 1: 0.7107309486780715
n_estimators = 6: 0.7807153965785381
n_estimators = 11: 0.7838258164852255
n_estimators = 16: 0.7869362363919129
n_estimators = 21: 0.7931570762052877
n_estimators = 26: 0.7853810264385692


With the Random Forest model, the highest accuracy is 79.3% at n_estimators = 21. On average, the accuracy increases as the n_estimator increases. 

# Logistic Regression 

In [19]:
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
score_train = model.score(features_train, target_train)
score_valid = model.score(features_train, target_train)
print('Accuracy on training set:', score_train)
print('Accuracy on valid set:', score_valid)

Accuracy on training set: 0.7505186721991701
Accuracy on valid set: 0.7505186721991701


The Logistic Regression accuracy for training and valid sets is 75%, which bareley mets the accuracy threshold. 

CONCLUSION: The Random Forest has the highest accuracy at the n_estimator = 21, follow by Decision Tree with 78.5% at max_depth=3, and Logistic Regression at the lowest with 75%. Random Forest is expected to have the highest accuracy because it uses an ensemble of trees instead of one tree. With Decision Tree, it is likely to undergo underfitting and overfitting, reducing accuracy of prediction. I'll use the Random Tree model to test my test dataset. 

<div class="alert alert-success">
<b>Reviewer's comment</b>

Great, you tried a couple of different models and did some hyperparameter tuning using the validation set

</div>

## Test Dataset 

In [40]:
model = RandomForestClassifier(random_state = 12345, n_estimators=21)
model.fit(features_train, target_train)
print('Acurracy of train dataset:',model.score(features_train, target_train))
print('Acurracy of valid dataset:',model.score(features_valid, target_valid))
print('Acurracy of test dataset:',model.score(features_test, target_test))

Acurracy of train dataset: 0.9922199170124482
Acurracy of valid dataset: 0.7931570762052877
Acurracy of test dataset: 0.776049766718507


CONCLUSION: Using Random Forest model, the test data has an accuracy of 77.6% which is above the accuracy threshold of 75%. 

<div class="alert alert-success">
<b>Reviewer's comment</b>

The final model was evaluated on the test set for an unbiased estimate of its generalization performancce

</div>