# Megaline ML Model Report

**The objective of this analysis is to develop a model that will pick the correct recommendation for phone plans (smart or ultra) based on exising customer behavior data. The accuracy thershold for this model will be 0.75 and will be tested using a test dataset. The model will be completed by using the following steps:**

Step 1: Open and look through the data file (datasets/users_behavior.csv)

Step 2: Split the source data into a training set, a validation set, and a test set

Step 3: Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study

Step 4: Check the quality of the model using the test set

Step 5: Form a conclusion

## Open and look through the data file 

In [1]:
import pandas as pd
user_data = pd.read_csv('/datasets/users_behavior.csv')
try:
    display(user_data.head(10), user_data.info())
except:
    display('cannot read csv')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


None

### Conclusion: Opening the data

1) The data is readable and non-corrupt

2) There are 5 columns, 4 quantitative with is_ultra as the categorical and 3214 rows

3) All data types are acceptable

4) The data is now ready to split 

## Split the source data into a training set, a validation set, and a test set

In [2]:
from sklearn.model_selection import train_test_split

#Only a source dataset exists therefore, a 3:1:1 split is needed
train_size=0.6

features = user_data.drop('is_ultra', axis=1)
target = user_data['is_ultra']

#The 'rem' datasets are the remaining 40% of the data after allocating 60% to the train dataset
features_train, features_rem, target_train, target_rem = train_test_split(features, target, 
                                                                          train_size=0.6, random_state=12345)

#Split the 'rem' dataset (40%) in half so that each half is 20% (60:20:20 is 3:1:1 ratio)
test_size = 0.5
features_valid, features_test, target_valid, target_test = train_test_split(features_rem, target_rem, 
                                                                            test_size=0.5, random_state=12345)

display(features_train.shape, target_train.shape, 
        features_valid.shape, target_valid.shape,
        features_test.shape, target_test.shape)


(1928, 4)

(1928,)

(643, 4)

(643,)

(643, 4)

(643,)

### Conclusion: Splitting into train, valid, and test sets

1) The data was successfully split into train (60%), valid (20%), and test (20%) datasets

2) There are 1928 rows in the training sets and 643 rows for the valid and test sets

## Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.

In [13]:
#First, experiment with decision tree hyperparameters
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

results=[]
for depth in range(1, 6):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    preds = model.predict(features_valid)
    results.append({
                'Accuracy': accuracy_score(target_valid, preds),
                'Max_depth': depth})
    
results = pd.DataFrame(results)
results = results.sort_values(by='Accuracy', ascending=False)
results


Unnamed: 0,Accuracy,Max_depth
2,0.785381,3
1,0.782271,2
3,0.77916,4
4,0.77916,5
0,0.754277,1


### Conclusion: Tuning hyperparameters -- Decision Tree

1) For a decision tree model, a depth of 3 produces the most accurate value

2) After concluding the most accurate decision tree model, random forest models can be analyzed next

In [20]:
#import random forest 
from sklearn.ensemble import RandomForestClassifier
results=[]

for estimator in range(1, 6):
    model = RandomForestClassifier(random_state=12345, n_estimators=estimator)
    model.fit(features_train, target_train)
    preds = model.predict(features_valid)
    results.append({
                'Accuracy': accuracy_score(target_valid, preds),
                'n_estimator': estimator})
    
results = pd.DataFrame(results)
results = results.sort_values(by='Accuracy', ascending=False)
results


Unnamed: 0,Accuracy,n_estimator
3,0.771384,4
1,0.763608,2
4,0.749611,5
2,0.738725,3
0,0.710731,1


### Conclusion: Tuning hyperparameters -- Random Forest

1) The number of estimators that ellicited the best accuracy was 4 at 0.77 accuracy

2) Next model, logistic regression

In [21]:
#logistic regression
from sklearn.linear_model import LogisticRegression 

model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
preds = model.predict(features_valid)

print(accuracy_score(target_valid, predictions_valid))

0.7542768273716952


### Tuning hyperparameters general conclusion

1) The best models for accuracy was the decision tree model at max_depths 3 and 2 producing 0.78 accuracy

2) The next best model was the random forest model at n_estimators 4 producing 0.77 accuracy

3) The worst model for accuracy was logistic regression at 0.75

<div class="alert alert-block alert-info">
<b>Improve: </b> You could tune more than one parameter for some models (loop in loop).
</div>

<div class="alert alert-block alert-success">
<b>Success:</b> Great that you've tried several models!
</div>

## Check the quality of the model using the test set

In [24]:
#using max_depth 3
model = DecisionTreeClassifier(random_state=12345, max_depth=3)
model.fit(features_train, target_train)
preds = model.predict(features_test)

print(accuracy_score(target_test, preds))


0.7791601866251944


### Conclusion: quality check with test set

1) The resulting accuracy score achieved (0.779) is very close to the validation set results (0.78)

2) The quality of the model is acceptable

## General Conclusion

1) The best model for achieving plan recommendations is a decision tree model with a max depth of 3

2) The accuracy score for this model has surpassed the accuracy threshold of 0.75 

3) Megaline can use this model to recommend phone plans to new users