# Intro to Machine Learning Project
***
In this project we will develop a model to assist the mobile carrier Megaline, who want us to analyze user behavior in order to recommend subscribers with legacy plans one of Megaline's newer plans; Smart or Ultra. The `users_behavior.csv` dataset contains data on subscribers who already have made the switch to one of the new plans. 

In order to develop the best model for Megaline, we will analyze a few different models in order to guarantee the best results. Because there are only two plans to choose we will be testing three classification models.

**Classification Models To Be Tested:**
- Decision Tree Classifier
- Random Forest Classifier
- Logistic Regression

While the accuracy threshold for the final test will be 0.75, the goal is to develop a model with the highest possible accuracy.

First, importing libraries:

In [1]:
# Importing libraries and programs
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Importing `user_behavior.csv` dataset:

In [2]:
# Load data into a DataFrame then display general info
df = pd.read_csv('users_behavior.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


There are no missing values, as preliminary EDA was already preformed earlier. Dataset will still be tested for duplicate values to ensure the accuracy of the model.

In [3]:
df.duplicated().sum()

0

Splitting dataset into training, verification, and test sets at a ratio of 3:1:1 respectively.
Then features and targets were selected for the 3 separate datasets.

In [4]:
# Spliting data into training, validation and test datasets
df_train, df_test_valid = train_test_split(df, test_size=0.4, random_state=12345)
df_test, df_valid = train_test_split(df_test_valid, test_size=0.5, random_state=12345)

# Further splitting data into features and target
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

# Verifying 3:1:1 ratio
print(features_train.shape,
      target_train.shape,
      features_valid.shape,
      target_valid.shape,
      features_test.shape,
      target_test.shape,
      sep='\n')
print('Ratio:', len(features_train)/len(df), len(features_valid)/len(df), len(features_test)/len(df))

(1928, 4)
(1928,)
(643, 4)
(643,)
(643, 4)
(643,)
Ratio: 0.5998755444928439 0.2000622277535781 0.2000622277535781


## Decision Tree Classifier

In order to determine the best hyperparameters for the `DecisionTreeClassifier` model, we create a loop and save the best result and its `max_depth` hyperparameter. 

In [5]:
# Evaluating DecisionTreeClassifier
best_depth = 0
best_result = 0
for depth in range(1, 11):
    tree = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    tree.fit(features_train, target_train)
    result = tree.score(features_valid, target_valid)
    if result > best_result:
        tree_depth = depth
        best_result = result
        best_tree = tree
print(f'Best Depth: {tree_depth} Accuracy Score: {best_result}')

Best Depth: 7 Accuracy Score: 0.7993779160186625


The tree that had the most accuracy scored 0.799 with the validation dataset which passes the 0.75 threshold, but the model may not be as effective with the test data set. This tree had a `max_depth` value of 7.

## Random Forest Classifier
Like with the `DecisionTreeClassifier` model, we must determine the best hyperparameters for the `RandomForestClassifier` model. This will require looping both the `max_depth` hyperparameter and the `n_estimators` hyperparameter to find the best combination.

In [6]:
# Evaluating RandomForestClassifier
best_score = 0
best_est = 0
for est in range(10, 51, 10):
    for depth in range(1,16):
        forest = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth) 
        forest.fit(features_train, target_train)
        score = forest.score(features_valid, target_valid)
        if score > best_score:
            best_score = score 
            best_forest = forest
            forest_depth = depth
            best_est = est
print(f'Best Number of Estimators: {best_est} Best Depth: {forest_depth} Accuracy: {best_score}')

Best Number of Estimators: 10 Best Depth: 9 Accuracy: 0.8133748055987559


The best `RandomForestClassifier` model returned an accuracy score of 0.813 with the validation dataset and had `n_estimators=10` and `max_depth=9`. This will be the model used for the final evaluation, however we will still evaluate the `LogisticRegression` model.

## Logistic Regression

Unlike the other learning models, there aren't hyperparameters to tune, so we will just compare its accuracy with the training dataset and the validation dataset.

In [7]:
# Evaluating LogisticRegression
log = LogisticRegression(random_state=12345, solver='liblinear')
log.fit(features_train, target_train)
score_train = log.score(features_train, target_train)
score_valid = log.score(features_valid, target_valid)
print(f'Training Set Accuracy: {score_train}', f'Validation Set Accuracy: {score_valid}', sep='\n')

Training Set Accuracy: 0.7510373443983402
Validation Set Accuracy: 0.7402799377916018


The model barely passes the accuracy threshold with the training dataset and does not pass it with the validation dataset.

## Final Test

The final test will be performed using the `RandomForestClassifier` model that preformed best with the validation dataset.

In [8]:
# Testing the model
high_score = best_forest.score(features_test, target_test)
print(f'Accuracy of best model with test dataset: {high_score}')

Accuracy of best model with test dataset: 0.7853810264385692


The model scored 0.785 with the test dataset, which passes the accuracy threshold and then some.

## Sanity Check

To ensure the model preforms better than random chance, we will make random predictions and test that against the `target_test` data.

In [9]:
# Preforming sanity check
sanity_predictions = pd.Series(np.random.default_rng(seed=12345).choice([0,1], size=len(target_test)), index=target_test.index)
sanity_score = accuracy_score(target_test, sanity_predictions)
print(f'Random Chance Score: {sanity_score}')

Random Chance Score: 0.5069984447900466


# Conclusion

We covered multiple machine learning models for Megaline and determined that the Random Forest Classifier model worked best for the client, and the model worked best with 10 in the `n_estimators` hyperparameter and 9 in the `max_depth` hyperparameter. The Decision Tree Classifier model preformed accurately with the validation dataset, but not well enough to be used in the final test. The sanity check ensures that all the models preform better than a coin flip.