# Intro to Machine Learning Project - Choosing the best phone plan

<b><u>Project description:

<i>Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. 

<i>You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.  

<i>Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

<b><u>Project instructions:

1. Open and look through the data file.
2. Split the source data into a training set, a validation set, and a test set.
3. Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.
4. Check the quality of the model using the test set.
5. Additional task: sanity check the model. This data is more complex than what you’re used to working with, so it's not an easy task. We'll take a closer look at it later.

<b>Step 1: Open and look through the data file.

In [16]:
#import necessary packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [17]:
# Open data file and look at the dataset
users = pd.read_csv("/datasets/users_behavior.csv")
users

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [18]:
# Take a more detailed look at the data file, check for Null Values
users.info()
users.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [19]:
# Check for Duplicates
users.duplicated().sum()

0

<span style= "color: blue"> Comments:
    Upon taking a closer look at the data file, we see there are 5 columns, 3214 rows and 0 null values. Checking for duplicates, we see there are no duplicate entries in the dataset. This indicates the data is clean and no further actions are necessary. We can now move on to Step 2.

<b> Step 2: Split the source data into a training set, a validation set, and a test set.

In [20]:
# Set up the Features and Target
features = users.drop(['is_ultra'], axis = 1)
target = users['is_ultra']

In [21]:
# Split into training set, validation set and test set

## First let's split the dataset into 60% training set and a 40% temporary set (we will use this in the next step)
features_train, features_temp, target_train, target_temp = train_test_split(
    features, target, test_size=0.40, random_state=12345, stratify = target)

## Next let's split the temporary dataset we just created into 2 equal parts for validation and test sets
features_valid, features_test, target_valid, target_test = train_test_split(
    features_temp, target_temp, test_size=0.50, random_state=12345, stratify = target_temp)

In [22]:
# let's verify the datasets are created properly by checking their sizes match the proportions we created in the previous step
print(features.shape, features_temp.shape, features_test.shape, features_valid.shape)
print(target.shape, target_temp.shape, target_test.shape, target_valid.shape)

(3214, 4) (1286, 4) (643, 4) (643, 4)
(3214,) (1286,) (643,) (643,)


<span style= "color: blue"> Comments: 
    The dataset was split into 3 sub datasets, namely training set, validation set and test set. The sizes of these sets were determined based on the breakdown as shown in "Chapter 4: Model Improvements, Lesson 2: Validation Datasets", where the Lesson says that general practice is to use 60% for training set, 20% for validation set and 20% for test set. Once the data is split, we can move on to step 3 in choosing the right model.

<b>Step 3. Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.

<span style= "color: blue"> Comments: The first step is to choose between a Classifier and Regressor. Since the target variable is the column, 'is_ultra', which only contains 0 and 1 values representating True and False, we conclude the target is categorical and therefore we should use a Classifier as opposed to a Regressor. Next we need to choose a model that will yield the highest accuracy (as per the prompt) but since we don't know which model that is based on just the dataset, let's create 3 different models, than tune the hyperparameters and calculate the accuracy for each model. Finally, we can compare the results from each model and choose the model with the highest accuracy.

In [23]:
# Decision Tree Model - Building the Model, Tuning Hyperparameters & Testing Accuracy on Validation Set

dt_best_model = None
dt_best_accuracy = 0

for depth in range(1, 11): 
    dt_model = DecisionTreeClassifier(random_state = 1234, max_depth = depth)
    dt_model.fit(features_train, target_train) 
    dt_predictions_valid = dt_model.predict(features_valid)
    dt_accuracy = accuracy_score(target_valid, dt_predictions_valid)
    print(f"At max_depth = {depth}, Accuracy = {dt_accuracy}")
    if dt_accuracy > dt_best_accuracy:
        dt_best_accuracy = dt_accuracy
        dt_best_model = dt_model
        
print()
print("Best Validation Set Accuracy:", dt_best_accuracy)
    

At max_depth = 1, Accuracy = 0.7402799377916018
At max_depth = 2, Accuracy = 0.7729393468118196
At max_depth = 3, Accuracy = 0.7776049766718507
At max_depth = 4, Accuracy = 0.7542768273716952
At max_depth = 5, Accuracy = 0.7869362363919129
At max_depth = 6, Accuracy = 0.776049766718507
At max_depth = 7, Accuracy = 0.7931570762052877
At max_depth = 8, Accuracy = 0.7962674961119751
At max_depth = 9, Accuracy = 0.7916018662519441
At max_depth = 10, Accuracy = 0.7744945567651633

Best Validation Set Accuracy: 0.7962674961119751


<b>Step 4. Check the quality of the model using the test set.

In [24]:
# Decision Tree Model - Test the Model on Test Set

threshold = 0.75
dt_predictions_test = dt_model.predict(features_test)
dt_accuracy_test = accuracy_score(target_test, dt_predictions_test)
print(f'Decision Tree Model - Test Set Accuracy: {dt_accuracy_test}')
if dt_accuracy_test > threshold:
    print("Accuracy Greater Than Threshold, Model Passes")
else:
    print("Model Fails")

Decision Tree Model - Test Set Accuracy: 0.7931570762052877
Accuracy Greater Than Threshold, Model Passes


<span style='color:blue'> Comments: Now let's repeat Steps 3 and 4 for 2 more models.

In [25]:
# Random Forest Model - Building the Model, Tuning Hyperparameters & Testing Accuracy on Validation Set

rf_best_model = None
rf_best_accuracy = 0

for est in [10, 50, 100, 150, 200]:
    rf_model = RandomForestClassifier(n_estimators = est, random_state=1234)
    rf_model.fit(features_train, target_train)
    rf_predictions_valid = rf_model.predict(features_valid)
    rf_accuracy = accuracy_score(target_valid, rf_predictions_valid)
    print(f"n_estimators = {est}, Accuracy = {rf_accuracy}")
    if rf_accuracy > rf_best_accuracy:
        rf_best_accuracy = rf_accuracy
        rf_best_model = rf_model
        
print()
print("Best Validation Set Accuracy:", rf_best_accuracy)

n_estimators = 10, Accuracy = 0.7931570762052877
n_estimators = 50, Accuracy = 0.7978227060653188
n_estimators = 100, Accuracy = 0.7993779160186625
n_estimators = 150, Accuracy = 0.8009331259720062
n_estimators = 200, Accuracy = 0.8009331259720062

Best Validation Set Accuracy: 0.8009331259720062


In [26]:
# Random Forest Model - Test the Model on Test Set
threshold = 0.75
rf_predictions_test = rf_model.predict(features_test)
rf_accuracy_test = accuracy_score(target_test, rf_predictions_test)
print(f'Random Forest Model - Test Set Accuracy: {rf_accuracy_test}')
if rf_accuracy_test > threshold:
    print("Accuracy Greater Than Threshold, Model Passes")
else:
    print("Model Fails")

Random Forest Model - Test Set Accuracy: 0.8055987558320373
Accuracy Greater Than Threshold, Model Passes


In [27]:
# Logistic Regression Model - Building the Model, Tuning Hyperparameters & Testing Accuracy on Validation Set

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
features_train_scaled = scaler.fit_transform(features_train)
features_valid_scaled = scaler.transform(features_valid)

lr_best_model = None
lr_best_accuracy = 0

for c in [0.01, 0.1, 1, 10, 100]:
    lr_model = LogisticRegression(C=c, penalty = 'l2', max_iter=100, solver = 'liblinear')
    lr_model.fit(features_train_scaled, target_train)
    lr_predictions_valid = lr_model.predict(features_valid_scaled)
    lr_accuracy = accuracy_score(target_valid, lr_predictions_valid)
    print(f"C = {c}, Accuracy = {lr_accuracy}")
    if lr_accuracy > lr_best_accuracy:
        lr_best_accuracy = lr_accuracy
        lr_best_model = lr_model
print()
print("Best Validation Set Accuracy:", lr_best_accuracy)

C = 0.01, Accuracy = 0.7433903576982893
C = 0.1, Accuracy = 0.7418351477449455
C = 1, Accuracy = 0.7387247278382582
C = 10, Accuracy = 0.7387247278382582
C = 100, Accuracy = 0.7387247278382582

Best Validation Set Accuracy: 0.7433903576982893


In [28]:
# Logistic Regression Model - Testing the Model on Test Set

features_test_scaled = scaler.transform(features_test)

threshold = 0.75
lr_predictions_test = lr_model.predict(features_test_scaled)
lr_accuracy_test = accuracy_score(target_test, lr_predictions_test)
print("Logistic Regression Model - Test Set Accuracy", lr_accuracy_test)
if lr_accuracy_test > threshold:
    print("Accuracy Greater Than Threshold, Model Passes")
else:
    print("Model Fails")

Logistic Regression Model - Test Set Accuracy 0.7465007776049767
Model Fails


In [29]:
# Select the model with the best accuracy

print(f"Decision Tree Accuracy: {dt_accuracy_test}\nRandom Forest Accuracy: {rf_accuracy_test} \nLogistic Regression Accuracy: {lr_accuracy_test}")

Decision Tree Accuracy: 0.7931570762052877
Random Forest Accuracy: 0.8055987558320373 
Logistic Regression Accuracy: 0.7465007776049767


<span style= "color: blue"> Comments: We completed Steps 3 and 4 by building 3 different model, tuning their hyperparameters to yield the highest accuracy on the validation set and then finally testing the quality of the model by checking for accuracy on the test set. Based on the results from each model, Random Forest generates the highest accuracy followed by Decision Tree and then Logistic Regression. As such, the best model for this scenario is the Random Forest Classifier Model. Before completing the project, we should perform a Sanity Check and we can do this in many ways. I chose to build a dummy model and compare my results to that of the dummy model. Seeing that Random Forest model yields a much higher accuracy than the dummy model, we can safely check sanity off the list.

<b> Step 5. Sanity check the model.

In [30]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

# Create a Dummy Classifier that predicts the most frequent class
baseline_model = DummyClassifier(strategy='most_frequent')

# Fit and predict
baseline_model.fit(features_train, target_train)
baseline_predictions = baseline_model.predict(features_test)

# Calculate accuracy
baseline_accuracy = accuracy_score(target_test, baseline_predictions)
print(f"Baseline Accuracy: {baseline_accuracy}")

Baseline Accuracy: 0.6936236391912908


<span style = 'color:Blue'>CONCLUSION: This was a great project as an Intro to Machine Learning because it required building a model based on data we already worked on and are familiar with from previous projects. This gave a good insight onto how model building can help further analyze a dataset and predict the right parameters as desired. In this project, the goal was to simply build the right model based on the highest accuracy. In order to accomplish this task, we needed to first distinugish between Classifier and Regressor models since both are used in different cases. Given that our target is a categorical column, we need to use a classifier model but we still don't know which one to choose from Decision Tree, Random Forest and Logistic Regression (These are the 3 we learned about in this Sprint). As such, we needed to build 3 different models and test the accuracy for each and choose the one with the highest accuracy as that will have the greatest chance of picking the optimum results. In building each model it was important to keep in mind the hyperparameters as tuning them could help improve the accuracy of the model. After picking the model based on the best yielding hyperparameters, each model was then test on the Test Set against the threshold accuracy of 0.75. Based on the results, we see that only Decision Tree and Random Forest yield a higher accuracy than the threshold, Logistic Regression does not regardless of the tuning of hyperparameters. Furthermore, comparing the test accuracy from each model, we see that Random Forest Classifier Model yields the highest accuracy and is the winner in this contest of models. We finally perform a Sanity Check by creating a DummyClassifier model and testing the accuracy on the Test Set the same as we did for the 3 models before and the results show that our built models do have a higher accuracy than the Dummy Model. This implies that the models we created is indeed learning rather than guessing. In Conclusion, this project showed how to develop different models for the same dataset and choosing the best one based on Accuracy tested on a Test set and also comparing it to a Dummy Model.