# Machine Learning Project <a id='intro'></a>

## Project Description

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

You have access to behavior data about subscribers who have already switched to the new plans. For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

## Project Instructions

1) Open and look through the data file. Path to the file: `datasets/users_behavior.csv`
2) Split the source data into a training set, a validation set, and a test set.
3) Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.
4) Check the quality of the model using the test set.
5) Sanity check the model.

## Data Description
Every observation in the dataset contains monthly behavior information about one user. The information given is as follows: 
- **сalls** — number of calls
- **minutes** — total call duration in minutes
- **messages** — number of text messages
- **mb_used** — Internet traffic used in MB
- **is_ultra** — plan for the current month (Ultra - 1, Smart - 0)


## Initialization

In [1]:
# Loading all the libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

## Load data

In [2]:
# Load 'users_behavior.csv' into a data frame
try:
    # Try to read the CSV file from the local path.
    users_behavior_df = pd.read_csv('/Users/benjaminstephen/Documents/TripleTen/Sprint_7/Machine_Learning_Project/datasets/users_behavior.csv')
except FileNotFoundError:
    try:
        # Try to read the CSV file from the server path
        users_behavior_df = pd.read_csv('/datasets/users_behavior.csv')
        print("CSV file successfully read from the server path.")
    except FileNotFoundError:
        print("CSV file not found. Please check the file paths.")
else:
    print("CSV file successfully read from the local path.")

CSV file successfully read from the local path.


In [3]:
# Print the general/summary information about 'users_behavior_df'
print("USERS BEHAVIOR DATA FRAME FRAME INFO:")
users_behavior_df.info()
print()

print("PERCENTAGE OF NULL VALUES:")
print(users_behavior_df.isnull().sum()/len(users_behavior_df))
print()

print("USERS BEHAVIOR FRAME:")
display(users_behavior_df)
print()

USERS BEHAVIOR DATA FRAME FRAME INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB

PERCENTAGE OF NULL VALUES:
calls       0.0
minutes     0.0
messages    0.0
mb_used     0.0
is_ultra    0.0
dtype: float64

USERS BEHAVIOR DATA FRAME:


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0





The 'Users Behaviors' DataFrame consists of 3214 entries with 5 columns: calls, minutes, messages, mb_used, and is_ultra. All columns are fully populated with no missing values, as indicated by the non-null counts matching the total entries. The data types are appropriate for the variables, with four columns as float64 and one as int64. The memory usage is 125.7 KB, which is efficient for this data size. There are no immediate errors in the data, as all columns are of expected types and fully populated, suggesting data integrity is maintained.

## Split Data

In [4]:
# Extract the features
features = users_behavior_df.drop(['is_ultra'], axis=1)

# Extract the target
target = users_behavior_df['is_ultra']

# Split 60% training and 40% temp (validation + test)
features_train, features_temp, target_train, target_temp = train_test_split(
    features, 
    target, 
    test_size=0.4, 
    random_state=12345)

# Split 50% validation and 50% test from the temp set
features_valid, features_test, target_valid, target_test = train_test_split(
    features_temp, 
    target_temp, 
    test_size=0.5, 
    random_state=12345)

# Print the dimensions of eacg dataset
print('Training Feature Set Size:', features_train.shape)
print('Training Target Set Size:', target_train.shape)
print()
print('Validation Feature Set Size:', features_valid.shape)
print('Validation Target Set Size:', target_valid.shape)
print()
print('Test Feature Set Size:', features_test.shape)
print('Test Target Set Size:', target_test.shape)

Training Feature Set Size: (1928, 4)
Training Target Set Size: (1928,)

Validation Feature Set Size: (643, 4)
Validation Target Set Size: (643,)

Test Feature Set Size: (643, 4)
Test Target Set Size: (643,)


Here we are splitting the source data into training (60%), validation (20%), and test (20%) sets. We split the data in a 3:1:1 ratio because the test data is derived from the source data itself. If the test data were separate, we would only need to split the source data into training and validation sets, making the validation set 25% of the source data to maintain a comparable split ratio.

Now that the data is split properly we will investigate the quality of different classifer models.

## Model Investigation
### Decision Tree

In [5]:
# Create a loop for max_depth from 1 to 21
dt_best_score = 0
best_depth = 0
for depth in range(1, 21):
        # Create a Decision Tree model, specify max_depth=depth
        dt_model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    
        # Train the model on training set
        dt_model.fit(features_train, target_train)

        # Calculate accuracy score on validation set
        dt_score = dt_model.score(features_valid, target_valid)
        if dt_score > dt_best_score:
                dt_best_score = dt_score # save best accuracy score on validation set
                best_depth = depth # save the max tree depth corresponding to best accuracy score

print("Accuracy of the best Decisiotn Tree model on the validation set (max_depth = {}): {}".format(best_depth, dt_score))

Accuracy of the best Decisiotn Tree model on the validation set (max_depth = 3): 0.7216174183514774


### Random Forest

In [6]:
# Create a loop for n_estimators from 1 to 101
rf_best_score = 0
best_est = 0
for est in range(1, 101): # choose hyperparameter range
    # Create a Random Forest model, specify n_estimators=est
    rf_model = RandomForestClassifier(random_state=12345, n_estimators=est)
    
    # Train the model training set
    rf_model.fit(features_train, target_train)

    # Calculate accuracy score on validation set
    rf_score = rf_model.score(features_valid, target_valid)
    if rf_score > rf_best_score:
        rf_best_score = rf_score # save best accuracy score on validation set
        best_est = est # save number of estimators corresponding to best accuracy score

print("Accuracy of the best Random Forest model on the validation set (n_estimators = {}): {}".format(best_est, rf_best_score))

Accuracy of the best Random Forest model on the validation set (n_estimators = 23): 0.7947122861586314


### Logistics Regression

In [7]:
# Create a Logistics Regression model, specify solver='liblinear'
lr_model = LogisticRegression(random_state=12345, solver='liblinear')

# Train model on training set
lr_model.fit(features_train, target_train)

# Calculate accuracy score on validation set
lr_score = lr_model.score(features_valid, target_valid)

print("Accuracy of the Logistic Regression model on the validation set:", lr_score)

Accuracy of the Logistic Regression model on the validation set: 0.7589424572317263


Based on the validation set accuracies of the three models tested, the Random Forest model, with an accuracy of 0.7947, outperforms the Decision Tree (0.7138) and Logistic Regression (0.7589) models. Therefore, the Random Forest model is the best choice based on accuracy. 

We will conduct a final test using the Random Forest model, with the hyperparameter 'n_estimators' set to 23, as this configuration yielded the highest accuracy during validation.

## Final Test

In [8]:
# Create a model, specify n_estimators=23 (best_est)
final_model = RandomForestClassifier(random_state=12345, n_estimators=23)

# Train final model on training set
final_model.fit(features_train, target_train)

# Calculate accuracy score on test set
final_score = final_model.score(features_test, target_test)

print("Accuracy of the final Random Forest model on the test set:", final_score)

Accuracy of the final Random Forest model on the test set: 0.7807153965785381


The final Random Forest model, with n_estimators set to 23, achieved a slightly lower accuracy of 0.7807 on the test set compared to 0.7947 on the validation set, indicating good but slightly reduced performance on unseen data. Still, the model performs reasonably well in predicting whether a customer will opt for the Ultra plan or not based on the given features.

Let's perform a quick sanity check to ensure the accuracy of the results.

## Sanity Check

In [9]:
# Add the actual target values and predictions to  the features_test data frame
predictions_test = final_model.predict(features_test)
features_test['is_ultra'] = target_test
features_test['predictions'] = predictions_test
display(features_test)

# Manually calculate number of correct predictions and overall accuracy of the final model
count = 0
for i in range(len(features_test)):
    if features_test.iloc[i]['predictions'] == features_test.iloc[i]['is_ultra']:
        count += 1

print("Error Count:", len(features_test) - count)
print("Accuracy:", count/len(features_test))


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra,predictions
160,61.0,495.11,8.0,10891.23,0,0
2498,80.0,555.04,28.0,28083.58,0,1
1748,87.0,697.23,0.0,8335.70,0,1
1816,41.0,275.80,9.0,10032.39,0,1
1077,60.0,428.49,20.0,29389.52,1,0
...,...,...,...,...,...,...
2401,55.0,446.06,79.0,26526.28,0,1
2928,102.0,742.65,58.0,16089.24,1,0
1985,52.0,349.94,42.0,12150.72,0,0
357,39.0,221.18,59.0,17865.23,0,0


Error Count: 141
Accuracy: 0.7807153965785381


The accuracy of the final Random Forest model on the test set is approximately 78.07%, indicating that it correctly classified 78.07% of the instances in the test data.

## Conclusion

In conclusion, the analysis of subscriber behavior data from Megaline suggests that a Random Forest model with 23 estimators is the most effective in predicting whether customers will opt for the Ultra plan. This model demonstrated a solid accuracy of 78.07% on unseen test data, meeting the project's threshold. Despite a slight drop in performance from validation to test, the model's reliability in recommending appropriate plans remains evident. Moving forward, implementing this model could significantly aid Megaline in transitioning subscribers from legacy plans to newer, more suitable options, ultimately improving customer satisfaction and company revenue.