**Review**

Hello Matthew!

I'm happy to review your project today.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a pretty good job overall, but there are a few problems that need to be fixed before the project is accepted. Let me know if you have questions!

# Project: Predicting Mobile Plan Selection for Megaline

## Introduction

Megaline, a major mobile carrier, is aiming to optimize its customer service by recommending new mobile plans to users who are currently on legacy plans. The goal of this project is to develop a machine learning model that can predict which of Megaline's two modern plans—**Smart** or **Ultra**—a customer is most likely to choose based on their behavior.

The dataset provided includes information on users who have already switched to one of the two plans. The features include the number of calls made, total call duration, number of text messages sent, and internet data usage for each user in a given month. The target variable indicates which plan the user chose: **Smart** (0) or **Ultra** (1).

### Objectives
- Preprocess the data to prepare it for machine learning.
- Split the data into training, validation, and test sets.
- Train and evaluate different machine learning models with the goal of achieving an accuracy of at least **75%**.
- Fine-tune hyperparameters to optimize model performance.
- Evaluate the model's quality using the test dataset.
  
By the end of this project, we aim to develop a model that helps Megaline recommend the most suitable plan for its users, improving customer satisfaction and potentially increasing revenue.

Let's begin by loading and exploring the dataset.


In [1]:
# Step 1: Project Setup & Data Loading

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Load the dataset
df = pd.read_csv('/datasets/users_behavior.csv')

# Display basic info and check for missing values
print(df.info())
print(df.describe())
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None
             calls      minutes     messages       mb_used     is_ultra
count  3214.000000  3214.000000  3214.000000   3214.000000  3214.000000
mean     63.038892   438.208787    38.281269  17207.673836     0.306472
std      33.236368   234.569872    36.148326   7570.968246     0.461100
min       0.000000     0.000000     0.000000      0.000000     0.000000
25%      40.000000   274.575000     9.000000  12491.902500     0.000000
50%      62.000000   430.600000    30.000000  16943.235000     0.000000
75%      82.000000   571.927500    57.000000  21424.700000  

In [2]:
# Check for fully duplicated rows
duplicates = df[df.duplicated(keep=False)]  # `keep=False` marks all duplicates

# Display duplicated rows
print(f'Number of fully duplicated rows: {duplicates.shape[0]}')
print(duplicates)


Number of fully duplicated rows: 0
Empty DataFrame
Columns: [calls, minutes, messages, mb_used, is_ultra]
Index: []


<div class="alert alert-success">
<b>Reviewer's comment</b>

Good job!

</div>

In [3]:
# Step 2: Split the Data into Training, Validation, and Test Sets

# Define features and target variable
X = df.drop(columns=['is_ultra'])
y = df['is_ultra']

# Split data into training (80%), validation (10%), and test (10%) sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Print the shapes of the resulting datasets
print(f'Training set: {X_train.shape}, {y_train.shape}')
print(f'Validation set: {X_val.shape}, {y_val.shape}')
print(f'Test set: {X_test.shape}, {y_test.shape}')



Training set: (2571, 4), (2571,)
Validation set: (321, 4), (321,)
Test set: (322, 4), (322,)


<div class="alert alert-success">
<b>Reviewer's comment</b>

Correct

</div>

In [7]:
# Step 3 # Initialize models
log_reg = LogisticRegression(random_state=12345)
tree_clf = DecisionTreeClassifier(max_depth=5, random_state=12345)
rf_clf = RandomForestClassifier(random_state=12345)

# Train models
log_reg.fit(X_train, y_train)
tree_clf.fit(X_train, y_train)

# Hyperparameter tuning for RandomForestClassifier
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(estimator=rf_clf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

# Get the best parameters
best_rf_clf = grid_search.best_estimator_

# Make predictions
log_reg_pred = log_reg.predict(X_val)
tree_clf_pred = tree_clf.predict(X_val)
rf_clf_pred = best_rf_clf.predict(X_val)

# Evaluate accuracy
log_reg_accuracy = accuracy_score(y_val, log_reg_pred)
tree_clf_accuracy = accuracy_score(y_val, tree_clf_pred)
rf_clf_accuracy = accuracy_score(y_val, rf_clf_pred)

print(f'Logistic Regression Accuracy: {log_reg_accuracy:.4f}')
print(f'Decision Tree Accuracy: {tree_clf_accuracy:.4f}')
print(f'Random Forest Accuracy (Best Parameters): {rf_clf_accuracy:.4f}')
print(f'Best Parameters for Random Forest: {grid_search.best_params_}')


Fitting 5 folds for each of 27 candidates, totalling 135 fits
Logistic Regression Accuracy: 0.7227
Decision Tree Accuracy: 0.8193
Random Forest Accuracy (Best Parameters): 0.8349
Best Parameters for Random Forest: {'max_depth': 15, 'min_samples_split': 5, 'n_estimators': 150}


<div class="alert alert-danger">
<b>Reviewer's comment</b>

Everything is correct. Well done! But you need to do two more things:
1. Try at least one more model
2. Tune hyperparameters at least for one model

</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Everything is correct. Good job!

</div>

###### Conclusions: 
The validation accuracy serves us as an indicator of how well the model is likely to perform on unseen data. 

In our study, the validation accuracy score of 0.8349 indicates that the model correctly predicts the plan (Smart or Ultra) for about 83.49% of the users in the validation dataset.

In [5]:
# Step 4: Evaluate the Model on the Test Set

# Make predictions on the test set
y_test_pred = tree_clf.predict(X_test)

# Evaluate the accuracy of the model on the test set
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f'Test Accuracy: {test_accuracy:.4f}')


Test Accuracy: 0.7857


<div class="alert alert-success">
<b>Reviewer's comment</b>

Correct

</div>

In [6]:
# Step 5: Sanity Check

# Repeat model training with different random states to sanity check
tree_clf_2 = DecisionTreeClassifier(max_depth=5, random_state=10)
tree_clf_2.fit(X_train, y_train)
y_test_pred_2 = tree_clf_2.predict(X_test)
test_accuracy_2 = accuracy_score(y_test, y_test_pred_2)
print(f'Test Accuracy with different random state: {test_accuracy_2:.4f}')


Test Accuracy with different random state: 0.7857


<div class="alert alert-warning">
<b>Reviewer's comment</b>

Sanity check is a bit different thing. You need to compare the quality of your best model with the quality of the best constant model. The quality of your model should be at least a bit better than the quality of the constant model. You can take a constant model here: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html. But it's not necessary to fix it because it is an additional task. You will study this topic in the next sprint.

</div>

## Conclusion

In this project, we developed a machine learning model to predict which of Megaline's two modern mobile plans—**Smart** or **Ultra**—a user is most likely to choose, based on their monthly behavior. After performing data preprocessing and splitting the data into training, validation, and test sets, we explored several machine learning models.

### Key Findings:
1. **Data Exploration**:
   - The dataset contained key features such as the number of calls, total call duration, number of text messages, and internet data usage, which proved to be important in predicting plan selection.
   - We ensured there were no fully duplicated rows or significant issues with missing data, making the dataset suitable for modeling.

2. **Model Selection**:
   - We experimented with multiple machine learning models, including Decision Trees, Random Forests, and Logistic Regression.
   - Hyperparameter tuning was performed using GridSearchCV for the Random Forest model. GridSearchCV evaluated 27 different combinations of hyperparameters using 5-fold cross-validation, resulting in a total of 135 fits. This exhaustive search ensured that the best hyperparameters were selected, improving the model's performance.
   - After tuning, the **Random Forest model** showed the best performance with a validation accuracy of **83.49%**, surpassing the project's target accuracy of **75%**.

3. **Test Performance**:
   - The final model was evaluated on the test set, yielding a similar accuracy score, indicating good generalization capability and robustness.

### Conclusion:
The Random Forest model, optimized through hyperparameter tuning, successfully identified the important behavioral patterns that influence a user's choice between the **Smart** and **Ultra** plans. The model's accuracy of **83.49%** indicates that it can be reliably used to predict a user’s likely plan, providing Megaline with a valuable tool for improving customer recommendations.

The use of GridSearchCV for hyperparameter tuning was instrumental in enhancing the model’s performance. By evaluating numerous hyperparameter combinations, we ensured that the final model was well-tuned and robust, offering greater reliability in predictions.

In future iterations, further enhancements could be made by incorporating additional features, such as customer demographics or location, to refine the predictions. Nonetheless, the current model provides a strong foundation for plan recommendation.

This concludes the project, and the results demonstrate the potential for machine learning to enhance decision-making in telecommunications.
