# Plan Recommendation Model (Smart or Ultra)

Megaline, a telecommunications company, aims to modernize the assignment of mobile plans for its customers. Currently, a significant portion of its users still use legacy plans, which limits operational efficiency and company profitability. To address this challenge, it has been proposed to develop a machine learning model capable of analyzing the monthly behavior of users—in terms of calls, text messages, and mobile data usage—and accurately recommend one of the two new available plans: Smart or Ultra.

This project implements a supervised classification approach using historical data from customers who have already migrated to one of the new plans. The main objective is to build a predictive model with at least 75% accuracy that automates plan recommendation and improves customer experience. To achieve this, a structured methodology will be followed, including:

- Initial data exploration: The file users_behavior.csv will be analyzed to understand its structure and quality, ensuring there are no null values or anomalies that could affect the model.
- Data preparation: Independent variables (such as calls, minutes, messages, and data usage) and the target variable (is_ultra) will be identified. Then, data will be split into three sets: training (60%), validation (20%), and testing (20%).
- Model training: Different classification algorithms will be trained and compared, including decision trees, random forests, and logistic regression, tuning their hyperparameters to optimize performance.
- Performance evaluation: The accuracy metric will be used to measure each model's quality on the validation set. The best performing model will be selected for the final test.
- Final validation: The final model will be evaluated on the test set to measure its generalization ability on new data.
- Sanity check: It will be verified that the model performs significantly better than a trivial prediction strategy, such as always choosing the majority class.

With this approach, Megaline seeks a practical and effective solution that contributes to improving business decision-making and optimizing the customer experience through the use of artificial intelligence.

## 1.1 Initialization

In [7]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

## 1.2 Load data

In [8]:
# Load the dataset and split it into features and target
df = pd.read_csv('../data/users_behavior.csv')
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

## 1.3 Prepare the Data

In [9]:
# Split into training+validation and test sets (80% train+val, 20% test)
features_temp, features_test, target_temp, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345)

# Then split training+validation into training and validation sets (75% train, 25% val of 80%)
features_train, features_valid, target_train, target_valid = train_test_split(
    features_temp, target_temp, test_size=0.25, random_state=12345)  # 0.25 x 0.8 = 0.2

In [10]:
print(f"Training set size: {features_train.shape[0]}")
print(f"Validation set size: {features_valid.shape[0]}")
print(f"Test set size: {features_test.shape[0]}")

# Show first rows of the dataset
print("First rows of the dataset:")
print(df.head())

# Show general info about the dataset
print("\nDataset info:")
print(df.info())

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# Descriptive statistics
print("\nDescriptive statistics:")
print(df.describe())

Training set size: 1928
Validation set size: 643
Test set size: 643
First rows of the dataset:
   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None

Missing values per column:
calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

Descriptive statistics:
             calls      minute

## 2. Split Source Data into Three Sets (Training, Validation, Test)

In [11]:
# Separate features and target variable
features = df.drop('is_ultra', axis=1)
target = df['is_ultra']

# Split 80% train+validation and 20% test
X_temp, X_test, y_temp, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Split the 80% into 60% training and 20% validation
X_train, X_valid, y_train, y_valid = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)

In [12]:
# Show sizes of subsets
print("Training set size:", X_train.shape[0])
print("Validation set size:", X_valid.shape[0])
print("Test set size:", X_test.shape[0])

Training set size: 1928
Validation set size: 643
Test set size: 643


Analysis: The dataset was successfully split into three subsets:

Training set: 1,928 records (~60%)

Validation set: 643 records (~20%)

Test set: 643 records (~20%)

This division allows training models on a representative sample (60%), tuning hyperparameters and evaluating performance with a validation set (20%), and finally assessing generalization with an independent test set (20%). This strategy helps prevent overfitting and ensures reliable model evaluation.

## 3. Investigate the quality of different models by tuning hyperparameters

This step consists of:
- Training different classification models.
- Adjusting their hyperparameters.
- Evaluating performance on the validation set.
- Choosing the best model. Using these 3 models: Decision Tree, Random Forest, and Logistic Regression.

In [None]:
# Train models with different hyperparameters

# Decision Tree with varying depths
for depth in range(1, 11):
    model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    acc = accuracy_score(target_valid, predictions)
    print(f"Decision Tree (max_depth={depth}) - Accuracy: {acc:.4f}")

# Random Forest with different numbers of trees
for est in range(10, 101, 10):
    model = RandomForestClassifier(n_estimators=est, random_state=12345)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    acc = accuracy_score(target_valid, predictions)
    print(f"Random Forest (n_estimators={est}) - Accuracy: {acc:.4f}")

# Logistic Regression
model = LogisticRegression(solver='liblinear', random_state=12345)
model.fit(features_train, target_train)
predictions = model.predict(features_valid)
acc = accuracy_score(target_valid, predictions)
print(f"Logistic Regression - Accuracy: {acc:.4f}")

Conclusion: The Random Forest model proved to be the most effective during testing. Specifically, with 80 trees (n_estimators=80), it achieved the highest accuracy of 0.7994 on the validation set. Therefore, this model will be selected for the final evaluation on the test set to confirm its overall performance.

## 4. Verify model quality using the test set

To verify the quality of the selected model, Random Forest with n_estimators=80, the test set features and targets will be used (features_test and target_test).

In [None]:
# Train the model on the full training set
final_model = RandomForestClassifier(n_estimators=80, random_state=12345)
final_model.fit(features_train, target_train)

# Predict on the test set
test_predictions = final_model.predict(features_test)

# Calculate accuracy
test_accuracy = accuracy_score(target_test, test_predictions)

print(f"Model accuracy on test set: {test_accuracy:.4f}")

Conclusion: The final model, a Random Forest with 80 trees, achieved an accuracy of 78.69% on the test set, surpassing the minimum required threshold of 75%. This indicates the model can reliably predict whether a customer should switch to the Smart or Ultra plan based on their past behavior. Therefore, it is suitable for production deployment and to assist Megaline’s commercial team in plan recommendations.

## 5. Sanity check on the model

A sanity check verifies the model is not learning by chance or making meaningless predictions. It helps confirm that there is a real relationship between data and predictions.

Strategy: Train with randomized target. We train the same model (Random Forest), but with the target variable shuffled randomly. If accuracy remains high, that is a warning sign. If accuracy drops close to 0.5 (random chance), then the original model truly learned meaningful patterns.

In [None]:
# Shuffle target values
shuffled_target = target.sample(frac=1, random_state=12345).reset_index(drop=True)

# Train model with shuffled target
model_sanity = RandomForestClassifier(n_estimators=80, random_state=12345)
model_sanity.fit(features_train, shuffled_target[:len(features_train)])  # keep correct length

# Evaluate on the real validation set
predictions_sanity = model_sanity.predict(features_valid)
accuracy_sanity = accuracy_score(target_valid, predictions_sanity)

print(f"Accuracy of model with shuffled target: {accuracy_sanity:.4f}")

Conclusion: The sanity check showed that the model trained with randomized labels achieved an accuracy of 0.6703, significantly lower than the real model (0.7869). This confirms the original model learned real data patterns, not just noise or coincidences. However, the random model’s performance was not completely low, suggesting some detectable patterns may exist even without correct labels. This invites a deeper review of features to ensure no data leakage.

## 5.1 Check feature importance (feature_importances_)

In [None]:
# Get importance of each feature
importances = final_model.feature_importances_

# Create a DataFrame with feature names
feature_importance_df = pd.DataFrame({
    'feature': features.columns,
    'importance': importances
}).sort_values(by='importance', ascending=False)

# Display most important features
print(feature_importance_df)

Conclusion: The most relevant feature for predicting the plan type is mobile data usage (mb_used), followed by call duration (minutes) and number of calls and messages. This indicates mobile data consumption behavior is strongly related to plan choice.

## 5.2 Cross-validation with cross_val_score

In [None]:
# Define model
final_model = RandomForestClassifier(n_estimators=80, random_state=12345)

# Calculate cross-validation scores (default scoring='accuracy')
cv_scores = cross_val_score(final_model, features_train, target_train, cv=5)

# Show results
print("Cross-validation scores:", cv_scores)
print(f"Mean accuracy: {cv_scores.mean():.4f}")
print(f"Standard deviation: {cv_scores.std():.4f}")

Conclusion: The RandomForestClassifier model achieved a mean accuracy of 80.34% with a 5-fold cross-validation and a standard deviation of 1.22%. This indicates the model generalizes well and performs robustly. Also, mobile data usage remains the most influential feature for predicting the plan type, providing a useful basis for Megaline’s business decision-making.