# Megaline Plans

## Introduction.

The telecommunications company Megaline seeks to optimize its customers' transition to the new "Smart" and "Ultra" plans, as a significant number of users remain on older plans. The objective of this project is to develop a model that, based on the analysis of customers' monthly usage behavior, can predict which of the two new plans would be most suitable for them. <br><br>

To achieve this, a dataset detailing the usage characteristics of subscribers who have already adopted these plans is available. Leveraging this information, a classification model with the highest possible level of accuracy will be built. The minimum performance threshold established for this project is 75% accuracy, which will be evaluated using the provided dataset. Since the data has already been preprocessed, the focus will be directly on developing and validating the predictive model.<br><br>

The dataset contains the following information: <br><br>
calls — number of calls.<br>
minutes — total call duration in minutes.<br>
messages — number of text messages.<br>
mb_used — Internet traffic used in MB.<br>
is_ultra — plan for the current month (Ultra — 1, Smart — 0).


## Import libraries and inspect data

In [1]:
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
import numpy as np

In [2]:
df= pd.read_csv('/Users/pauli/Documents/Data/megaline_plans_ML/users_behavior.csv')

df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [3]:
df.info()
df.duplicated().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


0

There are 3214 non-null values, of the correct type (all numeric), and no missing values.

## Develop a model as accurately as possible.

In [4]:
#Divide the dataset into training and validation sets.

features= df.drop(['is_ultra'], axis=1)
target= df['is_ultra']

# First it is divided into training set (60%) and temporary set (40%)
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size=0.4, random_state=15)

#Then the temporary set is divided into validation (20%) and test (20%)

features_valid, features_test, target_valid, target_test = train_test_split(features_temp, target_temp, test_size=0.5, random_state=15)

#Display the size of the sets

print("Training set:", features_train.shape)
print("Valid set:", features_valid.shape)
print("Test set:", features_test.shape)

Training set: (1928, 4)
Valid set: (643, 4)
Test set: (643, 4)


### Decision tree.

In [5]:
# Decision tree that has a loop to determine the best depth.

best_depth = None
best_accuracy = 0

for depth in range(1, 51):  # Iterate over 50 depths
    model = DecisionTreeClassifier(max_depth=depth, random_state=15)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    accuracy = accuracy_score(target_valid, predictions)** 0.5
    
    if accuracy > best_accuracy:
        best_depth = depth
        best_accuracy = accuracy
        
print(f"Best depth: {best_depth}")
print(f"Best accuracy on validation set: {best_accuracy}")

# Model accuracy
train_predictions = model.predict(features)
valid_predictions = model.predict(features_valid)

print('Training set accuracy:', accuracy_score(target, train_predictions))
print('Validation set accuracy:', accuracy_score(target_valid, valid_predictions))

Best depth: 2
Best accuracy on validation set: 0.8773989553818332
Training set accuracy: 0.8746110765401369
Validation set accuracy: 0.6889580093312597


The model is learning the patterns in the training data quite well, with an accuracy of 0.87.
On the validation set, the accuracy is below the required threshold of 0.75, suggesting that the model does not generalize sufficiently to new data. <br>
Let's test other models to find one that comes closest. <br>The high accuracy on the training set and significantly lower accuracy on the validation set suggests that the model may be overfitting to the training data.

### Logistic Regression

We use a Logistic Regression model, which is a linear algorithm but designed specifically for binary classification.

In [6]:
model = LogisticRegression(random_state=15, max_iter=1000)
model.fit(features_train, target_train)

valid_predictions = model.predict(features_valid)
valid_accuracy = accuracy_score(target_valid, valid_predictions)**0.5

train_predictions = model.predict(features_train)
train_accuracy = accuracy_score(target_train, train_predictions)

print(f"Accuracy on training set: {train_accuracy}")
print(f"Accuracy on validation set: {valid_accuracy}")


Accuracy on training set: 0.758298755186722
Accuracy on validation set: 0.8540453548367299


### Random Forest

This model tends to be more robust and less prone to overfitting than a simple decision tree.

In [7]:
# Random Forest training with different hyperparameters
best_n_estimators = None
best_max_depth = None
best_accuracy = 0

for n_estimators in [50, 100, 150]:  # Iterar sobre diferentes números de arboles
    for max_depth in range(5, 26, 5): 
        model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=15)
        model.fit(features_train, target_train)

        train_predictions = model.predict(features_train)
        train_accuracy = accuracy_score(target_train, train_predictions)** 0.5
        valid_predictions = model.predict(features_valid)
        valid_accuracy = accuracy_score(target_valid, valid_predictions) ** 0.5
        
        if valid_accuracy > best_accuracy:
            best_n_estimators = n_estimators
            best_max_depth = max_depth
            best_accuracy = valid_accuracy
            best_train_accuracy = train_accuracy
    
print(f"Best number of estimators (n_estimators): {best_n_estimators}")
print(f"Best maximum depth (max_depth): {best_max_depth}")
print(f"Accuracy on the training set: {best_train_accuracy}")
print(f"Accuracy on the validation set: {best_accuracy}")

Best number of estimators (n_estimators): 50
Best maximum depth (max_depth): 10
Accuracy on the training set: 0.9486286236247387
Accuracy on the validation set: 0.8862172569063238


The training set accuracy is quite close to 1, which means the model is learning the patterns in the training data well. <br>
The validation set accuracy is lower than the training set accuracy, but still quite good. <br>The difference between the training (0.95) and validation (0.88) accuracy is moderate, suggesting that the model is well-tuned: there is no obvious overfitting, as the validation accuracy remains high. <br>
Although the validation performance is above the 0.75 accuracy threshold set in the project, this indicates that the model is capable of generalization.

### Testing the model on the test set.

In the previous analysis, we concluded that the best model for this case is the random forest.

In [8]:
#The model is tested with the characteristics that gave the best RSMD

final_model = RandomForestClassifier(random_state=15, n_estimators=50, max_depth=10)
final_model.fit(features_train, target_train) 

predictions = final_model.predict(features_test)
score_final = final_model.score(features_test, target_test)
print('The accuracy of the model on the test set is:', score_final )

The accuracy of the model on the test set is: 0.7884914463452566


### Perform sanity testing on the selected model.

Sanity testing helps confirm that a model is performing reasonably and not simply exploiting accidental patterns or artifacts in the data. <br>
In this case, we will evaluate the model by introducing random data. If the accuracy is significantly high, it could indicate a problem with the data or the training process.

In [9]:
# Create random labels
random_labels = np.random.randint(0, 2, size=target_train.shape)

# Train a model with random labels
random_forest = RandomForestClassifier(n_estimators=50, max_depth=10, random_state=15)
random_forest.fit(features_train, random_labels)


random_valid_predictions = random_forest.predict(features_valid)
random_valid_accuracy = accuracy_score(target_valid, random_valid_predictions)


print(f"Model accuracy with random labels: {random_valid_accuracy}")

Model accuracy with random labels: 0.47433903576982894


The result of 48% for accuracy with random labels is very close to 50%, which is exactly what we would expect for a binary classification problem where the predictions are random.<br>This result indicates that the Random Forest model is not learning spurious patterns or artifacts in the data when given random labels.

## Conclusions.

In this project, we analyzed and developed machine learning models to recommend plans (Smart or Ultra) to Megaline subscribers based on their behavior. <br>

We trained three different models using the sklearn library: decision tree, logistic regression, and random forest. These models were evaluated based on their accuracy on the validation set. <br>

The best-performing model was the random forest with 50 trees in the set and a maximum depth of 10. It achieved an accuracy of approximately 88% on the validation set and 78% on the test set, passing the given threshold of 75%. It also successfully passed the sanity test. These are the reasons why the company is recommended to use this model to analyze customer behavior and recommend plans. 