# Megaline: Smart or Ultra

The mobile company Megaline is not satisfied seeing many of its customers using legacy plans. They want to develop a model that can analyze customer behavior and recommend one of Megaline's new plans: Smart or Ultra.

You have access to behavioral data from subscribers who have already switched to the new plans (from the Data Statistical Analysis course project). For this classification task, you should create a model that chooses the right plan. Since you have already gone through the data processing step, you can jump straight into creating the model.

Develop a model with the highest accuracy possible. In this project, the accuracy threshold is 0.75. Use the dataset to check the accuracy.

# Data Description

Each observation in the dataset contains monthly behavior information about a user. The provided information is as follows:

- `сalls` — number of calls.
- `minutes` — total call duration in minutes.
- `messages` — number of text messages.
- `mb_used` — Internet traffic used in MB.
- `is_ultra` — plan for the current month (Ultra - 1, Smart - 0).


# Initialization

In [14]:
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

# Load data

In [15]:
# Load the data into DataFrames
df = pd.read_csv('../datasets/users_behavior.csv')

In [16]:
# Print the general/summary information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [17]:
# Print a random sample of 5 rows from the DataFrame
df.sample(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1330,56.0,333.23,0.0,15373.48,0
1749,34.0,248.65,18.0,12741.28,0
125,6.0,36.45,7.0,4617.1,0
3119,89.0,627.32,88.0,16627.19,0
1784,67.0,356.05,48.0,19909.97,0


# Segment data

In [18]:
# Separate the data into features and target
features = df.drop("is_ultra", axis=1)
target = df["is_ultra"]

In [19]:
# Divide the data into training, validation, and test subsets (3:1:1)
features_train, features_valid_test, target_train, target_valid_test = train_test_split(features, target, test_size=.4, random_state=12345)
features_valid, features_test, target_valid, target_test = train_test_split(features_valid_test, target_valid_test, test_size=.5, random_state=12345)

# Investigate the quality of different models

In [20]:
# Select a random seed for the experiments
random_state=12345

## Decision Tree

In [21]:
# Train several decision tree models with different maximum depths and find the best one
best_score = 0
best_depth = 0
for depth in range(1, 50):
	model = DecisionTreeClassifier(random_state=random_state, max_depth=depth)
	model.fit(features_train, target_train)
	score = model.score(features_valid, target_valid)
	if score > best_score:
		best_score = score
		best_depth = depth
    
print("Depth of the best model:", best_depth)
print("Accuracy of the best model on the validation set:", best_score)

Depth of the best model: 3
Accuracy of the best model on the validation set: 0.7853810264385692


## Random Forest

In [22]:
# Train several random forest models with different numbers of estimators and find the best one
best_score = 0
best_est = 0
for est in range(1, 50):
    model = RandomForestClassifier(random_state=random_state, n_estimators=est)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    if score > best_score:
        best_score = score
        best_est = est

print("Number of estimators of the best model:", best_est)
print("Accuracy of the best model on the validation set:", best_score)

Number of estimators of the best model: 23
Accuracy of the best model on the validation set: 0.7947122861586314


## Logistic Regression

In [23]:
# Train a logistic regression model and find the accuracy on the validation set
model = LogisticRegression(random_state=random_state, solver='liblinear')
model.fit(features_train, target_train)
score_train = model.score(features_train, target_train)
score_valid = model.score(features_valid, target_valid)

print("Accuracy of the logistic regression model on the training set:", score_train)
print("Accuracy of the logistic regression model on the validation set:", score_valid)

Accuracy of the logistic regression model on the training set: 0.7510373443983402
Accuracy of the logistic regression model on the validation set: 0.7573872472783826


## Intermediate Conclusion

Of the 3 different types of classifiers:
- The `decision tree` model had the second highest score and the second shortest execution time
- The `random forest` model had the highest score and the longest execution time
- The `logistic regression` model had the lowest score and the shortest execution time

The **`Decision Tree`** model with the hyperparameter `max_depth = 3` will be the chosen model as it has the best balance between score and execution time.

# Model Quality Assessment

In [24]:
# Train a decision tree model with the maximum depth found earlier and determine the accuracy on the test set
model = DecisionTreeClassifier(random_state=random_state, max_depth=best_depth)
model.fit(features_train, target_train)
score_test = model.score(features_test, target_test)

print("Accuracy of the decision tree model on the test set:", score_test)

Accuracy of the decision tree model on the test set: 0.7791601866251944


## Sanity Check

In [25]:
# Create a dummy classifier that always predicts the most frequent class
dummy = DummyClassifier(strategy='most_frequent', random_state=random_state)
dummy.fit(features_train, target_train)

# Calculate the score on the test data
dummy_score = dummy.score(features_test, target_test)

print(f'Dummy Classifier Score: {dummy_score}')

Dummy Classifier Score: 0.6842923794712286


## Intermediate Conclusion

- The `Dummy Classifier` model, which always predicts the most frequent class, guesses correctly 68.4% of the time.
- The `Decision Tree` model guesses correctly 77.9% of the time. This indicates that the model is learning from the data and making predictions that are significantly better than a model that just guesses the most frequent class.

**Note**: The data is imbalanced, as a `Dummy Classifier` on a balanced dataset would have an accuracy of 50%.

# General Conclusion

The tested classification models included the `decision tree`, `random forest`, and `logistic regression`. The analysis revealed that:

- The `decision tree` model had the second highest score and the second shortest execution time.
- The `random forest` model had the highest score and the longest execution time.
- The `logistic regression` model had the lowest score and the shortest execution time.

The `decision tree` model with the hyperparameter `max_depth = 3` was chosen for its balance between a '`0.779`' score and execution time. This model outperformed the Dummy Classifier, indicating that the model is learning from the data and making predictions that are significantly better than a model that merely guesses the most frequent class.
