# Project Summary

This project analyzes user behavior data to predict whether a user is classified as "is_ultra". The goals are to explore and clean the data, engineer consistent feature scales, establish baseline performance, and compare several classification methods to identify a practical model for deployment.

Methods used: Data import and EDA; feature scaling and preprocessing; an 80/20 train/test split with a further validation split for tuning; baseline checks (mean and DummyClassifier); and model comparisons (Logistic Regression, Random Forest with n_estimators search, Decision Tree with max_depth search). Model selection was based on validation and final test accuracy, with Decision Tree recommended for initial deployment due to faster inference.

In [1]:
import pandas as pd
import numpy as np
from sklearn import set_config
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier

Added required libraries.

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')

Imported dataset.

In [None]:
print(df.head())
print(df.info())
print(df.describe())
print(df['is_ultra'].value_counts())

   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None
             calls      minutes     messages       mb_used     is_ultra
count  3214.000000  3214.000000  3214.000000   3214.000000  3214.000000
mean     63.038892   438.208787    38.281269  17207.673836     0.306472
std      33.236368   234.569872    36.148326   7570.968246  

Checking the data structure revealed some columns with large value ranges. I will rescale certain features to improve model performance.

In [None]:
df['mb_used'] = df['mb_used'] / 1024
df.rename(columns = {'mb_used' : 'gb_used'}, inplace = True)
df['minutes'] = df['minutes'] / 60
df.rename(columns = {'minutes' : 'hours'}, inplace = True)
print(df.head())

   calls      hours  messages   gb_used  is_ultra
0   40.0   5.198333      83.0  19.91542         0
1   85.0   8.612500      56.0  22.69696         0
2   77.0   7.794333      86.0  21.06045         0
3  106.0  12.425500      81.0   8.43739         1
4   66.0   6.979000       1.0  14.50275         0


Converted MB to GB and minutes to hours to produce more comparable feature scales.

In [5]:
df_tv, df_test = train_test_split(df, test_size = 0.20, random_state = 17)
df_train, df_valid = train_test_split(df_tv, test_size = 0.25, random_state = 17)
feature_train = df_train.drop(['is_ultra'], axis = 1)
target_train = df_train['is_ultra']
feature_valid = df_valid.drop(['is_ultra'], axis = 1)
target_valid = df_valid['is_ultra']
feature_test = df_test.drop(['is_ultra'], axis = 1)
target_test = df_test['is_ultra']

Set up feature and target variables and split the dataset into training, validation, and test sets.

In [6]:
baseline_prediction = target_train.mean()
baseline_predictions = pd.Series(baseline_prediction, index = target_valid.index)
baseline_accuracy = accuracy_score(target_valid, baseline_predictions.round())
print(f"Baseline accuracy: {baseline_accuracy}")

Baseline accuracy: 0.6765163297045101


Baseline prediction using the mean to provide a comparison point for models.

In [7]:
dummy_classifier = DummyClassifier(strategy = 'uniform', random_state = 17)
dummy_classifier.fit(feature_train, target_train)
random_accuracy = dummy_classifier.score(feature_valid, target_valid)
print(f"Random guessing accuracy: {random_accuracy}")

Random guessing accuracy: 0.4727838258164852


Random guessing accuracy using a uniform strategy.

In [8]:
model_logistic = LogisticRegression(random_state = 17, solver = 'liblinear')
model_logistic.fit(feature_train, target_train) 
score_train_logistic = model_logistic.score(feature_train, target_train)  
score_valid_logistic = model_logistic.score(feature_valid, target_valid)  

print('Accuracy of the logistic regression model on the training set:', score_train_logistic)
print("Accuracy of the logistic regression model on the validation set:",score_valid_logistic)

Accuracy of the logistic regression model on the training set: 0.7520746887966805
Accuracy of the logistic regression model on the validation set: 0.7262830482115086


The Logistic Regression model achieved a decent score but underperformed compared to the other models, likely because it captures linear relationships rather than complex patterns. Its performance suggests the data signal is driven by feature values, but non-linear models may better capture the relationships.

In [9]:
final_score_logistic = model_logistic.score(feature_test, target_test)
print(f"Logistic Regression Final Test Score: {final_score_logistic}")

Logistic Regression Final Test Score: 0.76049766718507


Tested the model against the test data set.

In [10]:
best_score_forest = 0
best_est_forest = 0
for est in range (1, 25):
    model_forest = RandomForestClassifier(random_state = 17, n_estimators = est)
    model_forest.fit(feature_train, target_train)
    score_forest = model_forest.score(feature_valid, target_valid)
    if score_forest > best_score_forest:
        best_score_forest = score_forest
        best_est_forest = est
final_forest = RandomForestClassifier(random_state = 17, n_estimators = best_est_forest)
final_forest.fit(feature_train, target_train)
print(f"Random Forest Classifier Best Score: {best_score_forest}")
print(f"Random Forest Best n_estimators: {best_est_forest}")


Random Forest Classifier Best Score: 0.7682737169517885
Random Forest Best n_estimators: 24


The Random Forest model achieved good accuracy and performed best with 24 estimators in the tested range (1–24).

In [11]:
final_score_forest = final_forest.score(feature_test, target_test)
print(f"Random Forest Classifier Final Score: {final_score_forest}")

Random Forest Classifier Final Score: 0.7978227060653188


Tested the model against the test data set.

In [12]:
best_score_tree = 0
best_depth_tree = 0
for depth in range(1, 21):
    model_tree = DecisionTreeClassifier(random_state=17, max_depth = depth)
    model_tree.fit(feature_train, target_train)
    score_tree = model_tree.score(feature_valid, target_valid)
    if score_tree > best_score_tree:
        best_score_tree = score_tree
        best_depth_tree = depth
final_tree = DecisionTreeClassifier(random_state=17, max_depth=best_depth_tree)
final_tree.fit(feature_train, target_train)

print(f"Decision Tree Best Score: {best_score_tree}")
print(f"Decision Tree Best max_depth: {best_depth_tree}")

Decision Tree Best Score: 0.7776049766718507
Decision Tree Best max_depth: 7


The Decision Tree model achieved comparable accuracy to the Random Forest and performed best with a max_depth of 7 in the tested range (1–20).

In [13]:
final_score_tree = final_tree.score(feature_test, target_test)
print(f"Decision Tree Final Test Score: {final_score_tree}")

Decision Tree Final Test Score: 0.7978227060653188


Models were evaluated on the held-out test set.

In [14]:
print(f"Random guessing accuracy: {random_accuracy}")
print(f"Baseline accuracy: {baseline_accuracy}")
print(f"Logistic Regression Final Test Score: {final_score_logistic}")
print(f"Random Forest Classifier Final Score: {final_score_forest}")
print(f"Decision Tree Final Test Score: {final_score_tree}")

Random guessing accuracy: 0.4727838258164852
Baseline accuracy: 0.6765163297045101
Logistic Regression Final Test Score: 0.76049766718507
Random Forest Classifier Final Score: 0.7978227060653188
Decision Tree Final Test Score: 0.7978227060653188


Recommendation: Deploy the Decision Tree Classifier initially because it matches the Random Forest's accuracy but is faster at inference. Monitor the Random Forest as the dataset grows, since it may become more accurate with additional data.