<style>
.callout {
  border: 2px solid green;
  padding: 10px;
  /* background-color: #F0FFF0; */
  font-family: Arial, sans-serif;
  font-size: 14px;
  color: #008000;
  text-align: justify;
}
</style>
<div class="callout">
  <p>Привет! Давай на ты, если ты непротив.</p>
  <p>Ниже ты увидишь что-то странное - весь проект будет сделан на английском. Пожалуйста, не удивляйся! Я работаю зарубежом и готовлю эти проекты для портфолио на `github`. Я пишу сразу на английском, чтобы потом их не переводить. Прошу понять и принять!</p>
  <p>Миша<br>
</div>

# Tariff recommendation

**Project description**

The mobile telecommunications operator, "Megaline," has discovered that many customers are still using outdated tariffs. They intend to develop a system capable of analyzing customer behavior and proposing new tariffs, namely "Smart" or "Ultra," to the users.

You have been provided with data regarding the behavior of customers who have already switched to these tariffs, obtained from the "Statistical Data Analysis" course project. Your task is to construct a classification model that can accurately recommend the appropriate tariff. Data preprocessing is not required as it has already been completed.

Construct a model with the highest achievable accuracy. To successfully complete the project, you must achieve a minimum accuracy of 0.75. Independently evaluate the model's accuracy using the test dataset.

Project execution instructions:

1. Divide the initial data into training, validation, and test datasets
2. Explore the performance of different models by varying hyperparameters. Summarize the findings of the investigation
3. Assess the quality of the model using the test dataset
4. Additional task: Evaluate the models for their reasonableness

**Data description**

Each entry in the dataset represents information about the behavior of an individual user over the course of one month. 

The following information is available:

1. `calls` - number of phone calls made
2. `minutes` - total duration of phone calls in minutes
3. `messages` - number of SMS messages sent
4. `mb_used` - amount of internet traffic used in megabytes
5. `is_ultra` - the tariff used by the customer during the month ("Ultra" - 1, "Smart" - 0).

In [1]:
import pandas as pd
import plotly.express as px
from IPython.display import display
from tqdm import tqdm

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

try:
    df_users = pd.read_csv('users_behavior.csv')
except:
    df_users = pd.read_csv('/datasets/users_behavior.csv')
    
FIG_WIDTH = 8
FIG_HEIGHT = 5
RANDOM_SEED = 42


## Exploratory data analysis

Let's examine the main dependencies in the data before we feed it to ML algorithms.


In [2]:
round(df_users.describe().T, 2)


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
calls,3214.0,63.04,33.24,0.0,40.0,62.0,82.0,244.0
minutes,3214.0,438.21,234.57,0.0,274.58,430.6,571.93,1632.06
messages,3214.0,38.28,36.15,0.0,9.0,30.0,57.0,224.0
mb_used,3214.0,17207.67,7570.97,0.0,12491.9,16943.24,21424.7,49745.73
is_ultra,3214.0,0.31,0.46,0.0,0.0,0.0,1.0,1.0


Users vary significantly in their usage patterns of calls, messages, and data, with high standard deviations observed across these variables.

Less than a third (~31%) of the users are subscribed to the 'ultra' plan, as indicated by the 'is_ultra' variable.

There's a considerable spread in the data use, with 50% of users consuming less than approximately 17GB and some consuming as much as nearly 50GB.

In [3]:
for column in ['calls', 'minutes', 'messages', 'mb_used']:
    fig = px.histogram(
        df_users,
        x=column,
        color='is_ultra',
        facet_col='is_ultra',
        title=f'Histogram of {column}',
        width=FIG_WIDTH * 100,
        height=FIG_HEIGHT * 100,
        template='plotly_white',
    )
    fig.update_traces(showlegend=False)
    fig.show()
    

Beatiful plots. Looks like we have significant usage pattern differences between the 2 user groups. Let's see how ML algorithms will pick that up.

# Machine learning

First, let's isolate features vs target variables and split the main dataset in three: one for training, another for validation and the last one for testing.

In [4]:
features_train, features_test, target_train, target_test = train_test_split(
    df_users.drop('is_ultra', axis=1), df_users.is_ultra, test_size=0.4, random_state=RANDOM_SEED
)

features_test, features_valid, target_test, target_valid = train_test_split(
    features_test, target_test, test_size=0.5, random_state=RANDOM_SEED
)

Now let's teach ML models. Decision tree first.

In [5]:
dt_best_model = None
dt_best_accuracy = 0
dt_best_depth = 0

depths = range(1, 30)
dt_accuracies = []

for depth in tqdm(depths):
    dt_model = DecisionTreeClassifier(random_state=RANDOM_SEED, max_depth=depth)
    dt_model.fit(features_train, target_train)
    dt_prediction = dt_model.predict(features_valid)
    dt_accuracy = accuracy_score(dt_prediction, target_valid)
    dt_accuracies.append(dt_accuracy)
    if dt_best_accuracy < dt_accuracy:
        dt_best_model = dt_model
        dt_best_accuracy = dt_accuracy
        dt_best_depth = depth


100%|██████████| 29/29 [00:00<00:00, 168.09it/s]


In [6]:
fig = px.line(
    x=depths,
    y=dt_accuracies,
    markers=True,
    title='DecisionTreeClassifier accuracy depending on depth',
    labels=dict(x='Depth', y='Accuracy'),
    width=FIG_WIDTH * 100,
    height=FIG_HEIGHT * 100,
    template='plotly_white'
)
fig.show()

Based on the plot, here are the conclusions that can be drawn for the Decision Tree algorithm:

1. Effect of Depth: the model's accuracy seems to increase as the depth of the decision tree increases, up until a certain point, then it starts to decrease after the depth of 10. This could be an indication of overfitting: as the tree depth increases, the model might be fitting too closely to the training data, causing a decrease in performance when tested on new data.

2. Maximum Accuracy: the highest accuracy achieved is around 0.8118 at a depth of 3. This could potentially be the optimal depth for this dataset, as deeper trees did not improve the model's accuracy.

3. Minimum Accuracy: the lowest accuracy achieved is around 0.7216 at a depth of 22. The accuracy remains around this mark for depths beyond 22. This indicates that, for this dataset, increasing the tree's depth beyond a certain point does not improve its performance, and can in fact lead to a decrease in accuracy.

Random forest next.

In [12]:
rf_best_model = None
rf_best_accuracy = 0
rf_best_depth = 0
rf_best_estimator = 0

depths = range(1, 30)
estimators = range(1, 15)
rf_accuracies = pd.DataFrame()

for depth in tqdm(depths):
    for estimator in estimators:
        rf_model = RandomForestClassifier(
            random_state=RANDOM_SEED, max_depth=depth, n_estimators=estimator
        )
        rf_model.fit(features_train, target_train)
        rf_prediction = rf_model.predict(features_valid)
        rf_accuracy = accuracy_score(rf_prediction, target_valid)
        rf_accuracies.loc[depth, estimator] = rf_accuracy
        if rf_best_accuracy < rf_accuracy:
            rf_best_model = rf_model
            rf_best_accuracy = rf_accuracy
            rf_best_depth = depth
            rf_best_estimator = estimator


100%|██████████| 29/29 [00:07<00:00,  3.75it/s]


In [14]:
fig = px.line(
    rf_accuracies,
    y=rf_accuracies.columns,
    markers=True,
    title='RandomForestClassifier accuracy depending on depth and # of estimators',
    labels=dict(index='Depth', value='Accuracy', variable='# of esimtators'),
    width=FIG_WIDTH * 100,
    height=FIG_HEIGHT * 100 * 1.6,
    template='plotly_white'
)
# fig.update_layout(hovermode='x unified')
fig.show()

Here are some potential conclusions that we could draw:

1. Model performance improves with depth and number of estimators, up to a certain point: generally, increasing the depth of the trees (more complex models) and the number of estimators (more trees, more robust model) should increase the model's performance. However, there seems to be a limit to these improvements. After reaching certain depth and estimator numbers, the accuracy does not increase substantially. This may be due to overfitting, where the model becomes too complex and starts fitting the noise in the training data, reducing its ability to generalize well to unseen data.

2. Adequate depth and estimator numbers need to be chosen: too low depth or estimator numbers can lead to underfitting, where the model is too simple and doesn't capture the patterns in the data. Too high numbers can lead to overfitting. The optimal depth and number of estimators need to be found through model tuning techniques such as cross-validation.

3. The highest accuracy seems to be achieved with moderate depth and high number of estimators: it seems the model performs best when it has a balance between depth and number of estimators. In this dataset, this occurs with moderate depth and high estimator numbers, although the specific values would depend on the data.