# Recommendation of tariffs

## Project Description

Mobile operator "Megaline" initiated a project to analyze customer behavior and propose a new tariff plan based on their usage patterns. The project aims to build a classification model that will determine the appropriate tariff plan for each customer. The dataset used in the project includes information on customer behavior for two existing tariff plans: "Smart" and "Ultra".

The project involves developing a classification model with the highest possible accuracy. The minimum required accuracy for successful project completion is set at 0.75. The data preprocessing stage has already been completed, and the main focus is on building and evaluating the model.

To ensure project success, the model's accuracy is tested on a separate test dataset. The model's accuracy is compared to the target threshold of 0.75 to assess its effectiveness in classifying customers and recommending the suitable tariff plan.

## 1 Open the data file and review the general information¶

In [22]:
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv(r'datasets/05_users_behavior.csv')
display(df)
df.info()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


The dataset is represented by 5 columns without gaps, in the correct formats, except for the last column. the tariff can be either Ultra or Smart, which means it is a categorical "bool" format

In [3]:
df['is_ultra'] = df['is_ultra'].astype('bool')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   bool   
dtypes: bool(1), float64(4)
memory usage: 103.7 KB


Data description

Each object in the dataset is information about the behavior of one user per month. Is known:

calls — number of calls,
minutes — total duration of calls in minutes,
messages — the number of SMS messages,
mb_used — consumed Internet traffic in MB,
is_ultra — what tariff was used during the month ("Ultra" — 1, "Smart" — 0).

## 02 Let's split the data into samples

In [5]:
features = df.drop('is_ultra', axis=1)

In [6]:
target = df['is_ultra']

The samples should be divided in the ratio: training sample - 0.6, validation sample - 0.2, test sample - 0.2

In [7]:
features_train, features_test, target_train, target_test = train_test_split(features, 
                                                                            target, 
                                                                            test_size=0.6, 
                                                                            random_state=12345
                                                                           ) 

In [8]:
features_valid, features_test, target_valid, target_test = train_test_split(features_test, 
                                                                            target_test, 
                                                                            test_size=0.5, 
                                                                            random_state=12345
                                                                           ) 

In [9]:
print(features_train.shape)
print(target_train.shape)

(1285, 4)
(1285,)


In [10]:
print(features_valid.shape)
print(target_valid.shape)

(964, 4)
(964,)


In [11]:
print(features_test.shape)
print(target_test.shape)

(965, 4)
(965,)


Таким образом мы получили нужное соотношение выборок

## 03 We investigate the quality of different types of models

### DecisionTreeClassifier

In [15]:
best_model = None
best_result = 0
best_depth = 0
for depth in range(1, 11):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    result = accuracy_score(target_valid, predictions_valid)
    print("max_depth =", depth, ": ", end='')
    print(accuracy_score(target_valid, predictions_valid))
    if result > best_result:
        best_model = model
        best_result = result
        best_depth = depth

print("Accuracy of the best model on the validation set:", best_result, "Best depth:", best_depth)

max_depth = 1 : 0.7769709543568465
max_depth = 2 : 0.7821576763485477
max_depth = 3 : 0.8101659751037344
max_depth = 4 : 0.803941908713693
max_depth = 5 : 0.7946058091286307
max_depth = 6 : 0.8049792531120332
max_depth = 7 : 0.7956431535269709
max_depth = 8 : 0.8008298755186722
max_depth = 9 : 0.7904564315352697
max_depth = 10 : 0.7769709543568465
Accuracy of the best model on the validation set: 0.8101659751037344 Best depth: 3


A tree with a depth of 3 gives the most accurate predictions. The accuracy of 0.81 is acceptable for our tasks.

### RandomForestClassifier

In [18]:
best_model = None
best_result = 0
best_depth = 0
best_result = 0
for est in range(1, 11):
    for depth in range(1, 11):
        model = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth = depth)
        model.fit(features_train, target_train) # обучите модель
        predictions_valid = model.predict(features_valid)
        result = accuracy_score(target_valid, predictions_valid)
        if result > best_result:
            best_model = model
            best_depth = depth
            best_result = result
            best_est = est
        

print("Accuracy of the best model on the validation set:", best_result,  "Best numbers of estimators:", best_est, "Best depth:", best_depth)

Accuracy of the best model on the validation set: 0.8174273858921162 Best numbers of estimators: 2 Best depth: 5


When choosing the optimal settings for a random forest, the optimal number of estimators (2) and depth (5) were determined. In this case, the model shows an accuracy of 0.82, which is a satisfactory result.

### LogisticRegression

In [19]:
model = LogisticRegression(random_state=12345, solver='lbfgs', max_iter=100)
model.fit(features, target)
predictions_valid = model.predict(features_valid) 
result = accuracy_score(target_valid, predictions_valid) 
print("Accuracy of the best model on the validation set:", result)

Accuracy of the best model on the validation set: 0.7759336099585062


In [20]:
model = LogisticRegression(random_state=12345, solver='sag', max_iter = 10000)
model.fit(features, target)
predictions_valid = model.predict(features_valid) # получите предсказания модели на валидационной выборке
result = accuracy_score(target_valid, predictions_valid)
print("Accuracy of the best model on the validation set:", result)

Accuracy of the best model on the validation set: 0.7261410788381742


Logistic regression is acceptable, but not the most accurate. Moreover, solver = lbfgs is the preferred hyperparameter, whereas sag requires more iterations and shows less accuracy

### Сonclusion by stage: 
We considered three types of models, of which the "Random Forest" model showed the best accuracy of responses when validating prediction results. When choosing the optimal settings for a random forest, the optimal number of trees (8) and depth (8) were determined. In this case, the model shows an accuracy of 0.8, which is a satisfactory result. 

It remains to test the model on test data.

## 04 Checking the quality of the model on a test sample

In [21]:
model = RandomForestClassifier(random_state=12345, n_estimators=8, max_depth = 8)
model.fit(features_train, target_train)
predictions_test = model.predict(features_test)
result = accuracy_score(target_test, predictions_test) 
print("Accuracy of the best model on the test set:", result)

Accuracy of the best model on the test set: 0.7896373056994819


The accuracy on the test sample is somewhat less, but this is an insignificant error of 0.79 instead of 0.81. We believe that the most successful model has been determined.

## 05 Checking the model for adequacy

In [23]:
model = DummyClassifier(strategy="most_frequent", random_state=0)
model.fit(features_train, target_train)
predictions_test = model.predict(features_test)
result = accuracy_score(predictions_test, target_test) 
print("Accuracy of the best model on the test set:", result)

Accuracy of the best model on the test set: 0.6984455958549223


The random forest model gives a result significantly higher than the model based on the most common class. Thus, the chosen model is adequate.

## 06 Сonclusion

- Uploaded data did not require data preprocessing
- Information on tariffs was transferred to a Boolean data type, since only two values (0 and 1) appear there
- To select the optimal model, three types of models were considered: decision tree, random forest, logistic regression
- Random forest showed the highest accuracy and accuracy satisfying the conditions of the task
- Checking on test data showed a similar accuracy result with a small error, less than 0.01
- The model was checked for adequacy, which confirmed the choice of the best model
- The accuracy of the model is 0.79, which corresponds to the conditions of the task.