# Recommendation of mobile tariffs

We have at our disposal data on the behavior of customers who have already switched to certain mobile tariffs. We need to build a model for the classification problem that will select the appropriate tariff. 

Let's build a model with the largest possible value of *accuracy*.

## Open and study data

Import necessary libraries:

In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [2]:
df = pd.read_csv('/Users/a1/Desktop/data_folder/users_behavior.csv')
df

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


Each object in the dataset is information about the behavior of one user per month:
* сalls — calls amount,
* minutes — summary duration of calls in minutes,
* messages — sms amount,
* mb_used — consumed internet traffic in MB,
* is_ultra — the name of the tariff used during the month ("Ultra" — 1, "Smart" — 0).

## Preparing datasets:

To get three samples - training, validation and test in the ratio of 60%, 20%, 20% - apply the train_test_split function twice:

In [3]:
df_train, temp_df = train_test_split(df, test_size = 0.4, random_state=12345)
df_valid, df_test = train_test_split(temp_df, test_size = 0.5, random_state=12345)

Separate features and target in each of the samples:

In [4]:
features_train = df_train.drop(['is_ultra'], axis =1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis =1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

## Study models

### Decision tree:

In [5]:
result_tree = 0
tree_depth = 0

for value in range(1,6):
    decision_tree_m = DecisionTreeClassifier(random_state=12345, max_depth = value)
    decision_tree_m.fit(features_train, target_train)
    decision_tree_m_predictions = decision_tree_m.predict(features_valid)
    accuracy_decision_tree = accuracy_score(target_valid, decision_tree_m_predictions)
    if accuracy_decision_tree > result_tree:
        result_tree = accuracy_decision_tree
        tree_depth = value

print("Tree depth:", tree_depth)
print("Accuracy: ",result_tree)

Tree depth: 3
Accuracy:  0.7853810264385692


### Random forest:

In [6]:
result_forest = 0
result_est = 0

for est in range(1,11):
    forest_m = RandomForestClassifier(random_state=12345, n_estimators = est, min_samples_leaf=est)
    forest_m.fit(features_train, target_train)
    forest_m_predictions = forest_m.predict(features_valid)
    accuracy_forest = accuracy_score(target_valid, forest_m_predictions)
    if accuracy_forest > result_forest:
        result_forest = accuracy_forest
        result_est = est

print("Estimators amount:", result_est)
print("Accuracy: ",result_forest)

Estimators amount: 6
Accuracy:  0.7962674961119751


### Logistic regression:

In [7]:
lr_m = LogisticRegression(random_state=12345, solver='lbfgs', max_iter=1000)
lr_m.fit(features_train, target_train)
lr_m_predictions = lr_m.predict(features_valid)
accuracy_lr_m = accuracy_score(target_valid, lr_m_predictions)
print("Accuracy: ",accuracy_lr_m)

Accuracy:  0.7107309486780715


Based on the data and classification models based on them, the "Random Forest" model showed the best results. It is worth noting that contrary to expectations, the results of the "Decision Tree" model turned out to be better than the "Logical Regression" model.

## Check model on test sample

Check the best model on test sample:

In [8]:
test_forest_m_predictions = forest_m.predict(features_test)
accuracy_test_forest = accuracy_score(target_test, test_forest_m_predictions)
print(accuracy_test_forest)

0.7978227060653188


## Check the models for adequacy

In [9]:
from sklearn.dummy import DummyClassifier
dummy_model = DummyClassifier(strategy='uniform', random_state=0)
dummy_model.fit(features_train, target_train)
predictions_dummy_test = dummy_model.predict(features_test)
dummy_score = accuracy_score(target_test, predictions_dummy_test)
print('Effectiveness of a fictitious random model:', dummy_score)

Effectiveness of a fictitious random model: 0.5303265940902022


## Conclusion

Thus, the most effective model in the present case is the "Random Forest" model. The accuracy result on the test sample is higher than on the validation sample = 0.7978227060653188 and higher than the efficiency of the fictitious random model.