# Step 1. Open the data file and read the general information

## Project description
For a mobile carrier, Megaline, we want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. We need to develop a model that will pick the right plan. We will try to develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. We will check the accuracy using the test dataset.

## Import

In [2]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import joblib
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

## Load data

In [3]:
try:
    df = pd.read_csv('users_behavior.csv')
except:
    df = pd.read_csv('datasets/users_behavior.csv')

## Step 2. Check the data

- сalls — number of calls,
- minutes — total call duration in minutes,
- messages — number of text messages,
- mb_used — Internet traffic used in MB,
- is_ultra — plan for the current month (Ultra - 1, Smart - 0).

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


No missing values

In [7]:
df.sample(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
365,20.0,131.97,23.0,26720.17,1
2242,9.0,72.66,6.0,1315.81,1
657,82.0,466.26,59.0,16996.03,0
1875,64.0,418.58,43.0,24877.83,0
1024,100.0,709.4,17.0,16964.33,0


In [10]:
df['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

In [18]:
df['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

More then double amount of Smart then Ultra plan. We should remember to see that the split for validation keeps the same proportion and not worse. We will create a function to calculate this proportion:

In [16]:
def proportion_of_ultra(df):
    value_count = df['is_ultra'].value_counts()
    proportion = df['is_ultra'].value_counts()[1]/(
        df['is_ultra'].value_counts()[1]+df['is_ultra'].value_counts()[0])
    return round(proportion, 2)

# Split the source data into a training set, a validation set, and a test set.

In [12]:
df_train, df_valid_and_test = train_test_split(df, test_size=0.4, random_state=12345)
df_valid, df_test = train_test_split(df_valid_and_test, test_size=0.4, random_state=12345)

See if the proportion of ultra plan stay the same after split

In [20]:
print('df_train proportion:', proportion_of_ultra(df_train))
print('df_valid proportion:', proportion_of_ultra(df_valid))
print('df_test proportion:', proportion_of_ultra(df_test))

df_train proportion: 0.31
df_valid proportion: 0.3
df_test proportion: 0.32


yes

# Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.

Because our task is to recommend between plans then any wrong recommendation will be considered as error. Whether we decide to recommend for Ultra when the target for this observation was Smart or the opposite. Therefore we will evaluate the model with accuracy metric.

This is a classification task so we will check which learning algorithm for classification yields the best accuracy. The models we will check are: Decision tree, Random forest and Logistic regression

create features and target for the models check

In [26]:
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']

## Decision tree

loop tree depth to optimize the model depth with best accuracy

In [27]:
for depth in range(1,6):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth) # create a model with the given depth
    model.fit(features_train, target_train) # train the model
    predictions_valid = model.predict(features_valid) # get the model's predictions
    result = accuracy_score(target_valid, predictions_valid) # calculate the accuracy
    print("max_depth =", depth, ": ", end='')
    print(result)

max_depth = 1 : 0.7509727626459144
max_depth = 2 : 0.7821011673151751
max_depth = 3 : 0.7872892347600519
max_depth = 4 : 0.7833981841763943
max_depth = 5 : 0.7808041504539559


Depth 3 give the highest accuracy - 0.787

## Random forest

loop tree number of trees to optimize the model with best accuracy

In [33]:
best_score = 0
best_est = 0
for est in range(1, 10): # choose hyperparameter range
    model = RandomForestClassifier(random_state=54321, n_estimators=est) # set number of trees
    model.fit(features_train, target_train) # train model on training set
    score = model.score(features_valid, target_valid) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score# save best accuracy score on validation set
        best_est = est# save number of estimators corresponding to best accuracy score

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 8): 0.7717250324254216


In [21]:
df_train

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
3027,60.0,431.56,26.0,14751.26,0
434,33.0,265.17,59.0,17398.02,0
1226,52.0,341.83,68.0,15462.38,0
1054,42.0,226.18,21.0,13243.48,0
1842,30.0,198.42,0.0,8189.53,0
...,...,...,...,...,...
2817,12.0,86.62,22.0,36628.85,1
546,65.0,458.46,0.0,15214.25,1
382,144.0,906.18,0.0,25002.44,1
2177,38.0,301.27,37.0,28914.24,1
