# Mobile plan recommendation.

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. You have access to behavior data about subscribers who have already switched to the new plans. For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model. Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

**Description of the data**

Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:

- сalls — number of calls,
- minutes — total call duration in minutes,
- messages — number of text messages,
- mb_used — Internet traffic used in MB,
- is_ultra — plan for the current month (Ultra - 1, Smart - 0).

## Open the file

In [1]:
#Libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression


In [2]:
#Open the file

df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
# Check the first 5 rows in a dataset

df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
#Check general information about the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [5]:
# Check duplicates

df.duplicated().sum()

0

There are no missing values or duplicates in the dataset. 

In [6]:
# Change the datatype of the "cols" and "messages" columns to int

df['calls'] = df['calls'].astype('int')
df['messages'] = df['messages'].astype('int')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null int64
minutes     3214 non-null float64
messages    3214 non-null int64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(2), int64(3)
memory usage: 125.7 KB


The data contains 3214 objects, 4 attributes (data in the columns "calls", "minutes", "messages", "mb_used") and one target - the tariff (column "is_ultra"). The target attribute can take the values 0 or 1, which indicates that it is a binary classification tasks.

## Create train and test split.

Split the data into training, validation and test in a ratio of 3:1:1.

In [7]:
# Split the data into a training sample and data for validation and test

df_train, df_other = train_test_split(df, test_size=0.25, random_state=12345)

In [8]:
# Slit the df_other data into two separate samples for validation and testing of the model

df_valid, df_test = train_test_split(df_other, test_size=0.5, random_state=12345)

In [9]:
# Select a data set with attributes for each data set

#Select a set of features for training
features_train = df_train.drop('is_ultra', axis=1)

#Select a set of features for validation
features_valid = df_valid.drop('is_ultra', axis=1)

#Select a set of features for testing
features_test = df_test.drop('is_ultra', axis=1)


In [10]:
# Select the target feature in each dataset

#Select the target feature for training
target_train = df_train['is_ultra']

#Select the target feature for validation
target_valid = df_valid['is_ultra']

#Select the target feature for testing
target_test = df_test['is_ultra']

## Models with various hyperparameters.

The following models will be used for the prediction:
- Decision Tree Classifier,
- Random Forest Classifier,
- Logistic Regression models. 

The models are built with a selection of hyperparameters.  
Accuracy is used as a metrics for model selection.


In [11]:
#Build a Decision Tree model and check its accuracy with a tree depth ranging from 1 to 5

best_depth = 0
best_result = 0

for depth in range(1,6):
    #Train the model with a given tree depth
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth) 
    model.fit(features_train, target_train)
    #Get the prediction of the model on the validation data 
    predictions = model.predict(features_valid)
    #Calculate accuracy on validation data
    result = accuracy_score(target_valid,predictions) 
    if result > best_result:
        best_depth = depth
        best_result = result
print('Accuracy лучшей модели:', best_result)
print('Глубина дерева:', best_depth)

Accuracy лучшей модели: 0.7985074626865671
Глубина дерева: 3


In [12]:
#Build a Random Forest model and check its accuracy with the number of trees ranging from 1 to 10 and the depth of the tree ranging from 1 to 5
best_est = 0
best_depth = 0
best_result = 0

for depth in range (1,6):
    for est in range(1,11):
        #Train the model with a given number of trees and a tree depth
        model = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth = depth) 
        model.fit(features_train, target_train) 
        #Calculate accuracy on validation data
        result = model.score(features_valid,target_valid) 
        if result > best_result:
            best_est = est
            best_result = result
            best_depth = depth
print('Accuracy лучшей модели:', best_result)
print('Число деревьев:', best_est)
print('Глубина дерева:', best_depth)

Accuracy лучшей модели: 0.8134328358208955
Число деревьев: 9
Глубина дерева: 3


In [13]:
#Build a Logistic regression model and check its accuracy

#Train the model 
model = LogisticRegression(random_state=12345) 
model.fit(features_train, target_train) 
#Calculate accuracy on validation data
result = model.score(features_valid,target_valid) 

print('Accuracy модели Логистической регрессии на валидационной выборке:', result)


Accuracy модели Логистической регрессии на валидационной выборке: 0.6990049751243781




Three machine learning models for the binary classification problem were considered: Decision Tree Classifier, Random Forest Classifier, Logistic Regression. Within the ranges of the specified hyperparameters, the models demonstrated the following quality indicators (accuracy metric):

- Decision tree: 0.7985 (tree depth 3)

- Random forest: 0.8134 (tree depth 3, number of trees 9)

- Logistic regression: 0.6990

Thus, the best quality indicator was demonstrated by the Random Forest model with a tree depth equals to 3 and the number of trees equals to 9.

## Checking the model on test data.

In [14]:
#Check the Random Forest model with a tree depth of 3 and a number of trees of 9 on a test sample

#Train the model
model = RandomForestClassifier(random_state=12345, n_estimators=9, max_depth = 3) 
model.fit(features_test, target_test) 
#Calculate accuracy
result = model.score(features_test,target_test) 

print('Accuracy лучшей модели:', result)

Accuracy лучшей модели: 0.8084577114427861


Accuracy of the Random Forest model on a test data is 0.8085. The model is suitable for prediction.

## Sanity test

For sanity check calculate how many smart and ultra tariffs there are in our dataset.

In [15]:
# Check the number of smart and ultra tariffs in the data set

df['is_ultra'].value_counts(normalize=1)

0    0.693528
1    0.306472
Name: is_ultra, dtype: float64

There are 69% smart tariffs in the dataset, assuming that the model predicts only the smart tariff, then its accuracy will be 69%, in our case, the accuracy of the model is 81%, which means that the model passed the test.