# Megaline plan recommendation <a class='tocSkip' ></a>

The purpose of this project is to classify the clients of the cellular telephone company **Megaline** to offer them one of the tariff plans (**Smart** or **Ultimate**) according to their consumption behavior. Such behavior takes into account the amount and number of calls made, number of messages sent and megabits (Mb) used monthly. We want to build a model that has an **accuracy of at least 75%** to classify each customer.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-and-data-examination" data-toc-modified-id="Load-and-data-examination-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load and data examination</a></span></li><li><span><a href="#Slicing-data-into-training,-validation-and-test-sets" data-toc-modified-id="Slicing-data-into-training,-validation-and-test-sets-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Slicing data into training, validation and test sets</a></span></li><li><span><a href="#Construction-and-evaluation-of-the-quality-of-different-models" data-toc-modified-id="Construction-and-evaluation-of-the-quality-of-different-models-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Construction and evaluation of the quality of different models</a></span><ul class="toc-item"><li><span><a href="#Decision-Tree" data-toc-modified-id="Decision-Tree-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Decision Tree</a></span></li><li><span><a href="#Random-Forest" data-toc-modified-id="Random-Forest-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Random Forest</a></span></li><li><span><a href="#Logistic-regression" data-toc-modified-id="Logistic-regression-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Logistic regression</a></span><ul class="toc-item"><li><span><a href="#Conclusion,-choosing-the-best-model" data-toc-modified-id="Conclusion,-choosing-the-best-model-3.3.1"><span class="toc-item-num">3.3.1&nbsp;&nbsp;</span>Conclusion, choosing the best model</a></span></li></ul></li></ul></li><li><span><a href="#Check-the-quality-of-the-final-model-with-the-test-set" data-toc-modified-id="Check-the-quality-of-the-final-model-with-the-test-set-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Check the quality of the final model with the test set</a></span></li><li><span><a href="#Sanity-test-of-the-final-model" data-toc-modified-id="Sanity-test-of-the-final-model-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Sanity test of the final model</a></span></li><li><span><a href="#Final-conclusions" data-toc-modified-id="Final-conclusions-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Final conclusions</a></span></li></ul></div>

# Load and data examination

The data is stored in the file `'users_behavior_upd.csv'` this data will be loaded into the DataFrame **`df`**. The DataFrame **`df`** consists of the following columns:

* `calls`: number of calls
* `minutes`: total duration of calls in minutes
* `messages`: number of text messages
* `mb_used`: Internet traffic used in megabits (MB)
* `is_ultimate`: plan for the current month (Ultimate - `'1'`, Smart - `'0'`)

Next, the data is loaded into **`df`**, its information and the first rows of the DataFrame are displayed.

In [2]:
# Libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier      # Decision tree
from sklearn.ensemble import RandomForestClassifier  # Random Forest
from sklearn.linear_model import LogisticRegression  # Logistic regression
from sklearn.model_selection import train_test_split # Subset data for train and test
from sklearn.metrics import accuracy_score           # Accuracy score

In [3]:
# Set working directory
%cd '/Users/jesusrfl/Yandex_coding_projects/phone_plan_recommendation'

# Load data into DataFrame 'df'
df = pd.read_csv('datasets/users_behavior.csv')

df.rename(columns={'is_ultra':'is_ultimate'}, inplace=True)

# DataFrame info
print(df.info())

# Head df
df.head()

/Users/jesusrfl/Yandex_coding_projects/phone_plan_recommendation
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   calls        3214 non-null   float64
 1   minutes      3214 non-null   float64
 2   messages     3214 non-null   float64
 3   mb_used      3214 non-null   float64
 4   is_ultimate  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None


Unnamed: 0,calls,minutes,messages,mb_used,is_ultimate
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


Our DataFrame **`df`** consists of 3214 rows and 5 variables described above. There are no missing values so we will go straight to creating the classification models.

# Slicing data into training, validation and test sets

The models are intended to classify customers into two categories (i.e. binary classification): **Ultimate** and **Ultra**. This classification is encoded in the `is_ultimate` column so this is our `target` variable. The other variables are the characteristics (`features`) of each client that will be used by the models to automatically classify the clients.

To train, validate, and test our classification models, we are going to split the data into three sets. Because we do not have a test data set, the ratio for the training, validation, and test data will be 3:1:1, that is, 60, 20, and 20% respectively. According to the purpose of each set, the suffix of each dataset is the following:

* **`_train`**: training data set which corresponds to 60% of the original data
* **`_valid`**: validation set which corresponds to 20% of the original data
* **`_test`**: test set which corresponds to 20% of the original data

Since we require three data sets, we are going to split the original data twice. In the first section, 60% of the data is reserved for the training set. The remaining 40% of data will be divided in half (ie 20% and 20%) to originate the validation and training data.

In [4]:
# Train slicing 60%
df_train, df_valid_test = train_test_split(df, test_size = 0.4, random_state = 54321)

# Data slicing ('df_valid_test') validation (20%) y test (20%)
df_valid, df_test = train_test_split(df_valid_test, test_size = 0.5, random_state= 54321)

# tamaño de cada conjunto
print('Training set:  ', len(df_train))
print('Validation set:', len(df_valid))
print('Test set:      ', len(df_test))

Training set:   1928
Validation set: 643
Test set:       643


Once the sets have been obtained, we will proceed to select the variables to build and evaluate the models (`features`) as well as the objective variable of each set (`target`) which in this case is `is_ultimate`.

In [5]:
# Selection of features 'features_' from each set
features_train = df_train.drop('is_ultimate', axis=1) 
features_valid = df_valid.drop('is_ultimate', axis=1)
features_test =  df_test.drop('is_ultimate', axis=1)

# Selection of objective variable 'target' of each set
target_train = df_train['is_ultimate']
target_valid = df_valid['is_ultimate']
target_test =  df_test['is_ultimate']

Once the characteristics and objective of each data set have been defined, we proceed to build our classification models.

# Construction and evaluation of the quality of different models

Since the purpose of the project is a classification task, we are going to build three classification models:
1. [Decision Tree](#3.1)
2. [Random Forest](#3.2)
3. [Logistic Regression](#3.3)

Once built we will select the best model according to its accuracy percentage. We will seek to optimize the decision tree and random forest models by modifying their hyperparameters. The model with the best accuracy will be trained again with the training and validation data set and will be finally evaluated with the test set.

## Decision Tree

The first model we are going to build is a decision tree. This model will seek to optimize by modifying one of its hyperparameters: `max_depth` which determines how deep the decision tree is, that is, how many branches the algorithm makes to make a decision to classify each customer. The depth of the tree will be iterated with values from 1 to 10; its precision will be determined to select the tree whose depth of greatest precision.

In [6]:
# Characteristics of the optimal model
best_ad_model = None # Select the model with the optimal hyperparameters
best_ad_score = 0    # Calculate the precision of the optimal model
best_ad_depth = 0    # Indicates the optimal depth of the model

# Build classification trees iterating depths from 1 to 10
for depth in range(1, 11):  # Set the range of the max_depht hyperparameter
    ad_model = DecisionTreeClassifier(max_depth=depth, random_state=54321) # Set max_depth
    ad_model.fit(features_train, target_train) # Fits model to the training data
    predictions = ad_model.predict(features_train) # Predictions
    score = accuracy_score(target_train, predictions) # Accuracy score
    if score > best_ad_score:
        best_ad_model = ad_model
        best_ad_score = score
        best_ad_depth = depth

print("Accuracy of the best model in the validation set:", best_ad_score) 
print("Optimum depth:", best_ad_depth)

Accuracy of the best model in the validation set: 0.883298755186722
Optimum depth: 10


The best-fit decision tree has a depth of 10 and offers an **88% accuracy**.

## Random Forest

In this section we are going to build models by modifying the hyperparameters that determine the number of trees in the forest (`n_estimators`) and the maximum depth of each tree (`max_depth`). The `n_estimators` hyperparameter will evaluate from 10 to 100 in intervals of 10 ie (10, 20, 30, . . . 100 trees) and `max_depth` will have a range from 1 to 10.

To determine the model with the appropriate hyperparameters, the accuracy of each model will be compared using the validation data.

In [7]:
# Characteristics of the optimal model
best_ba_score = 0 # Accuracy
best_est = 0      # Number of estimators (trees)
best_ba_depth = 0 # Optimum depth

for est in range(1, 101, 10): # Select n_estimators
    for depth in range(1,11): # Select max_depht
        ba_model = RandomForestClassifier(random_state=54321, n_estimators=est, max_depth=depth)
        ba_model.fit(features_train, target_train) # Model fitting
        score = ba_model.score(features_valid, target_valid) # Accuracy over valid set
        if score > best_ba_score:
            best_ba_score = score # Best model according its accuracy
            best_est = est        # Optimal number of trees
            best_ba_depth = depth # Optimal depth

print("Accuracy of the best model in the validation set (n_estimators = {}, max_depth = {}): {}".format(best_est, best_ba_depth, best_ba_score))

Accuracy of the best model in the validation set (n_estimators = 11, max_depth = 8): 0.7978227060653188


The optimal random forest model consists of **11** decision trees (`n_estimators`) with a maximum depth of **8** (`max_depth`). This model offers **~80% accuracy**.

## Logistic regression

Finally we are going to build a logistic regression model. In this model we are not going to change any hyperparameter, we are only going to define the `solver` hyperparameter as `'liblinear'`. Once our model is built we will determine its precision and choose the model with the highest precision.

In [8]:
# Model build
rl_model = LogisticRegression(random_state=54321, solver='liblinear') # Define model
rl_model.fit(features_train, target_train) # Model fit
rl_predictions = rl_model.predict(features_valid) # Predictions
rl_score = accuracy_score(target_valid, rl_predictions) # Accuracy 
print('Exactitud del modelo de regresión logística:', rl_score)

Exactitud del modelo de regresión logística: 0.6780715396578538


### Conclusion, choosing the best model

The **logistic regression model has an accuracy of ~68%** so it does not exceed the required threshold of 75%. **The random forest model had the highest accuracy score at 88%** beating the random forest model which scored 80% accurate.

In the next section, we are going to check the quality of our random forest model by training it with the combined training and validation datasets and then evaluate its accuracy with the test dataset.

# Check the quality of the final model with the test set

We have already found that the decision tree model with a depth of 10 is the model that offers the highest precision; now we are going to train it again but using the combined training and validation sets to later evaluate the model using the test data set (`_test`).

First, let's concatenate the training (`df_train`) and validation (`df_valid`) data into the DataFrame (`df_tv`) and get the feature (`tv_features`) and target (`tv_target`) variables. The prefix or suffix `'tv'` refers to the combined set 'train-valid'.

In [9]:
# Concatenation of training and validation sets in 'df_tv'
df_tv = pd.concat([df_train, df_valid], axis=0)
print('Tamaño de conjunto de entrenamiento para modelo final:', len(df_tv))

# Selection of features 'features_tv'
features_tv = df_tv.drop('is_ultimate', axis=1)

# Selection of target 'target_tv'
target_tv = df_tv['is_ultimate']

Tamaño de conjunto de entrenamiento para modelo final: 2571


Now we proceed to fit the final model **`final_model`** with the combined training and validation data and check its accuracy with the test data set.

In [10]:
# Definition of the decision tree model with a depth of 10
final_ad_model = DecisionTreeClassifier(max_depth=10, random_state=54321)

# Model fitting with combined data 
final_ad_model.fit(features_tv, target_tv) 

# Model accuracy over test set
print(final_ad_model.score(features_test, target_test))

0.80248833592535


Our final decision tree model, re-trained with the combined training and validation set, is **80% accurate**. This accuracy percentage exceeds the 75% accuracy threshold required for our project.

# Sanity test of the final model

We are going to perform a sanity test of our final decision tree model. To carry out this test, we are going to verify that the accuracy of the final model of the classification tree shows a similar accuracy with the training set and the test set.

First we are going to look at the proportion of customers for each plan using our test dataset (`target_test`).

In [11]:
# How many customers are in each category (Smart - 0, Ultimate - 1)
print(pd.DataFrame(target_test).value_counts('is_ultimate', normalize = True))

is_ultimate
0    0.724728
1    0.275272
dtype: float64


The proportions of the categories predicted by our model are similar. We are going to analyze if the accuracy score differs between the training set and the test set.

In [13]:
train_predictions = final_ad_model.predict(features_tv)
test_predictions = final_ad_model.predict(features_test)

print('Accuracy')
print('Training set:', accuracy_score(target_tv, train_predictions))
print('Test set:    ', accuracy_score(target_test, test_predictions))

Accuracy
Training set: 0.8774795799299884
Test set:     0.80248833592535


Our model is better with the training set than with the test set, however it still exceeds the accuracy threshold required for the project.

# Final conclusions

We carry out three classification models: Decision Tree, Random Forest and Logistic Regression to automatically classify the appropriate plan for each client according to their consumption behavior. The customer data was divided into subsets to train, validate, and test each model. The model with the best precision was the Decision Tree with an accuracy over the test set of 80%, exceeding the threshold of 75% required. The proportions of customers under each category ('Smart' or 'Ultimate') predicted by our final model are similar to those observed in the test set passing the sanity test.