<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Project-description" data-toc-modified-id="Project-description-1">Project description</a></span></li><li><span><a href="#Getting-to-know-data" data-toc-modified-id="Getting-to-know-data-2">Getting to know data</a></span></li><li><span><a href="#Model-selection" data-toc-modified-id="Model-selection-3">Model selection</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Decision-Tree-Classifier" data-toc-modified-id="Decision-Tree-Classifier-3.0.1">Decision Tree Classifier</a></span></li><li><span><a href="#Random-Forest-Classifier" data-toc-modified-id="Random-Forest-Classifier-3.0.2">Random Forest Classifier</a></span></li><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-3.0.3">Logistic Regression</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-3.0.4">Conclusion</a></span></li></ul></li></ul></li><li><span><a href="#Accuracy-test-with-test-dataset" data-toc-modified-id="Accuracy-test-with-test-dataset-4">Accuracy test with test dataset</a></span></li><li><span><a href="#Sanity-check" data-toc-modified-id="Sanity-check-5">Sanity check</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-6">Conclusion</a></span></li></ul></div>

# Project description

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.<br>
The dataset has behavior data about subscribers who have already switched to the new plans. The assumption is that the data doesn't need preparation.<br>
<b>The task</b> is to develop a model with the highest possible accuracy. In this project, the
threshold for accuracy is 0.75.<br>
<b>Data description</b>
Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:<br>
<i>сalls</i> — number of calls<br>
<i>minutes</i> — total call duration in minutes<br>
<i>messages</i> — number of text messages<br>
<i>mb_used</i> — Internet traffic used in MB<br>
<i>is_ultra</i> — plan for the current month (Ultra - 1, Smart - 0).

# Getting to know data

In [1]:
import sys
import warnings
if not sys.warnoptions:
       warnings.simplefilter("ignore")
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_curve

In [2]:
df = pd.read_csv('users_behavior.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
df.sample(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
2214,69.0,538.58,64.0,28639.49,0
65,98.0,677.18,46.0,16285.67,0
822,48.0,369.4,106.0,4520.18,0
3015,42.0,285.25,34.0,11348.42,0
1789,17.0,136.21,48.0,24729.33,1


Let's change value of the columns "calls" and "messages" to integer.

In [5]:
df['calls'] = df['calls'].astype('int')
df['messages'] = df['messages'].astype('int')

Let's check for duplicates in the dataset.

In [6]:
df.duplicated().sum()

0

<b>Conclusion</b>: the dataset has 3214 observations and 5 features. There is no missing or duplicated values. Our target is "is_ultra" column. We converted values in some columns to integer and left others as is, e.g. float type, in order to preserve this information for better model accuracy.

# Model selection

We will split the data into training, validation and test sets in the following ratio: 3:1:1 respectively.

In [7]:
# Splitting into 80% and 20%
remainder, test = train_test_split(df, test_size=0.2, random_state=1)
# splitting 80% into 75% and 25%, thus getting 60% and 20% of total respectively
train, valid = train_test_split(remainder, test_size=0.25, random_state=1)
print(train.shape, test.shape, valid.shape)

(1928, 5) (643, 5) (643, 5)


In [8]:
features_train = train.drop(['is_ultra'], axis=1)
features_test = test.drop(['is_ultra'], axis=1)
features_valid = valid.drop(['is_ultra'], axis=1)
target_train = train['is_ultra']
target_test = test['is_ultra']
target_valid = valid['is_ultra']

### Decision Tree Classifier

Let's consider Decision Tree Classifier model. We will compare the quality of the model for different values of the 'max_depth' hyper-parameter, which is the maximum depth of the tree.<br>
Note: Quality of the model is the number of correct answers to the total number of questions.

In [9]:
for depth in range(1, 6):
    model = DecisionTreeClassifier(random_state=123, max_depth=depth)
    model.fit(features_train, target_train)
    print('Tree depth', depth)
    print('Model score on the training set  ', '{:.3}'.format(model.score(features_train, target_train)))
    print('Model score on the validation set', '{:.3}'.format(model.score(features_valid, target_valid)))
    #print('Precision / recall pair:', precision_recall_curve(target_test, model.predict(features_test)), sep='\n')

Tree depth 1
Model score on the training set   0.768
Model score on the validation set 0.742
Tree depth 2
Model score on the training set   0.794
Model score on the validation set 0.762
Tree depth 3
Model score on the training set   0.805
Model score on the validation set 0.785
Tree depth 4
Model score on the training set   0.821
Model score on the validation set 0.795
Tree depth 5
Model score on the training set   0.828
Model score on the validation set 0.792


The best model quality was achieved with the 'max_depth' parameter set to 5.

### Random Forest Classifier

The next model that we will try out is the Random Forest Classifier. For this model we will examine the model quality vs. different values of the 'n-estimators' hyper-parameter, which is the number of trees in the forest.

In [10]:
for estim in range(1, 30, 4):
    model = RandomForestClassifier(random_state=12345, n_estimators=estim)
    model.fit(features_train, target_train)
    print('Selected "n-estimators" hyper-parameter:', estim)
    print('Model score on the training set  ', '{:.3}'.format(model.score(features_train, target_train)))
    print('Model score on the validation set', '{:.3}'.format(model.score(features_valid, target_valid)))

Selected "n-estimators" hyper-parameter: 1
Model score on the training set   0.912
Model score on the validation set 0.72
Selected "n-estimators" hyper-parameter: 5
Model score on the training set   0.97
Model score on the validation set 0.77
Selected "n-estimators" hyper-parameter: 9
Model score on the training set   0.988
Model score on the validation set 0.792
Selected "n-estimators" hyper-parameter: 13
Model score on the training set   0.993
Model score on the validation set 0.796
Selected "n-estimators" hyper-parameter: 17
Model score on the training set   0.994
Model score on the validation set 0.804
Selected "n-estimators" hyper-parameter: 21
Model score on the training set   0.997
Model score on the validation set 0.793
Selected "n-estimators" hyper-parameter: 25
Model score on the training set   0.997
Model score on the validation set 0.804
Selected "n-estimators" hyper-parameter: 29
Model score on the training set   0.998
Model score on the validation set 0.795


The best model quality was achieved with the 'n_estimators' parameter set to 25.

### Logistic Regression

Another model that we will try is Logistic Regression. We will iterate through possible values for the 'solver' hyper-parameter, which is an algorithm that the model uses in the optimization problem.  

In [11]:
for solver in {'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'}:
    model = LogisticRegression(random_state=123, solver=solver)
    model.fit(features_train, target_train)
    print('Selected "solver" hyper-parameter:', solver)
    print('Model score on the training set  ', '{:.3}'.format(model.score(features_train, target_train)))
    print('Model score on the validation set', '{:.3}'.format(model.score(features_valid, target_valid)))

Selected "solver" hyper-parameter: newton-cg
Model score on the training set   0.751
Model score on the validation set 0.756
Selected "solver" hyper-parameter: saga
Model score on the training set   0.707
Model score on the validation set 0.68
Selected "solver" hyper-parameter: sag
Model score on the training set   0.707
Model score on the validation set 0.68
Selected "solver" hyper-parameter: lbfgs
Model score on the training set   0.751
Model score on the validation set 0.756
Selected "solver" hyper-parameter: liblinear
Model score on the training set   0.751
Model score on the validation set 0.75


The best score was achieved for both 'newton-cg' and lbfgs' algorithms for the 'solver' hyper-parameter. 

### Conclusion

Among the three models that we have tried the best accuracy was achieved with the Random Forest Classifier model with the number of trees in the forest set to 25.

# Accuracy test with test dataset

Now we will test the selected model on the test dataset.

In [12]:
model = RandomForestClassifier(random_state=12345, n_estimators=25)
model.fit(features_train, target_train)
print('Model score on the test set  ', '{:.3}'.format(model.score(features_test, target_test)))

Model score on the test set   0.801


The model got 0.80 score, which satisfy our threshold of 0.75.

# Sanity check

Now we will perform sanity check on a randomly selected model to assess whether the Random Forest Classifier model makes sense. We will choose the Naive Bayes Classification model for this task.

In [13]:
model = GaussianNB()
model.fit(features_train, target_train)
print('Model score on the training set  ', '{:.3}'.format(model.score(features_train, target_train)))
print('Model score on the validation set', '{:.3}'.format(model.score(features_valid, target_valid)))
print('Model score on the test set      ', '{:.3}'.format(model.score(features_test, target_test)))

Model score on the training set   0.789
Model score on the validation set 0.785
Model score on the test set       0.778


The scores we got are pretty decent and are in line with the scores we were getting before with other models, and specifically with the Random Forest Classifier that we chose.

# Conclusion

Among the three chosen models the <b>Logistic Regression</b> is the least accurate and barely pass our 75% accuracy threshold. The <b>Random Forest Classifier</b> shows the most accuracy at the "n_estimators" parameter set to 25. The <b>Decision Tree Classifier</b> model has tolerable accuracy level above the 0.75 threshold at the "max_depth" hyper-parameter set at 2 and above. The highest accuracy was observed at with the tree depth set to 5.<br>
We will select the Random Forest Classifier model, even thought it's a slower model than the Decision Tree Classifier, but in this exercise the speed was not set as a limiting factor.