# Project Description

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

## 1. General View of the Behavior Data
**Data description:**

Every observation in the dataset contains monthly behavior information about one user. 

The information given is as follows:

сalls — number of calls,

minutes — total call duration in minutes, 

messages — number of text messages, 

mb_used — Internet traffic used in MB,

is_ultra — plan for the current month (Ultra - 1, Smart - 0).

In [1]:
# Import the packages
import pandas as pd 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier
from joblib import dump
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Open the dataset
data = pd.read_csv('users_behavior.csv')
data.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [3]:
# Look at the Information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
# No null values spotted
# Check duplicates
data.duplicated().sum()

0

In [5]:
# Check the correlations between the features
data.corr()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
calls,1.0,0.982083,0.177385,0.286442,0.207122
minutes,0.982083,1.0,0.17311,0.280967,0.206955
messages,0.177385,0.17311,1.0,0.195721,0.20383
mb_used,0.286442,0.280967,0.195721,1.0,0.198568
is_ultra,0.207122,0.206955,0.20383,0.198568,1.0


In [6]:
# Since the correlation between calls and minutes is really strong, in order to simplify the data and make the models more 
# effective, we can drop one of these 2 columns
data = data.drop(['calls'], axis = 1)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   minutes   3214 non-null   float64
 1   messages  3214 non-null   float64
 2   mb_used   3214 non-null   float64
 3   is_ultra  3214 non-null   int64  
dtypes: float64(3), int64(1)
memory usage: 100.6 KB


## 2. Training, Testing and Validation Sets

In [7]:
data_train_validation, data_test = train_test_split(data, test_size = 0.2, random_state = 12345)
data_train, data_validation = train_test_split(data_train_validation, test_size = 0.25, random_state = 12345)

In [8]:
print('The size of training, testing and validation sets:',len(data_train), len( data_test), len(data_validation))

The size of training, testing and validation sets: 1928 643 643


In [9]:
# Define features and target of the sets
# Training set
train_features = data_train.drop(['is_ultra'], axis=1)
train_target = data_train['is_ultra']
# Testing set
test_features = data_test.drop(['is_ultra'], axis=1)
test_target = data_test['is_ultra']
# Validation set
validation_features = data_validation.drop(['is_ultra'], axis=1)
validation_target = data_validation['is_ultra']
# Training + Validation
train_validation_features = data_train_validation.drop(['is_ultra'], axis=1)
train_validation_target = data_train_validation['is_ultra']

**Description:**

The data has been splitted into 3 datasets: *training, testing and validation*, which have a ratio of size as *3:1:1*.

## 3. Sanity Check

In [10]:
data['is_ultra'].value_counts(normalize=True)

0    0.693528
1    0.306472
Name: is_ultra, dtype: float64

In [11]:
# Fit the dummy model to the training dataset
dummy = DummyClassifier(strategy = 'constant', constant = 0)
dummy.fit(train_features, train_target)
# Test its accuracy with validation set
validation_predictions = dummy.predict(validation_features)
accuracy_dummy = accuracy_score(validation_target, validation_predictions)
print('dummy model:', accuracy_dummy)

dummy model: 0.6889580093312597


In [12]:
# Test its accuracy with testing set
dummy = DummyClassifier(strategy = 'constant', constant = 0)
dummy.fit(train_validation_features, train_validation_target)
test_predictions = dummy.predict(test_features)
accuracy_dummy = accuracy_score(test_target, test_predictions)
print('dummy model:', accuracy_dummy)

dummy model: 0.6951788491446346


**Description:**

The **dummy** model using a *constant* strategy has an accuracy of **69.52%** based on test set, which is nearly the same as the proportion of 0 **(69.35%)** in the whole dataset.

We can say that *the accuracy of the correct constant model is equal to a fraction of the larger class*.

## 4. Testing the Models

### 4.1 DecisionTreeClassifier

In [13]:
# DecisionTreeClassifier
for i in range(1, 11):
    # Train the models of different max_depth values with training set
    model = DecisionTreeClassifier(random_state = 12345, max_depth = i)
    model.fit(train_features, train_target)
    # check the models of different max_depth values with the validation set
    validation_predictions = model.predict(validation_features)
    accuracy = accuracy_score(validation_target, validation_predictions)
    print('max_depth =', i)
    print('Accuracy based on validation set:', accuracy)
    print()

max_depth = 1
Accuracy based on validation set: 0.7387247278382582

max_depth = 2
Accuracy based on validation set: 0.7573872472783826

max_depth = 3
Accuracy based on validation set: 0.7651632970451011

max_depth = 4
Accuracy based on validation set: 0.7651632970451011

max_depth = 5
Accuracy based on validation set: 0.7667185069984448

max_depth = 6
Accuracy based on validation set: 0.7667185069984448

max_depth = 7
Accuracy based on validation set: 0.7682737169517885

max_depth = 8
Accuracy based on validation set: 0.7636080870917574

max_depth = 9
Accuracy based on validation set: 0.7573872472783826

max_depth = 10
Accuracy based on validation set: 0.76049766718507



In [14]:
# Choose max_depth = 7 as the final model to be tested
model_DecisionTree =  DecisionTreeClassifier(random_state = 12345, max_depth = 7)
model_DecisionTree.fit(train_validation_features, train_validation_target)
# Test its accuracy with testing set
test_predictions = model_DecisionTree.predict(test_features)
accuracy_DecisionTree = accuracy_score(test_target, test_predictions)
print('DecisionTreeClassifier:', accuracy_DecisionTree)

DecisionTreeClassifier: 0.776049766718507


**Description:**

The **DecisionTreeClassifier** model with *max_depth = 7* has an accuracy of **77.60%** based on testing set.

### 4.2 RandomForestClassifier

In [15]:
# RandomForestClassifier
for i in range(1, 11):
    # Train the models of different n_estimators values with training set
    model = RandomForestClassifier(random_state = 12345, n_estimators = i * 10)
    model.fit(train_features, train_target)
    # check the models of different max_depth values with the validation set
    validation_predictions = model.predict(validation_features)
    accuracy = accuracy_score(validation_target, validation_predictions)
    print('n_estimators =', i * 10)
    print('Accuracy based on validation set:', accuracy)
    print()

n_estimators = 10
Accuracy based on validation set: 0.7869362363919129

n_estimators = 20
Accuracy based on validation set: 0.7978227060653188

n_estimators = 30
Accuracy based on validation set: 0.7947122861586314

n_estimators = 40
Accuracy based on validation set: 0.7931570762052877

n_estimators = 50
Accuracy based on validation set: 0.7900466562986003

n_estimators = 60
Accuracy based on validation set: 0.7900466562986003

n_estimators = 70
Accuracy based on validation set: 0.7916018662519441

n_estimators = 80
Accuracy based on validation set: 0.7900466562986003

n_estimators = 90
Accuracy based on validation set: 0.7900466562986003

n_estimators = 100
Accuracy based on validation set: 0.7869362363919129



In [16]:
# RandomForestClassifier
# Specify the n_estimators = 20
for i in range(1, 11):
    # Train the models of different max_depth values with training set
    model = RandomForestClassifier(random_state = 12345, n_estimators = 20, max_depth = i)
    model.fit(train_features, train_target)
    # check the models of different max_depth values with the validation set
    validation_predictions = model.predict(validation_features)
    accuracy = accuracy_score(validation_target, validation_predictions)
    print('max_depth =', i)
    print('Accuracy based on validation set:', accuracy)
    print()

max_depth = 1
Accuracy based on validation set: 0.7200622083981337

max_depth = 2
Accuracy based on validation set: 0.7713841368584758

max_depth = 3
Accuracy based on validation set: 0.7698289269051322

max_depth = 4
Accuracy based on validation set: 0.7729393468118196

max_depth = 5
Accuracy based on validation set: 0.7698289269051322

max_depth = 6
Accuracy based on validation set: 0.7838258164852255

max_depth = 7
Accuracy based on validation set: 0.7776049766718507

max_depth = 8
Accuracy based on validation set: 0.7869362363919129

max_depth = 9
Accuracy based on validation set: 0.7947122861586314

max_depth = 10
Accuracy based on validation set: 0.7947122861586314



In [17]:
# Choose n_estimators = 20, max_depth = 9 & 10 as the final models to be tested
for i in [9, 10]:
    model_RandomForest =  RandomForestClassifier(random_state = 12345, n_estimators = 20, max_depth = i)
    model_RandomForest.fit(train_validation_features, train_validation_target)
    # Test its accuracy with testing set
    test_predictions = model_RandomForest.predict(test_features)
    accuracy_RandomForest = accuracy_score(test_target, test_predictions)
    print('n_estimators = 20, max_depth =', i)
    print('RandomForestClassifier:', accuracy_RandomForest)

n_estimators = 20, max_depth = 9
RandomForestClassifier: 0.7978227060653188
n_estimators = 20, max_depth = 10
RandomForestClassifier: 0.7884914463452566


**Description:**

The **RandomForestClassifier** model with *n_estimators = 20* and *max_depth = 9* has an accuracy of **79.78%** based on testing set.

### 4.3 LogisticRegression

In [18]:
# LogisticRegression
for method in ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']:
    # Train the models of different solver values with training set
    model = LogisticRegression(random_state = 12345, solver = method)
    model.fit(train_features, train_target)
    # check the models of different solver values with the validation set
    validation_predictions = model.predict(validation_features)
    accuracy = accuracy_score(validation_target, validation_predictions)
    print('solver =', method)
    print('Accuracy based on validation set:', accuracy)
    print()

solver = newton-cg
Accuracy based on validation set: 0.7262830482115086

solver = lbfgs
Accuracy based on validation set: 0.7262830482115086

solver = liblinear
Accuracy based on validation set: 0.7013996889580093

solver = sag
Accuracy based on validation set: 0.6920684292379471

solver = saga
Accuracy based on validation set: 0.6920684292379471



In [19]:
# Choose solver = newton-cg and solver = lbfgs as the final models to be tested
for method in ['newton-cg', 'lbfgs']:
    model_LogisticRegression =  LogisticRegression(random_state = 12345, solver = method)
    model_LogisticRegression.fit(train_validation_features, train_validation_target)
    # Test its accuracy with testing set
    test_predictions = model_LogisticRegression.predict(test_features)
    accuracy_RandomForest = accuracy_score(test_target, test_predictions)
    print('solver =', method)
    print('RandomForestClassifier:', accuracy_RandomForest)

solver = newton-cg
RandomForestClassifier: 0.76049766718507
solver = lbfgs
RandomForestClassifier: 0.76049766718507


**Description:**

The **LogisticRegression** models with *solver = newton-cg or lbfgs* have an accuracy of **76.05%** based on testing set.

## 5. Save the Model with the highest Accuracy

The **RandomForestClassifier** model with *n_estimators = 20* and *max_depth = 9* has the highest accuracy based on testing set among all the models studied.

In [20]:
# Save model
model_RandomForest =  RandomForestClassifier(random_state = 12345, n_estimators = 20, max_depth = 9)
model_RandomForest.fit(train_validation_features, train_validation_target)
dump(model_RandomForest, 'model_RandomForest.joblib')

['model_RandomForest.joblib']

## 6. Conclusion

1. **DecisionTreeClassifier**, **RandomForestClassifier** and **LogisticRegression** are the 3 models to be tested using different parameters.

2. The sanity of the dataset was tested with **dummy** model using a *constant* strategy, which shows that the dataset is proper for the test.

3. The **RandomForestClassifier** model with *n_estimators = 20* and *max_depth = 9* has the highest accuracy *(79.78%)*.