# Introduction to Machine Learning Project

## Project Description
Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

## Data description

Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:

- `calls` is the number of calls. 
- `minutes` is the total call duration in minutes.
- `messages` is the number of text messages.
- `mb_used` is the Internet traffic used in MB.
- `is_ultra` is the plan for the current month (Ultra - 1, Smart - 0).

## Purpose of the project
- import and study the datasets
- tidy up the datasets when needed
- split the data into train, test, and validation sets
- develop different classification models and fine-tune parameters and hyperparameters
- evaluate the performance and quality of different classification models using the test set
- pick out the best model, draw conclusions and explain the results

## Initialization

In [7]:
# Loading all the libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score


## Load data

In [9]:
# Load the data files into different DataFrames

try: 
    df = pd.read_csv('datasets/users_behavior.csv')
except:
    df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
display(df.head(10))

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


The dataset has four columns with no missing values. The data types have no obvious issues. 

Next, we will split the dataset in to training, validation and test datasets according to a 3:1:1 ratio. 

In [5]:
df_train, df_valid = train_test_split(df, test_size=0.2, random_state=12345) 
df_train, df_test = train_test_split(df_train, test_size=0.2, random_state=12345) 

In [6]:
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']

features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']

features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

## Models with decision tree classifier

In [7]:
# < create a loop for max_depth from 1 to 5 >

for depth in range(1, 6):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)

    predictions_valid = model.predict(features_valid)
    print("max_depth =", depth, ": ", end='')
    print(accuracy_score(target_valid, predictions_valid)) 
        

max_depth = 1 : 0.7480559875583204
max_depth = 2 : 0.7838258164852255
max_depth = 3 : 0.7869362363919129
max_depth = 4 : 0.7884914463452566
max_depth = 5 : 0.7791601866251944


Using the decision tree classifier, the optimal model has `max_depth=2` with validation accuracy around 0.784.
Thus, we determined that the best decision tree classifier has `max_depth=2`. 

Next, we will fit the best decision tree classifier and test it using the test data. 

In [8]:
n=2
model = DecisionTreeClassifier(random_state=12345, max_depth=n)
model.fit(features_train, target_train)
predictions_test = model.predict(features_test)
print("max_depth =", n, ": ", end='')
print(accuracy_score(target_test, predictions_test)) 

max_depth = 2 : 0.7475728155339806


It shows that with the optimal decision tree classifier, the test accuracy is around 0.748. It is not bad. 

Let's try other types of models and see if we can do better.

## Random forest models


In [9]:
for n in range(1, 11):
    
    model = RandomForestClassifier(random_state=12345, n_estimators=n)
    model.fit(features_train, target_train) 
    
    print("n_estimators =", n, ": ", end='')
    print(model.score(features_valid, target_valid)) 

n_estimators = 1 : 0.7216174183514774
n_estimators = 2 : 0.744945567651633
n_estimators = 3 : 0.749611197511664
n_estimators = 4 : 0.7542768273716952
n_estimators = 5 : 0.7682737169517885
n_estimators = 6 : 0.7698289269051322
n_estimators = 7 : 0.7698289269051322
n_estimators = 8 : 0.7807153965785381
n_estimators = 9 : 0.7713841368584758
n_estimators = 10 : 0.7744945567651633


Using the random forest models, the optimal model has `n_estimators=8` with validation accuracy around 0.78.
Thus, we determined that the best random forest model has `n_estimators=8`. 

Next, we will fit the best random forest model and test it using the test data. 

In [10]:
n=8

model = RandomForestClassifier(random_state=12345, n_estimators=n)
model.fit(features_train, target_train) 
predictions_test = model.predict(features_test)

print("n_estimators =", n, ": ", end='')
print(accuracy_score(target_test, predictions_test)) 

n_estimators = 8 : 0.7669902912621359


It shows that with the optimal random forest model, the test accuracy is around 0.767. It is pretty good. 

Let's try one more type logistic regression models and see if we can do better.

## The logistic regression model

In [11]:
model = LogisticRegression(random_state=12345, solver='liblinear') 
model.fit(features_train, target_train) 
    
print(model.score(features_valid, target_valid)) 
print(model.score(features_test, target_test)) 

0.749611197511664
0.7203883495145631


The logistic regression model has validation accuracy about 0.75 and test accuracy 0.72. Apparently, it did not perform as well as the other two types of models. 

## General conclusion

In this project, we analyzed user behavior dataset from Mobile carrier Megaline. The major steps for analyzing the data are:

- import and study the datasets
- tidy up the datasets when needed
- split the data into train, test, and validation sets using 3:1:2 ratio
- develop three types classification models and fine-tune parameters and hyperparameters
- evaluate the performance and quality of the three classification models using the test dataset
- pick out the best model based on test accuracy

The main findings are:
- the raw dataset was clean and required minimal processing
- the best model for the classification task of this dataset is random forest model with `n_estimators=8`. It achieved test accuracy of 0.767.
- the test accuracy of the best decision tree classifier is around 0.748
- the test accuracy of the logistic regression model is around 0.72

