# Project 7 - Analyze Consumer Behavior and Recommend Megaline packages

## Project Description

Megaline cellular operator is dissatisfied because many of their customers are still using old packages. The company wanted to develop a model that could analyze consumer behavior and recommend one of Megaline's two new plans: Smart or Ultra.
Behavioral data of customers who have switched to the latest package (from the Statistical Data Analysis course project). In this classification task, it is necessary to develop a model that is able to select packages correctly. Considering that you have completed the data pre-processing step, you can go straight to the model creation stage.
Develop a model that has the highest possible accuracy. In this project, the threshold for the accuracy level is 0.75. Check your model accuracy metrics using the test dataset.

### Steps of The Project
1. Initialization
2. Data Overview
3. Macine Learning Preparation
4. Check Quality Model

**Data Description**

Each observation in our dataset contains monthly behavioral information about a single user. This information includes:

- `сalls` — number of calls
- `minutes` — total call duration in minutes
- `messages` — number of text messages
- `mb_used` — internet usage traffic in MB units
- `is_ultra` — plans for the current month (Ultra - 1, Smart - 0)

## Initialization

In [None]:
# import general and machine learning library

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Data Overview

In [2]:
data = pd.read_csv('users_behavior.csv')
data.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [3]:
data['is_ultra'].value_counts()/data.shape[0]*100

0    69.352831
1    30.647169
Name: is_ultra, dtype: float64

In [4]:
data.info()
data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


**Findings :**

- The use of the latest package (ultra) is still relatively small if seen from the data above, where only around 30% of people use it
- total rows reach 3214 with 5 columns namely calls, minutes, messages, mb_used, is_ultra
- there are no empty values and the data type does not need to be changed anymore

## Machine Learning Preparation

### Devide Train, Valid, and Test Data

Test data will be taken as much as 20% of the total data, considering the need for a lot of data for training, so that the composition of the training data is 75% of the combination of training data and valid data that has been set aside from the test data

In [5]:
train_valid, test = train_test_split(data, test_size=0.2)
train, valid = train_test_split(train_valid, test_size=0.25)

In [6]:
#train
features_train = train.drop(['is_ultra'], axis=1)
target_train = train['is_ultra']

#valid
features_valid = valid.drop(['is_ultra'], axis=1)
target_valid = valid['is_ultra']

#test
features_test = test.drop(['is_ultra'], axis=1)
target_test = test['is_ultra']

In [7]:
features_train.shape, features_valid.shape, features_test.shape

((1928, 4), (643, 4), (643, 4))

## Check Quality Model

### Without Hyperparameter

In [8]:
logistic_regression = LogisticRegression()
logistic_regression.fit(features_train, target_train)

predict_log_reg_valid = logistic_regression.predict(features_valid)
predict_log_reg_test = logistic_regression.predict(features_test)

print('valid accuracy :', accuracy_score(target_valid, predict_log_reg_valid)*100)
print('test accuracy :', accuracy_score(target_test, predict_log_reg_test)*100)

valid accuracy : 71.07309486780716
test accuracy : 73.71695178849144


In [9]:
dec_tree_clas = DecisionTreeClassifier()
dec_tree_clas.fit(features_train, target_train)

predict_dec_tree_clas_valid = dec_tree_clas.predict(features_valid)
predict_dec_tree_clas_test = dec_tree_clas.predict(features_test)

print('valid accuracy :', accuracy_score(target_valid, predict_dec_tree_clas_valid)*100)
print('test accuracy :', accuracy_score(target_test, predict_dec_tree_clas_test)*100)

valid accuracy : 73.09486780715396
test accuracy : 74.80559875583204


In [10]:
ran_for_clas = RandomForestClassifier()
ran_for_clas.fit(features_train, target_train)

predict_ran_for_clas_valid = ran_for_clas.predict(features_valid)
predict_ran_for_clas_test = ran_for_clas.predict(features_test)

print('valid accuracy :', accuracy_score(target_valid, predict_ran_for_clas_valid)*100)
print('test accuracy :', accuracy_score(target_test, predict_ran_for_clas_test)*100)

valid accuracy : 79.00466562986003
test accuracy : 82.89269051321928


**Findings :**

- The highest accuracy is found in the Random Forest model, where according to theory this model produces the highest accuracy compared to other models
- There is no overfitting or underfitting in the three models, because there is no significant difference between the accuracy produced by validation data and test data

### With Hyperparameter

***Depth***

In [11]:
best_score = 0
best_depth = 0

for depth in range (1, 9):
  dtc = DecisionTreeClassifier(max_depth=depth, random_state=12345)
  dtc.fit(features_train, target_train)
  score = dtc.score(features_valid, target_valid)
  if score > best_score:
      best_score = score
      best_depth = depth

print("Best accuracy based on validation set:", best_score, "best_depth:", depth)

Accuracy terbaik berdasarkan validation set: 0.7838258164852255 best_depth: 8


In [12]:
dtc_test = DecisionTreeClassifier(max_depth=8)
dtc_test.fit(features_train, target_train)
predict_test_dtc = dtc_test.predict(features_test)
accuracy_score(target_test, predict_test_dtc)*100

83.98133748055989

In [13]:
best_score = 0
best_depth = 0

for depth in range (1, 9):
  rfc = RandomForestClassifier(max_depth=depth, random_state=12345)
  rfc.fit(features_train, target_train)
  score = rfc.score(features_valid, target_valid)
  if score > best_score:
      best_score = score
      best_depth = depth

print("Best accuracy based on validation set:", best_score, "best_depth:", depth)

Accuracy terbaik berdasarkan validation set: 0.7978227060653188 best_depth: 8


In [14]:
rfc_test = RandomForestClassifier(max_depth=8)
rfc_test.fit(features_train, target_train)
predict_test_rfc = rfc_test.predict(features_test)
accuracy_score(target_test, predict_test_rfc)*100

84.44790046656298

***Estimator and Depth***

In [15]:
best_score = 0
best_est = 0
best_depth = 0

for est in range(100, 501, 100):
    for depth in range (1, 9):
        rfc1 = RandomForestClassifier(max_depth=depth, n_estimators=est, random_state=12345)
        rfc1.fit(features_train, target_train)
        score = rfc1.score(features_valid, target_valid)
        if score > best_score:
            best_score = score
            best_est = est
            best_depth = depth

print("Best accuracy based on validation set:", best_score, "n_estimators:", best_est, "best_depth:", depth)

Accuracy terbaik berdasarkan validation set: 0.7947122861586314 n_estimators: 100 best_depth: 8


In [16]:
rfc1_test = RandomForestClassifier(n_estimators=500, max_depth=8)
rfc1_test.fit(features_train, target_train)
predict_test_rfc1 = rfc1_test.predict(features_test)
accuracy_score(target_test, predict_test_rfc1)*100

85.06998444790047

**Findings :**

- For hyperparameter depth in the Decision Tree and Random Forest models, the highest accuracy is in the Random Forest model where the difference is quite large, around 3%
- The Random Forest model has an accuracy of up to 80% using hyperparameter depth and n_estimator, which makes this model very suitable for predicting the right packet