# Contents <a id='back'></a>

* [Introduction](#intro)
* [1. Data Overview](#data_review)
    * [Conclusions](#data_review_conclusions)
* [2. Splitting the Data](#splitting_data)
* [3. Assessing Model Quality](#model_quality)
    * [3.1 Logistic Regression](#initial_lr)
    * [3.2 Decision Tree](#initial_dtree)
    * [3.3 Random Forest](#initial_rf)
* [4. Hyperparameter Tuning](#hyperparameter_tuning)
    * [4.1 Logistic Regression](#tuning_lr)
    * [4.2 Decision Tree](#tuning_dtree)
    * [4.3 Random Forest](#tuning_rf)
* [5. Sanity Check](#sanity_check)
* [General Conclusion](#end)

# Introduction <a id='intro'></a>

In this project, I will train a model that can analyze consumer behavior and recommend one of the two new packages: Smart or Ultra.  For this project, the threshold for the accuracy level is 0.75. Evaluate the accuracy metric of my model using the test dataset.

**Objective:**

To train a model that can recommend a package based on consumer behavior with a minimum accuracy score of 0.75.


**This project will consist of three steps:**

1. Data Overview
2. Splitting the Data
3. Assessing Model Quality
4. Hyperparameter Tuning
5. Sanity Check


[Back to Contents](#back)

## 1. Data Overview <a id='data_review'></a>

The steps to be performed are as follows:
1. Checking the number of rows and columns.
2. Checking for missing values.
3. Checking for duplicate data.
4. Checking statistical information in columns with numerical data types.
5. Checking values in columns with categorical data types.

[Back to Contents](#back)

In [1]:
# load library

# dataset
import pandas as pd, numpy as np

# scientific computing
import numpy as np

# library for models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# splitting data
from sklearn.model_selection import train_test_split

# accuracy score
from sklearn.metrics import accuracy_score

# ignore warning
import warnings
from pandas.errors import SettingWithCopyWarning
warnings.filterwarnings("ignore")

In [3]:
# load dataset
path = 'data/users_behavior.csv'
df = pd.read_csv(path)

### 1.1 Data Exploration: users_behavior dataset

In [4]:
df.shape

(3214, 5)

In [5]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [7]:
# checking missing values
df.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [8]:
# checking statistic information for numerical variables
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [9]:
# checking data composition of target column
df['is_ultra'].value_counts()

is_ultra
0    2229
1     985
Name: count, dtype: int64

In [11]:
# checking data composition percentage
df['is_ultra'].value_counts()/df.shape[0] * 100

is_ultra
0    69.352831
1    30.647169
Name: count, dtype: float64

**Conclusion**

1. There are no missing values.
2. The data types in the columns are correct.
3. The composition of the target data is not ideal due to an imbalance. This implies that, since the majority of the target data is 0, there is a tendency for the model to predict the value 0. This can result in poor model performance and low accuracy. To address this imbalance, techniques like upsampling (increasing the frequency of value 1) or downsampling (reducing the frequency of value 0) can be employed. However, both upsampling and downsampling might lead to the introduction of synthetic data points.

## 2. Splitting the Data <a id='splitting_data'></a>

The data will be divided into:

1. Training Set: This data is used to train and build the model.
2. Validation Set: This data is used to optimize the model during its construction. It helps assess the model's ability to recognize patterns in a general sense. The validation set is also used to evaluate the accuracy of the created model. If the accuracy is not satisfactory, hyperparameter tuning can be performed.
3. Test Set: This data is used to test the model's performance.

[Back to Contents](#back)

In [12]:
# Split dataset
df_train_valid, df_test = train_test_split(df, test_size=0.2)

# Apply `random_state 42` to obtain the same results across different executions.
df_train, df_valid = train_test_split(df_train_valid, test_size=0.25)# untuk membuat model

# features and target for training dataset
features_train = df_train.drop('is_ultra', axis=1)
target_train = df_train['is_ultra']

# features and target for validation dataset
features_valid = df_valid.drop('is_ultra', axis=1)
target_valid = df_valid['is_ultra']

# features and target for test dataset
features_test = df_test.drop('is_ultra', axis=1)
target_test = df_test['is_ultra']

In [13]:
# checking row for each dataset
print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)

(1928, 4)
(643, 4)
(643, 4)


## 3. Assessing Model Quality <a id='model_quality'></a>

without hyperparameter tuning

[Back to Contents](#back)

### 3.1 Logistic Regression <a id='initial_lr'></a>

In [38]:
initial_model_logReg = LogisticRegression()

# train training set
initial_model_logReg.fit(features_train, target_train)

# predict using training set
y_predict_train_lr = initial_model_logReg.predict(features_train)

# predict using validation set
y_predict_valid_lr = initial_model_logReg.predict(features_valid)

# test accuracy score in training set
print("accuracy score on training set:", accuracy_score(target_train, y_predict_train_lr))
# test accuracy score in validation set
print("accuracy score on validation set: ", accuracy_score(target_valid, y_predict_valid_lr))

accuracy score on training set: 0.7484439834024896
accuracy score on validation set:  0.7682737169517885


### 3.2 Decision Tree <a id='initial_dtree'></a>

In [20]:
initial_model_dTree = DecisionTreeClassifier()

# train training set
initial_model_dTree.fit(features_train, target_train)

# predict using training set
y_predict_train_dtree = initial_model_dTree.predict(features_train)

# predict using validation set
y_predict_valid_dtree = initial_model_dTree.predict(features_valid)

# test accuracy score in training set
print("accuracy score on training set:", accuracy_score(target_train, y_predict_train_dtree))

# test accuracy score in validation set
print("accuracy score on validation set: ", accuracy_score(target_valid, y_predict_valid_dtree))

accuracy score on training set: 1.0
accuracy score on validation set:  0.7527216174183515


### 3.3 Random Forest <a id='initial_rf'></a>

In [21]:
initial_model_rf = RandomForestClassifier()

# train training set
initial_model_rf.fit(features_train, target_train)

# predict using training set
y_predict_train_rf = initial_model_rf.predict(features_train)

# predict using validation set
y_predict_valid_rf = initial_model_rf.predict(features_valid)

# test accuracy score in training set
print("accuracy score on training set:", accuracy_score(target_train, y_predict_train_rf))
# mengukur accuracy score
print("accuracy score on validation set:", accuracy_score(target_valid, y_predict_valid_rf))

accuracy score on training set: 1.0
accuracy score on validation set: 0.8055987558320373


**Conclusion**

1. Based on the training dataset, overfitting is observed. This is indicated by the accuracy value of 1.0 when using the decision tree and random forest models. This could be due to the small amount of data and the presence of imbalance.
2. Based on the accuracy of the validation dataset,
    - Logistic regression has the lowest accuracy rate and has not reached the threshold with an accuracy of 75% on the test dataset.
    - Decision Tree has a moderate accuracy rate and has not reached the threshold with an accuracy of 75% on the test dataset.
    - Random Forest has the highest accuracy rate and has already reached the 75% threshold on the test dataset.

## 4 Hyperparameter Tuning to Develop the model <a id='hyerparameter_tuning'></a>

[Back to Contents](#back)

### 4.1 Logistic Regression  <a id='tuning_lr'></a>

the hyperparameters that will be tuned are:
- solver: The algorithm used for optimization. For small-sized datasets, solver: liblinear is used.
- random_state: Controls the randomness of the estimator.

In [29]:
final_model_lr = LogisticRegression(solver = 'liblinear', random_state = 42)
final_model_lr.fit(features_train, target_train)

# training dataset
y_predict_train_final_lr = final_model_lr.predict(features_train)
print("In the final logistic regression model, the accuracy score on the training set:", accuracy_score(target_train, y_predict_train_final_lr))

# dataset validation
y_predict_valid_final_lr = final_model_lr.predict(features_valid)
print("In the final logistic regression model, the accuracy score on the validation set:", accuracy_score(target_valid, y_predict_valid_final_lr))

In the final logistic regression model, the accuracy score on the training set: 0.745850622406639
In the final logistic regression model, the accuracy score on the validation set: 0.7620528771384136


In [25]:
# predict using test dataset
y_predict_test_final_lr = final_model_lr.predict(features_test)

# test accuracy score on test set
print(accuracy_score(target_test, y_predict_test_final_lr))

0.744945567651633


**Conclusion - Logistic Regression Model**

The breakdown of accuracy results before and after hyperparameter tuning is as follows:

1. Training dataset --> Before: 74.8%, After: 74.5%. Decreased by 0.3%.
2. Validation dataset --> Before: 76.8%, After: 76.2%. Decreased by 0.6%.
3. Test dataset has an accuracy of 74.5%, which hasn't yet exceeded the threshold.

### 4.2 Decision Tree <a id='tuning_dtree'></a>

the hyperparameters that will be tuned are:
- max_depth: Limits the number of branches, preventing overfitting if set too high.
- random_state: Controls the randomness of the estimator.

In [41]:
for depth in range(1, 15):
    model_dtree = DecisionTreeClassifier(max_depth=depth, random_state = 42)
    model_dtree.fit(features_train, target_train)
    predictions_valid_dtree = model_dtree.predict(features_valid)
    print("At", "max_depth", depth, "the accuracy score on validation set is ", end='')
    print(accuracy_score(target_valid, predictions_valid_dtree))

At max_depth 1 the accuracy score on validation set is 0.7682737169517885
At max_depth 2 the accuracy score on validation set is 0.7869362363919129
At max_depth 3 the accuracy score on validation set is 0.8180404354587869
At max_depth 4 the accuracy score on validation set is 0.8164852255054432
At max_depth 5 the accuracy score on validation set is 0.8164852255054432
At max_depth 6 the accuracy score on validation set is 0.8211508553654744
At max_depth 7 the accuracy score on validation set is 0.80248833592535
At max_depth 8 the accuracy score on validation set is 0.8055987558320373
At max_depth 9 the accuracy score on validation set is 0.7978227060653188
At max_depth 10 the accuracy score on validation set is 0.8055987558320373
At max_depth 11 the accuracy score on validation set is 0.7729393468118196
At max_depth 12 the accuracy score on validation set is 0.7869362363919129
At max_depth 13 the accuracy score on validation set is 0.776049766718507
At max_depth 14 the accuracy score on

In [42]:
best_max_depth = 6
final_model_dtree = DecisionTreeClassifier(max_depth=best_max_depth, random_state = 42)
final_model_dtree.fit(features_train, target_train)

# dataset training
y_predict_train_final_dtree = final_model_dtree.predict(features_train)
print("In the final decision tree model, the accuracy score on the training set:", accuracy_score(target_train, y_predict_train_final_dtree))

# dataset validation
y_predict_valid_final_dtree = final_model_dtree.predict(features_valid)
print("In the final decision tree model, the accuracy score on the validation set:", accuracy_score(target_valid, y_predict_valid_final_dtree))

In the final decision tree model, the accuracy score on the training set: 0.8319502074688797
In the final decision tree model, the accuracy score on the validation set: 0.8211508553654744


In [23]:
# predict using test dataset
y_predict_test_final_dtree = final_model_dtree.predict(features_test)

# test accuracy score on test set
print(accuracy_score(target_test, y_predict_test_final_dtree))

0.7776049766718507


**Conclusion - Decision Tree Model**


The breakdown of accuracy results before and after hyperparameter tuning is as follows:
1. Training dataset -> Before: 1, After: 83.1% > No longer experiencing overfitting.
2. Validation dataset > Before: 75.2%, After: 82.1%. Increased by 6.9%.
3. Test dataset has an accuracy of 77.7%, which meets the threshold for accuracy.
4. Hyperparameters that were tuned:
    - max_depth, with a value of 4
    - random_state, with a value of 42

### 4.3 Random Forest <a id='tuning_rf'></a>

the hyperparameters that will be tuned are:
- n_estimators: Controls the number of trees in the random forest model. Increasing the number of estimators tends to decrease the prediction variance. Therefore, the more trees used, the better the results obtained.
- max_depth: Limits the number of branches.
- random_state: Controls the randomness of the estimator.

In [40]:
max_depth_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
n_estimator_list = [100, 200, 300, 400, 500]

for depth in max_depth_list:
    for est in n_estimator_list:
        model_rf = RandomForestClassifier(max_depth=depth, n_estimators = est, random_state = 42)
        model_rf.fit(features_train, target_train)
        predictions_valid_rf = model_rf.predict(features_valid)
        print("At", "max_depth", depth, "and n_estimators", est, "the accuracy score on validation set is ", end='')
        print(accuracy_score(target_valid, predictions_valid_rf))

At max_depth 1 and n_estimators 100 the accuracy score on validation set is 0.7651632970451011
At max_depth 1 and n_estimators 200 the accuracy score on validation set is 0.7682737169517885
At max_depth 1 and n_estimators 300 the accuracy score on validation set is 0.7667185069984448
At max_depth 1 and n_estimators 400 the accuracy score on validation set is 0.7667185069984448
At max_depth 1 and n_estimators 500 the accuracy score on validation set is 0.7667185069984448
At max_depth 2 and n_estimators 100 the accuracy score on validation set is 0.8009331259720062
At max_depth 2 and n_estimators 200 the accuracy score on validation set is 0.80248833592535
At max_depth 2 and n_estimators 300 the accuracy score on validation set is 0.8009331259720062
At max_depth 2 and n_estimators 400 the accuracy score on validation set is 0.8009331259720062
At max_depth 2 and n_estimators 500 the accuracy score on validation set is 0.80248833592535
At max_depth 3 and n_estimators 100 the accuracy score

In [43]:
# dataset validation
best_max_depth_rf = 9
best_est = 200
final_model_rf = RandomForestClassifier(max_depth=best_max_depth_rf, n_estimators = best_est,random_state = 42)
final_model_rf.fit(features_train, target_train)

# dataset training
y_predict_train_final_rf = final_model_rf.predict(features_train)
print("In the final random forest model, the accuracy score on the training set:", accuracy_score(target_train, y_predict_train_final_rf))

# dataset validation
y_predict_valid_final_rf = final_model_rf.predict(features_valid)
print("In the final random forest model, the accuracy score on the validation set:", accuracy_score(target_valid, y_predict_valid_final_rf))

In the final random forest model, the accuracy score on the training set: 0.8848547717842323
In the final random forest model, the accuracy score on the validation set: 0.8413685847589425


In [44]:
# predict using test dataset
y_predict_test_final_rf = final_model_rf.predict(features_test)
# test accuracy score on test set
print(accuracy_score(target_test, y_predict_test_final_rf))

0.8055987558320373


**Conclusion - Random Forest Model**

The breakdown of accuracy results before and after hyperparameter tuning for the Random Forest model is as follows:

1. Training dataset > Before: 1, After: 88.4%. No longer experiencing overfitting.
2. Validation dataset > Before: 80%, After: 84%. Increased by 4%.
3. Test dataset has an accuracy of 80.5%, exceeding the threshold.
4. Hyperparameters that were tuned:
    - max_depth, with a value of 7
    - n_estimators, with a value of 500
    - random_state, with a value of 42

## 5. Sanity Check <a id='sanity_check'></a>

[Back to Contents](#back)

In [46]:
# check composition data of target feature (in percentage)
df['is_ultra'].value_counts()/df.shape[0] * 100

is_ultra
0    69.352831
1    30.647169
Name: count, dtype: float64

**Conclusion**

Based on the above sanity check, the model is dominated by 0. Therefore, the results are not satisfactory, at around 70%.

# General Conclusion <a id='end'></a>

The accuracy values for each final model after hyperparameter tuning are as follows:

1. Logistic Regression: 
   - Training set: 0.745850622406639
   - Validation set: 0.7620528771384136
   - Test set: 0.744945567651633

2. Decision Tree:
   - Training set: 0.8319502074688797
   - Validation set: 0.8211508553654744
   - Test set: 0.7776049766718507

3. Random Forest:
   - Training set: 0.8848547717842323
   - Validation set: 0.8413685847589425
   - Test set: 0.8055987558320373

Among the three models, the highest accuracy on the test set is achieved by the Random Forest model. The difference between the training set, validation set, and test set accuracy is also not substantial, indicating that overfitting is not a concern.

Therefore, I would recommend the client to use the **Random Forest model** to determine the appropriate package choice between "Smart" and "Ultra".


[Back to Contents](#back)