
## Deskripsi Proyek
Operator seluler Megaline merasa tidak puas karena banyak pelanggan mereka yang masih menggunakan paket lama. Perusahaan tersebut ingin mengembangkan sebuah model yang dapat menganalisis perilaku konsumen dan merekomendasikan salah satu dari kedua paket terbaru Megaline: Smart atau Ultra.

* Anda memiliki akses terhadap data perilaku para pelanggan yang sudah beralih ke paket terbaru (dari proyek kursus Analisis Data Statistik). 
* Dalam tugas klasifikasi ini, Anda perlu mengembangkan sebuah model yang mampu memilih paket dengan tepat. Mengingat Anda telah menyelesaikan langkah pra-pemrosesan data, Anda bisa langsung menuju ke tahap pembuatan model.

Yang harus dilakukan:
* Kembangkanlah sebuah model yang memiliki accuracy setinggi mungkin. 
* Pada proyek ini, ambang batas untuk tingkat accuracy-nya adalah 0,75. 
* Periksalah metrik accuracy model Anda dengan menggunakan test dataset.

In [1]:
# memuat semua library
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as st
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy.stats import levene

import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows',100)

In [2]:
#import libraries
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier

### Langkah 1
Buka dan cermati file data secara teliti. File path: /datasets/users_behavior.csv. Unduh dataset

In [3]:
df = pd.read_csv('/datasets/users_behavior.csv')

In [4]:
# memperoleh baris pertama
display(df.head(), df.tail())

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
3209,122.0,910.98,20.0,35124.9,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0
3213,80.0,566.09,6.0,29480.52,1


**Deskripsi Data**

Setiap observasi dalam dataset yang kita miliki memuat informasi perilaku bulanan tentang satu pengguna. Adapun informasi tersebut mencakup:
* `сalls` — jumlah panggilan
* `minutes` — total durasi panggilan dalam satuan menit
* `messages` — jumlah pesan teks
* `mb_used` — traffic penggunaan internet dalam satuan MB
* `is_ultra` — paket untuk bulan yang sedang berjalan (Ultra - 1, Smart - 0)

In [5]:
# memperoleh informasi umum tentang data di df
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [6]:
# mengecek jumlah pengguna ultra dan smart plan (Ultra = 1, Smart = 0)
func_percent = lambda x: round(100*x.count()/df.shape[0])
display(df.pivot_table(index='is_ultra',values=['calls','minutes','messages','mb_used'],aggfunc=['mean','sum','count']))
display(df.pivot_table(index='is_ultra',values=['calls'],aggfunc=[func_percent]))

Unnamed: 0_level_0,mean,mean,mean,mean,sum,sum,sum,sum,count,count,count,count
Unnamed: 0_level_1,calls,mb_used,messages,minutes,calls,mb_used,messages,minutes,calls,mb_used,messages,minutes
is_ultra,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
0,58.463437,16208.466949,33.384029,405.942952,130315.0,36128672.83,74413.0,904846.84,2229,2229,2229,2229
1,73.392893,19468.823228,49.363452,511.224569,72292.0,19176790.88,48623.0,503556.2,985,985,985,985


Unnamed: 0_level_0,<lambda>
Unnamed: 0_level_1,calls
is_ultra,Unnamed: 1_level_2
0,69.0
1,31.0


**Kesimpulan:**
* Tipe data sudah sesuai
* Data frame tidak memiliki data null
* Terdapat 2229 (69%) pengguna smart-plan, 985 (31%) pengguna ultra-plan

### Langkah 2
Pisahkan data sumber menjadi training set, validation set, dan test set.

In [7]:
# memisahkan data set menjadi training, validation, dan test
df_train, df_valid_test = train_test_split(df, test_size=0.40, random_state=12345)
df_valid, df_test = train_test_split(df_valid_test, test_size=0.5, random_state=12345)
total_size = len(df)

# train
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
# valid
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
# test
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

print('training set  : {0:.0%}'.format(len(df_train)/total_size),df_train.shape)
print('validation set: {0:.0%}'.format(len(df_valid)/total_size),df_valid.shape)
print('test set      : {0:.0%}'.format(len(df_test)/total_size),df_test.shape)

training set  : 60% (1928, 5)
validation set: 20% (643, 5)
test set      : 20% (643, 5)


In [8]:
# Training dataset
func_percent_train = lambda x: round(100*x.count()/df_train.shape[0])
display(df_train.pivot_table(index='is_ultra',values=['calls'],aggfunc=['count',func_percent_train]))

Unnamed: 0_level_0,count,<lambda>
Unnamed: 0_level_1,calls,calls
is_ultra,Unnamed: 1_level_2,Unnamed: 2_level_2
0,1335,69.0
1,593,31.0


In [9]:
# Validation dataset
func_percent_valid = lambda x: round(100*x.count()/df_valid.shape[0])
display(df_valid.pivot_table(index='is_ultra',values=['calls'],aggfunc=['count',func_percent_valid]))

Unnamed: 0_level_0,count,<lambda>
Unnamed: 0_level_1,calls,calls
is_ultra,Unnamed: 1_level_2,Unnamed: 2_level_2
0,454,71.0
1,189,29.0


In [10]:
# Test dataset
func_percent_test = lambda x: round(100*x.count()/df_test.shape[0])
display(df_test.pivot_table(index='is_ultra',values=['calls'],aggfunc=['count',func_percent_test]))

Unnamed: 0_level_0,count,<lambda>
Unnamed: 0_level_1,calls,calls
is_ultra,Unnamed: 1_level_2,Unnamed: 2_level_2
0,440,68.0
1,203,32.0


**Kesimpulan**
* Membagi dataset utama menjadi 3 set: (Training set, Validation set, Test set).

Untuk pembagian antara pengguna smart dan ultra plan:
- Dataset utama terbagi smart plan: 69%, ultra plan: 31% 
- Training dataset memiliki pembagian yang sama dengan dataset utama
- Validasi dataset terbagi smart plan: 71%, ultra plan: 29% 
- Test dataset terbagi smart plan: 68%, ultra plan: 32%

Dataset validasi memiliki lebih banyak pengguna smart plan (0) dari pada pengguna ultra plan (1).

In [11]:
# membuat dataframe untuk menyimpan fungsi, hasil akurasi, dan jika memenuhi atau melampaui ambang batas akurasi
column_names = ["function","hyperparameters","accuracy_score validation","accuracy_score test",'above_threshold?']
df_results = pd.DataFrame(columns = column_names)

### Langkah 3
Periksa kualitas model yang berbeda dengan mengubah hyperparameter-nya. Jelaskan secara singkat temuan-temuan yang Anda dapatkan dari penelitian ini.

In [12]:
# Decision Tree Classifier tanpa Hyperparameter

model = DecisionTreeClassifier()
model.fit(features_train, target_train)

predictions_valid = model.predict(features_valid)
predictions_test = model.predict(features_test)

accuracy_valid = accuracy_score(target_valid, predictions_valid)
accuracy_test = accuracy_score(target_test, predictions_test)

print(accuracy_valid, accuracy_test)

0.7231726283048211 0.744945567651633


In [13]:
# memeriksa kualitas model
accuracy_threshold=0.75
accu_threshold = np.where(accuracy_test > accuracy_threshold, True, False)
rows = [pd.Series(['DecisionTree','None',accuracy_valid,accuracy_test,accu_threshold], index=df_results.columns)]

df_results = df_results.append(rows)
df_results.round(4)

Unnamed: 0,function,hyperparameters,accuracy_score validation,accuracy_score test,above_threshold?
0,DecisionTree,,0.7232,0.7449,False


**Kesimpulan:**
* Decision Tree Classifier tanpa hyperparameter menghasilkan skor akurasi 72.9% pada Test dataset.
* Karena ambang batas untuk tingkat accuracy-nya adalah 0,75. berdasarkan hasil pemeriksaan kualitas model diatas masih belum mencapai ambang batas tingkat accuracy-nya.
* Untuk mendapatkan tingkat accuracy tersebut perlu untuk menggunakan hyperparameter.

### Langkah 4
Periksa kualitas model dengan menggunakan test set.

In [14]:
# Decision Tree Classifier dengan Hyperparameter 'max_depth'

highest_score=0
for depth in range(1,6):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)

    accuracy_valid  = model.score(features_valid, target_valid)
    
    print('--------------------------------------------------------')
    print("max_depth =",depth,": ", 'accuracy score:',accuracy_valid)

    if (accuracy_valid > highest_score):
        highest_score = accuracy_valid
        highest_depth = depth
 
print('Highest depth: ', highest_depth,'Highest Accuracy Score', highest_score)

model = DecisionTreeClassifier(random_state=12345, max_depth=highest_depth)
model.fit(features_train, target_train)

accuracy_test = model.score(features_test, target_test)
print('Test Data Accuracy Score', accuracy_test,'Depth: ', highest_depth)

--------------------------------------------------------
max_depth = 1 :  accuracy score: 0.7542768273716952
--------------------------------------------------------
max_depth = 2 :  accuracy score: 0.7822706065318819
--------------------------------------------------------
max_depth = 3 :  accuracy score: 0.7853810264385692
--------------------------------------------------------
max_depth = 4 :  accuracy score: 0.7791601866251944
--------------------------------------------------------
max_depth = 5 :  accuracy score: 0.7791601866251944
Highest depth:  3 Highest Accuracy Score 0.7853810264385692
Test Data Accuracy Score 0.7791601866251944 Depth:  3


In [15]:
# memeriksa kualitas model
accu_threshold = np.where(accuracy_test > accuracy_threshold, True, False)
rows = [pd.Series(['DecisionTree', 'max_depth ='+ str(highest_depth),highest_score,accuracy_test,accu_threshold], index=df_results.columns)]

df_results = df_results.append(rows)
df_results.round(4)

Unnamed: 0,function,hyperparameters,accuracy_score validation,accuracy_score test,above_threshold?
0,DecisionTree,,0.7232,0.7449,False
0,DecisionTree,max_depth =3,0.7854,0.7792,True


**Kesimpulan:**
* Decision Tree Classifier dengan Hyperparameter max_depth = 3 dengan skor akurasi maksimum 78% pada Validasi dataset.
* Menggunakan max_depth = 3 untuk Test dataset, dan menghasilkan skor akurasi 77,9%.
* Pemeriksaan kualitas model dengan menggunakan test set sudah melebihi ambang batas tingkat accuracy 0,75.(0.7792)

In [16]:
# Random Forest Classifier dengan Hyperparameter

highest_score = 0.0
for forest_estimator in (10, 50, 100, 200, 300, 400):
    model = RandomForestClassifier(random_state=12345, n_estimators=forest_estimator)
    model.fit(features_train, target_train)
    accuracy_valid = model.score(features_valid, target_valid)
    print('--------------------------------------------------------')
    print('Estimator: ', forest_estimator,' Accuracy Score', accuracy_valid)
        
    if (accuracy_valid > highest_score):
        highest_score = accuracy_valid
        highest_estimator = forest_estimator
 
print('Highest Estimator: ', highest_estimator,'Highest Accuracy Score', highest_score)
 
model = RandomForestClassifier(random_state=12345, n_estimators=highest_estimator)
model.fit(features_train, target_train)

accuracy_test = model.score(features_test, target_test)
print('Test Data Accuracy Score', accuracy_test,'Estimator: ', highest_estimator)

--------------------------------------------------------
Estimator:  10  Accuracy Score 0.7853810264385692
--------------------------------------------------------
Estimator:  50  Accuracy Score 0.7916018662519441
--------------------------------------------------------
Estimator:  100  Accuracy Score 0.7853810264385692
--------------------------------------------------------
Estimator:  200  Accuracy Score 0.7869362363919129
--------------------------------------------------------
Estimator:  300  Accuracy Score 0.7869362363919129
--------------------------------------------------------
Estimator:  400  Accuracy Score 0.7853810264385692
Highest Estimator:  50 Highest Accuracy Score 0.7916018662519441
Test Data Accuracy Score 0.7931570762052877 Estimator:  50


In [17]:
# memeriksa kualitas model
accu_threshold = np.where(accuracy_test > accuracy_threshold, True, False)
rows = [pd.Series(['RandomForest','n_estimators ='+ str(highest_estimator),highest_score,accuracy_test,accu_threshold], index=df_results.columns)]

df_results = df_results.append(rows)
df_results.round(4)

Unnamed: 0,function,hyperparameters,accuracy_score validation,accuracy_score test,above_threshold?
0,DecisionTree,,0.7232,0.7449,False
0,DecisionTree,max_depth =3,0.7854,0.7792,True
0,RandomForest,n_estimators =50,0.7916,0.7932,True


**Kesimpulan:**
* Random Forest Classifier dengan Hyperparameter n_estimators = 50 dengan skor akurasi maksimum 79% pada Validasi dataset.
* Menggunakan max_depth = 50 untuk Test dataset, dan skor akurasinya adalah 79%.
* Hasil pada Test dataset serupa dengan validasi test dataset.
* Pemeriksaan kualitas model dengan menggunakan test set sudah melebihi ambang batas tingkat accuracy 0,75.(0.7932)

In [18]:
# LogisticRegression tanpa Hyperparameter

model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)

predictions_valid = model.predict(features_valid)
predictions_test = model.predict(features_test)

accuracy_valid = accuracy_score(target_valid, predictions_valid)
accuracy_test = accuracy_score(target_test, predictions_test)

print(accuracy_valid, accuracy_test)

0.7589424572317263 0.7402799377916018


In [19]:
# memeriksa kualitas model
accu_threshold = np.where(accuracy_test > accuracy_threshold, True, False)
rows = [pd.Series(['LogisticRegression','None',accuracy_valid,accuracy_test,accu_threshold], index=df_results.columns)]

df_results = df_results.append(rows)
df_results.round(4)

Unnamed: 0,function,hyperparameters,accuracy_score validation,accuracy_score test,above_threshold?
0,DecisionTree,,0.7232,0.7449,False
0,DecisionTree,max_depth =3,0.7854,0.7792,True
0,RandomForest,n_estimators =50,0.7916,0.7932,True
0,LogisticRegression,,0.7589,0.7403,False


**Kesimpulan:**
* Untuk Logistic Regression, menggunakan validasi dan test dataset, hasilnya skor testnya menunjukkan skor akurasi 74%
* Pemeriksaan kualitas model dengan menggunakan test set hampir mencapai ambang batas tingkat accuracy 0,75.(hanya 0.7403)

### Langkah 5
Tugas tambahan: lakukan sanity check terhadap model. Data ini lebih kompleks daripada data yang pernah Anda kerjakan sebelumnya, jadi ini memang bukanlah tugas yang mudah. Kita akan mempelajarinya lebih jauh nanti.

In [20]:
# Sanity check menggunakan DummyClassifier

dummy_clf = DummyClassifier(strategy='most_frequent', random_state=1234)
dummy_clf.fit(features_train, target_train)

accuracy_test = model.score(features_train, target_train)
print(accuracy_test)

0.7505186721991701


**Kesimpulan:**
* Untuk sanity checknya, menggunakan fungsi DummyClassifier, Hasilnya skor testnya menunjukkan skor akurasi 75%
* Skor akurasi untuk test dataset naik dari 0.7403 menjadi 0.7515 (mencapai ambang batas tingkat accuracy 0,75)

In [21]:
print('Laporan Analisa Klasifikasi Fungsi')
display(df_results.round(4))

Laporan Analisa Klasifikasi Fungsi


Unnamed: 0,function,hyperparameters,accuracy_score validation,accuracy_score test,above_threshold?
0,DecisionTree,,0.7232,0.7449,False
0,DecisionTree,max_depth =3,0.7854,0.7792,True
0,RandomForest,n_estimators =50,0.7916,0.7932,True
0,LogisticRegression,,0.7589,0.7403,False


## Kesimpulan

* Pemisahan dataset menjadi 3 dataset utama: (Training set, Validation set, Test set).
    - training set  : 60% (1928, 5)
    - validation set: 20% (643, 5)
    - test set      : 20% (643, 5)
* Untuk pembagian antara pengguna smart dan ultra plan:
    - Dataset utama terbagi smart plan: 69%, ultra plan: 31%
    - Training dataset memiliki pembagian yang sama dengan dataset utama
    - Validasi dataset terbagi smart plan: 71%, ultra plan: 29%
    - Test dataset terbagi smart plan: 68%, ultra plan: 32%
* Dataset validasi memiliki lebih banyak pengguna smart plan (0) dari pada pengguna ultra plan (1).
* Decision Tree tanpa Hyperparameters tidak memenuhi syarat karena akurasi_score adalah 73%
* Skor akurasi tertinggi ada di RandomForest sebesar 79%
* Skor akurasi Terendah berada di Logistic Regression sebesar 74%