## Model Prediktif Megaline

Dari data pelanggan yang telah beralih menggunakan salah satu dari dua paket Megaline yang lebih baru `Smart atau Ultra`, kami akan membagi data menjadi rangkaian pelatihan, validasi, dan pengujian untuk membuat sejumlah model prediktif dan memilih model terbaik yang akan membantu dalam menjual paket baru kepada pelanggan lama yang masih menggunakan paket lama.

Ambang batas minimum untuk tingkat akurasi prediksi ditetapkan pada `0,75` meskipun kami akan membuat sejumlah model berbeda untuk mencoba dan mendapatkan akurasi setinggi mungkin supaya dapat lebih membantu Megaline dalam penjualan dan pemasaran mereka.

In [1]:
#importing libraries

import pandas as pd
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

In [2]:
#importing data

user_behav = pd.read_csv('/datasets/users_behavior.csv')

## Processing Data

In [3]:
#process data

user_behav.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
user_behav.head(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


In [5]:
user_behav.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

Terlihat dari data diatas tidak terdapat missing value

## Memperbaiki Tipe Data

In [6]:
user_behav['calls'] = user_behav['calls'].astype('int64')
user_behav['messages'] = user_behav['messages'].astype('int64')
user_behav.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   int64  
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   int64  
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 125.7 KB


Tipe data untuk kolom `calls` dan `messages` telah diperbaiki ke tipe `int64`, dan hal tersebut dapat mempemudahkan untuk membuat model prediktif Megaline.

## Mempersiapkan Model

Metrik evaluasi tidak memerlukan kolom baru seperti kolom `is_ultra`, dengan Boolean 1 atau 0 menunjukkan hal tersebut. Ini akan digunakan sebagai target pemasangan model yang akan datang.

In [7]:
# Split training (60%), validation (20%), and testing (20%) sets

user_behav_train, user_behav_valid = train_test_split(user_behav, test_size=0.2, random_state=759638)

user_behav_train, user_behav_test = train_test_split(user_behav_train, test_size=0.25, random_state=759638)

## Mendeklerasi Variabel

Kami telah membagi data menjadi tiga set; `pelatihan, validasi, dan pengujian` dengan rasio 3:1:1. Pertama kita mengambil 20% data untuk membuat set validasi, kemudian dari 80% sisanya kita mengambil 25% untuk membuat set pengujian. Ini mengikuti rasio 3:1:1 atau 60%, 20%, dan 20% karena setelah pemisahan pertama, 80% dari data awal tetap ada, dan 0,8 * 0,25 = 0,2 mewakili 20% bagian kedua dari keseluruhan himpunan data

In [8]:
#Mendeklerasikan variabel untuk features dan targets


features_train = user_behav_train.drop(['is_ultra'], axis=1)
target_train = user_behav_train['is_ultra']
features_valid = user_behav_valid.drop(['is_ultra'], axis=1)
target_valid = user_behav_valid['is_ultra']
features_test = user_behav_test.drop(['is_ultra'], axis=1)
target_test = user_behav_test['is_ultra']

Kumpulan dari data pelatihan, validasi, dan pengujian telah dibagi menjadi beberapa fitur; yang mencakup jumlah `panggilan, menit, SMS, dan penggunaan data` serta hasilnya; apakah paketnya `Smart atau Ultra`.

##  Test Model


### Model A - Decision Tree

In [9]:
best_model = None
best_result = 0
for depth in range(1, 6):
    model = DecisionTreeClassifier(random_state=759638, max_depth=depth) #membuat kedalaman model tertentu
    model.fit(features_train, target_train) # train the model
    result = model.score(features_valid, target_valid) # menghitung tingkat akurasi
    if result > best_result:
        best_model = model
        best_result = result
        best_depth = depth
        
predictions_test = model.predict(features_test)
test_result = accuracy_score(target_test, predictions_test)
        
print("Depth of best model:", best_depth)
print("Accuracy of the decision tree model on the validation set:", best_result)
print("Accuracy of the decision tree model on the test set:", test_result)

Depth of best model: 3
Accuracy of the decision tree model on the validation set: 0.7978227060653188
Accuracy of the decision tree model on the test set: 0.8133748055987559


### Model B - Random Forest

In [10]:
best_score = 0
best_est = 0
for est in range(1, 11): #memilih rentan hyperparameter 
    model = RandomForestClassifier(random_state=759638, n_estimators=est) # mengatur nomor pohon
    model.fit(features_train, target_train) # train model on training set
    score = model.score(features_valid, target_valid) #menghitung nilai akurasi dari validation set
    if score > best_score:
        best_score = score
        best_est = est

print("n_estimator value of best model:", best_est)
print("Accuracy of the random forest on the model validation set:", best_score)
        
final_model = RandomForestClassifier(random_state=759638, n_estimators=est) # merubah n_estimators ke best model
final_model.fit(features_train, target_train)
test_score = model.score(features_test, target_test)

print("Accuracy of the random forest model on the test set:", test_score)

n_estimator value of best model: 4
Accuracy of the random forest on the model validation set: 0.7776049766718507
Accuracy of the random forest model on the test set: 0.8009331259720062


### Model C - Logistic Regression

In [11]:
model = LogisticRegression(random_state=759638, solver='liblinear')  # inisialisasi logistic regression constructor with parameters random_state=54321 and solver='liblinear'
model.fit(features_train, target_train)  # train model on training set
score_valid = model.score(features_valid, target_valid) # menghitung nilai akurasi dari validation set
score_test = model.score(features_test, target_test) # menghitung nilai akurasi dari validation set

print("Accuracy of the logistic regression model on the validation set:", score_valid)
print("Accuracy of the logistic regression model on the test set:", score_test)

Accuracy of the logistic regression model on the validation set: 0.7122861586314152
Accuracy of the logistic regression model on the test set: 0.7169517884914464


### Model D - Decision Tree Regression 

In [12]:
best_model = None
best_score = 10000
best_depth = 0
for depth in range(1, 6): # memilih rentan hyperparameter
    model = DecisionTreeRegressor(max_depth=depth, random_state=759638) # train model on training set
    model.fit(features_train, target_train) # train model on training set
    predictions_valid = model.predict(features_valid) # mendapatkan model predictions dari validation set
    result = mean_squared_error(target_valid, predictions_valid)**0.5 # menghitung RMSE dari validation set
    if result < best_result:
        best_model = model
        best_result = result
        best_depth = depth

model = RandomForestClassifier(random_state=759638, n_estimators=best_depth) 
model.fit(features_train, target_train)
score_valid = model.score(features_valid, target_valid)
score_test = model.score(features_test, target_test)
        
print("Best model depth:", best_depth)
print("Accuracy of the decision tree regression model on the validation set:", score_valid)
print("Accuracy of the decision tree regression model on the test set:", score_test)

Best model depth: 5
Accuracy of the decision tree regression model on the validation set: 0.7465007776049767
Accuracy of the decision tree regression model on the test set: 0.7729393468118196


### Model E - Random Forest Regression

In [13]:
best_model = None
best_result = 10000
best_est = 0
best_depth = 0
for est in range(10, 51, 10):
    for depth in range (1, 11):
        model = RandomForestRegressor(random_state=759638, n_estimators=est, max_depth=depth)
        model.fit(features_train, target_train) # train model on training set
        predictions_valid = model.predict(features_valid) # mendapatkan model predictions dari validation set
        result = mean_squared_error(target_valid, predictions_valid)**0.5 
        if result < best_result:
            best_model = model
            best_result = result
            best_est = est
            best_depth = depth

model = RandomForestRegressor(random_state=759638, n_estimators=est, max_depth=depth)
model.fit(features_train, target_train)
score_valid = model.score(features_valid, target_valid)
score_test = model.score(features_test, target_test)

print("Best model depth:", best_depth)
print("n_estimator value of best model:", best_est)
print("Accuracy of the random forest regression model on the validation set:", score_valid)
print("Accuracy of the random forest regression model on the test set:", score_test)

Best model depth: 7
n_estimator value of best model: 50
Accuracy of the random forest regression model on the validation set: 0.27790191886395654
Accuracy of the random forest regression model on the test set: 0.3308555611521775


### Model F - Linear Regression

In [14]:
model = LinearRegression()
model.fit(features_train, target_train) # train model on training set
predictions_valid = model.predict(features_valid) # mendapatkan prediksi model dari validation set
score_valid = model.score(features_valid, target_valid)
score_test = model.score(features_test, target_test)

print("Accuracy of the linear regression model on the validation set:", score_valid)
print("Accuracy of the linear regression model on the test set:", score_test)

Accuracy of the linear regression model on the validation set: 0.07155124406025604
Accuracy of the linear regression model on the test set: 0.06460707768249263


# Kesimpulan

Setelah menguji keakuratan dari enam model yang berbeda, ditentukan model dengan akurasi tertinggi pada `Model A - Decision Tree`. Dengan kedalaman 3 pohon, kami bisa mendapatkan akurasi sebesar `0,7978227060653188` dengan kumpulan data validasi, dan akurasi yang lebih tinggi yaitu `0,8133748055987559` dengan kumpulan data pengujian.

Karena model ini dikembangkan menggunakan penggunaan paket dari pelanggan yang telah beralih ke paket baru Megaline, model ini seharusnya dapat memberikan prediksi yang cukup akurat mengenai paket `Smart atau Ultra`, yang harus direkomendasikan kepada pelanggan lama yang masih menggunakan paket lama. Memprediksi secara akurat paket mana yang lebih menarik bagi pelanggan tertentu akan membantu meyakinkan mereka untuk beralih ke alternatif yang lebih modern dibandingkan pilihan mereka saat ini.