<a href="https://colab.research.google.com/github/ikrahmi/Artificial-Intelligence/blob/main/Tugas_Kel3_Real_time_AQI_Prediction_Model_using_SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Machine learning model for real-time prediction AQI in Jakarta**

**1. Import Packages**

In [77]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [78]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from sklearn.svm import SVC

**2. Import Dataset**

Dataset berisi mengenai Indeks Standar Pencemar Udara (ISPU) DKI Jakarta Tahun 2010-2021.

In [79]:
data = pd.read_csv('/content/drive/MyDrive/Semester Satu/DataScience/AQI_jakarta_2010_2021.csv')
data.head()

Unnamed: 0,tanggal,stasiun,pm10,so2,co,o3,no2,max,critical,categori
0,1/1/2010,DKI1 (Bunderan HI),60.0,4.0,73.0,27.0,14.0,73,CO,SEDANG
1,1/2/2010,DKI1 (Bunderan HI),32.0,2.0,16.0,33.0,9.0,33,O3,BAIK
2,1/3/2010,DKI1 (Bunderan HI),27.0,2.0,19.0,20.0,9.0,27,PM10,BAIK
3,1/4/2010,DKI1 (Bunderan HI),22.0,2.0,16.0,15.0,6.0,22,PM10,BAIK
4,1/5/2010,DKI1 (Bunderan HI),25.0,2.0,17.0,15.0,8.0,25,PM10,BAIK


In [80]:
data.shape          # cek dimensi data

(4272, 10)

**3. Exploratory Data Analysisi (EDA)**

*3.1. Delete feature yang tidak berpengaruh secara langsung terhadap target, yaitu: tanggal, stasiun, max dan critical*

In [81]:
data.drop(columns=['tanggal','stasiun', 'max','critical'], inplace=True)

*3.2. Cek metadata*

In [82]:
data.dtypes

pm10        float64
so2         float64
co          float64
o3          float64
no2         float64
categori     object
dtype: object

In [83]:
# cek jumlah kelas variabel target
data.categori.unique()

array(['SEDANG', 'BAIK', 'TIDAK SEHAT'], dtype=object)

3.3. Cek apakah data imbalance

In [84]:
data.categori.value_counts()

SEDANG         3064
BAIK           1054
TIDAK SEHAT     154
Name: categori, dtype: int64

3.4. Cek Missing value

In [85]:
data.isna().sum()

pm10        102
so2          65
co           40
o3           93
no2          83
categori      0
dtype: int64

3.5. Cek data duplikat dan drop jika ada

In [86]:
data.duplicated().sum()         # cek data duplikat

3

In [87]:
data = data.drop_duplicates()   # drop data duplikat

**4. Splitting Dataset**

Splitting dataset menjadi feature (X) dan target (y), kemudian split lagi menjadi data training dan data test

*Kasus regresi gunakan shuffle splitting, kasus klasifikasi gunakan stratified shuffle splitting. Nilai parameter shuffling (random_state) yang umum digunakan adalah 0 atau 1 atau 42.*

In [88]:
# split data menjadi X dan y
X = data.drop(columns='categori')
y = data.categori

# split X dan y menjadi train dan test
# stratify = 9 maksudnya di selang-selang berdasarkan kolom y (target). Random_state artinya proses shuffling dikontrol agar reproduceable
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, stratify=y,  random_state=42)

# tampilkan shape
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3415, 5), (854, 5), (3415,), (854,))

**5. Preprocessor dan Pipeline**

In [89]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

5.1. Menyiapkan pipeline untuk data numerik dan kategorikal

In [90]:
# pipeline untuk data numerik: melewati proses impute dan scaling

numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', MinMaxScaler())
])

In [15]:
'''
# pipeline untuk data kategorikal: melewati proses impute dan encoding

kategorikal_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoding', OneHotEncoder())
])
'''

"\n# pipeline untuk data kategorikal: melewati proses impute dan encoding\n\nkategorikal_pipeline = Pipeline([\n    ('imputer', SimpleImputer(strategy='most_frequent')),\n    ('encoding', OneHotEncoder())\n])\n"

5.2. Menyiapkan sebuah pipeline-preprocessor yang akan berisi pipeline numerik dan kategorik

In [91]:
pipeline_preprocessor = ColumnTransformer([
    ('numeric', numerical_pipeline,['pm10','so2','co','o3','no2'])       # kolom yang masuk ke pipeline numerik
])

# note: karena di dataset tidak ada feature yang berbentuk kategorik maka pipeline kategorik tidak diperlukan

5.3. Menyiapkan sebuah pipeline model yang akan berisi pipeline preprocessor dan pipilne algoritma yang akan digunakan

In [92]:
pipeline_model = Pipeline([
    ('preprocessor', pipeline_preprocessor),
    ('algoritma', SVC())
])

**6. SVM parameter tuning dengan GridSearchCV dan model training**

In [93]:
# cek parameter apa saja yang bisa di tuning pada model

pipeline_model.get_params()

{'memory': None,
 'steps': [('preprocessor',
   ColumnTransformer(transformers=[('numeric',
                                    Pipeline(steps=[('imputer', SimpleImputer()),
                                                    ('scaler', MinMaxScaler())]),
                                    ['pm10', 'so2', 'co', 'o3', 'no2'])])),
  ('algoritma', SVC())],
 'verbose': False,
 'preprocessor': ColumnTransformer(transformers=[('numeric',
                                  Pipeline(steps=[('imputer', SimpleImputer()),
                                                  ('scaler', MinMaxScaler())]),
                                  ['pm10', 'so2', 'co', 'o3', 'no2'])]),
 'algoritma': SVC(),
 'preprocessor__n_jobs': None,
 'preprocessor__remainder': 'drop',
 'preprocessor__sparse_threshold': 0.3,
 'preprocessor__transformer_weights': None,
 'preprocessor__transformers': [('numeric',
   Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', MinMaxScaler())]),
   ['pm10', 'so2', 'co', 'o3', 'no2

In [94]:
parameter = {
    'algoritma__gamma': [1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03],
    'algoritma__C': [1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]
    }

model = GridSearchCV(pipeline_model, parameter,cv=3, n_jobs=-1, verbose=1)
model.fit(X_train, y_train)

# note: hasil model.fit --> 49 candidat (7 variasi gamma x 7 variasi C = 49)
#                       --> 147 fits ( 49 candididat x 3 cv = 147)

Fitting 3 folds for each of 49 candidates, totalling 147 fits


**7. Model Evaluation**

In [95]:
model.best_params_      # menampilkan kombinasi parameter terbaik berdasarkan gridsearchCV

{'algoritma__C': 1000.0, 'algoritma__gamma': 1.0}

In [96]:
# tampilkan score training,validation & testing

print(f'Training score: {model.score(X_train,y_train)}')
print(f'Validation score: {model.best_score_}')
print(f'Testing sore: {model.score(X_test, y_test)}')

Training score: 0.9628111273792094
Validation score: 0.9540265178809766
Testing sore: 0.949648711943794


*Laporan lengkap hasil gridsearchCV berdasarkan rank test score*

In [97]:
gs_report = pd.DataFrame(model.cv_results_).sort_values("rank_test_score")
gs_report

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_algoritma__C,param_algoritma__gamma,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
45,0.335465,0.014763,0.080524,0.019518,1000.0,1.0,"{'algoritma__C': 1000.0, 'algoritma__gamma': 1.0}",0.953468,0.954306,0.954306,0.954027,0.000395,1
32,0.13907,0.026193,0.072481,0.023714,10.0,10.0,"{'algoritma__C': 10.0, 'algoritma__gamma': 10.0}",0.955224,0.956063,0.949912,0.953733,0.002723,2
39,0.115598,0.003624,0.044534,0.001652,100.0,10.0,"{'algoritma__C': 100.0, 'algoritma__gamma': 10.0}",0.95698,0.948155,0.94464,0.949925,0.005191,3
38,0.108619,0.00489,0.05441,0.000991,100.0,1.0,"{'algoritma__C': 100.0, 'algoritma__gamma': 1.0}",0.951712,0.943761,0.950791,0.948755,0.003551,4
46,0.50807,0.054336,0.068749,0.020022,1000.0,10.0,"{'algoritma__C': 1000.0, 'algoritma__gamma': 1...",0.941176,0.950791,0.945518,0.945829,0.003931,5
25,0.197076,0.011404,0.148206,0.00309,1.0,10.0,"{'algoritma__C': 1.0, 'algoritma__gamma': 10.0}",0.947322,0.936731,0.942004,0.942019,0.004324,6
33,0.493691,0.140609,0.261052,0.072221,10.0,100.0,"{'algoritma__C': 10.0, 'algoritma__gamma': 100.0}",0.934153,0.943761,0.93761,0.938508,0.003974,7
26,0.618342,0.027681,0.299987,0.01394,1.0,100.0,"{'algoritma__C': 1.0, 'algoritma__gamma': 100.0}",0.933275,0.941125,0.929701,0.9347,0.004771,8
31,0.161493,0.070066,0.102882,0.029615,10.0,1.0,"{'algoritma__C': 10.0, 'algoritma__gamma': 1.0}",0.937665,0.92355,0.941125,0.934113,0.007602,9
44,0.216797,0.031513,0.118834,0.022533,1000.0,0.1,"{'algoritma__C': 1000.0, 'algoritma__gamma': 0.1}",0.936787,0.92355,0.941125,0.933821,0.007475,10


**8. Menggunakan model untuk melakukan prediksi terhadap data baru**

Data baru yang akan diprediksi, format (struktur data) nya harus sama dengan struktur data yang digunakan untuk mentraining model. Dalam case ini adalah pandas dataframe

In [98]:
# 9 data baru yang akan diprediksi

data_pred=pd.read_csv('/content/drive/MyDrive/Semester Satu/DataScience/new_data.csv')
data_pred

Unnamed: 0,pm10,so2,co,o3,no2
0,38,29,6,31,13
1,27,27,7,47,7
2,44,25,7,40,13
3,30,24,4,32,7
4,38,24,6,31,9
5,41,23,13,46,13
6,35,22,6,39,10
7,37,26,16,17,10
8,47,16,27,22,12


In [99]:
# Lakukan prediksi

hasil_prediksi= model.predict(data_pred)
hasil_prediksi

array(['SEDANG', 'BAIK', 'SEDANG', 'BAIK', 'BAIK', 'BAIK', 'BAIK', 'BAIK',
       'BAIK'], dtype=object)

In [100]:
# Masukkan hasil prediksi dalam bentuk dataframe

data_pred['categori'] = model.predict(data_pred)
data_pred

Unnamed: 0,pm10,so2,co,o3,no2,categori
0,38,29,6,31,13,SEDANG
1,27,27,7,47,7,BAIK
2,44,25,7,40,13,SEDANG
3,30,24,4,32,7,BAIK
4,38,24,6,31,9,BAIK
5,41,23,13,46,13,BAIK
6,35,22,6,39,10,BAIK
7,37,26,16,17,10,BAIK
8,47,16,27,22,12,BAIK


**MENGGUNAKAN ALGORITMA RANDOM FOREST**

In [101]:
from sklearn.ensemble import RandomForestClassifier

In [104]:
pipeline_model = Pipeline([
    ('preprocessor', pipeline_preprocessor),
    ('algoritma', RandomForestClassifier(n_jobs=-1, random_state=42))
])

In [107]:
parameter = {
    'algoritma__n_estimators': [100,150,200],                         # jumlah pohon(decision tree) yang digunakan
    'algoritma__max_depth': [20,50,80],                               # jumlah cabang yang digunakan
    'algoritma__max_features':[0.3,0.6,0.8],                          # jumlah feature yang akan dicek untuk setiap percabangan
    'algoritma__min_samples_leaf':[1,5,10]                            # jumlah leaf minimum disetiap cabang
    }

model =  GridSearchCV(pipeline_model, parameter,cv=3, n_jobs=-1, verbose=1)
model.fit(X_train,y_train)

# note: hasil model.fit --> 81 candidat (3 estimator x 3 max_depth x 3 max_feature x 3 min_samples_leaf = 81)
#                       --> 243 fits fits ( 81 candididat x 3 cv = 243)

Fitting 3 folds for each of 81 candidates, totalling 243 fits


In [108]:
model.best_params_

{'algoritma__max_depth': 20,
 'algoritma__max_features': 0.6,
 'algoritma__min_samples_leaf': 1,
 'algoritma__n_estimators': 100}

In [109]:
print(f'Training score: {model.score(X_train,y_train)}')
print(f'Validation score: {model.best_score_}')
print(f'Testing sore: {model.score(X_test, y_test)}')

Training score: 1.0
Validation score: 0.9856511919879565
Testing sore: 0.9859484777517564


*Laporan lengkap hasil gridsearchCV berdasarkan rank test score*

In [110]:
report = pd.DataFrame(model.cv_results_).sort_values("rank_test_score")
report

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_algoritma__max_depth,param_algoritma__max_features,param_algoritma__min_samples_leaf,param_algoritma__n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
36,0.569277,0.029221,0.053642,0.000854,50,0.6,1,100,"{'algoritma__max_depth': 50, 'algoritma__max_f...",0.986831,0.987698,0.982425,0.985651,0.002308,1
63,0.914827,0.258996,0.073824,0.011406,80,0.6,1,100,"{'algoritma__max_depth': 80, 'algoritma__max_f...",0.986831,0.987698,0.982425,0.985651,0.002308,1
9,1.423827,0.255942,0.109509,0.006533,20,0.6,1,100,"{'algoritma__max_depth': 20, 'algoritma__max_f...",0.986831,0.987698,0.982425,0.985651,0.002308,1
19,1.784666,0.502025,0.109394,0.014528,20,0.8,1,150,"{'algoritma__max_depth': 20, 'algoritma__max_f...",0.985075,0.986819,0.984183,0.985359,0.001095,4
46,0.879229,0.009042,0.071649,0.001732,50,0.8,1,150,"{'algoritma__max_depth': 50, 'algoritma__max_f...",0.985075,0.986819,0.984183,0.985359,0.001095,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62,0.829036,0.012968,0.097819,0.008235,80,0.3,10,200,"{'algoritma__max_depth': 80, 'algoritma__max_f...",0.970149,0.957821,0.949912,0.959294,0.008327,76
35,0.889607,0.016056,0.106976,0.016403,50,0.3,10,200,"{'algoritma__max_depth': 50, 'algoritma__max_f...",0.970149,0.957821,0.949912,0.959294,0.008327,76
33,0.423331,0.009053,0.055013,0.000888,50,0.3,10,100,"{'algoritma__max_depth': 50, 'algoritma__max_f...",0.967515,0.956063,0.951670,0.958416,0.006680,79
6,0.420243,0.007441,0.059983,0.004753,20,0.3,10,100,"{'algoritma__max_depth': 20, 'algoritma__max_f...",0.967515,0.956063,0.951670,0.958416,0.006680,79
