# Modelling

- Menggunakan algoritma Random Forest dan Logistic Regression.
- Tujuan : Menguji antara data hasil remove overlapping dan replace overlapping yang mana yang dapat menghasilkan akurasi model yang terbaik.
- Pada case ini, Severity Level 4 diabaikan karena datanya yang sangat sedikit dibandingkan dengan level lainnya.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression

In [2]:
# Function for ML Report Classification
def ml_report_classification(model, X_train, y_train, X_test, y_test, name):
    print("====================\n" + name + "\n====================\n")
    scores = cross_val_score(model, X_train, y_train, cv=4)
    print("List Traning Scores :", scores)
    print("Training Average : {} +- {}".format(scores.mean(), scores.std()))
    print("\n---------------------\nValidation Score\n----------------------\n")
    y_pred = model.predict(X_test)
    print("Accuracy Score :", accuracy_score(y_pred, y_test))
    print("F1-Score       :", f1_score(y_pred, y_test, average='macro'))
    print(classification_report(y_pred, y_test))
    print(confusion_matrix(y_pred, y_test))

## Read Data

In [3]:
# Read original data
df_ori1234 = pd.read_csv("results/ground_vi_data.csv")
df_ori123 = df_ori1234[df_ori1234['severity_level'] < 4]
print(df_ori123.head())
print("\nShape of dataframe :", df_ori123.shape)

       date hemispherical   longitude  latitude   region      ndre       lci   
0  19/01/22    IMG_3761_2  104.541217 -2.925328  1-bpm24  0.179929  0.263949  \
1  19/01/22    IMG_3754_2  104.541035 -2.925307  1-bpm24  0.186941  0.282940   
2  19/01/22    IMG_3746_2  104.541010 -2.925442  1-bpm24  0.182272  0.266318   
3  19/01/22    IMG_3737_2  104.541267 -2.925553  1-bpm24  0.188025  0.278577   
4  19/01/22    IMG_3731_2  104.541405 -2.925567  1-bpm24  0.178248  0.266711   

   severity_level  
0               2  
1               3  
2               3  
3               2  
4               2  

Shape of dataframe : (132, 8)


In [4]:
# Read remove overlapping data
df_remove1234 = pd.read_csv("results/remove_overlapping_data.csv")
df_remove123 = df_remove1234[df_remove1234['severity_level'] < 4]
print(df_remove123.head())
print("\nShape of dataframe :", df_remove123.shape)

       date hemispherical   longitude  latitude   region  period      ndre   
0  19/01/22    IMG_3761_2  104.541217 -2.925328  1-bpm24       1  0.179929  \
1  19/01/22    IMG_3737_2  104.541267 -2.925553  1-bpm24       1  0.188025   
2  19/01/22    IMG_3731_2  104.541405 -2.925567  1-bpm24       1  0.178248   
3  19/01/22    IMG_3723_2  104.541402 -2.925407  1-bpm24       1  0.184802   
4  19/01/22    IMG_3718_2  104.541495 -2.925234  1-bpm24       1  0.186362   

        lci  severity_level  
0  0.263949               2  
1  0.278577               2  
2  0.266711               2  
3  0.272173               2  
4  0.276429               2  

Shape of dataframe : (74, 9)


In [5]:
# Read replace overlapping data
df_replace1234 = pd.read_csv("results/replace_overlapping_data.csv")
df_replace123 = df_replace1234[df_replace1234['severity_level'] < 4]
print(df_replace123.head())
print("\nShape of dataframe :", df_replace123.shape)

       date hemispherical   longitude  latitude   region  period      ndre   
0  19/01/22    IMG_3761_2  104.541217 -2.925328  1-bpm24       1  0.179929  \
1  19/01/22    IMG_3754_2  104.541035 -2.925307  1-bpm24       1  0.186941   
2  19/01/22    IMG_3746_2  104.541010 -2.925442  1-bpm24       1  0.182272   
3  19/01/22    IMG_3737_2  104.541267 -2.925553  1-bpm24       1  0.188025   
4  19/01/22    IMG_3731_2  104.541405 -2.925567  1-bpm24       1  0.178248   

        lci  severity_level  
0  0.263949               2  
1  0.282940               2  
2  0.266318               2  
3  0.278577               2  
4  0.266711               2  

Shape of dataframe : (138, 9)


## Modelling : Original Data (0)

In [6]:
# Split dataframe into X and y
y0 = np.array(df_ori123['severity_level'])
X0 = np.array(df_ori123[['ndre', 'lci']])

In [7]:
# Split data into train and test
X_train0, X_test0, y_train0, y_test0 = train_test_split(X0, y0, test_size=0.2, stratify=y0, random_state=0)
X_train0.shape, X_test0.shape, y_train0.shape, y_test0.shape

((105, 2), (27, 2), (105,), (27,))

### Random Forest

In [8]:
# Train a model
rf0 = RandomForestClassifier(random_state=0)
rf0.fit(X_train0, y_train0)

In [9]:
# Show ML Report
ml_report_classification(rf0, X_train0, y_train0, X_test0, y_test0, "RANDOM FOREST")

RANDOM FOREST

List Traning Scores : [0.40740741 0.42307692 0.26923077 0.46153846]
Training Average : 0.3903133903133903 +- 0.07262862107807724

---------------------
Validation Score
----------------------

Accuracy Score : 0.5925925925925926
F1-Score       : 0.5789669267930136
              precision    recall  f1-score   support

           1       0.38      0.60      0.46         5
           2       0.60      0.75      0.67         8
           3       0.78      0.50      0.61        14

    accuracy                           0.59        27
   macro avg       0.58      0.62      0.58        27
weighted avg       0.65      0.59      0.60        27

[[3 1 1]
 [1 6 1]
 [4 3 7]]


Akurasi Model yang dihasilkan masih rendah. Salah satu faktor penyebabnya karena sebagian dari dataset labelnya saling overlaping.

## Modelling : Remove Overlapping Data (1)

In [11]:
# Split dataframe into X and y
y1 = np.array(df_remove123['severity_level'])
X1 = np.array(df_remove123[['ndre', 'lci']])
y1a = np.array(df_remove1234['severity_level'])
X1a = np.array(df_remove1234[['ndre', 'lci']])

In [12]:
# Split data into train and test --> Level 1-3
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.2, stratify=y1, random_state=0)
X_train1.shape, X_test1.shape, y_train1.shape, y_test1.shape

((59, 2), (15, 2), (59,), (15,))

In [13]:
# Split data into train and test --> Level 1-4
X_train1a, X_test1a, y_train1a, y_test1a = train_test_split(X1a, y1a, test_size=0.2, stratify=y1a, random_state=0)
X_train1a.shape, X_test1a.shape, y_train1a.shape, y_test1a.shape

((64, 2), (16, 2), (64,), (16,))

### Random Forest (Level 1-3)

In [14]:
# Train a model
rf1 = RandomForestClassifier(random_state=0)
rf1.fit(X_train1, y_train1)

In [15]:
# Show ML Report
ml_report_classification(rf1, X_train1, y_train1, X_test1, y_test1, "RANDOM FOREST")

RANDOM FOREST

List Traning Scores : [1.         0.93333333 1.         1.        ]
Training Average : 0.9833333333333334 +- 0.02886751345948128

---------------------
Validation Score
----------------------

Accuracy Score : 1.0
F1-Score       : 1.0
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         5
           2       1.00      1.00      1.00         6
           3       1.00      1.00      1.00         4

    accuracy                           1.00        15
   macro avg       1.00      1.00      1.00        15
weighted avg       1.00      1.00      1.00        15

[[5 0 0]
 [0 6 0]
 [0 0 4]]


Ada peningkatan akurasi yang sangat signifikan pada skor training setelah dilakukan penghapusan overlapping data. Hal ini ditunjukkan dengan peningkatan dari skor training 0.3903 menjadi 0.9834. Selain itu, skor validasi menunjukkan hasil yang sempurna.

### Random Forest (All Levels)

In [16]:
# Train a model
rf1a = RandomForestClassifier(random_state=0)
rf1a.fit(X_train1a, y_train1a)

In [17]:
# Show ML Report
ml_report_classification(rf1a, X_train1a, y_train1a, X_test1a, y_test1a, "RANDOM FOREST")

RANDOM FOREST

List Traning Scores : [1.     0.9375 0.875  0.875 ]
Training Average : 0.921875 +- 0.05182226234930312

---------------------
Validation Score
----------------------

Accuracy Score : 0.9375
F1-Score       : 0.7222222222222222
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         5
           2       1.00      1.00      1.00         6
           3       1.00      0.80      0.89         5
           4       0.00      0.00      0.00         0

    accuracy                           0.94        16
   macro avg       0.75      0.70      0.72        16
weighted avg       1.00      0.94      0.97        16

[[5 0 0 0]
 [0 6 0 0]
 [0 0 4 1]
 [0 0 0 0]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Logistic Regression (Level 1-3)

In [None]:
# Train a model
lr1 = LogisticRegression(random_state=0)
lr1.fit(X_train1, y_train1)

In [None]:
# Show ML Report
ml_report_classification(lr1, X_train1, y_train1, X_test1, y_test1, "LOGISTIC REGRESSION")

Model Logistic Regression tidak dapat mengidentifikasi data-data dengan level severity selain level 2 sehingga pada confusion matrixnya tidak ada hasil prediksi utk severity dengan level 1,3, dan 4.\
Salah satu faktor penyebabnya karena antara variabel ndre dan lci mempunyai korelasi yang sangat kuat, sedangkan utk model logistik regresi harus ada asumsi yg dipenuhi yaitu tidak ada multikolinearitas antar variabel bebasnya.

## Modelling : Replace Overlapping Data (2)

In [18]:
# Split dataframe into X and y
y2 = np.array(df_replace123['severity_level'])
X2 = np.array(df_replace123[['ndre', 'lci']])
y2a = np.array(df_replace1234['severity_level'])
X2a = np.array(df_replace1234[['ndre', 'lci']])

In [19]:
# Split data into train and test --> Level 1-3
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.2, stratify=y2, random_state=0)
X_train2.shape, X_test2.shape, y_train2.shape, y_test2.shape

((110, 2), (28, 2), (110,), (28,))

In [20]:
# Split data into train and test --> All Levels
X_train2a, X_test2a, y_train2a, y_test2a = train_test_split(X2a, y2a, test_size=0.2, stratify=y2a, random_state=0)
X_train2a.shape, X_test2a.shape, y_train2a.shape, y_test2a.shape

((115, 2), (29, 2), (115,), (29,))

### Random Forest (Level 1-3)

In [22]:
# Train a model
rf2 = RandomForestClassifier(random_state=0)
rf2.fit(X_train2, y_train2)

In [23]:
# Show ML Report
ml_report_classification(rf2, X_train2, y_train2, X_test2, y_test2, "RANDOM FOREST")

RANDOM FOREST

List Traning Scores : [1.         1.         0.96296296 1.        ]
Training Average : 0.9907407407407407 +- 0.01603750747748963

---------------------
Validation Score
----------------------

Accuracy Score : 1.0
F1-Score       : 1.0
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         7
           2       1.00      1.00      1.00        15
           3       1.00      1.00      1.00         6

    accuracy                           1.00        28
   macro avg       1.00      1.00      1.00        28
weighted avg       1.00      1.00      1.00        28

[[ 7  0  0]
 [ 0 15  0]
 [ 0  0  6]]


### Random Forest (All Levels)

In [24]:
# Train a model
rf2a = RandomForestClassifier(random_state=0)
rf2a.fit(X_train2a, y_train2a)

In [25]:
# Show ML Report
ml_report_classification(rf2a, X_train2a, y_train2a, X_test2a, y_test2a, "RANDOM FOREST")

RANDOM FOREST

List Traning Scores : [0.86206897 0.89655172 0.93103448 0.92857143]
Training Average : 0.9045566502463054 +- 0.028049364467808823

---------------------
Validation Score
----------------------

Accuracy Score : 0.9655172413793104
F1-Score       : 0.7307692307692307
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         7
           2       1.00      1.00      1.00        15
           3       1.00      0.86      0.92         7
           4       0.00      0.00      0.00         0

    accuracy                           0.97        29
   macro avg       0.75      0.71      0.73        29
weighted avg       1.00      0.97      0.98        29

[[ 7  0  0  0]
 [ 0 15  0  0]
 [ 0  0  6  1]
 [ 0  0  0  0]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
