# Modelling

- Menggunakan algoritma Random Forest dan Logistic Regression.
- Tujuan : Menguji antara data hasil remove overlapping dan replace overlapping yang mana yang dapat menghasilkan akurasi model yang terbaik.
- Pada case ini, Severity Level 4 diabaikan karena datanya yang sangat sedikit dibandingkan dengan level lainnya.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression

In [2]:
# Function for ML Report Classification
def ml_report_classification(model, X_train, y_train, X_test, y_test, name):
    print("====================\n" + name + "\n====================\n")
    scores = cross_val_score(model, X_train, y_train, cv=4)
    print("List Traning Scores :", scores)
    print("Training Average : {} +- {}".format(scores.mean(), scores.std()))
    print("\n---------------------\nValidation Score\n----------------------\n")
    y_pred = model.predict(X_test)
    print("Accuracy Score :", accuracy_score(y_pred, y_test))
    print("F1-Score       :", f1_score(y_pred, y_test, average='macro'))
    print(classification_report(y_pred, y_test))
    print(confusion_matrix(y_pred, y_test))

## Read Data

In [3]:
# Read original data
df_ori_all = pd.read_csv("results/ground_vi_data.csv")
df_ori = df_ori_all[df_ori_all['severity_level'] < 4]
print(df_ori.head())
print("\nShape of dataframe :", df_ori.shape)

       date hemispherical   longitude  latitude   region      ndre       lci   
0  19/01/22    IMG_3761_2  104.541217 -2.925328  1-bpm24  0.179929  0.263949  \
1  19/01/22    IMG_3754_2  104.541035 -2.925307  1-bpm24  0.186941  0.282940   
2  19/01/22    IMG_3746_2  104.541010 -2.925442  1-bpm24  0.182272  0.266318   
3  19/01/22    IMG_3737_2  104.541267 -2.925553  1-bpm24  0.188025  0.278577   
4  19/01/22    IMG_3731_2  104.541405 -2.925567  1-bpm24  0.178248  0.266711   

   severity_level  
0               2  
1               3  
2               3  
3               2  
4               2  

Shape of dataframe : (132, 8)


In [4]:
# Read remove overlapping data
df_final_all = pd.read_csv("results/final_ground_vi.csv")
df_final = df_final_all[df_final_all['severity_level'] < 4]
print(df_final.head())
print("\nShape of dataframe :", df_final.shape)

       date hemispherical   longitude  latitude   region  period      ndre   
0  19/01/22    IMG_3761_2  104.541217 -2.925328  1-bpm24       1  0.179929  \
1  19/01/22    IMG_3754_2  104.541035 -2.925307  1-bpm24       1  0.186941   
2  19/01/22    IMG_3746_2  104.541010 -2.925442  1-bpm24       1  0.182272   
3  19/01/22    IMG_3737_2  104.541267 -2.925553  1-bpm24       1  0.188025   
4  19/01/22    IMG_3731_2  104.541405 -2.925567  1-bpm24       1  0.178248   

        lci  severity_level  
0  0.263949               2  
1  0.282940               3  
2  0.266318               3  
3  0.278577               2  
4  0.266711               2  

Shape of dataframe : (138, 9)


## Modelling : Original Data (0)

In [5]:
# Split dataframe into X and y
X0 = np.array(df_ori[['ndre', 'lci']])
y0 = np.array(df_ori['severity_level'])

In [6]:
# Split dataframe into X
X0_ndre = np.array(df_ori['ndre']).reshape(-1, 1)
X0_lci = np.array(df_ori['lci']).reshape(-1, 1)b

In [None]:
# Split data into train and test
X0_train_ndre, X0_test_ndre, y0_train_ndre, y0_test_ndre = train_test_split(X0_ndre, y0, test_size=0.2, 
                                                                            stratify=y0, random_state=0)
X0_train_ndre.shape, X0_test_ndre.shape, y0_train_ndre.shape, y0_test_ndre.shape

In [None]:
# Split data into train and test
X0_train_lci, X0_test_lci, y0_train_lci, y0_test_lci = train_test_split(X0_lci, y0, test_size=0.2, 
                                                                        stratify=y0, random_state=0)
X0_train_lci.shape, X0_test_lci.shape, y0_train_lci.shape, y0_test_lci.shape

In [14]:
# Split data into train and test
X_train0, X_test0, y_train0, y_test0 = train_test_split(X0, y0, test_size=0.2, stratify=y0, random_state=0)
X_train0.shape, X_test0.shape, y_train0.shape, y_test0.shape

((105, 2), (27, 2), (105,), (27,))

### Random Forest

#### Univariate - NDRE

In [8]:
# Train a model
rf0_ndre = RandomForestClassifier(random_state=0)
rf0_ndre.fit(X0_train_ndre, y0_train_ndre)

In [9]:
# Show ML Report
ml_report_classification(rf0_ndre, X0_train_ndre, y0_train_ndre, X0_test_ndre, y0_test_ndre, "RANDOM FOREST")

RANDOM FOREST

List Traning Scores : [0.33333333 0.23076923 0.30769231 0.5       ]
Training Average : 0.34294871794871795 +- 0.09821508619259334

---------------------
Validation Score
----------------------

Accuracy Score : 0.48148148148148145
F1-Score       : 0.4304891015417331
              precision    recall  f1-score   support

           1       0.12      0.33      0.18         3
           2       0.70      0.50      0.58        14
           3       0.56      0.50      0.53        10

    accuracy                           0.48        27
   macro avg       0.46      0.44      0.43        27
weighted avg       0.58      0.48      0.52        27

[[1 1 1]
 [4 7 3]
 [3 2 5]]


#### Univariate - LCI

In [12]:
# Train a model
rf0_lci = RandomForestClassifier(random_state=0)
rf0_lci.fit(X0_train_lci, y0_train_lci)

In [13]:
# Show ML Report
ml_report_classification(rf0_lci, X0_train_lci, y0_train_lci, X0_test_lci, y0_test_lci, "RANDOM FOREST")

RANDOM FOREST

List Traning Scores : [0.44444444 0.42307692 0.38461538 0.42307692]
Training Average : 0.4188034188034188 +- 0.02158013875718391

---------------------
Validation Score
----------------------

Accuracy Score : 0.48148148148148145
F1-Score       : 0.47348484848484845
              precision    recall  f1-score   support

           1       0.50      0.50      0.50         8
           2       0.30      0.50      0.37         6
           3       0.67      0.46      0.55        13

    accuracy                           0.48        27
   macro avg       0.49      0.49      0.47        27
weighted avg       0.54      0.48      0.49        27

[[4 3 1]
 [1 3 2]
 [3 4 6]]


#### Multivariate

In [15]:
# Train a model
rf0 = RandomForestClassifier(random_state=0)
rf0.fit(X_train0, y_train0)

In [16]:
# Show ML Report
ml_report_classification(rf0, X_train0, y_train0, X_test0, y_test0, "RANDOM FOREST")

RANDOM FOREST

List Traning Scores : [0.40740741 0.42307692 0.26923077 0.46153846]
Training Average : 0.3903133903133903 +- 0.07262862107807724

---------------------
Validation Score
----------------------

Accuracy Score : 0.5925925925925926
F1-Score       : 0.5789669267930136
              precision    recall  f1-score   support

           1       0.38      0.60      0.46         5
           2       0.60      0.75      0.67         8
           3       0.78      0.50      0.61        14

    accuracy                           0.59        27
   macro avg       0.58      0.62      0.58        27
weighted avg       0.65      0.59      0.60        27

[[3 1 1]
 [1 6 1]
 [4 3 7]]


Akurasi Model yang dihasilkan masih rendah. Salah satu faktor penyebabnya karena sebagian dari dataset labelnya saling overlaping.

### Logistic Regression

#### Univariate - NDRE

In [33]:
# Train a model
lr0_ndre = LogisticRegression(random_state=0)
lr0_ndre.fit(X0_train_ndre, y0_train_ndre)

In [35]:
# Show ML Report
ml_report_classification(lr0_ndre, X0_train_ndre, y0_train_ndre, X0_test_ndre, y0_test_ndre, "RANDOM FOREST")

RANDOM FOREST

List Traning Scores : [0.55555556 0.34615385 0.34615385 0.34615385]
Training Average : 0.39850427350427353 +- 0.09067359996888355

---------------------
Validation Score
----------------------

Accuracy Score : 0.37037037037037035
F1-Score       : 0.1801801801801802
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         0
           2       1.00      0.37      0.54        27
           3       0.00      0.00      0.00         0

    accuracy                           0.37        27
   macro avg       0.33      0.12      0.18        27
weighted avg       1.00      0.37      0.54        27

[[ 0  0  0]
 [ 8 10  9]
 [ 0  0  0]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Univariate - LCI

In [38]:
# Train a model
lr1_lci = LogisticRegression(random_state=0)
lr1_lci.fit(X1_train_lci, y1_train_lci)

In [39]:
# Show ML Report
ml_report_classification(lr1_lci, X1_train_lci, y1_train_lci, X1_test_lci, y1_test_lci, "RANDOM FOREST")

RANDOM FOREST

List Traning Scores : [0.39285714 0.35714286 0.37037037 0.37037037]
Training Average : 0.3726851851851851 +- 0.012837333957527558

---------------------
Validation Score
----------------------

Accuracy Score : 0.39285714285714285
F1-Score       : 0.18803418803418803
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         0
           2       1.00      0.39      0.56        28
           3       0.00      0.00      0.00         0

    accuracy                           0.39        28
   macro avg       0.33      0.13      0.19        28
weighted avg       1.00      0.39      0.56        28

[[ 0  0  0]
 [ 8 11  9]
 [ 0  0  0]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Modelling : Final Data (1)

In [17]:
# Split dataframe into X and y
y1 = np.array(df_final['severity_level'])
X1 = np.array(df_final[['ndre', 'lci']])
y1a = np.array(df_final_all['severity_level'])
X1a = np.array(df_final_all[['ndre', 'lci']])

In [18]:
# Extract features
X1_ndre = np.array(df_final['ndre']).reshape(-1,1)
X1_lci = np.array(df_final['lci']).reshape(-1,1)

In [19]:
# Split data into train and test --> ndre
X1_train_ndre, X1_test_ndre, y1_train_ndre, y1_test_ndre = train_test_split(X1_ndre, y1, test_size=0.2, 
                                                                            stratify=y1, random_state=0)

In [20]:
# Split data into train and test --> lci
X1_train_lci, X1_test_lci, y1_train_lci, y1_test_lci = train_test_split(X1_lci, y1, test_size=0.2, 
                                                                        stratify=y1, random_state=0)

In [21]:
# Split data into train and test --> Level 1-3
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, stratify=y1, random_state=0)
X1_train.shape, X1_test.shape, y1_train.shape, y1_test.shape

((110, 2), (28, 2), (110,), (28,))

In [22]:
# Split data into train and test --> Level 1-4
X1a_train, X1a_test, y1a_train, y1a_test = train_test_split(X1a, y1a, test_size=0.2, stratify=y1a, random_state=0)
X1a_train.shape, X1a_test.shape, y1a_train.shape, y1a_test.shape

((115, 2), (29, 2), (115,), (29,))

### Random Forest

#### Univariate - NDRE

In [23]:
# train a mode
rf1_ndre = RandomForestClassifier(random_state=0)
rf1_ndre.fit(X1_train_ndre, y1_train_ndre)

In [24]:
# Show ML Report
ml_report_classification(rf1_ndre, X1_train_ndre, y1_train_ndre, X1_test_ndre, y1_test_ndre, "RANDOM FOREST")

RANDOM FOREST

List Traning Scores : [0.42857143 0.46428571 0.44444444 0.33333333]
Training Average : 0.4176587301587301 +- 0.05030260846860008

---------------------
Validation Score
----------------------

Accuracy Score : 0.5
F1-Score       : 0.4974424552429668
              precision    recall  f1-score   support

           1       0.50      0.50      0.50         8
           2       0.55      0.50      0.52        12
           3       0.44      0.50      0.47         8

    accuracy                           0.50        28
   macro avg       0.50      0.50      0.50        28
weighted avg       0.50      0.50      0.50        28

[[4 2 2]
 [3 6 3]
 [1 3 4]]


#### Univariate - LCI

In [25]:
# train a mode
rf_lci = RandomForestClassifier(random_state=0)
rf_lci.fit(X1_train_lci, y1_train_lci)

In [26]:
# Show ML Report
ml_report_classification(rf_lci, X1_train_lci, y1_train_lci, X1_test_lci, y1_test_lci, "RANDOM FOREST")

RANDOM FOREST

List Traning Scores : [0.5        0.53571429 0.51851852 0.51851852]
Training Average : 0.5181878306878307 +- 0.012631236279619226

---------------------
Validation Score
----------------------

Accuracy Score : 0.39285714285714285
F1-Score       : 0.3874883286647992
              precision    recall  f1-score   support

           1       0.62      0.38      0.48        13
           2       0.27      0.50      0.35         6
           3       0.33      0.33      0.33         9

    accuracy                           0.39        28
   macro avg       0.41      0.41      0.39        28
weighted avg       0.46      0.39      0.40        28

[[5 4 4]
 [1 3 2]
 [2 4 3]]


#### Multivariate

In [27]:
# Train a model
rf1 = RandomForestClassifier(random_state=0)
rf1.fit(X1_train, y1_train)

In [28]:
# Show ML Report
ml_report_classification(rf1, X1_train, y1_train, X1_test, y1_test, "RANDOM FOREST")

RANDOM FOREST

List Traning Scores : [0.5        0.53571429 0.51851852 0.48148148]
Training Average : 0.5089285714285714 +- 0.020263907010388268

---------------------
Validation Score
----------------------

Accuracy Score : 0.42857142857142855
F1-Score       : 0.4292929292929293
              precision    recall  f1-score   support

           1       0.50      0.50      0.50         8
           2       0.45      0.45      0.45        11
           3       0.33      0.33      0.33         9

    accuracy                           0.43        28
   macro avg       0.43      0.43      0.43        28
weighted avg       0.43      0.43      0.43        28

[[4 2 2]
 [2 5 4]
 [2 4 3]]


Ada peningkatan akurasi yang sangat signifikan pada skor training setelah dilakukan penghapusan overlapping data. Hal ini ditunjukkan dengan peningkatan dari skor training 0.3903 menjadi 0.5089285.

### Logistic Regression

#### Univariate - NDRE

In [29]:
# Train a model
lr_ndre = LogisticRegression(random_state=0)
lr_ndre.fit(X1_train_ndre, y1_train_ndre)

In [30]:
# Show ML Report
ml_report_classification(lr_ndre, X1_train_ndre, y1_train_ndre, X1_test_ndre, y1_test_ndre, "LOGISTIC REGRESSION")

LOGISTIC REGRESSION

List Traning Scores : [0.39285714 0.35714286 0.37037037 0.37037037]
Training Average : 0.3726851851851851 +- 0.012837333957527558

---------------------
Validation Score
----------------------

Accuracy Score : 0.39285714285714285
F1-Score       : 0.18803418803418803
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         0
           2       1.00      0.39      0.56        28
           3       0.00      0.00      0.00         0

    accuracy                           0.39        28
   macro avg       0.33      0.13      0.19        28
weighted avg       1.00      0.39      0.56        28

[[ 0  0  0]
 [ 8 11  9]
 [ 0  0  0]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Model Logistic Regression tidak dapat mengidentifikasi data-data dengan level severity selain level 2 sehingga pada confusion matrixnya tidak ada hasil prediksi utk severity dengan level 1,3, dan 4.\
Salah satu faktor penyebabnya karena antara variabel ndre dan lci mempunyai korelasi yang sangat kuat, sedangkan utk model logistik regresi harus ada asumsi yg dipenuhi yaitu tidak ada multikolinearitas antar variabel bebasnya.

#### Univariate - LCI

In [31]:
# Train a model
lr_lci = LogisticRegression(random_state=0)
lr_lci.fit(X1_train_lci, y1_train_lci)

In [32]:
# Show ML Report
ml_report_classification(lr_lci, X1_train_lci, y1_train_lci, X1_test_lci, y1_test_lci, "LOGISTIC REGRESSION")

LOGISTIC REGRESSION

List Traning Scores : [0.39285714 0.35714286 0.37037037 0.37037037]
Training Average : 0.3726851851851851 +- 0.012837333957527558

---------------------
Validation Score
----------------------

Accuracy Score : 0.39285714285714285
F1-Score       : 0.18803418803418803
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         0
           2       1.00      0.39      0.56        28
           3       0.00      0.00      0.00         0

    accuracy                           0.39        28
   macro avg       0.33      0.13      0.19        28
weighted avg       1.00      0.39      0.56        28

[[ 0  0  0]
 [ 8 11  9]
 [ 0  0  0]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Summary

1. Univariate --> LCI mempunyai skor training dan testing lebih tinggi dibandingkan dengan NDRE.
2. Jika digabungkan antara variabel LCI dan NDRE menyebabkan penurunan akurasi, tetapi tidak terlalu signifikan (+- 1%).
3. Setelah dilakukan pelabelan ulang utk severity dengan selisih level > 2, terdapat peningkatan akurasi cukup signifikan (+- 10%).
4. Jika dilakukan pemodelan menggunakan Logistic Regression, hasil prediksinya hanya pada level severity 2