# **Proyek Machine Learning**

Nama: Melanie Sayyidina Sabrina Refman

## **Import Dataset**

In [None]:
from google.colab import files

# Mengupload file kaggle
files.upload()

Saving kaggle (3).json to kaggle (3).json


{'kaggle (3).json': b'{"username":"melanierefman","key":"ce0211165e67e656e93a08a36a7acd09"}'}

In [None]:
# Mengonfigurasi Kaggle API di lingkungan Colab
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json

cp: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory


In [None]:
# Mengunduh dataset dari Kaggle menggunakan Kaggle API
!kaggle datasets download -d thedevastator/weather-prediction

Dataset URL: https://www.kaggle.com/datasets/thedevastator/weather-prediction
License(s): other
Downloading weather-prediction.zip to /content
  0% 0.00/936k [00:00<?, ?B/s]
100% 936k/936k [00:00<00:00, 123MB/s]


In [None]:
from zipfile import ZipFile

# Mengekstrak file zip
file_name = "/content/weather-prediction.zip"
with ZipFile(file_name,'r') as zip:
  zip.extractall()
  print('Extraction Completed')

Extraction Completed


## **Import Library**

Pada bagian ini, dilakukan import berbagai library yang akan digunakan dalam proses pengolahan data, eksplorasi, pemodelan, hingga evaluasi model.

In [None]:
# Import Library
import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout

*   NumPy dan Pandas: Untuk manipulasi data dan operasi numerik.
*   Scikit-learn: Untuk preprocessing data, pemodelan, dan evaluasi model.
*   TensorFlow dan Keras: Untuk membangun dan melatih model RNN.


## **Eksplorasi Awal Dataset**

Dataset dimuat dari file CSV bernama weather_prediction_dataset.csv menggunakan Pandas.

In [None]:
# Path dataset
dataset = "weather_prediction_dataset.csv"
df = pd.read_csv(dataset)

Ditampilkan bentuk dataset (jumlah baris dan kolom).

In [None]:
print("Dataset Shape:", data.shape)

Dataset Shape: (3654, 165)


Diperlihatkan beberapa baris pertama dataset untuk mengetahui gambaran awal data.

In [None]:
# Tampilkan beberapa baris awal
print("Preview dataset:")
print(df.head())

Preview dataset:
       DATE  MONTH  BASEL_cloud_cover  BASEL_humidity  BASEL_pressure  \
0  20000101      1                  8            0.89          1.0286   
1  20000102      1                  8            0.87          1.0318   
2  20000103      1                  5            0.81          1.0314   
3  20000104      1                  7            0.79          1.0262   
4  20000105      1                  5            0.90          1.0246   

   BASEL_global_radiation  BASEL_precipitation  BASEL_sunshine  \
0                    0.20                 0.03             0.0   
1                    0.25                 0.00             0.0   
2                    0.50                 0.00             3.7   
3                    0.63                 0.35             6.9   
4                    0.51                 0.07             3.7   

   BASEL_temp_mean  BASEL_temp_min  ...  STOCKHOLM_temp_min  \
0              2.9             1.6  ...                -9.3   
1              3.6   

Informasi mengenai dataset seperti tipe data dan jumlah nilai yang tersedia.

In [None]:
# Informasi dataset
print("\nInformasi dataset:")
print(df.info())


Informasi dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3654 entries, 0 to 3653
Columns: 165 entries, DATE to TOURS_temp_max
dtypes: float64(150), int64(15)
memory usage: 4.6 MB
None


#### **Memeriksa Data yang Hilang**

Pada tahap ini, diperiksa jumlah nilai yang hilang (missing values) di setiap kolom dataset. Hal ini penting untuk mengetahui kualitas data dan menentukan apakah perlu dilakukan penanganan missing values.

In [None]:
# Cek missing values
missing_values = df.isnull().sum()
print("\nNilai hilang di setiap kolom:")
print(missing_values)


Nilai hilang di setiap kolom:
DATE                      0
MONTH                     0
BASEL_cloud_cover         0
BASEL_humidity            0
BASEL_pressure            0
                         ..
TOURS_global_radiation    0
TOURS_precipitation       0
TOURS_temp_mean           0
TOURS_temp_min            0
TOURS_temp_max            0
Length: 165, dtype: int64


## **Data Preparation**



#### **Data Preprocessing dan Augmentasi**

Dilakukan transformasi data menggunakan MinMaxScaler untuk menormalkan data ke dalam rentang 0-1. Ini bertujuan agar algoritma machine learning bekerja lebih optimal.

In [None]:
# Data Transformation
scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

# Assuming 'TG' (mean temperature) as the target variable
X = data_scaled.drop(columns=['BASEL_temp_mean'])
y = data_scaled['BASEL_temp_mean']

# Splitting into Train, Validation, and Test Sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

Kolom target (BASEL_temp_mean) dipisahkan dari dataset, sementara kolom lainnya dijadikan fitur (X).

Dataset kemudian dibagi menjadi tiga bagian:
* Training set (70%) untuk melatih model.
* Validation set (15%) untuk mengevaluasi performa selama training.
* Test set (15%) untuk mengevaluasi performa akhir model pada data yang tidak dilihat sebelumnya.

## **Modeling**

#### **Random Forest Regressor:**

Model Random Forest Regressor digunakan sebagai baseline. Model ini dilatih pada data training dan menghasilkan prediksi pada data test.

In [None]:
# Random Forest Regressor
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
rf_preds = rf_model.predict(X_test)

#### **Gradient Boosting Regressor:**

Gradient Boosting Regressor diterapkan sebagai model machine learning berbasis boosting. Model ini juga dilatih pada data training dan dievaluasi dengan data test.

In [None]:
# Gradient Boosting Regressor
gb_model = GradientBoostingRegressor(random_state=42)
gb_model.fit(X_train, y_train)
gb_preds = gb_model.predict(X_test)

#### **Recurrent Neural Network (RNN):**

Model RNN dibuat menggunakan arsitektur LSTM untuk menangkap pola data sekuensial (time series).

Arsitektur model:
* LSTM layer dengan 50 unit dan dropout untuk mencegah overfitting.
* Dense layer untuk memprediksi nilai target.

Preprocessing tambahan: Data fitur (X) diubah menjadi tiga dimensi agar sesuai dengan input LSTM.

Model dilatih menggunakan adam optimizer dengan fungsi loss Mean Squared Error (MSE).

In [None]:
# RNN
rnn_model = Sequential([
    LSTM(50, activation='relu', input_shape=(X_train.shape[1], 1), return_sequences=True),
    Dropout(0.2),
    LSTM(50, activation='relu'),
    Dropout(0.2),
    Dense(1)
])
rnn_model.compile(optimizer='adam', loss='mse')

# Reshape data for RNN
X_train_rnn = np.expand_dims(X_train.values, axis=2)
X_val_rnn = np.expand_dims(X_val.values, axis=2)
X_test_rnn = np.expand_dims(X_test.values, axis=2)

# Training the RNN
rnn_model.fit(X_train_rnn, y_train, validation_data=(X_val_rnn, y_val), epochs=50, batch_size=32)
rnn_preds = rnn_model.predict(X_test_rnn)

Epoch 1/50


  super().__init__(**kwargs)


[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 149ms/step - loss: 0.1144 - val_loss: 0.0116
Epoch 2/50
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 290ms/step - loss: 0.0159 - val_loss: 0.0063
Epoch 3/50
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 288ms/step - loss: 0.0120 - val_loss: 0.0055
Epoch 4/50
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 190ms/step - loss: 0.0099 - val_loss: 0.0052
Epoch 5/50
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 144ms/step - loss: 0.0090 - val_loss: 0.0057
Epoch 6/50
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 169ms/step - loss: 0.0089 - val_loss: 0.0050
Epoch 7/50
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 157ms/step - loss: 0.0091 - val_loss: 0.0048
Epoch 8/50
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 149ms/step - loss: 0.0094 - val_loss: 0.0047
Epoch 9/50
[1m80/80[0m [32m━━━━━━━━━━━━━

## **Evaluation**

Fungsi evaluasi dibuat untuk menghitung tiga metrik performa utama:

* Mean Absolute Error (MAE): Rata-rata absolut kesalahan prediksi.
* Mean Squared Error (MSE): Rata-rata kuadrat kesalahan prediksi.
* R-squared (R²): Kualitas fit model terhadap data sebenarnya.
Setiap model dievaluasi pada data test untuk membandingkan performanya.

In [None]:
# Evaluation function
def evaluate_model(name, y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"{name} Evaluation:\nMAE: {mae}\nMSE: {mse}\nR^2: {r2}\n")

# Evaluate the models
evaluate_model("Random Forest", y_test, rf_preds)
evaluate_model("Gradient Boosting", y_test, gb_preds)
evaluate_model("RNN", y_test, rnn_preds)

Random Forest Evaluation:
MAE: 0.010620116328287372
MSE: 0.00019772468576288951
R^2: 0.9945941293679008

Gradient Boosting Evaluation:
MAE: 0.010223996372961218
MSE: 0.0001809642687199338
R^2: 0.9950523657581859

RNN Evaluation:
MAE: 0.036825949161540514
MSE: 0.0022658183835689515
R^2: 0.9380516347255979



## **Testing**

Dilakukan pengujian lebih lanjut untuk memastikan bahwa model bekerja dengan baik pada data yang tidak dilihat sebelumnya.

In [None]:
# Testing phase (testing the models with unseen data)
print("Testing Random Forest Model:")
rf_test_preds = rf_model.predict(X_test)
evaluate_model("Random Forest Test", y_test, rf_test_preds)

print("Testing Gradient Boosting Model:")
gb_test_preds = gb_model.predict(X_test)
evaluate_model("Gradient Boosting Test", y_test, gb_test_preds)

print("Testing RNN Model:")
rnn_test_preds = rnn_model.predict(X_test_rnn)
evaluate_model("RNN Test", y_test, rnn_test_preds)

Testing Random Forest Model:
Random Forest Test Evaluation:
MAE: 0.010620116328287372
MSE: 0.00019772468576288951
R^2: 0.9945941293679008

Testing Gradient Boosting Model:
Gradient Boosting Test Evaluation:
MAE: 0.010223996372961218
MSE: 0.0001809642687199338
R^2: 0.9950523657581859

Testing RNN Model:
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 96ms/step
RNN Test Evaluation:
MAE: 0.036825949161540514
MSE: 0.0022658183835689515
R^2: 0.9380516347255979

