# FINAL PROJECT 2 
# Prediksi Cuaca Harian di Australia
Anggota Kelompok 8:
- Rijal Muhammad Farizky
- Diva Nabila

## Project Overview
### Latar Belakang
`Cuaca` menjadi faktor penting yang dapat mempengaruhi kehidupan sehari-hari di berbagai sektor ekonomi, termasuk pertanian, pariwisata, dan manajemen bencana. Di Australia, cuaca yang ekstrem seperti kekeringan dan banjir dapat memiliki dampak yang signifikan pada kehidupan dan ekonomi masyarakat. Oleh karena itu, pengembangan sistem prediksi cuaca yang akurat berdasarkan data historis adalah suatu kebutuhan mendesak.

Dataset `Rain in Australia` berisi pengamatan cuaca harian selama sekitar 10 tahun dari berbagai stasiun cuaca di Australia. Dengan dataset ini kita dapat melakukan eksplorasi dan membangun model machine learning untuk prediksi cuaca dengan baik.

### Tujuan
1. `Exploratory Data Analysis` : Melakukan eksplorasi data yang mendalam untuk memahami pola cuaca di Australia selama 10 tahun terakhir, mengidentifikasi tren, variabilitas musiman, dan hubungan antar variabel cuaca.

2. `Prediksi Cuaca Harian`: Mengembangkan model prediksi cuaca harian menggunakan dua pendekatan: `Logistic Regression` dan `Support Vector Machine (SVM)`.

3. `Evaluasi Model`: Mengukur kinerja model menggunakan metrik evaluasi seperti akurasi, presisi, recall, dan F1-score. Melakukan perbandingan antara model Logistic Regression dan SVM untuk menentukan model mana yang memberikan prediksi terbaik.

4. `Implementasi dan Pengujian`: Melakukan deployment model Logistic Regression dan Support Vector Machine (SVM) ke dalam sistem yang dapat memberikan prediksi cuaca harian secara real-time berdasarkan data cuaca terbaru.

## Pustaka yang Digunakan

In [1]:
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import pickle

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, confusion_matrix, accuracy_score

## Data Loading
**Load Dataset**

In [5]:
df_wth = pd.read_csv('../dataset/weatherAUS.csv')

**Row and Column of Dataset**

In [6]:
df_wth.shape

(145460, 23)

**The Data**

In [40]:
df_wth

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,Albury,13.4,22.9,0.6,4.8,8.4,W,44.0,W,WNW,...,71.0,22.0,1007.7,1007.1,8.0,5.0,16.9,21.8,No,No
1,Albury,7.4,25.1,0.0,4.8,8.4,WNW,44.0,NNW,WSW,...,44.0,25.0,1010.6,1007.8,5.0,5.0,17.2,24.3,No,No
2,Albury,12.9,25.7,0.0,4.8,8.4,WSW,46.0,W,WSW,...,38.0,30.0,1007.6,1008.7,5.0,2.0,21.0,23.2,No,No
3,Albury,9.2,28.0,0.0,4.8,8.4,NE,24.0,SE,E,...,45.0,16.0,1017.6,1012.8,5.0,5.0,18.1,26.5,No,No
4,Albury,17.5,32.3,1.0,4.8,8.4,W,41.0,ENE,NW,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145454,Uluru,3.5,21.8,0.0,4.8,8.4,E,31.0,ESE,E,...,59.0,27.0,1024.7,1021.2,5.0,5.0,9.4,20.9,No,No
145455,Uluru,2.8,23.4,0.0,4.8,8.4,E,31.0,SE,ENE,...,51.0,24.0,1024.6,1020.3,5.0,5.0,10.1,22.4,No,No
145456,Uluru,3.6,25.3,0.0,4.8,8.4,NNW,22.0,SE,N,...,56.0,21.0,1023.5,1019.1,5.0,5.0,10.9,24.5,No,No
145457,Uluru,5.4,26.9,0.0,4.8,8.4,N,37.0,SE,WNW,...,53.0,24.0,1021.0,1016.8,5.0,5.0,12.5,26.1,No,No


**Informasi detail mengenai dataset**

In [41]:
df_wth.info()

<class 'pandas.core.frame.DataFrame'>
Index: 123710 entries, 0 to 145458
Data columns (total 22 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Location       123710 non-null  object 
 1   MinTemp        123710 non-null  float64
 2   MaxTemp        123710 non-null  float64
 3   Rainfall       123710 non-null  float64
 4   Evaporation    123710 non-null  float64
 5   Sunshine       123710 non-null  float64
 6   WindGustDir    123710 non-null  object 
 7   WindGustSpeed  123710 non-null  float64
 8   WindDir9am     123710 non-null  object 
 9   WindDir3pm     123710 non-null  object 
 10  WindSpeed9am   123710 non-null  float64
 11  WindSpeed3pm   123710 non-null  float64
 12  Humidity9am    123710 non-null  float64
 13  Humidity3pm    123710 non-null  float64
 14  Pressure9am    123710 non-null  float64
 15  Pressure3pm    123710 non-null  float64
 16  Cloud9am       123710 non-null  float64
 17  Cloud3pm       123710 non-null  fl

Berdasarkan informasi di atas, diketahui bahwa data terdiri dari `145460 baris` dan `23 kolom`. Data tersebut masih terdapat missing value terlihat dari 23 kolom, hanya 2 kolom yang berisi data tidak null, yakni kolom Date dan Location. Maka dari itu, diperlukan pembersihan data.

## Data Cleaning

### Examine Duplicated Values

In [10]:
df_wth.duplicated().sum()

0

### Examining Missing Values

In [13]:
df_wth.isnull().sum()

Date                 0
Location             0
MinTemp           1485
MaxTemp           1261
Rainfall          3261
Evaporation      62790
Sunshine         69835
WindGustDir      10326
WindGustSpeed    10263
WindDir9am       10566
WindDir3pm        4228
WindSpeed9am      1767
WindSpeed3pm      3062
Humidity9am       2654
Humidity3pm       4507
Pressure9am      15065
Pressure3pm      15028
Cloud9am         55888
Cloud3pm         59358
Temp9am           1767
Temp3pm           3609
RainToday         3261
RainTomorrow      3267
dtype: int64

In [29]:
df_wth.drop("Date", axis=1, inplace=True)

### Filling Missing Values
Filling Numerical Columns With Median

In [30]:
numerical = [col for col in df_wth.columns if df_wth[col].dtype!='O']
numerical

['MinTemp',
 'MaxTemp',
 'Rainfall',
 'Evaporation',
 'Sunshine',
 'WindGustSpeed',
 'WindSpeed9am',
 'WindSpeed3pm',
 'Humidity9am',
 'Humidity3pm',
 'Pressure9am',
 'Pressure3pm',
 'Cloud9am',
 'Cloud3pm',
 'Temp9am',
 'Temp3pm']

In [31]:
df_wth.describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm
count,143975.0,144199.0,142199.0,82670.0,75625.0,135197.0,143693.0,142398.0,142806.0,140953.0,130395.0,130432.0,89572.0,86102.0,143693.0,141851.0
mean,12.194034,23.221348,2.360918,5.468232,7.611178,40.03523,14.043426,18.662657,68.880831,51.539116,1017.64994,1015.255889,4.447461,4.50993,16.990631,21.68339
std,6.398495,7.119049,8.47806,4.193704,3.785483,13.607062,8.915375,8.8098,19.029164,20.795902,7.10653,7.037414,2.887159,2.720357,6.488753,6.93665
min,-8.5,-4.8,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,980.5,977.1,0.0,0.0,-7.2,-5.4
25%,7.6,17.9,0.0,2.6,4.8,31.0,7.0,13.0,57.0,37.0,1012.9,1010.4,1.0,2.0,12.3,16.6
50%,12.0,22.6,0.0,4.8,8.4,39.0,13.0,19.0,70.0,52.0,1017.6,1015.2,5.0,5.0,16.7,21.1
75%,16.9,28.2,0.8,7.4,10.6,48.0,19.0,24.0,83.0,66.0,1022.4,1020.0,7.0,7.0,21.6,26.4
max,33.9,48.1,371.0,145.0,14.5,135.0,130.0,87.0,100.0,100.0,1041.0,1039.6,9.0,9.0,40.2,46.7


In [36]:
df_wth[numerical] = df_wth[numerical].fillna(df_wth[numerical].median())
df_wth.isna().sum()

Location             0
MinTemp              0
MaxTemp              0
Rainfall             0
Evaporation          0
Sunshine             0
WindGustDir      10326
WindGustSpeed        0
WindDir9am       10566
WindDir3pm        4228
WindSpeed9am         0
WindSpeed3pm         0
Humidity9am          0
Humidity3pm          0
Pressure9am          0
Pressure3pm          0
Cloud9am             0
Cloud3pm             0
Temp9am              0
Temp3pm              0
RainToday         3261
RainTomorrow      3267
dtype: int64

### Drop the missing value in Categorical Columns

In [37]:
df_wth.dropna(subset=['WindGustDir','WindDir9am','WindDir3pm','RainToday','RainTomorrow'], axis=0, inplace=True)
df_wth.isna().sum()

Location         0
MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustDir      0
WindGustSpeed    0
WindDir9am       0
WindDir3pm       0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
RainToday        0
RainTomorrow     0
dtype: int64

In [39]:
df_wth.shape

(123710, 22)

Setalah dilakukan penghapusan missing values pada categorical value, baris data yang semula `145460` berkurang menjadi `123710` baris.