# About The Dataset


This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |




## **Import the required libraries**


In [113]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [114]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

### Importing the Dataset


In [115]:
from pyodide.http import pyfetch

async def download(url, filename):
    response = await pyfetch(url)
    if response.status == 200:
        with open(filename, "wb") as f:
            f.write(await response.bytes())

In [116]:
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'

In [117]:
await download(path, "Weather_Data.csv")
filename ="Weather_Data.csv"

In [118]:
df = pd.read_csv("Weather_Data.csv")

In [120]:
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


### Data Preprocessing


#### One Hot Encoding


First, we need to perform one hot encoding to convert categorical variables to binary variables.


In [121]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.


In [122]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

### Training Data and Test Data


Now, we set our 'features' or x values and our Y or target variable.


In [123]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [124]:
df_sydney_processed = df_sydney_processed.astype(float)

In [125]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

### Linear Regression


#### Q1) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.


In [126]:
x_train, x_test, y_train, y_test = train_test_split(features , Y , test_size=0.2 , random_state=10)


#### Q2) Create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).


In [127]:
LinearReg = LinearRegression()
LinearReg.fit(x_train , y_train)

#### Q3) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [129]:
y_pred = LinearReg.predict(x_test)

#### Q4) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [130]:
#Enter Your Code and Execute
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

LinearRegression_MAE = mean_absolute_error(y_test, y_pred)
LinearRegression_MSE = mean_squared_error(y_test, y_pred)
LinearRegression_R2 = r2_score(y_test, y_pred)
y_pred = (y_pred > 0.5).astype(int)
linear_Accuracy_Score = accuracy_score(y_test, y_pred)
linear_JaccardIndex = jaccard_score(y_test , y_pred)
linear_F1_Score = f1_score(y_test , y_pred)


#### Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.


In [132]:
results = pd.DataFrame({
    "Metric": ["Mean Absolute Error (MAE)", "Mean Squared Error (MSE)", "R² Score"],
    "Value": [LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2]
})

# Display the DataFrame
print(results)

                      Metric     Value
0  Mean Absolute Error (MAE)  0.256319
1   Mean Squared Error (MSE)  0.115721
2                   R² Score  0.427130


### KNN


#### Q6) Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.


In [134]:
KNN = KNeighborsClassifier ()

KNN.fit(x_train , y_train)

#### Q7) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [136]:
y_pred = KNN.predict(x_test)

#### Q8) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [138]:
KNN_Accuracy_Score = accuracy_score(y_test, y_pred)
KNN_JaccardIndex = jaccard_score(y_test , y_pred)
KNN_F1_Score = f1_score(y_test , y_pred)

print(KNN_Accuracy_Score,KNN_JaccardIndex,KNN_F1_Score)

0.8198473282442749 0.4660633484162896 0.6358024691358025


### Decision Tree


#### Q9) Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).


In [140]:
Tree = DecisionTreeClassifier()
Tree.fit(x_train , y_train)

#### Q10) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [142]:
y_pred = Tree.predict(x_test)


#### Q11) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [144]:
Tree_Accuracy_Score = accuracy_score(y_test , y_pred)
Tree_JaccardIndex = jaccard_score(y_test , y_pred)
Tree_F1_Score = f1_score(y_test , y_pred)
print(Tree_Accuracy_Score,Tree_JaccardIndex,Tree_F1_Score)

0.7435114503816794 0.38461538461538464 0.5555555555555556


### Logistic Regression


#### Q12) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.


In [145]:
#Enter Your Code and Execute
x_train, x_test, y_train, y_test = train_test_split(features , Y , test_size = 0.2 , random_state = 10 )

#### Q13) Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.


In [146]:
#Enter Your Code and Execute
LR = LogisticRegression()
LR.fit(x_train , y_train)

#### Q14) Now, use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.


In [148]:
y_pred = LR.predict(x_test)

In [149]:
y_pred_proba = LR.predict_proba(x_test)

#### Q15) Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [151]:
LR_Accuracy_Score = accuracy_score(y_test, y_pred)
print(f"Accuracy Score: {LR_Accuracy_Score:.2f}")

LR_JaccardIndex = jaccard_score(y_test, y_pred)
print(f"Jaccard Index: {LR_JaccardIndex:.2f}")

LR_F1_Score = f1_score(y_test, y_pred)
print(f"F1 Score: {LR_F1_Score:.2f}")

LR_Log_Loss = log_loss(y_test, y_pred_proba)
print(f"Log Loss: {LR_Log_Loss:.2f}")

Accuracy Score: 0.84
Jaccard Index: 0.52
F1 Score: 0.68
Log Loss: 0.37


### SVM


#### Q16) Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).


In [153]:
SVM = svm.SVC()
SVM.fit(x_train , y_train)

#### Q17) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [155]:
y_pred = SVM.predict(x_test)

#### Q18) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [156]:
SVM_Accuracy_Score = accuracy_score(y_test , y_pred)
SVM_JaccardIndex = jaccard_score(y_test , y_pred)
SVM_F1_Score = f1_score(y_test , y_pred )

#### Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

\*LogLoss is only for Logistic Regression Model


In [158]:
def calculate_metrics(acc, jaccard, f1, logloss=None, model_name=""):
    metrics = {
        "Model": model_name,
        "Accuracy": acc,
        "Jaccard Index": jaccard,
        "F1 Score": f1,
        "Log Loss": logloss if logloss is not None else "N/A"
    }
    return metrics



lr_metrics = calculate_metrics(linear_Accuracy_Score, linear_JaccardIndex,linear_F1_Score, model_name="Linear Regression")

knn_metrics = calculate_metrics(KNN_Accuracy_Score, KNN_JaccardIndex,KNN_F1_Score, model_name="KNN")

svm_metrics = calculate_metrics(SVM_Accuracy_Score, SVM_JaccardIndex,SVM_F1_Score, model_name="SVM")

dt_metrics = calculate_metrics(Tree_Accuracy_Score, Tree_JaccardIndex, Tree_F1_Score, model_name="Decision Trees")

log_reg_metrics = calculate_metrics(LR_Accuracy_Score, LR_JaccardIndex, LR_F1_Score , LR_Log_Loss, model_name="Logistic Regression")

results = pd.DataFrame([lr_metrics, knn_metrics, svm_metrics, dt_metrics, log_reg_metrics])

print(results)


                 Model  Accuracy  Jaccard Index  F1 Score  Log Loss
0    Linear Regression  0.836641       0.513636  0.678679       N/A
1                  KNN  0.819847       0.466063  0.635802       N/A
2                  SVM  0.719084       0.000000  0.000000       N/A
3       Decision Trees  0.743511       0.384615  0.555556       N/A
4  Logistic Regression  0.844275       0.518868  0.683230  0.366987
