# Weather Classification and Rainfall Prediction in Australia

## Table of Contents

1. [About the Dataset](#About-The-Dataset)
2. [Importing Data](#Importing-Data)
3. [Data Preprocessing](#Data-Preprocessing)
4. [One Hot Encoding](#One-Hot-Encoding)
5. [Train and Test Data Split](#Train-and-Test-Data-Split)
6. [Model Training and Evaluation](#Model-Training-and-Evaluation)
    - Logistic Regression
    - K-Nearest Neighbors (KNN)
    - Decision Tree
    - Support Vector Machine (SVM)
    - Linear Regression
7. [Model Comparison](#Model-Comparison)

# About The Dataset


The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)




This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



## **Import the required libraries**


In [1]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [2]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

### Importing the Dataset


In [4]:
import requests

def download(url, filename):
    response = requests.get(url)
    if response.status_code == 200:
        with open(filename, "wb") as f:
            f.write(response.content)

In [5]:
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'
filename ="Weather_Data.csv"

download(path, filename)    
       
df = pd.read_csv(filename)

df.head()

### Data Preprocessing


#### One Hot Encoding

We perform one hot encoding to convert categorical variables to binary variables.

In [9]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

df_sydney_processed.drop('Date',axis=1,inplace=True)

df_sydney_processed = df_sydney_processed.astype(float)

features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

## Linear Regression



Splitting the data into training and testing sets

We use train_test_split from sklearn.model_selection to split our data
- features: input variables (X)
- Y: target variable
- test_size=0.2: 20% of the data will be used for testing, 80% for training
- random_state=10: for reproducibility of the split

This split allows us to train our model on one portion of the data
and then test its performance on unseen data, helping to evaluate
how well the model generalizes.


In [14]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)


Training the model

In [15]:
from sklearn.linear_model import LinearRegression

LinearReg = LinearRegression()
LinearReg.fit(x_train, y_train)

Testing the model

In [18]:
# Use the predict method on the testing data (x_test)
predictions = LinearReg.predict(x_test)

Evaluating the Linear Regression Model

After training and testing our Linear Regression model, we need to evaluate its performance.
We'll use three common metrics for regression problems:

1. Mean Absolute Error (MAE): Average of absolute differences between predicted and actual values.
   - Gives an idea of how far off our predictions are on average, in the same units as the target variable.

2. Mean Squared Error (MSE): Average of squared differences between predicted and actual values.
   - Penalizes larger errors more heavily, useful when large errors are particularly undesirable.

3. R-squared (R2) Score: Proportion of variance in the dependent variable predictable from the independent variable(s).
   - Ranges from 0 to 1, with 1 indicating perfect prediction and 0 indicating the model predicts no better than a horizontal line.

These metrics will help us understand how well our Linear Regression model is performing on the test data.



In [19]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Calculate Mean Absolute Error (MAE)
LinearRegression_MAE = mean_absolute_error(y_test, predictions)

# Calculate Mean Squared Error (MSE)
LinearRegression_MSE = mean_squared_error(y_test, predictions)

# Calculate R-squared (R2) score
LinearRegression_R2 = r2_score(y_test, predictions)

metrics_df = pd.DataFrame({
    'Metric': ['MAE', 'MSE', 'R2'],
    'Value': [LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2]
})

metrics_df

## KNN


Train a KNN model called KNN using the training data

KNN (K-Nearest Neighbors) is a simple, non-parametric classification algorithm.
It works by finding the K nearest neighbors to a given data point in the feature space,
and then classifying the point based on the majority class of its neighbors.

Steps:
1. Choose the number of neighbors (K)
2. Find the K nearest neighbors of the sample to be classified
3. Assign the class label by majority vote of the K neighbors

Advantages:
- Simple to understand and implement
- No assumptions about data distribution
- Can be effective for multi-class problems

Disadvantages:
- Computationally expensive for large datasets
- Sensitive to irrelevant features and the scale of the data
- Requires feature scaling for best performance

 In the following code, we'll create and train a KNN model with 4 neighbors.

In [21]:
from sklearn.neighbors import KNeighborsClassifier

# Create and train a KNN model
KNN = KNeighborsClassifier(n_neighbors=4)
KNN.fit(x_train, y_train)

Predict using the KNN model

After training the KNN model, the next step is to use it for making predictions
on the test data. This allows us to evaluate the model's performance on unseen data.

The following cell will:
1. Convert the test data (x_test) to a contiguous numpy array
2. Use the trained KNN model to make predictions on this test data

Note: Converting to a contiguous array ensures that the data is stored in a
continuous block of memory, which can improve performance for some operations.



In [24]:
# Convert x_test to a numpy array and ensure it's contiguous
x_test_array = np.ascontiguousarray(x_test)

# Use the predict method on the testing data (x_test_array)
predictions = KNN.predict(x_test_array)

Evaluate KNN Model Performance

After making predictions with our KNN model, we need to evaluate its performance.
We'll use three common metrics for classification tasks:

1. Accuracy Score: Measures the overall correctness of predictions.
2. Jaccard Index: Measures the similarity between predicted and actual label sets.
3. F1 Score: Harmonic mean of precision and recall, balancing both metrics.

The next cell will calculate and print these metrics for our KNN model.


In [25]:
from sklearn.metrics import accuracy_score, jaccard_score, f1_score

# Calculate accuracy score
KNN_Accuracy_Score = accuracy_score(y_test, predictions)

# Calculate Jaccard index
KNN_JaccardIndex = jaccard_score(y_test, predictions, average='weighted')

# Calculate F1 score
KNN_F1_Score = f1_score(y_test, predictions, average='weighted')

# Print the results
print("KNN Accuracy Score:", KNN_Accuracy_Score)
print("KNN Jaccard Index:", KNN_JaccardIndex)
print("KNN F1 Score:", KNN_F1_Score)

KNN Accuracy Score: 0.8183206106870229
KNN Jaccard Index: 0.6875883517104892
KNN F1 Score: 0.802374933635524


### Decision Tree


Decision Tree Classifier

In this section, we'll implement a Decision Tree classifier.
Decision Trees are versatile machine learning algorithms that can be used for both classification and regression tasks.
They work by creating a model that predicts the target variable by learning simple decision rules inferred from the data features.

Key characteristics of Decision Trees:
1. Non-linear relationships: Can capture complex, non-linear relationships in the data.
2. Feature importance: Provides a measure of importance for each feature in making decisions.
3. Interpretability: The decision-making process can be easily visualized and understood.
4. No feature scaling required: Unlike some other algorithms, decision trees don't require feature scaling.

We'll use scikit-learn's DecisionTreeClassifier to create and train our model, then evaluate its performance using the same metrics as before.


In [26]:
from sklearn.tree import DecisionTreeClassifier

# Create a Decision Tree classifier
Tree = DecisionTreeClassifier(random_state=1)

# Train the model using the training data
Tree.fit(x_train, y_train)

Evaluate Decision Tree Model Performance

After training the Decision Tree model, we need to evaluate its performance
on the test set. We'll calculate the following metrics:

1. Accuracy Score: Measures the overall correctness of predictions
2. Jaccard Index: Measures the similarity between predicted and actual labels
3. F1 Score: Harmonic mean of precision and recall

These metrics will help us assess how well our Decision Tree model performs
compared to other models we've implemented, such as KNN.


In [27]:
# Use the trained Decision Tree model to make predictions on the test data
predictions = Tree.predict(x_test)

Evalute the Model

In [28]:
Tree_Accuracy_Score = accuracy_score(y_test, predictions)
Tree_JaccardIndex = jaccard_score(y_test, predictions, average='weighted')
Tree_F1_Score = f1_score(y_test, predictions, average='weighted')

### Logistic Regression


Create new test and training sets

In [29]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)

Train the model

In [30]:
LR = LogisticRegression(solver='liblinear')
LR.fit(x_train, y_train)

Make predictions using the Logistic Regression model

This step involves using the trained Logistic Regression model (LR)
to make predictions on the test data (x_test). We'll generate both
class predictions and probability estimates.


In [31]:
predictions = LR.predict(x_test)
predict_proba = LR.predict_proba(x_test)

Evaluate the model

In [32]:
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions, average='binary')
LR_F1_Score = f1_score(y_test, predictions, average='binary')
LR_Log_Loss = log_loss(y_test, predict_proba)

### SVM


Train SVM model

In [36]:
from sklearn.svm import SVC

SVM = SVC()
SVM.fit(x_train, y_train)

Make predictions using the SVM model

This step involves using the trained Support Vector Machine (SVM) model
to make predictions on the test data (x_test). We'll generate class predictions.

Note: Unlike Logistic Regression, SVM doesn't provide probability estimates
by default, so we're only generating class predictions here.

In [37]:
predictions = SVM.predict(x_test)

Evaluate the model

In [38]:
SVM_Accuracy_Score = accuracy_score(y_test, predictions)
SVM_JaccardIndex = jaccard_score(y_test, predictions, average='binary')
SVM_F1_Score = f1_score(y_test, predictions, average='binary')

Compare Model Performances

We have calculated three key metrics for all our models (KNN, Decision Tree, Logistic Regression, and SVM):
1. Accuracy Score: Measures the overall correctness of predictions
2. Jaccard Index: Measures the similarity between predicted and actual labels
3. F1-Score: Harmonic mean of precision and recall, useful for imbalanced datasets

Additionally, we have calculated Log Loss for the Logistic Regression model.

In the next cell, we'll create a comparison table to evaluate and compare the performance of all these models.


In [39]:
Report = pd.DataFrame({
    'Model': ['KNN', 'Decision Tree', 'Logistic Regression', 'SVM'],
    'Accuracy': [KNN_Accuracy_Score, Tree_Accuracy_Score, LR_Accuracy_Score, SVM_Accuracy_Score],
    'Jaccard Index': [KNN_JaccardIndex, Tree_JaccardIndex, LR_JaccardIndex, SVM_JaccardIndex],
    'F1-Score': [KNN_F1_Score, Tree_F1_Score, LR_F1_Score, SVM_F1_Score],
    'LogLoss': [np.nan, np.nan, LR_Log_Loss, np.nan]
})

Report

Unnamed: 0,Model,Accuracy,Jaccard Index,F1-Score,LogLoss
0,KNN,0.818321,0.687588,0.802375,
1,Decision Tree,0.769466,0.638747,0.770754,
2,Logistic Regression,0.836641,0.509174,0.674772,0.380451
3,SVM,0.722137,0.0,0.0,


# Conclusion

Based on the comparison table of model performances, we can draw the following conclusions:

1. Accuracy: The Logistic Regression model performed the best with an accuracy of 83.66%, followed closely by KNN at 81.83%. The SVM model had the lowest accuracy at 72.21%.

2. Jaccard Index: KNN showed the highest Jaccard Index of 0.6876, indicating the best similarity between predicted and actual labels. Logistic Regression and Decision Tree followed, while SVM performed poorly with a Jaccard Index of 0.

3. F1-Score: KNN again performed the best with an F1-Score of 0.8024, followed by the Decision Tree and Logistic Regression. SVM had an F1-Score of 0, suggesting poor performance in terms of precision and recall.

4. LogLoss: This metric was only calculated for the Logistic Regression model, which showed a reasonable log loss of 0.3805.

Overall, the KNN model seems to be the most consistent performer across all metrics, while the SVM model underperformed in this specific case. The Logistic Regression model also showed strong performance, especially in terms of accuracy. The Decision Tree model performed moderately well across all metrics.

For this rainfall prediction task, based on these results, we would recommend using either the KNN or Logistic Regression model. However, further tuning of hyperparameters and cross-validation could potentially improve the performance of all models, including SVM.
