<a href="https://colab.research.google.com/github/jaisal2000/datascience/blob/main/ML_RainPrediction_ClassificationAlgorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tomorrow Rain Predictions


In this notebook, we are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics. With this dataset, we can predict next-day rain by training classification models on the target variable RainTomorrow.

We will use some of the algorithms specifically:

1. Linear Regression
2. KNN
3. Decision Trees
4. Logistic Regression
5. SVM

We will evaluate our models using:

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score



# About The Dataset


The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)




This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



## **Import the required libraries**


In [1]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [2]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

### Importing the Dataset


In [3]:
filepath = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv"
df = pd.read_csv(filepath)

In [4]:
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


### Data Preprocessing


#### One Hot Encoding


First, we need to perform one hot encoding to convert categorical variables to binary variables.


In [5]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.


In [6]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

### Training Data and Test Data


Now, we set our 'features' or x values and our Y or target variable.


In [7]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [8]:
df_sydney_processed = df_sydney_processed.astype(float)

In [9]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

### Linear Regression


#### Q1) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.


In [10]:
from sklearn.model_selection import train_test_split



In [11]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)

#### Q2) Create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).


In [13]:


# Create the Linear Regression model
LinearReg = LinearRegression()

# Train the model using the training data
LinearReg.fit(x_train, y_train)

# Display the coefficients and intercept
print("Coefficients:", LinearReg.coef_)
print("Intercept:", LinearReg.intercept_)


Coefficients: [-2.36934040e-02  1.29995560e-02  7.29625159e-04  6.48758392e-03
 -3.51687983e-02  4.23655223e-03  1.83079049e-03  7.89537750e-04
  9.55405652e-04  8.56305789e-03  7.71427992e-03 -9.25247903e-03
 -8.86171626e-03  1.00332952e-02  1.44743785e-02 -3.48417548e-03
 -5.05116631e+10 -5.05116631e+10 -1.09342204e+10 -1.09342204e+10
 -1.09342204e+10 -1.09342204e+10 -1.09342204e+10 -1.09342204e+10
 -1.09342204e+10 -1.09342204e+10 -1.09342204e+10 -1.09342204e+10
 -1.09342204e+10 -1.09342204e+10 -1.09342204e+10 -1.09342204e+10
 -1.09342204e+10 -1.09342204e+10 -2.79026559e+09 -2.79026559e+09
 -2.79026559e+09 -2.79026559e+09 -2.79026559e+09 -2.79026559e+09
 -2.79026559e+09 -2.79026559e+09 -2.79026559e+09 -2.79026559e+09
 -2.79026559e+09 -2.79026559e+09 -2.79026559e+09 -2.79026559e+09
 -2.79026559e+09 -2.79026559e+09 -5.02087788e+09 -5.02087788e+09
 -5.02087788e+09 -5.02087788e+09 -5.02087788e+09 -5.02087788e+09
 -5.02087788e+09 -5.02087788e+09 -5.02087788e+09 -5.02087788e+09
 -5.0208778

#### Q3) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [45]:
# Predict using the testing data
predictions_LR = LinearReg.predict(x_test)




#### Q4) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [16]:

# Calculate evaluation metrics
jaccard = jaccard_score(y_test, predictions.round())
f1 = f1_score(y_test, predictions.round())
logloss = log_loss(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions.round())
accuracy = accuracy_score(y_test, predictions.round())

# Display the results
print("Jaccard Score:", jaccard)
print("F1 Score:", f1)
print("Log Loss:", logloss)
print("Confusion Matrix:\n", conf_matrix)
print("Accuracy:", accuracy)


Jaccard Score: 0.5136363636363637
F1 Score: 0.6786786786786786
Log Loss: 0.5088747601196979
Confusion Matrix:
 [[435  36]
 [ 71 113]]
Accuracy: 0.8366412213740458


In [18]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Calculate MAE
LinearRegression_MAE = mean_absolute_error(y_test, predictions)

# Calculate MSE
LinearRegression_MSE = mean_squared_error(y_test, predictions)

# Calculate R2
LinearRegression_R2 = r2_score(y_test, predictions)



#### Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.


In [19]:


# Create a DataFrame for the linear model metrics
linear_metrics = pd.DataFrame({
    'Metric': ['MAE', 'MSE', 'R2'],
    'Value': [LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2]
})

# Display the DataFrame
print(linear_metrics)


  Metric     Value
0    MAE  0.256309
1    MSE  0.115719
2     R2  0.427138


### KNN


#### Q6) Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.


In [21]:

# Create the KNN model
KNN = KNeighborsClassifier(n_neighbors=4)

# Train the model using the training data
KNN.fit(x_train, y_train)


#### Q7) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [44]:
predictions_KNN = KNN.predict(x_test)

#### Q8) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [27]:
# Calculate evaluation metrics
jaccard = jaccard_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
accuracy = accuracy_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)

# Display the results
print("Jaccard Score:", jaccard)
print("F1 Score:", f1)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)


Jaccard Score: 0.4251207729468599
F1 Score: 0.5966101694915255
Accuracy: 0.8183206106870229
Confusion Matrix:
 [[448  23]
 [ 96  88]]


### Decision Tree


#### Q9) Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).


In [28]:

# Create the Decision Tree model
Tree = DecisionTreeClassifier()

# Train the model using the training data
Tree.fit(x_train, y_train)


#### Q10) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [43]:
# Predict using the testing data
predictions_DT = Tree.predict(x_test)




#### Q11) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [30]:
# Calculate evaluation metrics
jaccard = jaccard_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
accuracy = accuracy_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)

# Display the results
print("Jaccard Score:", jaccard)
print("F1 Score:", f1)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)


Jaccard Score: 0.4
F1 Score: 0.5714285714285714
Accuracy: 0.7572519083969466
Confusion Matrix:
 [[390  81]
 [ 78 106]]


### Logistic Regression


#### Q12) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.


In [32]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)

#### Q13) Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.


In [33]:
# Create the Logistic Regression model
LR = LogisticRegression(solver='liblinear')

# Train the model using the training data
LR.fit(x_train, y_train)

#### Q14) Now, use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.


In [42]:
# Predict using the testing data
predictions_LR = LR.predict(x_test)

# Predict probabilities using the testing data
predict_proba = LR.predict_proba(x_test)




#### Q15) Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [35]:


# Calculate evaluation metrics
jaccard = jaccard_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
accuracy = accuracy_score(y_test, predictions)
logloss = log_loss(y_test, predict_proba)
conf_matrix = confusion_matrix(y_test, predictions)

# Display the results
print("Jaccard Score:", jaccard)
print("F1 Score:", f1)
print("Accuracy:", accuracy)
print("Log Loss:", logloss)
print("Confusion Matrix:\n", conf_matrix)


Jaccard Score: 0.5091743119266054
F1 Score: 0.6747720364741641
Accuracy: 0.8366412213740458
Log Loss: 0.3812590636097066
Confusion Matrix:
 [[437  36]
 [ 71 111]]


### SVM


#### Q16) Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).


In [36]:


# Create the SVM model
SVM = svm.SVC()

# Train the model using the training data
SVM.fit(x_train, y_train)


#### Q17) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [41]:
# Predict using the testing data
predictions_SVM = SVM.predict(x_test)




#### Q18) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [38]:

# Calculate evaluation metrics
jaccard = jaccard_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
accuracy = accuracy_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)

# Display the results
print("Jaccard Score:", jaccard)
print("F1 Score:", f1)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)


Jaccard Score: 0.0
F1 Score: 0.0
Accuracy: 0.7221374045801526
Confusion Matrix:
 [[473   0]
 [182   0]]
