# Rain Prediction: Classification Approach

#### In this project, we will predict the rain using the Australian weather data.

##### About the dataset:

The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import  confusion_matrix, accuracy_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import sklearn.metrics as metrics

In [2]:
# Loading the dataset.
df = pd.read_csv(r"C:\Users\mayan\Downloads\Weather_Data.csv")
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


In [3]:
df.isnull().sum()

Date             0
MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustDir      0
WindGustSpeed    0
WindDir9am       0
WindDir3pm       0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
RainToday        0
RainTomorrow     0
dtype: int64

In [4]:
# Converting categorical columns to binary values.
df_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir','WindDir9am', 'WindDir3pm'])

In [5]:
df_processed.head(5)

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,41,17,20,92,...,False,False,False,False,False,True,False,False,False,False
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,41,9,13,83,...,False,False,False,False,False,False,False,False,False,False
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,41,17,2,88,...,False,False,False,False,False,False,False,False,False,False
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,41,22,20,83,...,False,False,False,False,False,False,False,False,False,False
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,41,11,6,88,...,False,False,False,False,False,False,False,True,False,False


In [6]:
df_processed.replace(['No','Yes'], [0,1], inplace = True)

In [7]:
df_processed.columns

Index(['Date', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
       'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am',
       'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm',
       'Temp9am', 'Temp3pm', 'RainTomorrow', 'RainToday_No', 'RainToday_Yes',
       'WindGustDir_E', 'WindGustDir_ENE', 'WindGustDir_ESE', 'WindGustDir_N',
       'WindGustDir_NE', 'WindGustDir_NNE', 'WindGustDir_NNW',
       'WindGustDir_NW', 'WindGustDir_S', 'WindGustDir_SE', 'WindGustDir_SSE',
       'WindGustDir_SSW', 'WindGustDir_SW', 'WindGustDir_W', 'WindGustDir_WNW',
       'WindGustDir_WSW', 'WindDir9am_E', 'WindDir9am_ENE', 'WindDir9am_ESE',
       'WindDir9am_N', 'WindDir9am_NE', 'WindDir9am_NNE', 'WindDir9am_NNW',
       'WindDir9am_NW', 'WindDir9am_S', 'WindDir9am_SE', 'WindDir9am_SSE',
       'WindDir9am_SSW', 'WindDir9am_SW', 'WindDir9am_W', 'WindDir9am_WNW',
       'WindDir9am_WSW', 'WindDir3pm_E', 'WindDir3pm_ENE', 'WindDir3pm_ESE',
       'WindDir3

In [8]:
df_processed.drop('Date',axis=1,inplace=True)

In [9]:
df_processed = df_processed.astype(float)

In [10]:
df_processed.corrwith(df_processed['RainTomorrow'])

MinTemp           0.082804
MaxTemp          -0.152525
Rainfall          0.296120
Evaporation      -0.070145
Sunshine         -0.529112
                    ...   
WindDir3pm_SSW    0.171661
WindDir3pm_SW     0.046124
WindDir3pm_W     -0.046956
WindDir3pm_WNW   -0.095579
WindDir3pm_WSW   -0.037785
Length: 67, dtype: float64

In [11]:
# Creating features and label.
X = df_processed.drop(columns='RainTomorrow', axis=1)
Y = df_processed['RainTomorrow']

In [12]:
# Splitting the data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state = 10)

### Decision Tree:

In [13]:
Tree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)

Tree.fit(x_train, y_train)

In [14]:
predictions = Tree.predict(x_test)

In [15]:
Tree_Accuracy_Score = accuracy_score(y_test, predictions)
Tree_JaccardIndex = jaccard_score(y_test, predictions, pos_label = 0)
Tree_F1_Score = f1_score(y_test, predictions, average = 'weighted')

print("Accuracy Score:",Tree_Accuracy_Score)
print("Jaccard Index:",Tree_JaccardIndex)
print("F1 Score:",Tree_F1_Score)

Accuracy Score: 0.8183206106870229
Jaccard Index: 0.781651376146789
F1 Score: 0.8132626923421479


### Logistic Regression:

In [16]:
LR = LogisticRegression(C=0.01, solver='liblinear')

LR.fit(x_train, y_train)

In [17]:
predictions = LR.predict(x_test)

In [18]:
predict_proba = LR.predict_proba(x_test)

In [19]:
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions, pos_label = 0)
LR_F1_Score = f1_score(y_test, predictions, average = 'weighted')
LR_Log_Loss = log_loss(y_test, predict_proba)

print("Accuracy Score:",LR_Accuracy_Score)
print("Jaccard Index:",LR_JaccardIndex)
print("F1 Score:",LR_F1_Score)
print("Log Loss:", LR_Log_Loss)

Accuracy Score: 0.8427480916030534
Jaccard Index: 0.8103130755064457
F1 Score: 0.8361692056982617
Log Loss: 0.3587932863375965


### SVM:

In [20]:
SVM = svm.SVC(kernel='rbf')

SVM.fit(x_train,y_train)

In [21]:
predictions = SVM.predict(x_test)

In [22]:
SVM_Accuracy_Score = accuracy_score(y_test, predictions)
SVM_JaccardIndex = jaccard_score(y_test, predictions, pos_label = 0)
SVM_F1_Score = f1_score(y_test, predictions, average = 'weighted')

print("Accuracy Score:",SVM_Accuracy_Score)
print("Jaccard Index:",SVM_JaccardIndex)
print("F1 Score:",SVM_F1_Score)

Accuracy Score: 0.7190839694656489
Jaccard Index: 0.7190839694656489
F1 Score: 0.6015782408851165


In [23]:
Report = pd.DataFrame([{"Model":"Decision Tree","Accuracy Score":Tree_Accuracy_Score,"Jaccard Index":Tree_JaccardIndex,"F1 Score":Tree_F1_Score},
                      {"Model":"Logistic Regression","Accuracy Score":LR_Accuracy_Score,"Jaccard Index":LR_JaccardIndex,"F1 Score":LR_F1_Score,"Log Loss":LR_Log_Loss},
                      {"Model":"SVM","Accuracy Score":SVM_Accuracy_Score,"Jaccard Index":SVM_JaccardIndex,"F1 Score":SVM_F1_Score}]).set_index('Model')
Report.index.name = None
print(Report)

                     Accuracy Score  Jaccard Index  F1 Score  Log Loss
Decision Tree              0.818321       0.781651  0.813263       NaN
Logistic Regression        0.842748       0.810313  0.836169  0.358793
SVM                        0.719084       0.719084  0.601578       NaN


#

##### Clearly, we can see that Logistic Regression Model is performing better in case of this scenario.