# Rainfall prediction usign Australian Government's Bureau of Meteorology data set 



## About the Dataset

The Weather dataset is from the Australian Government's Bureau of Meteorology. It was gathered and prepared by the Machine Learning course from IBM on Coursera. The course describes the dataset as follow: 

- The original source of the data is from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

- Extra columns like 'RainToday' and 'RainTomorrow', which is a Yes/No of rain on particular day,  was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)


Detailed description of the dataset can be found from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01). A screenshot from the website is as below: 

![](../img/Australia_weather_columns.png)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import accuracy_score, jaccard_score, f1_score, log_loss, confusion_matrix, log_loss

## Importing the Dataset

In [5]:
url='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'
df = pd.read_csv(url)

## Exploratory Data Analysis

#How many rows and columns

In [7]:
df.shape

(3271, 22)

In [6]:
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


In [8]:
df.columns

Index(['Date', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
       'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
       'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
       'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
       'Temp3pm', 'RainToday', 'RainTomorrow'],
      dtype='object')

Are there any missing data ? 

In [10]:
df.isnull().sum()

Date             0
MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustDir      0
WindGustSpeed    0
WindDir9am       0
WindDir3pm       0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
RainToday        0
RainTomorrow     0
dtype: int64

In [18]:
df.dtypes

Date              object
MinTemp          float64
MaxTemp          float64
Rainfall         float64
Evaporation      float64
Sunshine         float64
WindGustDir       object
WindGustSpeed      int64
WindDir9am        object
WindDir3pm        object
WindSpeed9am       int64
WindSpeed3pm       int64
Humidity9am        int64
Humidity3pm        int64
Pressure9am      float64
Pressure3pm      float64
Cloud9am           int64
Cloud3pm           int64
Temp9am          float64
Temp3pm          float64
RainToday         object
RainTomorrow      object
dtype: object

In [14]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MinTemp,3271.0,14.877102,4.55471,4.3,11.0,14.9,18.8,27.6
MaxTemp,3271.0,23.005564,4.483752,11.7,19.6,22.8,26.0,45.8
Rainfall,3271.0,3.342158,9.917746,0.0,0.0,0.0,1.4,119.4
Evaporation,3271.0,5.175787,2.757684,0.0,3.2,4.8,7.0,18.4
Sunshine,3271.0,7.16897,3.815966,0.0,4.25,8.3,10.2,13.6
WindGustSpeed,3271.0,41.476307,10.806951,17.0,35.0,41.0,44.0,96.0
WindSpeed9am,3271.0,15.077041,7.043825,0.0,11.0,15.0,20.0,54.0
WindSpeed3pm,3271.0,19.294405,7.453331,0.0,15.0,19.0,24.0,57.0
Humidity9am,3271.0,68.243962,15.086127,19.0,58.0,69.0,80.0,100.0
Humidity3pm,3271.0,54.698563,16.279241,10.0,44.0,56.0,64.0,99.0


In [15]:
df.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Date,3271.0,3271.0,2/1/2008,1.0,,,,,,,
MinTemp,3271.0,,,,14.877102,4.55471,4.3,11.0,14.9,18.8,27.6
MaxTemp,3271.0,,,,23.005564,4.483752,11.7,19.6,22.8,26.0,45.8
Rainfall,3271.0,,,,3.342158,9.917746,0.0,0.0,0.0,1.4,119.4
Evaporation,3271.0,,,,5.175787,2.757684,0.0,3.2,4.8,7.0,18.4
Sunshine,3271.0,,,,7.16897,3.815966,0.0,4.25,8.3,10.2,13.6
WindGustDir,3271.0,16.0,W,1425.0,,,,,,,
WindGustSpeed,3271.0,,,,41.476307,10.806951,17.0,35.0,41.0,44.0,96.0
WindDir9am,3271.0,16.0,W,1260.0,,,,,,,
WindDir3pm,3271.0,16.0,E,624.0,,,,,,,


## Data Preprocessing

One Hot Enconding to binarize categorical columnns

In [37]:
categorical_columns = ['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm']
df_cat_encoded = pd.get_dummies(df, columns=categorical_columns, dtype=int)

Convert target RainTomorrow from YES/NO to binary [0,1]

In [47]:
y = df_cat_encoded.RainTomorrow
y = y.apply(lambda x: 0 if str(x)=="No" else 1)
features = df_cat_encoded.drop(columns=df.RainTomorrow.name, axis=1)