# Australian Weather Prediction

### Objective
- This Project Predict the next-day rain by training classification model (Logistic Regression) on the target variable RainTomorrow. The target variable RainTomorrow means: Did it rain the next day? Yes or No.


### The dataset

+ The dataset used contains about 10 years of daily weather observations from many locations across Australia. 


### Features description
- RainTomorrow is the target variable to predict. It means -- did it rain the next day, Yes or No? This column is Yes if the rain for that day was 1mm or more.

- Date: the date of the observation

- Location: The common name of the location of the weather station.
 
- MinTemp: The minimum temperature in degree Celsius
 
- MaxTemp: The maximum temperature in degree Celsius
- 
- Rainfall: The amount of rainfall recorded for the day in mm

- Evaporation: The so-called Class A pan evaporation(mm) in the 25 hours to 9am
 
- Sunshine: The number of hours of bright sunshine in the day
 
- WindGustDir: The direction of the strongest wind gust in the 24 hours to midnight
 
- WindGustSpeed: The speed (km/h) of the strongest wind gust in the 24 hours to midnight
 
- WindDir9am: Direction of the wind at 9am
 
- WindDir3pm: Direction of the wind at 3pm
 
- WindSpeed9am: Wind speed (km/hr) averaged over 10 minutes prior to 9am
 
- WindSpeed3pm: Wind speed (km/hr) averaged over 10 minutes prior to 3pm
 
- Humidity9am: Humidity (percent) at 9am
 
- Humidity3pm: Humidity (percent) at 3pm
 
- Pressure9am: Atmospheric pressure (hpa) reduced to mean sea level at 9am
 
- Pressure3pm: Atmospheric pressure(hpa) reduced to mean sea level at 3pm
 
- Cloud9am: Fraction of the sky obscured by cloud at 9am. This is measured in “oktas” which are unit of eighths.
 
- Cloud3pm: Fraction of the sky obscured by cloud at 3pm. This is measured in “oktas” which are unit of eighths.
 
- Temp9am: Temperature (degrees C) at 9am
 
- Temp3pm: Temperature (degrees C) at 3pm
 
- RainToday: Yes, if precipitation(mm) in the 24 hours to 9am exceeds 1mm, otherwise No

### Source & Acknowledgements

Observations were drawn from numerous weather stations. 
- The daily observations are available from http://www.bom.gov.au/climate/data. 
- An example of latest weather observations in Canberra: http://www.bom.gov.au/climate/dwo/IDCJDW2801.latest.shtml
- Definitions adapted from http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml
- Data source: http://www.bom.gov.au/climate/dwo/ and http://www.bom.gov.au/climate/data.

###### Copyright Commonwealth of Australia 2010, Bureau of Meteorology.

### Importing libraries 

In [1]:
# packages/Libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#For splitting the data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
import math 

#Performance metrices
from sklearn.metrics import roc_auc_score,roc_curve,auc,accuracy_score, classification_report

#For encoding the features
from sklearn.preprocessing import LabelEncoder,LabelBinarizer

#For ignoring warnings
import warnings
warnings.filterwarnings("ignore") 

### Loading the raw data 

In [2]:
# Loading the data 
data = pd.read_csv('weatherAUS.csv')

### Data Exploration   

In [3]:
# printing the shape 
print(data.shape)

(145460, 23)


##### There are 145,460 observations and 23 columns 

In [4]:
# running the first 5 rows
data.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


In [5]:
# checking descriptive statistics 
data.describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm
count,143975.0,144199.0,142199.0,82670.0,75625.0,135197.0,143693.0,142398.0,142806.0,140953.0,130395.0,130432.0,89572.0,86102.0,143693.0,141851.0
mean,12.194034,23.221348,2.360918,5.468232,7.611178,40.03523,14.043426,18.662657,68.880831,51.539116,1017.64994,1015.255889,4.447461,4.50993,16.990631,21.68339
std,6.398495,7.119049,8.47806,4.193704,3.785483,13.607062,8.915375,8.8098,19.029164,20.795902,7.10653,7.037414,2.887159,2.720357,6.488753,6.93665
min,-8.5,-4.8,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,980.5,977.1,0.0,0.0,-7.2,-5.4
25%,7.6,17.9,0.0,2.6,4.8,31.0,7.0,13.0,57.0,37.0,1012.9,1010.4,1.0,2.0,12.3,16.6
50%,12.0,22.6,0.0,4.8,8.4,39.0,13.0,19.0,70.0,52.0,1017.6,1015.2,5.0,5.0,16.7,21.1
75%,16.9,28.2,0.8,7.4,10.6,48.0,19.0,24.0,83.0,66.0,1022.4,1020.0,7.0,7.0,21.6,26.4
max,33.9,48.1,371.0,145.0,14.5,135.0,130.0,87.0,100.0,100.0,1041.0,1039.6,9.0,9.0,40.2,46.7


In [6]:
# Checcking data types
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null

##### There are both numeric and string variables

### Data Cleaning 


In [7]:
# Checking for NULL values 
data.isnull().sum()

Date                 0
Location             0
MinTemp           1485
MaxTemp           1261
Rainfall          3261
Evaporation      62790
Sunshine         69835
WindGustDir      10326
WindGustSpeed    10263
WindDir9am       10566
WindDir3pm        4228
WindSpeed9am      1767
WindSpeed3pm      3062
Humidity9am       2654
Humidity3pm       4507
Pressure9am      15065
Pressure3pm      15028
Cloud9am         55888
Cloud3pm         59358
Temp9am           1767
Temp3pm           3609
RainToday         3261
RainTomorrow      3267
dtype: int64

##### From the results, features such as 'Evaporation', 'Sunshine', 'Cloud9am' and 'Cloud3pm' are missing over 55,000 values. This is very difficult to fix so these columns will be droped.

In [8]:
# dropping columns 
data = data.drop(['Evaporation', 'Sunshine', 'Cloud9am', 'Cloud3pm'], axis = 1)

In [9]:
# checking columns after operation 
data.columns

Index(['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'WindGustDir',
       'WindGustSpeed', 'WindDir9am', 'WindDir3pm', 'WindSpeed9am',
       'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am',
       'Pressure3pm', 'Temp9am', 'Temp3pm', 'RainToday', 'RainTomorrow'],
      dtype='object')

In [10]:
# review the datatypes 
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 19 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   WindGustDir    135134 non-null  object 
 6   WindGustSpeed  135197 non-null  float64
 7   WindDir9am     134894 non-null  object 
 8   WindDir3pm     141232 non-null  object 
 9   WindSpeed9am   143693 non-null  float64
 10  WindSpeed3pm   142398 non-null  float64
 11  Humidity9am    142806 non-null  float64
 12  Humidity3pm    140953 non-null  float64
 13  Pressure9am    130395 non-null  float64
 14  Pressure3pm    130432 non-null  float64
 15  Temp9am        143693 non-null  float64
 16  Temp3pm        141851 non-null  float64
 17  RainToday      142199 non-nul

In [11]:
# probing the string variables 
data[['Date', 'Location','WindGustDir','WindDir9am','WindDir3pm','RainToday','RainTomorrow']].head(10)

Unnamed: 0,Date,Location,WindGustDir,WindDir9am,WindDir3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,W,W,WNW,No,No
1,2008-12-02,Albury,WNW,NNW,WSW,No,No
2,2008-12-03,Albury,WSW,W,WSW,No,No
3,2008-12-04,Albury,NE,SE,E,No,No
4,2008-12-05,Albury,W,ENE,NW,No,No
5,2008-12-06,Albury,WNW,W,W,No,No
6,2008-12-07,Albury,W,SW,W,No,No
7,2008-12-08,Albury,W,SSE,W,No,No
8,2008-12-09,Albury,NNW,SE,NW,No,Yes
9,2008-12-10,Albury,W,S,SSE,Yes,No


##### Results indicates some categorical variables. The goal of the project is to focus on binary classification and therefore features such as Date, Location, WindGustDir, WindDir9am and WindDir3pm having multiple categories will not be relevant therefore has to be dropped.

In [12]:
# Removing unwanted features 
data.drop(['Date', 'Location','WindGustDir','WindDir9am', 'WindDir3pm'],axis=1,inplace=True)

In [13]:
# reviewing columns after operation 
data.columns

Index(['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am',
       'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am',
       'Pressure3pm', 'Temp9am', 'Temp3pm', 'RainToday', 'RainTomorrow'],
      dtype='object')

##### There are two categorical variables which are the 'RainToday and 'RainTomorrow' and 'RainTomorrow is the target variable.

### Dealing with NULL values 

In [14]:
# Checking for NULL values 
data.isnull().sum()

MinTemp           1485
MaxTemp           1261
Rainfall          3261
WindGustSpeed    10263
WindSpeed9am      1767
WindSpeed3pm      3062
Humidity9am       2654
Humidity3pm       4507
Pressure9am      15065
Pressure3pm      15028
Temp9am           1767
Temp3pm           3609
RainToday         3261
RainTomorrow      3267
dtype: int64

In [15]:
# since we are dealing with both numeric and categorical variables so the variables are listed as such
list_num = data.dtypes != 'object'
list_obj = data.dtypes == 'object'
list_num

MinTemp           True
MaxTemp           True
Rainfall          True
WindGustSpeed     True
WindSpeed9am      True
WindSpeed3pm      True
Humidity9am       True
Humidity3pm       True
Pressure9am       True
Pressure3pm       True
Temp9am           True
Temp3pm           True
RainToday        False
RainTomorrow     False
dtype: bool

##### A function to replace NULL values. Numeric Values that are NULL are replaced with mean values and categorical variables are replaced with  most frequently occurring values.  

In [16]:
# making lists of number and category columns
list_num = data.dtypes != 'object'
list_obj = data.dtypes == 'object'
# indexing the data frame using the column lists
num_col = data.columns[list_num].tolist()
cat_col = data.columns[list_obj].tolist()

# simple inputer function is used to fill the NULLS
imputer_num_mean = SimpleImputer(strategy = 'mean')
imputer_category = SimpleImputer(strategy = 'most_frequent')

In [17]:
# For loop to apply imputation
for col in data.columns.tolist():
    if col in num_col:
        data[col] = imputer_num_mean.fit_transform(data[[col]])
    else:
        data[col] = imputer_category.fit_transform(data[[col]])

In [18]:
#Check null values after operation 
data.isnull().sum()

MinTemp          0
MaxTemp          0
Rainfall         0
WindGustSpeed    0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Temp9am          0
Temp3pm          0
RainToday        0
RainTomorrow     0
dtype: int64

##### NULL values has successfully been replaced

### Feature Engineering 

##### 'RainToday' and 'RainTomorrow' are categorical variables and hence have to be converted to numerical ones 

In [19]:
# Converting 'RainTomorrow' to numeric
data['RainTomorrow'][data['RainTomorrow'] == 'Yes'] = 1
data['RainTomorrow'][data['RainTomorrow'] == 'No'] = 0

In [20]:
# Converting 'RainToday' to numeric
data['RainToday'][data['RainToday'] == 'Yes'] = 1
data['RainToday'][data['RainToday'] == 'No'] = 0

In [21]:
# displaying the first 10 rows of the two features
data[['RainToday','RainTomorrow']].head(10)

Unnamed: 0,RainToday,RainTomorrow
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0
5,0,0
6,0,0
7,0,0
8,0,1
9,1,0


##### All categorical variables sucessfully converted to numerics 


### Model Training 

In [22]:
#spliting data into two, ofcourse the dependable variable is raintomorrow
Y= data['RainTomorrow']
X = data.drop(['RainTomorrow'],axis=1)
Y=Y.astype(int)


In [23]:
# printing the shape 
print(X.shape)
print(Y.shape)

(145460, 13)
(145460,)


In [24]:
# traing data is set at 80% and test data at 20%
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size = 0.8, test_size=0.2, random_state=15)

In [25]:
# fitting model
log_reg = LogisticRegression(random_state=10, solver = 'lbfgs')

log_reg.fit(X_train, y_train)

LogisticRegression(random_state=10)

### Model Evaluation 

In [26]:
# predict - Predict class labels for samples in X
log_reg.predict(X_train)
y_pred = log_reg.predict(X_train)

# predict_proba - Probability estimates
pred_proba = log_reg.predict_proba(X_train)

In [27]:
# Accuracy on Train
print("The Training Accuracy is: ", log_reg.score(X_train, y_train))

# Accuracy on Test
print("The Testing Accuracy is: ", log_reg.score(X_test, y_test))


# Classification Report
print(classification_report(y_train, y_pred))

The Training Accuracy is:  0.8383662175168431
The Testing Accuracy is:  0.8349374398460058
              precision    recall  f1-score   support

           0       0.86      0.95      0.90     90827
           1       0.72      0.43      0.54     25541

    accuracy                           0.84    116368
   macro avg       0.79      0.69      0.72    116368
weighted avg       0.83      0.84      0.82    116368



### The model is 83% accurate in predicting rainfall with the training data and 83% accurate with the test data. 
