# Predicting Rain in AU
CS 6140 Midterm - Problem 2
Author: Sid Nagaich

Problem 2: Using the Rain in Australia data set, build a system that predicts whether it is going to rain tomorrow.

# Loading the Data
Here we simply import some libraries, suppress warnings, and view the provided data

In [1]:
# import libraries
import numpy as np
import pandas as pd

# suppress warnings
import warnings
warnings.filterwarnings('ignore')

# read data
data = pd.read_csv("weatherAUS.csv")

# show data
data

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,0.0,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,25.0,1010.6,1007.8,,,17.2,24.3,No,0.0,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,0.0,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,16.0,1017.6,1012.8,,,18.1,26.5,No,1.0,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,0.2,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145455,2017-06-21,Uluru,2.8,23.4,0.0,,,E,31.0,SE,...,24.0,1024.6,1020.3,,,10.1,22.4,No,0.0,No
145456,2017-06-22,Uluru,3.6,25.3,0.0,,,NNW,22.0,SE,...,21.0,1023.5,1019.1,,,10.9,24.5,No,0.0,No
145457,2017-06-23,Uluru,5.4,26.9,0.0,,,N,37.0,SE,...,24.0,1021.0,1016.8,,,12.5,26.1,No,0.0,No
145458,2017-06-24,Uluru,7.8,27.0,0.0,,,SE,28.0,SSE,...,24.0,1019.4,1016.5,3.0,2.0,15.1,26.0,No,0.0,No


# Cleaning the Data
We drop the RISK_MM column, as our model should not use the amount of rain the next day to predict whether it will rain the next day. We then figure out which of the columns hold numerical data and which hold categorical data. We will encode the categorical data to have discernable numerical values. We all rows in which data for RainToday or RainTomorrow are missing, as these are related to our labeled data. We also do not use Evaporation, Sunshine, Cloud9am, Cloud3pm as features, as these fields contain significant amounts of missing data. We modified our encoded value for Date by using modulo division to roughly standardize the time of year. Finally, we replace missing values with the mean value in each column. This certainly could be further optimized by perhaps further subgrouping the data and using many means of these subgroups.

In [2]:
# CLEANING DATA

# drop amount of rain next day
data = data.drop(columns='RISK_MM')

# grab headers
headers = [h for h in data.columns]

# these columns contain numeric data
n_cols = data[headers].select_dtypes(include=np.number).columns.tolist()
print("\nnumeric columns: ", n_cols)

# these columns contain categorical data
c_cols = data[headers].select_dtypes('object').columns.tolist()
print("\ncategorical columns: ", c_cols)

# drop rows where RainToday or RainTomorrow is missing
data.dropna(subset=['RainToday', 'RainTomorrow'], inplace=True)

# drop columns that are large amounts of data missing data
data = data.drop(columns=['Evaporation','Sunshine','Cloud9am','Cloud3pm'])

# encode categorical data using pandas 
data[c_cols] = data[c_cols].astype('category')
data[c_cols] = data[c_cols].apply(lambda x: x.cat.codes)

# convert date to modulo integer (slightly imperfect due to leap years)
data['Date'] = data['Date'] % 365

# replace missing data with the average of the column
data = data.fillna(data.mean())

# show cleaned data
data



numeric columns:  ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm']

categorical columns:  ['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']


Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,31,2,13.4,22.9,0.6,13,44.0,13,14,20.0,24.0,71.0,22.0,1007.7,1007.1,16.9,21.8,0,0
1,32,2,7.4,25.1,0.0,14,44.0,6,15,4.0,22.0,44.0,25.0,1010.6,1007.8,17.2,24.3,0,0
2,33,2,12.9,25.7,0.0,15,46.0,13,15,19.0,26.0,38.0,30.0,1007.6,1008.7,21.0,23.2,0,0
3,34,2,9.2,28.0,0.0,4,24.0,9,0,11.0,9.0,45.0,16.0,1017.6,1012.8,18.1,26.5,0,0
4,35,2,17.5,32.3,1.0,13,41.0,1,7,7.0,20.0,82.0,33.0,1010.8,1006.0,17.8,29.7,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145454,145,41,3.5,21.8,0.0,0,31.0,2,0,15.0,13.0,59.0,27.0,1024.7,1021.2,9.4,20.9,0,0
145455,146,41,2.8,23.4,0.0,0,31.0,9,1,13.0,11.0,51.0,24.0,1024.6,1020.3,10.1,22.4,0,0
145456,147,41,3.6,25.3,0.0,6,22.0,9,3,13.0,9.0,56.0,21.0,1023.5,1019.1,10.9,24.5,0,0
145457,148,41,5.4,26.9,0.0,3,37.0,9,14,9.0,9.0,53.0,24.0,1021.0,1016.8,12.5,26.1,0,0


# Partitioning our Data
Since we have a lot of data, we are going to use an 80% / 20% split between training and testing data. Future considerations would be to use a further split for validation data.

In [3]:
# RainTomorrow is what we want to predict
y = data.loc[:,'RainTomorrow']

# remove RainTomorrow from the data we will use to build our model
del data["RainTomorrow"]

# update headers
headers = [h for h in data.columns]

# x is data with which we will predict y
x = data.loc[:, headers]

In [4]:
from sklearn.model_selection import train_test_split

# shuffle data before splitting 75/25
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.20, random_state=13)

# First Model - Logistic Regression
We use logistic regression as a binary classifier to predict whether or not it will rain the next day. We will regularize the data to disallow values going to zero and to attempt to not overfit our model. This is not a large concern for this type of model.

In [5]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# use L2 regularization and Logistic Regression to fit a classifier model to our training data
clf = make_pipeline(StandardScaler(), LogisticRegression()).fit(train_x, train_y)

# score the model on our test data
clf.score(test_x, test_y)

# use model to predict
clf_predictions = clf.predict(test_x)

# show P, R, F1 for each class (number of rings)
print(classification_report(test_y, clf_predictions))
print("model score: " + str(clf.score(test_x, test_y)))

              precision    recall  f1-score   support

           0       0.86      0.95      0.90     21821
           1       0.72      0.47      0.57      6337

    accuracy                           0.84     28158
   macro avg       0.79      0.71      0.73     28158
weighted avg       0.83      0.84      0.83     28158

model score: 0.8391576106257547


# Second Model - Gradient Boosted Trees
We use gradient boosted trees as a classifier to predict whether or not it will rain the next day. We regularize this data to help prevent overfitting. Hyperparameters have been tuned.

In [6]:
from sklearn.ensemble import GradientBoostingClassifier

### takes a couple of minutes to train ###

# Gradient Boosted Trees
gbt = make_pipeline(StandardScaler(), GradientBoostingClassifier(n_estimators=1500, learning_rate=0.09, max_depth=2,
                                 subsample=0.85, validation_fraction=0.1, n_iter_no_change=20,
                                 max_features=4)).fit(train_x,train_y)

gbt.score(test_x,test_y)

gbt_predictions = gbt.predict(test_x)
print(classification_report(test_y, gbt_predictions))
print("model score: " + str(gbt.score(test_x, test_y)))

              precision    recall  f1-score   support

           0       0.87      0.95      0.91     21821
           1       0.75      0.51      0.61      6337

    accuracy                           0.85     28158
   macro avg       0.81      0.73      0.76     28158
weighted avg       0.84      0.85      0.84     28158

model score: 0.851125790183962


# Third Model - Stochastic Gradient Descent 

*I was simply curious to see how this model would perform.*

We use SGD to train a binary classifier to predict whether or not it will rain the next day. We regularize this data to help speed up convergence.

In [7]:
from sklearn.linear_model import SGDClassifier

# Stochastic Gradient Descent
sgd = make_pipeline(StandardScaler(), SGDClassifier(max_iter=1000, tol=1e-3)).fit(train_x, train_y)

sgd.score(test_x,test_y)

sgd_predictions = sgd.predict(test_x)
print(classification_report(test_y, sgd_predictions))
print("model score: " + str(sgd.score(test_x, test_y)))

              precision    recall  f1-score   support

           0       0.86      0.94      0.90     21821
           1       0.69      0.48      0.57      6337

    accuracy                           0.84     28158
   macro avg       0.78      0.71      0.73     28158
weighted avg       0.82      0.84      0.82     28158

model score: 0.8358548192343206


# Discussion
Our data had to be cleaned somewhat thoroughly in this exercise, as there were many missing values in the large dataset. Decisions to replace data with measures other than the mean could perhaps be justified depending on the type of data. As mentioned above, perhaps further improvement could be achieved if missing data were replaced with means of various subgroups of data (ex: average temperature in winter vs. summer). Data was normalize in all of our models. In logistic regression, we do not want values to go to zero and we want to penalize outliers greatly (L2 regularization). In GBTs regularized data helps preventing the model form overfitting, and in SGD it decreases the time it takes to converge. 