# Problem 2 -- Rain in Australia

In this problem, we will build models to predict whether it's going to rain tomorrow.

## 1. Data loading
Different from Problem 1, this data set contains three different types of features. We need to specify them and later process them separately before feeding to the training session.

In [1]:
DATE_COLUMNS = ['Date']

NUMERIC_COLUMNS = ['MinTemp','MaxTemp','Rainfall','Evaporation','Sunshine',
  'WindGustSpeed','WindSpeed9am','WindSpeed3pm','Humidity9am','Humidity3pm',
  'Pressure9am','Pressure3pm','Cloud9am','Cloud3pm','Temp9am','Temp3pm']

CATEGORICAL_COLUMNS = ['Location','WindGustDir','WindDir9am','WindDir3pm','RainToday','RainTomorrow']

Now we load data in. The first line of the CSV file is the head, which is removed when loading. We also need to move the column "RISK_MM" that's excluded from the problem.

In [2]:
import pandas as pd

dftrain = pd.read_csv(
  'weatherAUS.csv',
  header=1,
  names=['Date','Location','MinTemp','MaxTemp','Rainfall','Evaporation','Sunshine',
    'WindGustDir','WindGustSpeed','WindDir9am','WindDir3pm','WindSpeed9am','WindSpeed3pm',
    'Humidity9am','Humidity3pm','Pressure9am','Pressure3pm','Cloud9am','Cloud3pm',
    'Temp9am','Temp3pm','RainToday','RISK_MM','RainTomorrow'])

dftrain.pop('RISK_MM')

0         0.0
1         0.0
2         1.0
3         0.2
4         0.0
         ... 
145454    0.0
145455    0.0
145456    0.0
145457    0.0
145458    NaN
Name: RISK_MM, Length: 145459, dtype: float64

Now the data looks like this.

In [3]:
print(dftrain.head(5))

         Date Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine  \
0  2008-12-02   Albury      7.4     25.1       0.0          NaN       NaN   
1  2008-12-03   Albury     12.9     25.7       0.0          NaN       NaN   
2  2008-12-04   Albury      9.2     28.0       0.0          NaN       NaN   
3  2008-12-05   Albury     17.5     32.3       1.0          NaN       NaN   
4  2008-12-06   Albury     14.6     29.7       0.2          NaN       NaN   

  WindGustDir  WindGustSpeed WindDir9am  ... Humidity9am  Humidity3pm  \
0         WNW           44.0        NNW  ...        44.0         25.0   
1         WSW           46.0          W  ...        38.0         30.0   
2          NE           24.0         SE  ...        45.0         16.0   
3           W           41.0        ENE  ...        82.0         33.0   
4         WNW           56.0          W  ...        55.0         23.0   

   Pressure9am  Pressure3pm  Cloud9am  Cloud3pm  Temp9am  Temp3pm  RainToday  \
0       1010.6    

## 2. Data preprocessing

### 2.1. Date type
Sci-kit learn doesn't train against date type columns. In order to make it trainable, we need to either (1) convert date into numeric values (i.e. "2019-01-01" to int(20190101)); or (2) convert date into ordinals, so that dates are converted into sorted numbers in ascending order.

Here we pick (2).

In [4]:
import datetime as dt

for data_column in DATE_COLUMNS:
  dftrain['Date'] = pd.to_datetime(dftrain['Date'])
  dftrain['Date'] = dftrain['Date'].map(dt.datetime.toordinal)


### 2.2. Numeric type
For numeric columns, convert NaN and inf/-inf into the mean value of column. Again, we leave generalization at training time.

In [5]:
import numpy as np

for column in NUMERIC_COLUMNS:
  dftrain[column].fillna(dftrain[column].mean(), inplace=True)
  dftrain[column].replace([np.inf, -np.inf], dftrain[column].mean(), inplace=True)

### 2.3. Categorical type
For categorical columns, do label encoding. For the sake of space, we don't do one-hot encoding here.

In [6]:
from sklearn import preprocessing

for column in CATEGORICAL_COLUMNS:
  dftrain[column] = preprocessing.LabelEncoder().fit_transform(dftrain[column].astype(str))

### 2.3. Shuffling and split
Here we shuffle the data, generate the label and split them under 80:20 (train:test) ratio.

In [7]:
from sklearn.utils import shuffle

import sklearn
from sklearn.model_selection import train_test_split

dftrain = shuffle(dftrain, random_state=8)

dflabel = dftrain.pop('RainTomorrow')

x_train, x_test, y_train, y_test = train_test_split(dftrain, dflabel, test_size=0.2, random_state=42)

The training and testing data look like this.

In [8]:
print('Trainig features are:\n', x_train.head(3))
print('Training labels are:\n', y_train.head(3))

Trainig features are:
           Date  Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine  \
57727   736092         5     12.7     17.2       0.4     5.468232  7.611178   
48586   736070         9     10.2     23.0       0.6     5.468232  7.611178   
100241  734106        22      9.7     20.9       0.0     3.600000  3.100000   

        WindGustDir  WindGustSpeed  WindDir9am  ...  WindSpeed3pm  \
57727             5           54.0           5  ...          30.0   
48586             6           30.0          11  ...          19.0   
100241            2           48.0           2  ...          30.0   

        Humidity9am  Humidity3pm  Pressure9am  Pressure3pm  Cloud9am  \
57727         100.0        100.0       1010.4       1004.4       7.0   
48586          87.0         50.0       1022.3       1019.3       7.0   
100241         68.0         59.0       1018.9       1016.2       8.0   

        Cloud3pm  Temp9am  Temp3pm  RainToday  
57727        8.0     13.0     14.5          0 

## 3. Model training: a classification problem

### 3.1. Logistic regression
We first apply logistic regression with L1 and L2 regularization. Please be noted that training may be slower than Problem 1, due to the size of data.

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error

logistic_reg = LogisticRegression(solver='liblinear', max_iter=500, multi_class ='auto', random_state=23)

logistic_reg.fit(x_train, y_train)

print('logistic reg mean squared error for train: %.4f' % mean_squared_error(logistic_reg.predict(x_train), y_train))
print('logistic reg mean squared error for test: %.4f' % mean_squared_error(logistic_reg.predict(x_test), y_test))

print('logistic_reg score for train: %.4f' % logistic_reg.score(x_train, y_train))
print('logistic_reg score for test: %.4f' % logistic_reg.score(x_test, y_test))

logistic reg mean squared error for train: 0.3078
logistic reg mean squared error for test: 0.3136
logistic_reg score for train: 0.7592
logistic_reg score for test: 0.7550


Cross validation is needed as well.

In [10]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import cross_val_score

scores = cross_val_score(logistic_reg, dftrain, dflabel, cv=5, scoring='accuracy')

print('Scores for accuracy: ', scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Scores for accuracy:  [0.75836112 0.75836112 0.75841325 0.75841325 0.75841325]
Accuracy: 0.76 (+/- 0.00)


### 3.2. Gradient boosted tree classifier

In [11]:
from sklearn import ensemble

gb_tree_classifier = ensemble.GradientBoostingClassifier(
  loss='deviance',
  learning_rate=0.1,
  n_estimators=200,
  random_state=24,
  max_depth=3,
  max_features=None)

gb_tree_classifier.fit(x_train, y_train)

print('logistic reg mean squared error for train: %.4f' % mean_squared_error(gb_tree_classifier.predict(x_train), y_train))
print('logistic reg mean squared error for test: %.4f' % mean_squared_error(gb_tree_classifier.predict(x_test), y_test))

print('logistic_reg score for train: %.4f' % gb_tree_classifier.score(x_train, y_train))
print('logistic_reg score for test: %.4f' % gb_tree_classifier.score(x_test, y_test))


logistic reg mean squared error for train: 0.1790
logistic reg mean squared error for test: 0.1895
logistic_reg score for train: 0.8495
logistic_reg score for test: 0.8421


Cross validation also verifies that results are consistent across runs.

In [12]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import cross_val_score

scores = cross_val_score(logistic_reg, dftrain, dflabel, cv=5, scoring='accuracy')

print('Scores for accuracy: ', scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Scores for accuracy:  [0.75836112 0.75836112 0.75841325 0.75841325 0.75841325]
Accuracy: 0.76 (+/- 0.00)


As a result, we found that the gradient boosted tree model was able to generate a better prediction with ~10% better accuracy, while maintaining a good balance in bias and variance.