# Rain Classifier


<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/drive/1J-AiKNtBcl4ZbYlKNxtkhJ3_Nnr35fgz"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/inaki-gonzalo/rain_predictor/blob/main/Rain_Classifier.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View on GitHub</a>
  </td>
</table>

## Introduction

The goal of this notebook is to build a binary classifier to predict whether it is going to rain.

I know nothing about meteorology so I start with raw data from NOAA. It contains precipitation per day along with other variables. I choose the variables that are correlated to precipitation as my features. (See visualization section).

Based on a Medium [article](https://medium.com/datadriveninvestor/building-neural-network-using-keras-for-classification-3a3656c726c1).


## Import libraries

In [None]:
from keras import Sequential
from keras.layers import Dense
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


## Download data

Download and clean the data.

In [None]:
# Original source of the data.
# https://www.ncdc.noaa.gov/cdo-web/search

data_columns = ['DATE', 'DailyAverageDewPointTemperature', 'DailyAverageRelativeHumidity',
                'DailyPrecipitation', 'DailyHeatingDegreeDays', 'DailyDepartureFromNormalAverageTemperature']
data_url = 'https://raw.githubusercontent.com/inaki-gonzalo/rain_predictor/main/sj_data.csv'
dataset = pd.read_csv(data_url, sep=',', header=0, parse_dates=[
                      'DATE'], usecols=data_columns, na_values=['T'])

dataset.dropna(inplace=True)

# Convert into correct data types.
float_columns = {}
for c in data_columns:
    if c != 'DATE':
        float_columns[c] = np.float16
dataset = dataset.astype(float_columns)

# Convert precipitation into a binary category.
dataset['DailyPrecipitation'] = np.where(
    dataset['DailyPrecipitation'] > 0.01, 1, 0)

# Make day of year a feature.
dataset['dayofyear'] = dataset['DATE'].dt.dayofyear

## Visualize

Explore relationships between variables in your data. The goal here is to find a picture were the dots of different colors are separate clusters.

In [None]:
sns.pairplot(dataset, hue='DailyPrecipitation')

For the following I see the row with DailyPrecipitation. I want to choose featues with high correlation.

In [None]:
sns.heatmap(dataset.corr(), annot=True)

## Create and split data
Remove the daily precipitation from the input data and put it as output.

The data is all in different units. To make it easier for the model to learn from it, we pass the inputs through a standard scalar.

Split the data into train and test.


In [None]:
del data_columns[data_columns.index('DATE')]
del data_columns[data_columns.index('DailyPrecipitation')]
x = dataset.loc[:, data_columns]
y = dataset.loc[:, ['DailyPrecipitation']]

# standardizing the input features.
sc = StandardScaler()
x = sc.fit_transform(x)

# Split data into train and test.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.7)

## Defining the model

This simple 3 layer model seems to be sufficient for this problem.

In [None]:
classifier = Sequential()
classifier.add(Dense(4, activation='relu',
                     kernel_initializer='random_normal', input_dim=len(data_columns)))
classifier.add(Dense(4, activation='relu', kernel_initializer='random_normal'))
classifier.add(Dense(1, activation='sigmoid',
                     kernel_initializer='random_normal'))
classifier.compile(
    optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

## Training
The data is not balanced because it doesn't rain half of the time.
So we need to compensate with class_weight.

In [None]:
class_weight = {0: 1,
                1: 15}
classifier.fit(x_train, y_train, batch_size=10, epochs=100, class_weight=class_weight)

## Evaluate the model

In [None]:
eval_model = classifier.evaluate(x_test, y_test)
print(eval_model)

y_pred = classifier.predict(x_test)
y_pred = (y_pred > 0.5)

cm = confusion_matrix(y_test, y_pred)
print('\nNumber of samples:{}'.format(len(y_pred)))
print('True negatives:{}'.format(cm[0][0]))
print('False negatives:{}'.format(cm[1][0]))
print('True positives:{}'.format(cm[0][1]))
print('False positives:{}'.format(cm[1][1]))