#### Part 3: 
#### This data set is looking at Weather Data collected over a few years. The data set is not totally ready for analysis as there are some outliers that need to be handled. 

#### After this is done, I will prepare the validation, training and test sets

The '2_Weather_Proc' data set includes details of ~460 different monthly time periods, recorded at the Change Climate Station. Here are the fields:

| Field          | Description|
|----------------|--------------------------------------------------------|
| Year-month                | Month and Year of Data Point                |
| temp_mean_daily_min       | The monthly and annual mean daily minimum temperature                                                               |
| temp_extremes_min         | The absolute extreme minimum air temperature|
| temp_mean_daily_max       | The monthly and annual mean daily maximum temperature                                                               |
| mean_temp                 | The monthly mean air temperature            |
| max_temperature           | The monthly extreme maximum air temperature |
| mean_sunshine_hrs         | The monthly mean sunshine hours in a day    |
| wet_bulb_temperature      | The hourly wet bulb temperature             |
| maximum_rainfall_in_a_day | The highest daily total rainfall            |
| total_rainfall            | The total monthly rainfall                  |
| rh_extremes_minimum       | The absolute extreme minimum relative humidity                                                                  |
| mean_rh                   | The monthly mean relative humidity          |
| no_of_rainy_days          | The number of rain days (day with rainfall amount of 0.2mm or more)                                                  |

### Import Libraries

In [None]:
# General Libraries
import itertools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.ticker import NullFormatter
import time
import re
import requests
import pickle
import seaborn as sns
import os
import glob
import sys
sns.set()

# Sklearn Liraries
from sklearn import preprocessing

from datetime import timedelta, date 
start = time.time()
%matplotlib inline

# Forces the print statement to show everything and not truncate
# np.set_printoptions(threshold=sys.maxsize) 
print('Libraries imported')

### Load Data from CSV File

In [None]:
df_pre_proc = pd.read_csv('2_Weather_Proc.csv')
print(df_pre_proc.shape)
print(df_pre_proc.info())
df_pre_proc.describe(include='all')

In [None]:
df_pre_proc.head(5)

In [None]:
df_pre_proc.tail(5)

### Dealing with outliers

In [None]:
(df_pre_proc == 0).astype(int).sum(axis=0)

There are 250 cells in the 'temp_mean_daily_min' column (~50% of the data) which is null (no data was recorded). There are also 55 cells in 'wet_bulb_temperature' which is 0. A "0" degree value in tropical Singapore is impossible, meaning these values are outliers that have to be dealt with.  

In [None]:
df_pre_proc['temp_mean_daily_min'].dropna().value_counts().sort_index()

For 'temp_mean_daily_min', the points mostly lie in the range 24.5 - 25.4 (0.9 degree difference)

In [None]:
df_pre_proc['wet_bulb_temperature'].dropna().round(1).value_counts().sort_index()

For 'wet_bulb_temperature', the points mostly lie in the range 24.6 - 26.1 (1.5 degree difference)

The small range of data points presents 2 approaches to dealing with this '0' outlier. Either I drop the whole column or I can statistically replace the '0' values with a random number from the range mentioned above. I will apply the latter. 

This statistical replacement will be done by (a) masking out values that are non-zero and (b) selecting, from a normal distribution, a random number that is then multiplied to another random number uniformly extracted from the range 23.5 to 27.1. This emulates noise while keeping the data within a tight bound

In [None]:
def rand_gen(a, b, df):
    # Pick a random number based on a normal distribution
    rand_norm = np.random.normal(1, 0.01, df.shape[0])
    rand_uni = np.random.uniform(a, b)
    out = rand_norm * rand_uni
    return out

This is the process to replace the 0-cells in 'temp_mean_daily_min'

In [None]:
# Create a boolean mask, 0 = true
mask_1 = df_pre_proc['temp_mean_daily_min'] == 0
mask_2 = ~mask_1

# All values that are greater than 0 are filtered out
df_filt = df_pre_proc[df_pre_proc['temp_mean_daily_min'] > 0]

In [None]:
# Generate the numbers that are the replacement for 0. 
# np automatically converts boolen (T, F) to (0, 1)
df_min = df_pre_proc[df_pre_proc['temp_mean_daily_min'] > 0]['temp_mean_daily_min'].mean()
df_max = df_pre_proc[df_pre_proc['temp_mean_daily_min'] > 0]['temp_mean_daily_min'].max()
replace_1 = mask_1 * rand_gen(df_min, df_max, df_pre_proc)
replace_2 = mask_2 * df_pre_proc['temp_mean_daily_min']

In [None]:
df_pre_proc['temp_mean_daily_min'] = np.maximum(replace_1,replace_2).round(1)
df_pre_proc['temp_mean_daily_min']

This is the process to replace the 0-cells in 'wet_bulb_temperature'

In [None]:
# Create a boolean mask, 0 = true
mask_3 = df_pre_proc['wet_bulb_temperature'] == 0
mask_4 = ~mask_3

# All values that are greater than 0 are filtered out
df_filt2 = df_pre_proc[df_pre_proc['wet_bulb_temperature'] > 0]

In [None]:
# Generate the numbers that are the replacement for 0. 
# np automatically converts boolen (T, F) to (0, 1)
df_min2 = df_pre_proc[df_pre_proc['wet_bulb_temperature'] > 0]['wet_bulb_temperature'].mean()
df_max2 = df_pre_proc[df_pre_proc['wet_bulb_temperature'] > 0]['wet_bulb_temperature'].max()
replace_3 = mask_3 * rand_gen(df_min2, df_max2, df_pre_proc)
replace_4 = mask_4 * df_pre_proc['wet_bulb_temperature']

In [None]:
df_pre_proc['wet_bulb_temperature'] = np.maximum(replace_3,replace_4).round(1)
df_pre_proc['wet_bulb_temperature']

In [None]:
df_pre_proc.columns.values

In [None]:
ncols = 3
fig, axes = plt.subplots(ncols=ncols)
fig.set_figwidth(20)

# Plot all Temperature related data for QC
sns.distplot(df_pre_proc['temp_mean_daily_min'],
             hist = True, ax=axes[0])
sns.distplot(df_pre_proc['temp_extremes_min'],
             hist = True, ax=axes[0])
sns.distplot(df_pre_proc['temp_mean_daily_max'],
             hist = True, ax=axes[0])
sns.distplot(df_pre_proc['mean_temp'],
             hist = True, ax=axes[0])
sns.distplot(df_pre_proc['max_temperature'],
             hist = True, ax=axes[0])
sns.distplot(df_pre_proc['wet_bulb_temperature'],
             hist = True, ax=axes[0])

# Plot all Rain related data for QC
sns.distplot(df_pre_proc['maximum_rainfall_in_a_day'],
             hist = True, ax=axes[1])
sns.distplot(df_pre_proc['total_rainfall'],
             hist = True, ax=axes[1])
sns.distplot(df_pre_proc['no_of_rainy_days'],
             hist = True, ax=axes[1])

# Plot all Humidity related data for QC
sns.distplot(df_pre_proc['rh_extremes_minimum'],
             hist = True, ax=axes[2])
sns.distplot(df_pre_proc['mean_rh'],
             hist = True, ax=axes[2])

for i in range(ncols):
    ax = axes[i]
    if i == 0:
        ax.set_title('Histogram - Temperature Data Distribution')
        ax.set_xlabel('Temperature Data (Deg C)')
        ax.set_ylabel('Frequency')
    if i == 1:
        ax.set_title('Histogram - Rainfall Data Distribution')
        ax.set_xlabel('Rainfall Data (Days)')
        ax.set_ylabel('Frequency')
    if i == 2:
        ax.set_title('Histogram - Humidity Data Distribution')
        ax.set_xlabel('Humidity Data (%)')
        ax.set_ylabel('Frequency')

In [None]:
# For empty Dataframe - testing purposes
# column_names = [ ]
# df = pd.DataFrame(columns = column_names)

df_pre_proc['Year-month']=pd.to_datetime(df_pre_proc['Year-month'])

df_pre_proc['month'] = df_pre_proc['Year-month'].dt.month
df_pre_proc['Year'] = df_pre_proc['Year-month'].dt.year
df_pre_proc.drop(['Year-month'], axis = 1)

Now the data set is truly ready for interpretation and use. First, lets split the data set into a training/test set and an evaluation set. The evaluation set will be treated as an "out-of-sample" set for the final model evaluation.

### Preparing the Data Sets for Predictive Modelling

In [None]:
msk = np.random.rand(len(df_pre_proc))<0.8
train_test_set = df_pre_proc[msk]
validate_set = df_pre_proc[~msk]
print(train_test_set.shape)
print(validate_set.shape)

#### Feature Selection on 'train_test_set'

In [None]:
Feature = train_test_set[[
    'temp_mean_daily_min', 'temp_extremes_min', 'temp_mean_daily_max', 'mean_temp', 
    'max_temperature', 'mean_sunshine_hrs', 'wet_bulb_temperature', 'maximum_rainfall_in_a_day', 
    'total_rainfall', 'rh_extremes_minimum', 'mean_rh', 'month'
]]
x=Feature
x.head()

#### Label data for Machine Learning

In [None]:
y = train_test_set['no_of_rainy_days'].values
print(y[0:5])
print(x.shape, y.shape)

Now I split the 'train_test_set' into a training and testing set. I will do this with a 70-30 split.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
            x, y, test_size = 0.3, random_state = 42
)

print('Train Set: ', x_train.shape, y_train.shape)
print('Test Set: ', x_test.shape, y_test.shape)

#### Normalize Data to give zero mean and unit variance. 

This is only done to the features

In [None]:
X_train=preprocessing.StandardScaler().fit(x_train).transform(x_train)
X_test=preprocessing.StandardScaler().fit(x_test).transform(x_test)
print('Normalized X Training Set: ', X_train[0:5])
print('Normalized X Testing Set: ', X_test[0:5])

#### Pickle the Models

In [None]:
# Pickle all the training (post normalization) 
# and testing data sets
with open('X_train', 'wb') as file:
    pickle.dump(X_train, file)
with open('X_test', 'wb') as file:
    pickle.dump(X_test, file)
with open('y_train', 'wb') as file:
    pickle.dump(y_train, file)
with open('y_test', 'wb') as file:
    pickle.dump(y_test, file)
with open('validate_set', 'wb') as file:
    pickle.dump(validate_set, file)

In [None]:
count = 'Completed Process'
elapsed = (time.time() - start)
print ("%s in %s seconds" % (count,elapsed))