# Weather Classification -- Sydney, Australia
## Scott Campbell, Matthew Triebes

### Dataset description
We sourced our dataset from Kaggle.com: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

When we examined the dataset, we realized that the dataset is a little too large to analyze. Because of this, we've decided to focus our efforts on examining the rain data out of Sydney Australia. That should limit the data to about 3400 instances which will make it a lot easier to analyze. 

With that taken into account, we’ve made separate files based on the 9am and 3pm data. Comparing those results should be interesting and we’ll see is there is much of a difference.

### Imports

In [11]:
import importlib
import pickle

import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

import mysklearn.mypytable
importlib.reload(mysklearn.mypytable)
from mysklearn.mypytable import MyPyTable 

import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MyKNeighborsClassifier, MySimpleLinearRegressor, MyNaiveBayesClassifier, MyDecisionTreeClassifier, MyRandomForestClassifier

import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
import mysklearn.myevaluation as myevaluation

## Loading the Dataset into Data Science Table

In [12]:
table = MyPyTable()
table.load_from_file("Sydney_weather.csv")

<mysklearn.mypytable.MyPyTable at 0x7f86533d5d00>

### Discretizing Continuous Attributes

The datatable has several attributes that are continuous variables that must first be discretized for use in the various classifiers, as well as the Random Forest Classifier.

In [13]:
# Define the cutoffs and labels for the continuous attributes
temp_cutoffs = [0, 5, 10, 15, 20, 25, 30]
temp_labels = [kk+1 for kk in range(len(temp_cutoffs))]
humidity_cutoffs = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
humidity_labels = [kk+1 for kk in range(len(humidity_cutoffs))]
pressure_cutoffs = [950, 960, 970, 980, 990, 1000, 1010, 1020, 1030, 1040, 1050]
pressure_labels = [kk+1 for kk in range(len(pressure_cutoffs))]
# Get the attributes of interest from the datatable
subdataset = table.get_multiple_columns(["MinTemp", "MaxTemp", "WindGustDir", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm", "RainToday"])
new_table = MyPyTable(data=subdataset, column_names=["MinTemp", "MaxTemp", "WindGustDir", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm", "RainToday"])
# Remove all instances with NA
new_table.remove_rows_with_missing_values()
# Classify temps as continuous datat
min_temp = new_table.get_column("MinTemp")
min_temp = myutils.classify_continuous_data(min_temp, temp_cutoffs, temp_labels, lower_inclusive_upper_exclusive=False)
max_temp = new_table.get_column("MaxTemp")
max_temp = myutils.classify_continuous_data(max_temp, temp_cutoffs, temp_labels, lower_inclusive_upper_exclusive=False)
humid9am = new_table.get_column("Humidity9am")
humid9am = myutils.classify_continuous_data(humid9am, humidity_cutoffs, humidity_labels, lower_inclusive_upper_exclusive=False)
humid3pm = new_table.get_column("Humidity3pm")
humid3pm = myutils.classify_continuous_data(humid3pm, humidity_cutoffs, humidity_labels, lower_inclusive_upper_exclusive=False)
pressure9am = new_table.get_column("Pressure9am")
pressure9am = myutils.classify_continuous_data(pressure9am, pressure_cutoffs, pressure_labels, lower_inclusive_upper_exclusive=False)
pressure3pm = new_table.get_column("Pressure3pm")
pressure3pm = myutils.classify_continuous_data(pressure3pm, pressure_cutoffs, pressure_labels, lower_inclusive_upper_exclusive=False)
windGust = new_table.get_column("WindGustDir")
rainToday = new_table.get_column("RainToday")

# Create the final table with the conditioning finished
dataset = [ [str(min_temp[kk])] + [str(max_temp[kk])] + [str(humid9am[kk])] + [str(humid3pm[kk])] + [str(pressure9am[kk])] + [str(pressure3pm[kk])] + [str(windGust[kk])] + [str(rainToday[kk])] for kk in range(len(new_table.data))]
table = MyPyTable(data=dataset, column_names=["MinTemp", "MaxTemp", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm",  "WindGustDir", "RainToday"])

## Feature Selection using a Decision Tree Classifier

A preliminary round of feature selection was performed on the basis of how many NA values the attribute column contained. We chose only to consider the attributes that had only a small percent of NA values to maintain a larger useable dataset. We chose not to replace the missing values as the classification problem is not favorable to such a large number of replacements. 

In [14]:
# Create the X and y data
X_data = table.get_multiple_columns(["MinTemp", "MaxTemp", "WindGustDir", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm"])
y_data = table.get_column("RainToday")

# Create the classifier and fit it
decisionTreeClassifier = MyDecisionTreeClassifier()
myevaluation.perform_holdout_method(decisionTreeClassifier, X_data, y_data, random_state=None, shuffle=False, normalize_X=False)

# Make a visualization of the decision tree to determine the attributes of the most importance 
# in determining whether or not it will rain that day. This is to be used for feature selection
# 
# decisionTreeClassifier.visualize_tree("SydneyAUS_DecisionTree_DOT", "SydneyAUS_DecisionTree", attribute_names=["MinTemp", "MaxTemp", "WindGustDir", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm", "RainToday"])

decisionTreeClassifier.print_decision_rules(attribute_names=["MinTemp", "MaxTemp", "WindGustDir", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm"], class_name="RainToday")


Decision Rules:

IF Humidity9am = 10 AND WindGustDir = E THEN RainToday = Yes
IF Humidity9am = 10 AND WindGustDir = ENE THEN RainToday = Yes
IF Humidity9am = 10 AND WindGustDir = ESE THEN RainToday = Yes
IF Humidity9am = 10 AND WindGustDir = N THEN RainToday = Yes
IF Humidity9am = 10 AND WindGustDir = NE THEN RainToday = Yes
IF Humidity9am = 10 AND WindGustDir = NNE THEN RainToday = Yes
IF Humidity9am = 10 AND WindGustDir = NNW THEN RainToday = No
IF Humidity9am = 10 AND WindGustDir = NW THEN RainToday = Yes
IF Humidity9am = 10 AND WindGustDir = S THEN RainToday = Yes
IF Humidity9am = 10 AND WindGustDir = SE THEN RainToday = Yes
IF Humidity9am = 10 AND WindGustDir = SSE THEN RainToday = Yes
IF Humidity9am = 10 AND WindGustDir = SSW THEN RainToday = Yes
IF Humidity9am = 10 AND WindGustDir = SW THEN RainToday = Yes
IF Humidity9am = 10 AND WindGustDir = W THEN RainToday = Yes
IF Humidity9am = 10 AND WindGustDir = WNW THEN RainToday = No
IF Humidity9am = 10 AND WindGustDir = WSW THEN Rain

### Feature Selection

Based on the decision tree classifier using entropy, it would seem that the most important attributes for classification are the following:

* Humidity9am
* WindGustDir
* Pressure9am
* MaxTemp

Moving forward with classification, we feel that we will use our originally determined set of attributes for the classification given that they are all "pairs", i.e. Min/Max temperature and Pressure/Humidity at both 9am and 3pm. We believe this will be important for generating a reliable Random Forest classifier. However, for creating a Decision Tree Classifier and a Naive Bayes Classifier, the shortened attribute list posed above will be used.

# Classification Using Multiple Classifier Types

## Decision Tree Classifier

To begin, we chose to use a decision tree classifier and stratified k-fold validation using k=10. 

In [15]:
k_cross_validation = 10 # The number of folds for the (stratifed) cross-validation

# Create the X and y data
X_train = table.get_multiple_columns(["MaxTemp", "WindGustDir", "Humidity9am", "Pressure9am"])
y_train = table.get_column("RainToday")

# Create the classifier and fit it using k-fold cross validation
decisionTreeClassifier = MyDecisionTreeClassifier()
correct_sum = 0
# Get the train-test indices for cross-validation
train_folds, test_folds = myevaluation.stratified_kfold_cross_validation(X_train, y_train, n_splits=k_cross_validation)
# Run the fitting
all_y_pred, all_y_actual = [], []
for kk in range(k_cross_validation):
    # Get the X,y train/test indices
    train_indices, test_indices = train_folds[kk], test_folds[kk]
    # Fit the classifier
    X_test_indices, y_test_indices, y_test, y_test_pred = myevaluation.fit_classifier(decisionTreeClassifier, X_train, y_train, train_indices, test_indices, train_indices, test_indices, normalize_X=False)
    # Fetch the y_test_actual values
    y_test_actual = [y_train[kk] for kk in test_indices]
    # Append these to their respective arrays
    all_y_actual += y_test_actual
    all_y_pred += y_test_pred
    # Get the number correct
    correct_sum += myutils.get_percent_correct(y_test_pred, y_test_actual)

predictive_accuracy = correct_sum / k_cross_validation
myutils.print_stratified_crossVal_results([predictive_accuracy], ["Decision Tree"], k_cross_validation, title="Accuracy Results")

Accuracy Results
Stratified 10-Fold Cross Validation
Decision Tree: accuracy = 0.7959618208516888, error rate = 0.20403817914831124


### Notes on Decision Tree Classifier

The decision tree classifier was able to do a good job of classifiying the dataset in the sense that its predictive accuracy is higher than the highest percentage of class labels in our binary classification problem.

## Naive Bayes Classifier

Next, we move to using a Naive Bayes Classifier, also using a stratified 10-fold cross-validation scheme.

In [16]:
k_cross_validation = 10 # The number of folds for the (stratifed) cross-validation

# Create the X and y data
X_train = table.get_multiple_columns(["MaxTemp", "WindGustDir", "Humidity9am", "Pressure9am"])
y_train = table.get_column("RainToday")

# create the classifier
naiveBayesClassifier = MyNaiveBayesClassifier()
correct_sum = 0
all_y_pred, all_y_actual = [], []
# Get the train-test indices for stratified cross-validation
train_folds, test_folds = myevaluation.stratified_kfold_cross_validation(X_train, y_train, n_splits=k_cross_validation)

# Fit the classifier
for kk in range(k_cross_validation):
    # Get the X,y train/test indices
    train_indices, test_indices = train_folds[kk], test_folds[kk]
    # Fit the  classifier
    X_test_indices, y_test_indices, y_test, y_test_pred =  myevaluation.fit_classifier(naiveBayesClassifier, X_train, y_train, train_indices, test_indices, train_indices, test_indices, normalize_X=False)
    # Fetch the y_test_actual values
    y_test_actual = [y_train[kk] for kk in test_indices]
    # Append these to their respective arrays
    all_y_actual += y_test_actual
    all_y_pred += y_test_pred
    # Get the number correct
    correct_sum += myutils.get_percent_correct(y_test_pred, y_test_actual)

predictive_accuracy = correct_sum / k_cross_validation
myutils.print_stratified_crossVal_results([predictive_accuracy], ["Naive Bayes"], k_cross_validation, title="Accuracy Results")

Accuracy Results
Stratified 10-Fold Cross Validation
Naive Bayes: accuracy = 0.7985895355127908, error rate = 0.20141046448720923


### Notes on Naive Bayes Classifier

The Naive Bayes classifier was able to do a good job of classifiying the dataset--similar to that of the Decision Tree Classifier--in the sense that its predictive accuracy is higher than the highest percentage of class labels in our binary classification problem.

## Random Forest Classifier

Here, we are generating multiple random forest classifiers while modifying the variables N, M, and F described below. Multiple trials have been conducted and a summary of the results can be found below.


In [19]:
# Define the 3 random forest variables
N = 30 # Total number of decision trees to generate
M = 13 # Number of "best" trees ro keep
F = 6 # Number of random attributes to select from

k_cross_validation = 10 # The number of folds for the (stratifed) cross-validation

# First, make sure everything is represented as a string
X_train = table.get_multiple_columns(["MinTemp", "MaxTemp", "WindGustDir", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm"])
y_train = table.get_column("RainToday")

# Create the classifier and fit it using k-fold cross validation
randForestClassifier = MyRandomForestClassifier(N, M, F)
correct_sum = 0
# Get the train-test indices for cross-validation
train_folds, test_folds = myevaluation.stratified_kfold_cross_validation(X_train, y_train, n_splits=k_cross_validation)
# Run the fitting
all_y_pred, all_y_actual = [], []
for kk in range(k_cross_validation):
    # Get the X,y train/test indices
    train_indices, test_indices = train_folds[kk], test_folds[kk]
    # Fit the classifier
    X_test_indices, y_test_indices, y_test, y_test_pred = myevaluation.fit_classifier(randForestClassifier, X_train, y_train, train_indices, test_indices, train_indices, test_indices, normalize_X=False)
    # Fetch the y_test_actual values
    y_test_actual = [y_train[kk] for kk in test_indices]
    # Append these to their respective arrays
    all_y_actual += y_test_actual
    all_y_pred += y_test_pred
    # Get the number correct
    correct_sum += myutils.get_percent_correct(y_test_pred, y_test_actual)

predictive_accuracy = correct_sum / k_cross_validation
myutils.print_stratified_crossVal_results([predictive_accuracy], ["Random Forest"], k_cross_validation, title="Accuracy Results")

Accuracy Results
Stratified 10-Fold Cross Validation
Random Forest: accuracy = 0.7964023494860498, error rate = 0.20359765051395018


### Variable Modification Notes

In this case, we are varying the variables N, M, and F in an attempt to generate the best random forest classifier. A summary of these results is given below:

* N=20, M=7, F=2
    * Accuracy = 0.752 -- My thought is that F=2 is a little too small. Increasing for next trial.
* N=20, M=7, F=3
    * Accuracy = 0.769 -- A slight to no improvement. Continuing with experimental changes.
* N=20, M=9, F=5
    * Accuracy = 0.792 -- Significant improvement from increasing F. Increasing N and M as well
* N=30, M=11, F=5
    * Accuracy = 0.774 -- No improvement. Next, planning on increasing both N and M and F 
* N=30, M=13, F=6
    * Accuracy = 0.801 -- Better than the single decision tree classifier. Increasing  N and M a little more
* N=40, M=17, F=6
    * Accuracy = 0.801 -- No improvement and longer run time. Sticking with the previous trial. From curiosity, try lowering N and M
* N=20, M=7, F=6
    * Accuracy = 0.796

It seems from the above trials that the best combination of variables for our dataset is N=30, M=13, and F=6. This has the best accuracy of all Classifiers that were tried and this is what will be used for our Flask App Deployment.


# Saving the Best Classifier

We want to upload our best classifier for use in a Heroku Web App. That is done here first. Note that the type of classifier will determine how the weather_app.py file needs to be structured.

The app can be found on Heroku at: https://git.heroku.com/sydney-aus-weather-predictor.git

In [10]:
# Run the best classifier on all of the available data
bestClassifier = MyRandomForestClassifier(30, 13, 6)

X_train = table.get_multiple_columns(["MinTemp", "MaxTemp", "WindGustDir", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm"])
y_train = table.get_column("RainToday")
bestClassifier.fit(X_train, y_train)

# Pickle the tree to use in the Heroku Web App
packaged_object = [bestClassifier.header, bestClassifier.trees]
# We want to pickle the packaged objects
outfile = open(os.path.join("Heroku_Files", "bestClassifier.p"), "wb")
pickle.dump(packaged_object, outfile)
outfile.close()