# Weather Classification -- Sydney, Australia
## Scott Campbell, Matthew Triebes

### Dataset description
We sourced our dataset from Kaggle.com: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

When we examined the dataset, we realized that the dataset is a little too large to analyze. Because of this, we've decided to focus our efforts on examining the rain data out of Sydney Australia. That should limit the data to about 3400 instances which will make it a lot easier to analyze. 

With that taken into account, we’ve made separate files based on the 9am and 3pm data. Comparing those results should be interesting and we’ll see is there is much of a difference.

### Imports

In [78]:
import importlib

import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

import mysklearn.mypytable
importlib.reload(mysklearn.mypytable)
from mysklearn.mypytable import MyPyTable 

import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MyKNeighborsClassifier, MySimpleLinearRegressor, MyNaiveBayesClassifier, MyDecisionTreeClassifier, MyRandomForestClassifier

import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
import mysklearn.myevaluation as myevaluation

## Loading the Dataset into Data Science Table

In [79]:
table = MyPyTable()

table.load_from_file("Sydney_weather.csv")

x = table.get_column("MinTemp", False)
x_float = []
y = table.get_column("Rainfall", False)
y_float = []

for i in range(len(x)):
    if(x[i] != 'NA' and y[i] != 'NA'):
        x_float.append(float(x[i]))
        y_float.append(float(y[i]))

### Discretizing Continuous Attributes

The datatable has several attributes that are continuous variables that must first be discretized for use in the various classifiers, as well as the Random Forest Classifier.

In [80]:
# Define the cutoffs and labels for the continuous attributes
temp_cutoffs = [0, 5, 10, 15, 20, 25, 30]
temp_labels = [kk+1 for kk in range(len(temp_cutoffs))]
humidity_cutoffs = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
humidity_labels = [kk+1 for kk in range(len(humidity_cutoffs))]
pressure_cutoffs = [950, 960, 970, 980, 990, 1000, 1010, 1020, 1030, 1040, 1050]
pressure_labels = [kk+1 for kk in range(len(pressure_cutoffs))]
# Get the attributes of interest from the datatable
subdataset = table.get_multiple_columns(["MinTemp", "MaxTemp", "WindGustDir", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm", "RainToday"])
new_table = MyPyTable(data=subdataset, column_names=["MinTemp", "MaxTemp", "WindGustDir", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm", "RainToday"])
# Remove all instances with NA
new_table.remove_rows_with_missing_values()
# Classify temps as continuous datat
min_temp = new_table.get_column("MinTemp")
min_temp = myutils.classify_continuous_data(min_temp, temp_cutoffs, temp_labels, lower_inclusive_upper_exclusive=False)
max_temp = new_table.get_column("MaxTemp")
max_temp = myutils.classify_continuous_data(max_temp, temp_cutoffs, temp_labels, lower_inclusive_upper_exclusive=False)
humid9am = new_table.get_column("Humidity9am")
humid9am = myutils.classify_continuous_data(humid9am, humidity_cutoffs, humidity_labels, lower_inclusive_upper_exclusive=False)
humid3pm = new_table.get_column("Humidity3pm")
humid3pm = myutils.classify_continuous_data(humid3pm, humidity_cutoffs, humidity_labels, lower_inclusive_upper_exclusive=False)
pressure9am = new_table.get_column("Pressure9am")
pressure9am = myutils.classify_continuous_data(pressure9am, pressure_cutoffs, pressure_labels, lower_inclusive_upper_exclusive=False)
pressure3pm = new_table.get_column("Pressure3pm")
pressure3pm = myutils.classify_continuous_data(pressure3pm, pressure_cutoffs, pressure_labels, lower_inclusive_upper_exclusive=False)
windGust = new_table.get_column("WindGustDir")
rainToday = new_table.get_column("RainToday")
# Create the final table with the conditioning finished
dataset = [ [min_temp[kk]] + [max_temp[kk]] + [humid9am[kk]] + [humid3pm[kk]] + [pressure9am[kk]] + [pressure3pm[kk]] + [windGust[kk]] + [rainToday[kk]] for kk in range(len(new_table.data))]
table = MyPyTable(data=dataset, column_names=["MinTemp", "MaxTemp", "WindGustDir", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm", "RainToday"])

## Classification using a Decision Tree Classifier

To begin, we chose to use a decision tree classifier and stratified k-fold validation using k=10. 

In [81]:
k_cross_validation = 10 # The number of folds for the (stratifed) cross-validation

# Create the X and y data
X_train = table.get_multiple_columns(["MinTemp", "MaxTemp", "WindGustDir", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm"])
y_train = table.get_column("RainToday")

# Create the classifier and fit it using k-fold cross validation
decisionTreeClassifier = MyDecisionTreeClassifier()
correct_sum = 0
# Get the train-test indices for cross-validation
train_folds, test_folds = myevaluation.stratified_kfold_cross_validation(X_train, y_train, n_splits=k_cross_validation)
# Run the fitting
all_y_pred, all_y_actual = [], []
for kk in range(k_cross_validation):
    # Get the X,y train/test indices
    train_indices, test_indices = train_folds[kk], test_folds[kk]
    # Fit the classifier
    X_test_indices, y_test_indices, y_test, y_test_pred = myevaluation.fit_classifier(decisionTreeClassifier, X_train, y_train, train_indices, test_indices, train_indices, test_indices, normalize_X=False)
    # Fetch the y_test_actual values
    y_test_actual = [y_train[kk] for kk in test_indices]
    # Append these to their respective arrays
    all_y_actual += y_test_actual
    all_y_pred += y_test_pred
    # Get the number correct
    correct_sum += myutils.get_percent_correct(y_test_pred, y_test_actual)

predictive_accuracy = correct_sum / k_cross_validation
myutils.print_stratified_crossVal_results([predictive_accuracy], ["Decision Tree"], k_cross_validation, title="Accuracy Results")

Accuracy Results
Stratified 10-Fold Cross Validation
Decision Tree: accuracy = 0.7950865600123658, error rate = 0.20491343998763423


### Notes on Decision Tree Classifier

The decision tree classifier was able to do a good job of classifiying the dataset in the sense that its predictive accuracy is higher than the highest percentage of class labels in our binary classification problem.

## Classification using a Random Forest Classifier


# IN PROGRESS


In [82]:
# Define the 3 random forest variables
N = 20 # Total number of decision trees to generate
M = 7 # Number of "best" trees ro keep
F = 2 # Number of random attributes to select from

# First, make sure everything is represented as a string
X = table.get_multiple_columns(["MinTemp", "MaxTemp", "WindGustDir", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm"])
y = table.get_column("RainToday")

for row in range(len(X)):
    y[row] = str(y[row])
    for col in range(len(X[0])):
        X[row][col] = str(X[row][col])

# Generate a random stratified test set consisting of one third of the original data set, 
# with the remaining two thirds of the instances forming the "remainder set".
test_X, test_y, remainder_X, remainder_y = myevaluation.random_stratified_train_test_split(X, y, test_size=0.33)

# Create a classifier and fit it
randForestClassifier = MyRandomForestClassifier(N, M, F)
randForestClassifier.fit(remainder_X, remainder_y)

NameError: name 'M' is not defined