# Transformations

* Use LogEvaporation instead of Evaporation
* It may be worthwhile to add Month as a feature

# Thoughts

* Use random forest method
* Extract Month from the date, and use that instead of the full date
* Advice for handling missing data: https://www.kaggle.com/residentmario/simple-techniques-for-missing-data-imputation
* There are sklearn methods for random forest classification, k fold cross validation
* Compute percentage missing: https://machinelearningmastery.com/statistical-imputation-for-missing-values-in-machine-learning/
* Use k-fold cross validation (maybe set aside 20% as test set, and use k=5 for rest?)
* Features with lots of missing data (Sunshine, Evaporation, Cloud9am, Cloud3pm):
    - Either remove features, or don't remove but impute. Since so much data is missing, imputation might be risky
* Other features with some missing data, but not much:
    - Either impute with mean, or remove rows with missing data

# Initial preprocessing

Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

* Don't use Rainfall, use RainToday as a categorical variable
* Perform log-transformation on Evaporation, if using. Use one of Evaporation and LogEvaporation in the model, but not both, and see which one performs better
* Use Month (categorical/factor variable)

In [3]:
from sklearn.model_selection import train_test_split, KFold
from sklearn.ensemble import RandomForestClassifier
import pandas, numpy

pandas.options.display.max_columns = 500


# Model 1

* Don't use Sunshine, Evaporation, Cloud9am, or Cloud3pm
* For remaining numerical features, impute missing data 
* For remaining categorical features, use dummy indicator variables for each category; ignore (don't remove) if data missing

### Model 1 preprocessing

In [4]:
import sys, os
sys.path.append(os.getcwd() + '/../models')
#print(sys.path)
from model1.build_features import produce_dataframe

df_model1 = produce_dataframe()
train1, test1 = train_test_split(df_model1, test_size=0.2)
print('========\n Train \n========')
print(train1.head())
print(train1.describe(include='all'))
print('========\n Test \n========')
print(test1.head())
print(test1.describe(include='all'))


 Train 
       MinTemp  MaxTemp  WindGustSpeed  WindSpeed9am  WindSpeed3pm  \
67613      8.4     16.7           41.0          19.0          22.0   
77180     10.2     15.0           30.0          13.0          15.0   
67126      8.5     23.6           26.0           9.0          15.0   
37651     20.3     26.2           31.0          11.0          15.0   
73823      9.9     20.2           46.0          17.0          26.0   

       Humidity9am  Humidity3pm  Pressure9am  Pressure3pm  Temp9am  Temp3pm  \
67613         68.0         50.0       1027.3       1024.9     10.9     15.8   
77180         76.0         77.0       1028.7       1027.3     12.9     14.3   
67126         83.0         46.0       1024.7       1021.5     14.0     22.7   
37651         82.0         77.0       1016.4       1014.6     21.2     24.4   
73823         75.0         51.0       1014.2       1015.5     15.1     17.6   

       Month_1  Month_2  Month_3  Month_4  Month_5  Month_6  Month_7  Month_8  \
67613        0 

### Model 1 fitting

* k-fold cross validation: https://scikit-learn.org/0.20/modules/generated/sklearn.model_selection.KFold.html
* Random forest
    * https://scikit-learn.org/0.20/modules/generated/sklearn.ensemble.RandomForestClassifier.html
    * https://machinelearningmastery.com/random-forest-ensemble-in-python/ tune number of trees, depth of trees
* Use confusion matrix to assess model (false positive/negative rate): 
    * https://scikit-learn.org/0.20/modules/generated/sklearn.metrics.confusion_matrix.html

In [None]:
# for number_of_trees = 10, 50, 100, 500, 1000, 1500:
# k-fold cross validation on train1 with k=10
# for each of the splits, fit the random forest model and show the confusion matrix. Then calculate the f1 score
#   get X, y from dataframe 
#   fit random forest classifier
#   confusion matrix, f1 score
# plot f1 score against number_of_trees

# Model 2

* Use all or some of Sunshine, (Log)Evaporation, Cloud9am, Cloud3pm. Impute missing data
* For remaining numerical features, impute missing data 
* For remaining categorical features, use dummy indicator variables for each category; ignore (don't remove) if data missing