# CoSMO Machine Learning Workshop

## Prerequisites:

## Kaggle
We will be using Kaggle for this workshop, so head to [kaggle.com](https://www.kaggle.com) and sign up for an account.    
You can sign up with Google or Facebook (or just manually create an account).    
We will be using the "Rain in Australia" [dataset](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package), so search for that and select it (or choose your own if you want to be creative!)

#### Making a Kaggle Kernel
Once you've selected a Kaggle dataset, choose "New Kernel" and then choose "Notebook".   
Notebooks are a really great way to iteratively work on a machine learning project. You run code in chunks, one paragraph at a time.   
They allow for making quick and easy changes without having to rerun all of your code every time.   
You can also add in markdown paragraphs (like this one) to explain to your audience what you are doing.

## Python
We will be using Python to do data processing and model training.    
If you aren't familiar with Python, you can check out the basics from one of our previous [workshops](https://github.com/NUCoSMO/Workshops/blob/master/PythonBeginnerWorkshop.md)    
Three extremely common Python libraries that are used for machine learning are numpy, pandas, and sklearn.

#### Numpy
Numpy defines a number of helpful functions that can be applied to lists/matrices/dataframes   

Check out numpy [documentation](https://docs.scipy.org/doc/numpy/reference/index.html), and don't be afraid to Google!

#### Pandas
Pandas defines "DataFrames", one of the most common data types when working on machine learning projects.   
DataFrames are basically tables of data, where each row is a data point and each column is a feature of that data point.   
Many machine learning models available in Python accept DataFrames as input.

Check out pandas [documentation](http://pandas.pydata.org/pandas-docs/stable/reference/)!

#### Sklearn
Sklearn is Python's most commonly used package for building machine learning models.   
Sklearn implements a number of machine learning models, as well as additional functions for training and evaluation.

Check out sklearn [documentation](https://scikit-learn.org/stable/)!

In [10]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

['weatherAUS.csv']


# Step 0: Have a data source


In [11]:
## Load your dataset, use pd.read_csv(...) if working with csvs
all_data = pd.read_csv('../input/weatherAUS.csv')

## Inspect your data set (dimensions, columns, first few rows)
print(all_data.shape)
print(all_data.columns)
all_data.head(5)

(142193, 24)
Index(['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',
       'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
       'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
       'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
       'Temp3pm', 'RainToday', 'RISK_MM', 'RainTomorrow'],
      dtype='object')


Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,0.0,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,0.0,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,0.0,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,1.0,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,0.2,No


In [12]:
## Look for null values in your columns (see which columns you want to use)

pd.isnull(all_data).sum(axis=0) * 100 / all_data.shape[0]

Date              0.000000
Location          0.000000
MinTemp           0.447983
MaxTemp           0.226453
Rainfall          0.988797
Evaporation      42.789026
Sunshine         47.692924
WindGustDir       6.561504
WindGustSpeed     6.519308
WindDir9am        7.041838
WindDir3pm        2.656952
WindSpeed9am      0.948007
WindSpeed3pm      1.849599
Humidity9am       1.247600
Humidity3pm       2.538803
Pressure9am       9.855619
Pressure3pm       9.832411
Cloud9am         37.735332
Cloud3pm         40.152469
Temp9am           0.635756
Temp3pm           1.917113
RainToday         0.988797
RISK_MM           0.000000
RainTomorrow      0.000000
dtype: float64

# Step 1: Generate features

In [13]:
### Generate features from total dataset (subset columns, apply transformations, etc)

## Determine which raw feature columns to use
feat_cols = ['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 
            'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
            'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
            'Pressure9am', 'Pressure3pm','Temp9am','Temp3pm', 'RainToday']

## Determine your label column
output_col = "RainTomorrow"

## Split up features (X) from labels (Y)
features = all_data[feat_cols]
labels = all_data[output_col]
print(features.shape)

## Remove rows with nulls (there are other ways to deal with null values, but this is simplest for now)
non_null_rows = pd.isnull(features).sum(axis=1) == 0
features = features[non_null_rows]
labels = labels[non_null_rows]

## Map categorical value to binary
features["RainToday"] = features["RainToday"].map({"Yes":1, "No":0})
labels = labels.map({"Yes":1, "No":0})

## Extract feature from existing column (Month from Date)
features["Date"] = pd.to_datetime(features["Date"])
features["Month"] = features["Date"].apply(lambda x: x.month)

## Drop extra column
features.drop("Date", axis=1, inplace=True)

print(features.shape)
features.head(10)

(142193, 18)
(112925, 18)


Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Temp9am,Temp3pm,RainToday,Month
0,Albury,13.4,22.9,0.6,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,16.9,21.8,0,12
1,Albury,7.4,25.1,0.0,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,17.2,24.3,0,12
2,Albury,12.9,25.7,0.0,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,21.0,23.2,0,12
3,Albury,9.2,28.0,0.0,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,18.1,26.5,0,12
4,Albury,17.5,32.3,1.0,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,17.8,29.7,0,12
5,Albury,14.6,29.7,0.2,WNW,56.0,W,W,19.0,24.0,55.0,23.0,1009.2,1005.4,20.6,28.9,0,12
6,Albury,14.3,25.0,0.0,W,50.0,SW,W,20.0,24.0,49.0,19.0,1009.6,1008.2,18.1,24.6,0,12
7,Albury,7.7,26.7,0.0,W,35.0,SSE,W,6.0,17.0,48.0,19.0,1013.4,1010.1,16.3,25.5,0,12
8,Albury,9.7,31.9,0.0,NNW,80.0,SE,NW,7.0,28.0,42.0,9.0,1008.9,1003.6,18.3,30.2,0,12
9,Albury,13.1,30.1,1.4,W,28.0,S,SSE,15.0,11.0,58.0,27.0,1007.0,1005.7,20.1,28.2,1,12


In [14]:
## Convert categorical columns to dummy variables
dummy_cols = ["Location", "WindGustDir", "WindDir9am", "WindDir3pm", "Month"]
features = pd.get_dummies(data=features, columns=dummy_cols)

print(features.shape)
features.head()

(112925, 117)


Unnamed: 0,MinTemp,MaxTemp,Rainfall,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Temp9am,Temp3pm,RainToday,Location_Adelaide,Location_Albury,Location_AliceSprings,Location_BadgerysCreek,Location_Ballarat,Location_Bendigo,Location_Brisbane,Location_Cairns,Location_Canberra,Location_Cobar,Location_CoffsHarbour,Location_Dartmoor,Location_Darwin,Location_GoldCoast,Location_Hobart,Location_Katherine,Location_Launceston,Location_Melbourne,Location_MelbourneAirport,Location_Mildura,Location_Moree,Location_MountGambier,Location_Nhil,Location_NorahHead,Location_NorfolkIsland,Location_Nuriootpa,Location_PearceRAAF,...,WindDir9am_NE,WindDir9am_NNE,WindDir9am_NNW,WindDir9am_NW,WindDir9am_S,WindDir9am_SE,WindDir9am_SSE,WindDir9am_SSW,WindDir9am_SW,WindDir9am_W,WindDir9am_WNW,WindDir9am_WSW,WindDir3pm_E,WindDir3pm_ENE,WindDir3pm_ESE,WindDir3pm_N,WindDir3pm_NE,WindDir3pm_NNE,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW,Month_1,Month_2,Month_3,Month_4,Month_5,Month_6,Month_7,Month_8,Month_9,Month_10,Month_11,Month_12
0,13.4,22.9,0.6,44.0,20.0,24.0,71.0,22.0,1007.7,1007.1,16.9,21.8,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
1,7.4,25.1,0.0,44.0,4.0,22.0,44.0,25.0,1010.6,1007.8,17.2,24.3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
2,12.9,25.7,0.0,46.0,19.0,26.0,38.0,30.0,1007.6,1008.7,21.0,23.2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
3,9.2,28.0,0.0,24.0,11.0,9.0,45.0,16.0,1017.6,1012.8,18.1,26.5,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,17.5,32.3,1.0,41.0,7.0,20.0,82.0,33.0,1010.8,1006.0,17.8,29.7,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


# Step 2: Split up your data

In [15]:
## import sklearn.model_selection.train_test_split

from sklearn.model_selection import train_test_split

In [16]:
## Split your features and labels into training and testing sets 

x_train, x_test, y_train, y_test = train_test_split(features, labels, train_size=0.80, test_size=0.20)

print("X train: {}  X test:{}".format(x_train.shape, x_test.shape))
print("Y train: {}  Y test:{}".format(y_train.shape, y_test.shape))

X train: (90340, 117)  X test:(22585, 117)
Y train: (90340,)  Y test:(22585,)


# Step 3: Pick a model and hyperparameters

In [18]:
## import sklearn.ensemble.RandomForestClassifier
## import sklearn.model_selection.GridSearchCV

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [19]:
## For this workshop we're using Random Forest, define hyperparameters and grid search
rf_params = {"n_estimators":[10, 50, 100], "max_depth":[3, 6]}

## GridSearchCV combines grid search for choosing hyperparameters, and cross validation
## This one line of code is incredibly powerful (it trains 3 x 2 x 3 models)
grid_search = GridSearchCV(RandomForestClassifier(class_weight="balanced"), rf_params, cv=3)

# Step 4: Train your model

In [20]:
## Fit your grid search object (this can take a while)
## For all 6 combinations of hyperparameters, trains 3 fold cross validation (trains 18 total models)

grid_search.fit(x_train, y_train)

## Get the estimator with the best performance from the grid search
model = grid_search.best_estimator_

model

RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=6, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=None, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

# Step 5: Evaluate your model

In [21]:
## Predict labels and/or probabilities for your test set

y_pred = model.predict(x_test)
y_probs = model.predict_proba(x_test)

print(y_pred.size)

22585


In [22]:
## Import sklearn.metrics:confusion_matrix, precision_score, recall_score, and accuracy_score
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score

In [24]:
## Make a confusion matrix to look at binary classification performance

cm = confusion_matrix(y_test, y_pred)
print("True Positive: {}  False Positive: {}".format(cm[1][1], cm[1][0]))
print("False Negative: {}  True Negative: {}".format(cm[0][1], cm[0][0]))


True Positive: 3814  False Positive: 1311
False Negative: 3633  True Negative: 13827


In [25]:
## Calculate 3 other common classification metrics

metrics = {"precision":precision_score, "recall":recall_score, 
           "accuracy":accuracy_score}

In [26]:
for name in metrics:
    print(name)
    print(metrics[name](y_test, y_pred))

precision
0.5121525446488518
recall
0.7441951219512195
accuracy
0.7810936462253708


### A warning about "accuracy":

Accuracy is not a great measure when you have unbalanced classes.   
If you're predicting cancer and it is only present in 1% of people, your model could always   
predict "No Cancer" and it would have an accuracy of 99%. This is misleading.

In [28]:
## What accuracy could our model achieve if it only predicted "No Rain"
1 - np.mean(y_test)

0.7730794775293336

## Optional: Feature Importance
 Some models assign "importance" to the various feature columns in your training data.   
 If the model you used calculates those, you can look at the learned importance of each variable.

In [29]:
feat_importances = pd.DataFrame({"Feature":x_train.columns, "Importance":model.feature_importances_})

In [30]:
feat_importances.sort_values("Importance", ascending=False).head(15)

Unnamed: 0,Feature,Importance
7,Humidity3pm,0.266797
2,Rainfall,0.148645
6,Humidity9am,0.11122
12,RainToday,0.0985
3,WindGustSpeed,0.074194
8,Pressure9am,0.065966
9,Pressure3pm,0.06148
11,Temp3pm,0.040103
1,MaxTemp,0.031291
0,MinTemp,0.019379
