<img src=images/gdd-logo.png width=300px align=right>

# Hawks Hackathon

## Machine Learning Self Study

## About the data
Students and faculty at Cornell College in Mount Vernon, Iowa, collected data over many years at the hawk blind at Lake MacBride near Iowa City, Iowa. The data set that we are analyzing here is a subset of the original data set, using only those species for which there were more than 10 observations. Data were collected on random samples of three different species of hawks: Red-tailed, Sharp-shinned, and Cooper's hawks. Professor Bob Black at Cornell College shared the data. 

The dataset provides a great dataset for data exploration & visualisation. 

|Field|Description|
|:---|:---|
|month|	8=September to 12=December|
|day|	Date in the month|
|year|	Year: 1992-2003|
|capturetime|	Time of capture (HH:MM)|
|releasetime|	Time of release (HH:MM)|
|bandnumber|	ID band code|
|species|	CH=Cooper's, RT=Red-tailed, SS=Sharp-Shinned|
|age|	A=Adult or I=Imature|
|wing|	Length (in mm) of primary wing feather from tip to wrist it attaches to|
|weight|	Body weight (in gm)|
|culmen|	Length (in mm) of the upper bill from the tip to where it bumps into the fleshy part of the bird|
|hallux|	Length (in mm) of the killing talon|
|tail|	Measurement (in mm) related to the length of the tail (invented at the MacBride Raptor Center)|
|standardtail|	Standard measurement of tail length (in mm)|
|tarsus|	Length of the basic foot bone (in mm)|
|wingpitfat|	Amount of fat in the wing pit|
|keelfat|	Amount of fat on the breastbone (measured by feel)|
|crop|	Amount of material in the crop, coded from 1=full to 0=empty|

## Hackathon

The goal of this hackathon is to explore data and practice your skills with building a model in sklearn.

<img src='images/hawkscropped.png'> 
    
Once you have your business goal you can follow the following steps:

1. [**Load the data:**](#one)
2. [**Exploratory Analysis:**](#two)
3. [**Build a model**](#three)
4. [**Summary & Next Steps:**](#four)

In [None]:
import pandas as pd
import numpy as np

<a id = 'one'></a>
## 1. Loading our data

There are many places your data can originate from. Maybe you want to load it from a Excel file you have stored locally on your system, maybe you have a .csv file stored online somewhere. Scikit-learn comes with various standard datasets that can be used for practice, that can be loaded if you have scikit-learn installed on your system. 

However, the dataset we will be using today (the R hawks dataset) does not come from scikit-learn, but from a package in R. Luckily for us we can access that data [from here](https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Hawks.csv)

The data has been downloaded stored here: `'../data/hawks.csv'`.

In [None]:
hawks = pd.read_csv('data/hawks.csv')
hawks.head()

<a id = 'two'></a>
## 2. Exploratory Analysis

The first thing we must do before we answer our overall goal is to ask small questions to start to understand the data. Answer the following questions (and come up with any of your own related to your overall goal):

1. How many missings are there in each column?

2. What is the mean length (in mm) of the tail of the hawks?

3. What is the median tail by species?

4. How many observations do you have for each year?

5. What were the most common times of capture?

<a id = 'three'></a>
## 3. Build a Decision Tree model

1. Remove any columns with a missing value ratio of 10% or higher
2. Drop the remaining missing values 
3. Remove any categorical columns
4. Split the data into X and y
5. Use the train_test_split method using a random_state
6. Instantiate a DecisionTreeClassifier with a max_depth of 5
7. Fit the model to `X_train` and `y_train`
8. Calculate the `accuracy_score` and print the `classfication report`

<a id = 'four'></a>
## 4. Summary & Next Steps

What have you learned about this data? What do you need to or want to look at next?

