In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import logging
import matplotlib.pyplot as plt
import numpy as np
import os
import seaborn as sns
import sklearn 
import pandas as pd

import shelter
from shelter.config import data_dir

logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)

%matplotlib inline

# Machine Learning Model

In this hackathon we'll try to predict the outcome of animals (adoption, etc.) at the Austin Animal Center using intake data (breed, age, etc.).
We'll use the data from [this Kaggle competition](https://www.kaggle.com/c/shelter-animal-outcomes).
At the end of the hackathon you should be able to send your own submission to Kaggle!

Make sure you've followed the instructions from the `README.md`.
You will probably encounter some problems if you won't!

To start, read the documentation on the [Kaggle competition](https://www.kaggle.com/c/shelter-animal-outcomes) and download the [data](https://www.kaggle.com/c/shelter-animal-outcomes/data).
Unzip the data in the folder `data/`.
There should be (at least) three files: `sample_submission.csv`, `train.csv` and `test.csv`.

Mac users may have some problems with some of the files: not all files may have a correct extension, so you may have to add a `.gz`.
(Ask the instructors if you get any weird errors.)

Load the data with the functions from our own `shelter` package:

In [None]:
train = shelter.data.load_data(os.path.join(data_dir, 'train.csv'))
test = shelter.data.load_data(os.path.join(data_dir, 'test.csv'))

train.head()

Let's check how often each outcome occurs:

In [None]:
n_per_outcome = train['outcome_type'].value_counts()
ax = n_per_outcome.plot(kind='bar', rot=45)
ax.set_ylabel('# animals')
ax.set_title('Occurrence of outcome types')

Now that you've got the data, try to create a model that predict the `outcome_type` given the intake data.
Our final metric is the [multi-class logarithmic loss]https://www.kaggle.com/c/shelter-animal-outcomes#evaluation over all classes.

> #### Tips
> 
* Start with exploring the data and building your first hypotheses.
What is the input data? 
Are there any missing values? 
What do you think will predict the outcome type? 
* Know what random performance looks like.
Create a baseline model that randomly predicts one outcome type given their occurrences.
Check the [`DummyClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)
* `sklearn` doesn't work with string values, you probably want to look at [`pd.get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html), `sklearn`'s [`LabelEncoder`](http://scikit-learn.org/stable/modules/preprocessing_targets.html) or [`OneHotEncoder`](http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features) to convert the strings to numeric values.
* Try to a create model that predicts only one outcome type (e.g. `Adoption`) before focussing on all outcomes.
* `sklearn` has many models for [supervised learning](http://scikit-learn.org/stable/supervised_learning.html), try to find one that fits the problem.
* Look at [Kaggle Kernels](https://www.kaggle.com/c/shelter-animal-outcomes/kernels) for inspiration.
* You will get better performance with some [feature engineering](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/).
* Once you got your first model working, generate predictions for `test.csv` and submit it on Kaggle.