# Shelter Animals - Technical Delivery

I will briefly summarise the steps that I've taken and their respective outcomes. I've written this document with the README file in mind. It is aimed at giving data science recruiters and others who may be interested a quick glance of what the project entails. For a more detailed view I reference you to the Proposal, Data Preparation, Exploratory Data Analysis and Modelling notebooks in that order.

- What kind of technologies or methodologies were used?
- What were the brief outcomes of these steps?
- How can I try out the final result?


# Shelter Animals: Project Overview

- Created a tool that predicts how long it takes for cats and dogs to be adopted (Acc ~51%) based on 5 categories. 
- Researched the domain to get a greater understanding of potentially influential factors.
- Merged intake and outcome datasets to extract length of stay.
- Extracted features from the available data to qualify the importance potential adopters put on fur colour, gender, castration, breed and age.
- Investigated correlations between adoption speed and other characteristics such as age, intake type, intake condition and animal type.
- Optimised Decision Tree, Random Forest, K-Nearest Neighbor and Support Vector Machine Classification using GridSearchCV and feature selection to reach the best model.
- Built a client facing API using flask.


# Code and Resourced Used

**Python version**: 3.7  
**Packages**: pandas, numpy, matplotlib, seaborn, plotly, wordcloud, sklearn, time, datetime, dateutil, calendar, re  
**For Web Framework Requirements**: `pip install -r requirements.txt`  
**Flask Production**: https://medium.com/@nutanbhogendrasharma/deploy-machine-learning-model-with-flask-on-heroku-cd079b692b1d   
**Technical Documentation**: https://www.youtube.com/watch?v=agHKuUoMwvY&list=PL2zq7klxX5ASFejJj80ob9ZAnBHdz5O1t

### Data Collection 

Combined the Austin Shelter Intake and Outcome datasets. For each animal we for the following information:

- Animal ID
- Animal Type
- Breed
- Color
- Found Location
- Date of Birth
- Intake Name
- Outcome Name
- Intake DateTime
- Outcome Datetime
- Sex upon Intake
- Sex upon Outcome
- Age upon Intake
- Age upon Outcome
- Intake Type
- Intake Condition
- Outcome Type
- Outcome Subtype

### Data Cleaning
After merging the data I needed to clean it up and extract various information. I made the following changes:
- Calculated Days in Shelter.
- Bucketed the Days in Shelter into Adoption Speed groups. 


- Selected only cats and dogs that were adopted.
- Removed rows with unknown gender.
- Removed or corrected rows with typographical errors.
- Renamed the Name column and filled missing names with `Unknown`. 
- Calculated Intake and Outcome age to have it in a consistent format.


- Made columns for gender and sterilization intake/outcome from the Gender intake/outcome.
- Made columns for fur colour based on the most commonly found colours, allowing for mixed colours.
- Narrowed down the breeds to the most commonly found family names. For example: `Alaskan Husky` becomes `Husky`.

### EDA
I looked at the distributions of the data, the value counts of the categories and the influence the features have on the target variable. I used graphs and visualisations to draw conclusions from. Below are a few highlights. 

<img src='https://i.imgur.com/xHLFFW1.png' width=450px align="left">
<img src='https://i.imgur.com/4XFXDbz.png' width=450px align="left">

<img src='https://i.imgur.com/OBZYo6O.png' width=450px align="left">
<img src='https://i.imgur.com/XOhicKJ.png' width=450px align="left">

### Model Building
First I one hot encoded the categorical variables, label encoded the target variable and converted dates to categories for years, months, days and day of the week. To even out the weight of numerical values I scaled them using a MinMaxScaler. I also split the data into a training and testing (or validation) set with a test size of 20%.   

I set the baseline of the models to the largest group, which was approximately equal to 23.5% of the total. I tried four different models and evaluated them through the accuracy of cross validation and the testing set. When more insights were necessary I used the classification report to look at the recall and precision or made use of the confusion matrix.

The four models I tried were:
1. Decision Tree
2. Random Forest
3. K-Nearest Neighbor
4. Support Vector Machine

### Model Performance

The Random Forest model far outperformed the other approaches on the testing and validation sets. Below I've listed the scores of each model on the validation set.

- **Decision Tree** : 44.92%
- **Random Forest** : 51.00% 
- **K-Nearest Neighbor** : 40.06%
- **Support Vector Machine** : 39.12% 


### Productionization
To put the model into production I built a flask API endpoint that is hosted on a local webserver. I followed a tutorial that is listed in Code and Resourced used. The API endpoint takes a POST request in JSON format, then transforms and scales the data the same way I prepared the data during the Modelling stage. It afterwards returns the estimated time it takes for the animal to get adopted.