# Homework2

## Introduction

- This notebook will apply the cleaned dataset from homework 1 and use it to create 3 different models to predict whether an applicant for credit will repay their credit within 2 years.
    - The value 0 indicates that an animal will have a positive outcome (adopted/returned to owner).
    - The value 1 indicates that an animal will have a negative outcome (death).
- Each model will be a supervised learning model and the output will be a binary classification. 
- For each applicant a number of features will be given (independent variables) and the target, risk performance (dependent variable), will be predicted.

This homework will be broken down into 4 main parts:
1. Review the dataset from homework one and decide on which features to use to build our model
2. Create a Linear Regression model and analyse
3. Create a Logistical Regression model and analyse
4. Create a Random Forest model and analyse
5. Try to optimize each model

We will begin by importing the packages needed for this assignment. 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

from patsy import dmatrices

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.tree import export_graphviz

# import graphviz
# from graphviz import Source

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

## 1) Data Understanding and Prep

The original dataset was cleaned in homework 1 and will now be imported as a starting point for this homework. There are some points to note before proceeding: 

- The cleaned dataset will be a starting point however, some additional cleaning steps will be performed before proceeding with homework2. 
- The data quality report for the cleaned dataset has been provided as a reference. 
- A summary of the data quality plan is seen below. 
- Based on the findings in homework1, four additional features were added. These are: 
    - 'SexKnown' which indicates whether the sex of an animal is known or not. 
    - 'CatOrDog' which indicates whether an animal is either a cat or a dog or not. 
    - 'AgeIntake_bins which grouped the ages of animals upon intake into four equal frequency bins.
    - 'SickOrInjured' which indicates whether an animal was sick/injured or not. 

## Insert data quality plan table here!

We will begin by importing the cleaned dataset from homework 1. 

In [2]:
# read in the cleaned csv
df = pd.read_csv("19205514_cleaned_new_features_added.csv")

We will now look at the first five lines of this dataset. 

In [3]:
df.head(5)

Unnamed: 0,AnimalID,Name_Provided,DateTime_Intake,FoundLocation,IntakeType,IntakeCondition,AnimalType_Intake,SexuponIntake,AgeuponIntake,Breed_Intake,...,DateTime_Outcome,DateofBirth,SexuponOutcome,AgeuponOutcome,binary_outcome,percent,SexKnown,CatOrDog,AgeIntake_bins,SickOrInjured
0,A687076,Yes,2014-08-30 17:55:00,Austin (TX),Stray,Normal,Cat,Intact Female,76,Domestic Shorthair,...,2014-09-05 14:33:00,2014-06-15 00:00:00,Spayed Female,82,0,0.109769,1,1,"(55.8, 92.0]",0
1,A685139,No,2014-08-03 11:23:00,Austin (TX),Stray,Other,Dog,Intact Female,92,Other Dog Breeds,...,2014-08-07 18:09:00,2014-05-03 00:00:00,Spayed Female,96,0,0.109769,1,1,"(55.8, 92.0]",0
2,A741039,Yes,2016-12-27 11:18:00,Austin (TX),Stray,Normal,Dog,Intact Male,366,Other Dog Breeds,...,2017-01-01 15:27:00,2015-12-27 00:00:00,Intact Male,371,0,0.109769,1,1,"(365.0, 427.6]",0
3,A759166,Yes,2017-09-28 11:03:00,Austin (TX),Stray,Normal,Cat,Intact Male,62,Domestic Shorthair,...,2017-11-04 18:11:00,2017-07-28 00:00:00,Neutered Male,99,0,0.109769,1,1,"(55.8, 92.0]",0
4,A696479,Yes,2015-02-05 11:51:00,Travis (TX),Stray,Normal,Cat,Intact Male,365,Domestic Medium Hair,...,2015-02-15 11:39:00,2014-02-05 00:00:00,Neutered Male,375,0,0.109769,1,1,"(212.0, 365.0]",0


In [4]:
#look at the shape of the dataset
df.shape

(1000, 21)

We will now look at the datatypes of our features after importing the csv. 

In [5]:
df.dtypes

AnimalID              object
Name_Provided         object
DateTime_Intake       object
FoundLocation         object
IntakeType            object
IntakeCondition       object
AnimalType_Intake     object
SexuponIntake         object
AgeuponIntake          int64
Breed_Intake          object
Color_Intake          object
DateTime_Outcome      object
DateofBirth           object
SexuponOutcome        object
AgeuponOutcome         int64
binary_outcome         int64
percent              float64
SexKnown               int64
CatOrDog               int64
AgeIntake_bins        object
SickOrInjured          int64
dtype: object

We can see that the datatpyes of some features have reverted back to type "object". We will now convert these features back to their appropriate types. 

In [6]:
#convert all objects to categories
object_columns = df.select_dtypes(['object']).columns
for column in object_columns:
    df[column] = df[column].astype('category')

#convert animal ID to an object type
df['AnimalID'] = df['AnimalID'].astype('object')  

#convert the new features 'SexKnown','CatOrDog' and 'SickOrInjured' back to category type as this 
#is how they were implemented in homework1.
new_features = ['SexKnown','CatOrDog', 'SickOrInjured']
for column in new_features:
    df[column] = df[column].astype('category')
    
#convert all date features to datetime types
date_columns = ['DateTime_Intake', 'DateTime_Outcome', 'DateofBirth']
for column in date_columns: 
    df[column] = df[column].astype('datetime64')

We will now remove the feature 'percent'. This was generating while implementing bar plots in homework1 and should not be part of the dataset. 

In [8]:
#drop the feature 'percent'
df = df.drop("percent", 1)

### Further cleaning steps

As mentioned above, before going any further there is another cleaning step required. In homework1 the features *DateTime_Intake* and *DateTime_Outcome* were explored. It was found that there were some data integrity issues and these were dealt with in homework1. However, it was not deemed necessary at the time to split these features into categorical features representing the year, month and day. Before proceeding with predictive modelling we will now carry out this step. This step will make it easier to work with the high cardinality of the datetime features and it will convert them to a format which is easier to work with for predictive modelling. 

It is necessary to note that some investigation was done into the month and day of intakei in a temporary dataframe in homework1 and no clear relationship was seen with the binary outcome. However, at this stage in homework2 it is necessary to explore these features further before deciding which ones we will use for predictive modelling. 

We begin this step by converting the feature *DateTime_Intake* into year, month and day features. 

In [9]:
#extract the year, month and day from 'DateTime_Intake'
df['Intake_Year']=df['DateTime_Intake'].dt.year
df['Intake_Month']=df['DateTime_Intake'].dt.month
df['Intake_Day']=df['DateTime_Intake'].dt.day

We will now convert *DateTime_Intake* into year, month and day features. 

In [10]:
#extract the year, month and day from 'DateTime_Outcome'
df['Outcome_Year']=df['DateTime_Outcome'].dt.year
df['Outcome_Month']=df['DateTime_Outcome'].dt.month
df['Outcome_Day']=df['DateTime_Outcome'].dt.day

We will now convert these new features to type 'category'.

In [11]:
#convert new features to category type
new_date_features = ['Intake_Year','Intake_Month', 'Intake_Day', 'Outcome_Year', 'Outcome_Month', 'Outcome_Day']
for column in new_date_features: 
    df[column] = df[column].astype('category')

In [12]:
df.dtypes

AnimalID                     object
Name_Provided              category
DateTime_Intake      datetime64[ns]
FoundLocation              category
IntakeType                 category
IntakeCondition            category
AnimalType_Intake          category
SexuponIntake              category
AgeuponIntake                 int64
Breed_Intake               category
Color_Intake               category
DateTime_Outcome     datetime64[ns]
DateofBirth          datetime64[ns]
SexuponOutcome             category
AgeuponOutcome                int64
binary_outcome                int64
SexKnown                   category
CatOrDog                   category
AgeIntake_bins             category
SickOrInjured              category
Intake_Year                category
Intake_Month               category
Intake_Day                 category
Outcome_Year               category
Outcome_Month              category
Outcome_Day                category
dtype: object

The features *DateTime_Intake* and *DateTime_Outcome* were of type datetime64. The new year, month and day features extracted from these have a much lower cardinality and are in a format that is much more useful for predictive modelling. 

We can now drop the original features *DateTime_Intake* and *DateTime_Outcome* as we have extracted all important information and no information will be lost. 

In [13]:
#drop the features 'DateTime_Intake' and'DateTime_Outcome'
df = df.drop('DateTime_Intake', 1)
df = df.drop('DateTime_Outcome', 1)

Furthermore, the feature *DateofBirth* was used to calculate the features *AgeuponIntake* and *AgeuponOutcome*. This feature will not be needed for predictive modelling as the information we need is represented by the age features. As a result, we will now drop this feature. 

In [16]:
#drop 'DateofBirth'
df = df.drop('DateofBirth', 1)

Missing values were dealt with in homework1. However, we will now check to ensure that there are currently no missing values in our dataset. 

In [17]:
df.isna().sum()

AnimalID             0
Name_Provided        0
FoundLocation        0
IntakeType           0
IntakeCondition      0
AnimalType_Intake    0
SexuponIntake        0
AgeuponIntake        0
Breed_Intake         0
Color_Intake         0
SexuponOutcome       0
AgeuponOutcome       0
binary_outcome       0
SexKnown             0
CatOrDog             0
AgeIntake_bins       0
SickOrInjured        0
Intake_Year          0
Intake_Month         0
Intake_Day           0
Outcome_Year         0
Outcome_Month        0
Outcome_Day          0
dtype: int64

We can see that there are no missing values. We will now check the datatypes again before proceeding with plots.

In [18]:
df.dtypes

AnimalID               object
Name_Provided        category
FoundLocation        category
IntakeType           category
IntakeCondition      category
AnimalType_Intake    category
SexuponIntake        category
AgeuponIntake           int64
Breed_Intake         category
Color_Intake         category
SexuponOutcome       category
AgeuponOutcome          int64
binary_outcome          int64
SexKnown             category
CatOrDog             category
AgeIntake_bins       category
SickOrInjured        category
Intake_Year          category
Intake_Month         category
Intake_Day           category
Outcome_Year         category
Outcome_Month        category
Outcome_Day          category
dtype: object

## Setup continuous and categorical types for plotting later? 

### (1.1) Randomly shuffle the rows of your dataset and split the dataset into two datasets: 70% training and 30% test. Keep the test set aside. 

Sklearn train_test_split randomly shuffles the dataset. However, we will implement an additional shuffling step beforehand as specified in the requirements. 

In [19]:
# randomly generate a sequence based on the dataframe index. Set this to be the new index.
df.set_index(np.random.permutation(df.index))
# sort the resulting random index
df.sort_index(inplace=True)
#Look at the first five rows of the randomly shuffled dataset
df.head(5)

Unnamed: 0,AnimalID,Name_Provided,FoundLocation,IntakeType,IntakeCondition,AnimalType_Intake,SexuponIntake,AgeuponIntake,Breed_Intake,Color_Intake,...,SexKnown,CatOrDog,AgeIntake_bins,SickOrInjured,Intake_Year,Intake_Month,Intake_Day,Outcome_Year,Outcome_Month,Outcome_Day
0,A687076,Yes,Austin (TX),Stray,Normal,Cat,Intact Female,76,Domestic Shorthair,Mixed Pattern,...,1,1,"(55.8, 92.0]",0,2014,8,30,2014,9,5
1,A685139,No,Austin (TX),Stray,Other,Dog,Intact Female,92,Other Dog Breeds,White,...,1,1,"(55.8, 92.0]",0,2014,8,3,2014,8,7
2,A741039,Yes,Austin (TX),Stray,Normal,Dog,Intact Male,366,Other Dog Breeds,Black,...,1,1,"(365.0, 427.6]",0,2016,12,27,2017,1,1
3,A759166,Yes,Austin (TX),Stray,Normal,Cat,Intact Male,62,Domestic Shorthair,Black,...,1,1,"(55.8, 92.0]",0,2017,9,28,2017,11,4
4,A696479,Yes,Travis (TX),Stray,Normal,Cat,Intact Male,365,Domestic Medium Hair,Black,...,1,1,"(212.0, 365.0]",0,2015,2,5,2015,2,15


**The dataset will now be split into two separate datasets - 70% training and 30% test.**
- We set the target feature "y" to be "binary_outcome"
- We feature "x" to be all remaining features in the dataset. The feature "binary_outcome" will be excluded.

In [20]:
y = pd.DataFrame(df["binary_outcome"])
X = df.drop(["binary_outcome"],1)

We will now split the dataset. The parameter 'test-size' will determine the size of the training and test datasets. We will set this to 0.3 in order to split into 70% training and 30% test. The parameter 'random_state' sets a seed to the train_test_split random generator. We will set this parameter to 1 to ensure that the train/test split is the same each time this code is executed.  

In [21]:
# Split the dataset into two datasets: 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=1)

print("original range is: ",df.shape[0])
print("training range (70%):\t rows 0 to", round(X_train.shape[0]))
print("test range (30%): \t rows", round(X_train.shape[0]), "to", round(X_train.shape[0]) + X_test.shape[0])

original range is:  1000
training range (70%):	 rows 0 to 700
test range (30%): 	 rows 700 to 1000


In [None]:
We are now ready to plot the features. 

#### (1.2) On the training set - For each continuous feature, plot its interaction with the target feature (a plot for each pair of continuous feature and target feature). Discuss what you observe from these plots, e.g., which continuous features seem to be better at predicting the target feature? Choose a subset of continuous features you find promising (if any). Justify your choices.