## All you need is love… And a pet!

<img src="img/dataset-cover.jpg" width="920">

Here we are going to build a classifier to predict whether an animal from an animal shelter will be adopted or not (aac_intakes_outcomes.csv, available at: https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes/version/1#aac_intakes_outcomes.csv). You will be working with the following features:

1. *animal_type:* Type of animal. May be one of 'cat', 'dog', 'bird', etc.
2. *intake_year:* Year of intake
3. *intake_condition:* The intake condition of the animal. Can be one of 'normal', 'injured', 'sick', etc.
4. *intake_number:* The intake number denoting the number of occurrences the animal has been brought into the shelter. Values higher than 1 indicate the animal has been taken into the shelter on more than one occasion.
5. *intake_type:* The type of intake, for example, 'stray', 'owner surrender', etc.
6. *sex_upon_intake:* The gender of the animal and if it has been spayed or neutered at the time of intake
7. *age_upon\_intake_(years):* The age of the animal upon intake represented in years
8. *time_in_shelter_days:* Numeric value denoting the number of days the animal remained at the shelter from intake to outcome.
9. *sex_upon_outcome:* The gender of the animal and if it has been spayed or neutered at time of outcome
10. *age_upon\_outcome_(years):* The age of the animal upon outcome represented in years
11. *outcome_type:* The outcome type. Can be one of ‘adopted’, ‘transferred’, etc.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
from itertools import combinations 
import ast
# added 
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge
# from sklearn.preprocessing import OneHotEncoder
# from pandas.plotting import scatter_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
#
import seaborn as sn
%matplotlib inline

data_folder = './data/'

### A) Load the dataset and convert categorical features to a suitable numerical representation (use dummy-variable encoding). 
- Split the data into a training set (80%) and a test set (20%). Pair each feature vector with the corresponding label, i.e., whether the outcome_type is adoption or not. 
- Standardize the values of each feature in the data to have mean 0 and variance 1.

The use of external libraries is not permitted in part A, except for numpy and pandas. 
You can drop entries with missing values.

In [2]:
columns = ['animal_type', 'intake_year', 'intake_condition', 'intake_number', 'intake_type', 'sex_upon_intake', \
          'age_upon_intake_(years)', 'time_in_shelter_days', 'sex_upon_outcome', 'age_upon_outcome_(years)', \
          'outcome_type']
original_data = pd.read_csv(data_folder+'aac_intakes_outcomes.csv', usecols=columns)

In [3]:
original_data.dropna() # per default, drops row containing a missing value
original_data.head()

Unnamed: 0,outcome_type,sex_upon_outcome,age_upon_outcome_(years),animal_type,intake_condition,intake_type,sex_upon_intake,age_upon_intake_(years),intake_year,intake_number,time_in_shelter_days
0,Return to Owner,Neutered Male,10.0,Dog,Normal,Stray,Neutered Male,10.0,2017,1.0,0.588194
1,Return to Owner,Neutered Male,7.0,Dog,Normal,Public Assist,Neutered Male,7.0,2014,2.0,1.259722
2,Return to Owner,Neutered Male,6.0,Dog,Normal,Public Assist,Neutered Male,6.0,2014,3.0,1.113889
3,Transfer,Neutered Male,10.0,Dog,Normal,Owner Surrender,Neutered Male,10.0,2014,1.0,4.970139
4,Return to Owner,Neutered Male,16.0,Dog,Injured,Public Assist,Neutered Male,16.0,2013,1.0,0.119444


In [4]:
df = pd.get_dummies(original_data)
df.head()

Unnamed: 0,age_upon_outcome_(years),age_upon_intake_(years),intake_year,intake_number,time_in_shelter_days,outcome_type_Adoption,outcome_type_Died,outcome_type_Disposal,outcome_type_Euthanasia,outcome_type_Missing,...,intake_type_Euthanasia Request,intake_type_Owner Surrender,intake_type_Public Assist,intake_type_Stray,intake_type_Wildlife,sex_upon_intake_Intact Female,sex_upon_intake_Intact Male,sex_upon_intake_Neutered Male,sex_upon_intake_Spayed Female,sex_upon_intake_Unknown
0,10.0,10.0,2017,1.0,0.588194,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
1,7.0,7.0,2014,2.0,1.259722,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
2,6.0,6.0,2014,3.0,1.113889,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
3,10.0,10.0,2014,1.0,4.970139,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
4,16.0,16.0,2013,1.0,0.119444,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0


In [5]:
# to split the data into train and test sets, use from sklearn.model_selection import train_test_split 
# however here, I should use numpy and panda.

In [6]:
# Standardization
df_std = df
for x in (['age_upon_outcome_(years)', 'age_upon_intake_(years)', 'intake_year', 'intake_number', 'time_in_shelter_days']):
    df_std[x] = (df[x] - df[x].mean())/df[x].std()
df_std.head()

Unnamed: 0,age_upon_outcome_(years),age_upon_intake_(years),intake_year,intake_number,time_in_shelter_days,outcome_type_Adoption,outcome_type_Died,outcome_type_Disposal,outcome_type_Euthanasia,outcome_type_Missing,...,intake_type_Euthanasia Request,intake_type_Owner Surrender,intake_type_Public Assist,intake_type_Stray,intake_type_Wildlife,sex_upon_intake_Intact Female,sex_upon_intake_Intact Male,sex_upon_intake_Neutered Male,sex_upon_intake_Spayed Female,sex_upon_intake_Unknown
0,2.709378,2.727873,1.200085,-0.278079,-0.387936,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
1,1.674923,1.69095,-1.102017,1.914629,-0.371824,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
2,1.330105,1.345309,-1.102017,4.107338,-0.375323,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
3,2.709378,2.727873,-1.102017,-0.278079,-0.282801,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
4,4.778288,4.801719,-1.869384,-0.278079,-0.399183,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0


In [7]:
# Splitting into test and training
indices = np.random.permutation(df_std.shape[0])

training_idx = indices[:round(df_std.shape[0]*0.8)]
df_80 = df_std.loc[training_idx]

test_idx = indices[round(df_std.shape[0]*0.8):]
df_20 = df_std.loc[test_idx]

In [117]:
mod = smf.logit(formula='outcome_type_Adoption ~  age_upon_outcome_(years) + age_upon_intake_(years) + intake_year + intake_number + time_in_shelter_days ', data=df_std)

res = mod.fit()

# Extract the estimated propensity scores
df['Propensity_score'] = res.predict()

print(res.summary())

NameError: name 'smf' is not defined

### B) Train a logistic regression classifier on your training set. Logistic regression returns probabilities as predictions, so in order to arrive at a binary prediction, you need to put a threshold on the predicted probabilities. 
- For the decision threshold of 0.5, present the performance of your classifier on the test set by displaying the confusion matrix. Based on the confusion matrix, manually calculate accuracy, precision, recall, and F1-score with respect to the positive and the negative class. 

In [18]:
y = df_80['outcome_type_Adoption']
X_80 = df_80
X_20 = df_20

In [12]:
logistic = LogisticRegression(solver='lbfgs')
logistic.fit(X_80, y)

LogisticRegression()

In [38]:
d = pd.DataFrame(np.zeros((len(X_20), 1)))
X_20['test_res'] = d
X_20.fillna(0)

Unnamed: 0,age_upon_outcome_(years),age_upon_intake_(years),intake_year,intake_number,time_in_shelter_days,outcome_type_Adoption,outcome_type_Died,outcome_type_Disposal,outcome_type_Euthanasia,outcome_type_Missing,...,intake_type_Owner Surrender,intake_type_Public Assist,intake_type_Stray,intake_type_Wildlife,sex_upon_intake_Intact Female,sex_upon_intake_Intact Male,sex_upon_intake_Neutered Male,sex_upon_intake_Spayed Female,sex_upon_intake_Unknown,test_res
54320,-0.049168,-0.037255,0.432718,-0.278079,1.449688,1,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0.0
78284,-0.393987,-0.382896,1.967453,-0.278079,-0.396500,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0.0
19680,-0.393987,-0.382896,-1.102017,-0.278079,-0.112720,1,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0.0
53781,-0.725579,-0.715280,0.432718,-0.278079,-0.400782,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0.0
9527,0.295650,0.308386,-1.102017,-0.278079,-0.311626,1,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57388,-0.393987,-0.382896,0.432718,-0.278079,-0.385703,0,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0.0
54051,0.640468,0.654027,0.432718,-0.278079,-0.305278,1,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0.0
53138,-0.393987,-0.382896,0.432718,-0.278079,-0.308643,0,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0.0
38290,-0.568758,-0.586493,-0.334649,-0.278079,-0.301679,1,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0.0


In [39]:
th = 0.5 # threshold probability

TP = 0 # True positive
FP = 0
TN = 0 
FN = 0 # False negative

for idx, test in X_20.iterrows(): 
    if logistic.predict_proba([test])[0] > th:
    #if logistic.predict([test])[0] > 0 : could also be used, as it is per default on a threshold of 0.5, 
    #with 0 saying 'NO' to the feature y, and 1 = 'YES' = True to the feature
        X_20.loc[idx, 'test_res'] = 1
        if X_20.loc[idx,'test_res']==X_20.loc[idx,'outcome_type_Adoption']:
            TP+=1
        else:
            FN+=1
    if X_20.loc[idx,'test_res']==X_20.loc[idx,'outcome_type_Adoption']:
            TN+=1
    else:
            FP+=1
    

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [None]:
acc = (TP+TN)/(TP+TN+FT+FN) # accuracy
precision = TP/(TP+FP) # fraction of all classified positive actually positive
recall = TP/(TP+FN) # fraction of all positive actually recognized as such

### C) Vary the value of the threshold in the range from 0 to 1 and visualize the value of accuracy, precision, recall, and F1-score (with respect to both classes) as a function of the threshold.

### D) Plot in a bar chart the coefficients of the logistic regression sorted by their contribution to the prediction.

# Quiz

## Questions 1: Which of the following metrics is most suitable when you are dealing with unbalanced classes?

- a) F1 Score
- b) Recall
- c) Precision
- d) Accuracy

Answer : **a)**, as accuracy is useful when the data is balanced (e.g. not a skewed distribution, as FN and FP, the rrors, are weighted both with 1, equally), and F1 is a way of weighting recall and precision intelligently. 

## Question 2: You are working on a binary classification problem. You trained a model on a training dataset and got the following confusion matrix on the test dataset. What is true about the evaluation metrics (rounded to the second decimal point):

|            | Pred = NO|Pred=YES|
|------------|----------|--------|
| Actual NO  |    50    |   10   |
| Actual YES |    5     |   100  |

- a) Accuracy is 0.95
- b) Accuracy is 0.85
- c) False positive rate is 0.95
- d) True positive rate is 0.95

Accuracy is (100+50)/165 = 90.9%, TP rate is 100/(100+5) = 95.2% (#correct predicted true/#total actual true), FP rate is 10/(10+50) = 16.7% ("probability of falsely rejecting the null hypothesis for a particular test" : FP/FP+TN). Hence, **c)**. 