#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Regression Project

In this project you will be divided into small groups (two or three people). You will be pointed to a dataset and asked to create a model to solve a problem. Over the course of the day, your team will explore the data and train the best model you can for solving the problem. At the end of the day, your team will give a short presentation about your solution.

## Overview

### Learning Objectives

* Apply scikit-learn or TensorFlow to a dataset to create a regression model.
* Preprocess data for feeding into a model.
* Use a hand-built model to make predictions.
* Measure the quality of predictions from your model.

### Prerequisites

* Introduction to Colab
* Intermediate Python
* Intermediate Pandas
* Visualizations
* Regression
* Regression with scikit-learn
* Regression with TensorFlow

### Estimated Duration

330 minutes (285 minutes working time, 45 minutes for presentations)

### Deliverables

1. A copy of this Colab notebook containing your code and responses to the ethical considerations below.
1. A group presentation. After everyone is done, we will ask each group to stand in front of the class and give a brief presentation about what they have done in this lab. The presentation can be a code walkthrough, a group discussion, a slide show, or any other means that conveys what you did over the course of the day and what you learned. If you do create any artifacts for your presentation, please share them in the class folder.

### Grading Criteria

This project is graded in separate sections that each contribute a percentage of the total score:

1. Building and Using a Model (80%)
1. Ethical Implications (10%)
1. Project Presentation (10%)

#### Building and Using a Model

There are 6 demonstrations of competency listed in the problem statement below. Each competency is graded on a 3 point scale for a total of 18 available points. The following rubric will be used:

| Points | Description |
|--------|-------------|
| 0      | No attempt at the competency |
| 1      | Attempted competency, but in an incorrect manner |
| 2      | Attempted competency correctly, but sub-optimally |
| 3      | Successful demonstration of competency |

The demonstrations of competency show that the team knows how to use the tools of a data scientist, but they are not a good judge of "thinking like a data scientist". 3 additional points will be graded on the teams demonstration of skillful application of data science concepts and graded on the following rubric:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Created a generic model with little insight |
| 2      | Performed some basic data science processes and patterns |
| 3      | Demonstrated mastery of data science and exploration concepts learned so far |

#### Ethical Implications

There are six questions in the **Ethical Implications** secion. Each question is worth 2 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at question or answer was off-topic or didn't make sense |
| 1      | Question was answered, but answer missed important considerations  |
| 2      | Answer adequately considered ethical implications |

#### Project Presentation

The project presentation will be graded on participation. All members of a team should actively participate.

## Team

Please enter your team members names in the placeholders in this text area:

*   *Jesse Stewart*
*   *Huize Huang*
*   *Narangerel Boldbaatar*



# Exercises

## Exercise 1: Coding

[Kaggle](http://www.kaggle.com) hosts a [dataset containing intake and outcome data](https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes) for the [Austin Animal Care Shelter](http://www.austintexas.gov/department/aac). In this project we will **use intake data to predict the number of days that an animal is likely to stay in the shelter before being adopted**.

You are free to use any toolkit that we have covered in this class to solve the problem. That should be at least scikit-learn and TensorFlow.

Important details:

* The [dataset](https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes) offers three files, one for intakes, one for outcomes, and one that joins the two and adds some additional columns. Feel free to use any combination of the files.
* The column we are trying to predict is 'time_in_shelter_days'.
* Do not use any outcome data as features for training the model. We want to be able to predict the time in shelter for any given animal at intake.
* Not all animals have outcomes. Not all outcomes are adoption.

**Graded** demonstrations of competency:
1. Get the data into a Python object.
1. The ability to examine the data programmatically and visually.
1. Perform at least one preprocessing transformation on the data.
1. Creation and training of a regression model.
1. Testing and/or scoring of a model.
1. Model experimentation and tuning: record parameters and objects used along with resulting scores.

### Student Solution

In [0]:
import altair as alt
import numpy as np 
import pandas as pd
import math
import matplotlib
import matplotlib.pyplot as plt
import re
import seaborn as sns
from google.colab import files 

filename = './aac_intakes.csv'
aac_intake = pd.read_csv(filename)

filename = './aac_intakes_outcomes.csv'
aac_outake = pd.read_csv(filename)


In [0]:
aac_intake.describe()

In [0]:
aac_outake.describe()

In [0]:
aac_intake.head()

In [0]:
aac_outake.head()

In [0]:
aac_intake.columns

In [0]:
aac_intake.dtypes

In [0]:
aac_outake.dtypes

# Exam the data

Check for null values

In [0]:
n = len(aac_intake)
print("age_upon_intake: "+ str(n-aac_intake['age_upon_intake'].isnull().value_counts().values.tolist()[0]))
print("animal_type: "+str(n-aac_intake['animal_type'].isnull().value_counts().values.tolist()[0]))
print("intake_condition: "+ str(n-aac_intake['intake_condition'].isnull().value_counts().values.tolist()[0]))
print("intake_type: "+str(n-aac_intake['intake_type'].isnull().value_counts().values.tolist()[0]))
print("datetime: "+str(n-aac_intake['datetime'].isnull().value_counts().values.tolist()[0]))

In [0]:
n = len(aac_outake)
print("time_in_shelter_days: "+str(n-aac_outake['time_in_shelter_days'].isnull().value_counts().values.tolist()[0]))
print("age_upon_intake_(years): "+str(n-aac_outake['age_upon_intake_(years)'].isnull().value_counts().values.tolist()[0]))
print("age_upon_intake_(years): "+str(n-aac_outake['age_upon_intake_(days)'].isnull().value_counts().values.tolist()[0]))

# Data Preprocssing 

Unify age upon intake to days

In [0]:
def change_intake_age(i):
  i,j = i.split()
  if 'day' in j:
    k = int(i)
  elif 'week' in j:
    k = int(i)*7
  elif 'month' in j:
    k = int(i)*30
  elif 'year' in j:
    k = int(i)*365
  return k

aac_intake['age_upon_intake'] = aac_intake['age_upon_intake'].apply(lambda x: change_intake_age(x))
# aac_outake['age_upon_intake'] = aac_outake['age_upon_intake'].apply(lambda x: change_intake_age(x))
 

Unify age upon intake to months

In [0]:
filename = './aac_intakes.csv'
aac_intake = pd.read_csv(filename)

filename = './aac_intakes_outcomes.csv'
aac_outake = pd.read_csv(filename)

def change_intake_age_month(i):
  i,j = i.split()
  if 'day' in j:
    k = float(i)/30
  elif 'week' in j:
    k = float(i)/4
  elif 'month' in j:
    k = float(i)
  elif 'year' in j:
    k = float(i)*12
  return k

# aac_intake['age_upon_intake'] = aac_intake['age_upon_intake'].apply(lambda x: change_intake_age(x))
aac_outake['age_upon_intake'] = aac_outake['age_upon_intake'].apply(lambda x: change_intake_age_month(x))
 

# Data Visualization

In [0]:
plt.plot(aac_outake['age_upon_intake'],aac_outake['time_in_shelter_days'],'b.')

In [0]:
animal_type_uniqueValues, animal_type_occurCount = np.unique(aac_intake.animal_type, return_counts=True)
print("animal_type Unique Values : " , uniqueValues)
print("animal_type Occurrence Count : ", occurCount)
sns.barplot(animal_type_uniqueValues, animal_type_occurCount,ax=axes[0,0])

In [0]:
intake_condition_uniqueValues, intake_condition_occurCount = np.unique(aac_intake.intake_condition, return_counts=True)
print("intake_condition Unique Values : " , uniqueValues)
print("intake_condition Occurrence Count : ", occurCount)
# intake_condition_plot = sns.barplot(intake_condition_uniqueValues,intake_condition_occurCount)

In [0]:
intake_type_uniqueValues, intake_type_occurCount = np.unique(aac_intake.intake_type, return_counts=True)
print("intake_type Unique Values : " , uniqueValues)
print("intake_type Occurrence Count : ", occurCount)
# intake_type_plot = sns.barplot(intake_type_uniqueValues, intake_type_occurCount)
# intake_type_plot.set_xticklabels(intake_type.get_xticklabels(),rotation=45)

Convert DateTime to only month

In [0]:
aac_intake.datetime = aac_intake.datetime.apply(lambda x:int(x.split('-')[1])) 

In [0]:
datetime_uniqueValues, datetime_occurCount = np.unique(aac_intake.datetime, return_counts=True)
print("datetime Unique Values : " , datetime_uniqueValues)
print("datetime Occurrence Count : ", datetime_occurCount)
# datetime_plot = sns.barplot(datetime_uniqueValues, datetime_occurCount)

In [0]:
f, axes = plt.subplots(2, 2,figsize=(15,10))
sns.barplot(animal_type_uniqueValues, animal_type_occurCount,ax=axes[0,0])
axes[0,0].title.set_text("animal_type") 
sns.barplot(intake_condition_uniqueValues,intake_condition_occurCount,ax = axes[0,1])
axes[0,1].title.set_text("intake_condition") 
sns.barplot(intake_type_uniqueValues, intake_type_occurCount,ax = axes[1,0])
axes[1,0].set_xticklabels(axes[1,0].get_xticklabels(),rotation=45)
axes[1,0].title.set_text("intake_type")
sns.barplot(datetime_uniqueValues, datetime_occurCount,ax = axes[1,1])
axes[1,1].title.set_text("datetime")
plt.tight_layout()

# Convert String columns to useful numeric features

Convert intake_condition to scale 0-10

In [0]:
aac_outake["intake_condition"].value_counts()

In [0]:
intake_condition_nums = {"Normal": 10, "Injured": 8, "Sick": 6, "Nursing": 7,
                                  "Aged": 9, "Other": 3, "Feral":2,"Pregnant":5}
aac_outake.intake_condition.replace(intake_condition_nums, inplace=True)

Convert intake_type to scale 1-5

In [0]:
aac_outake.intake_type.value_counts()

In [0]:
intake_type_nums = {"Stray": 3, "Owner Surrender": 4, "Public Assist": 5, "Wildlife": 2,
                                  "Euthanasia Request": 1}
aac_outake.intake_type.replace(intake_type_nums, inplace=True)

Convert animal_type to scale 0-5

In [0]:
aac_outake.animal_type.value_counts()

In [0]:
animal_type_nums = {"Dog": 3, "Cat": 4, "Other": 0, "Bird": 1}
aac_outake.animal_type.replace(animal_type_nums, inplace=True)

# Train Model

**Linear Regression model**

Use age_upon_intake_(days), intake_condition, intake_type, animal_type as features

In [0]:
from sklearn.model_selection import train_test_split

xTrain, xTest, yTrain, yTest = train_test_split(
    aac_outake[['age_upon_intake_(days)','intake_condition','intake_type','animal_type']], 
                                                aac_outake['time_in_shelter_days'],
                                                test_size = 0.2, random_state = 0)

Train the model

In [0]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression(normalize = True)
lin_reg.fit(xTrain, yTrain)
lin_reg.coef_, lin_reg.intercept_


Score the model

In [0]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# rmse for test data
days_predict = lin_reg.predict(xTest)
rmse = sqrt(mean_squared_error(yTest, days_predict))
rmse

In [0]:
days_predict.mean()

In [0]:
# rmse for whole data
days_predict = lin_reg.predict(aac_outake[['age_upon_intake_(days)','intake_condition','intake_type','animal_type']])
rmse = sqrt(mean_squared_error(aac_outake['time_in_shelter_days'], days_predict))
rmse

In [0]:
lin_reg.score(aac_outake[['age_upon_intake_(days)','intake_condition','intake_type','animal_type']],aac_outake['time_in_shelter_days'])

**SGD Regression**

Use age_upon_intake_(days), intake_condition, intake_type, animal_type as features

In [0]:
from sklearn.model_selection import train_test_split

xTrain, xTest, yTrain, yTest = train_test_split(
    aac_outake[['age_upon_intake_(years)','intake_condition','intake_type','animal_type']], 
                                                aac_outake['time_in_shelter_days'],
                                                test_size = 0.2, random_state = 0)

Train the model

In [0]:
from sklearn.linear_model import SGDRegressor

# Create a new Stochastic Gradient Descent regressor
sgd_reg = SGDRegressor(max_iter = 17)

# Fit the model
sgd_reg.fit(xTrain, yTrain)

# Display the slope and intercept
sgd_reg.coef_, sgd_reg.intercept_

Score the model

In [0]:
# rmse for test data
days_predict = sgd_reg.predict(xTest)
rmse = sqrt(mean_squared_error(yTest, days_predict))
rmse

In [0]:
# rmse for whole data
days_predict = sgd_reg.predict(aac_outake[['age_upon_intake_(years)','intake_condition','intake_type','animal_type']])
rmse = sqrt(mean_squared_error(aac_outake['time_in_shelter_days'], days_predict))
rmse

In [0]:
sgd_reg.score(aac_outake[['age_upon_intake_(years)','intake_condition','intake_type','animal_type']],aac_outake['time_in_shelter_days'])

**Iterations**

Record different attempts at model configurations here:

| Model                        | Parameters                | Score         |
|------------------------------|---------------------------|---------------|
| sklearn LinearRegressor      | none                      | R^2 = 0.010854 |
| sklearn SGDRegressor         |max_iter = 17 | R^2 = 0.010355 |

## Exercise 2: Ethical Implications

Even the most basic of models have the potential to affect segments of the population in different ways. It is important to consider how your model might positively and negative effect different types of users.

In this section of the project you will reflect on the positive and negative implications of your model.

### Student Solution

**Positive Impact**

Your model is trying to solve a problem. Think about who will benefit from that problem being solved and write a brief narrative about how the model will help.

---

Animals who are less likely to be adopted could benefit from our model. Our model shows that animals might be more likely to be discriminated against because of their age, condition and type upon entering the shelter. By highlighting discrimination, staff can take appropriate measures to promote the adoption of animals that might spend more time in the shelter than their counterparts.

**Negative Impact**

Models don't often have universal benefit. Think about who might be negatively impacted by the predictions your model is making. This person or persons might not be directly using the model, but instead might be impacted indirectly.

---

Animals with a normal intake condition might be less likely to be adopted due to our model. This is because our model uses an arbitrary scale that rates the normal intake condition higher than all other conditions. Therefore, the shelter staff may make less of an effort to promote the adoption of animals in this group.

**Bias**

Models can be bias for many reasons. The bias can come from the data used to build the model (eg. sampling, data collection methods, available sources) and from the interpretation of the predictions generated by the model.

Think of at least two ways that bias might have been introduced to your model and explain both below.

---

One way bias might have been introduced to our model would be through the scale we used for intake condition:

![alt text](https://i.imgur.com/LOiybJ8.png)

This scale introduces experimenter's bias because it assigns values to arbitrary categories.

Another way bias might have been introduced to our model would be through the arbitrary scale we used for intake type:

![alt text](https://i.imgur.com/kTWIAIk.png)

This scale also introduces experimenter's bias because it assigns values to arbitrary categories.

**Changing the Dataset to Mitigate Bias**

Bias datasets are one of the primary ways in which bias is introduced to a machine learning model. Look back at the input data that you fed to your model. Think about how you might change something about the data to reduce bias in your model.

What change or changes could you make to your dataset less bias? Consider the data that you have, how and where that data was collected, and what other sources of data might be used to reduce bias.

Write a summary of change that could be made to your input data.

---

Since the data has potential reporting bias, we could change the column for intake condition. This is because it could be the case that only apparently sick animals would be recorded with the sick condition, which might result in false negatives for animals who appear normal despite being verifiably sick. Instead, the column could account for the animal's percieved appearance.

**Changing the Model to Mitigate Bias**

Is there any way to reduce bias by changing the model itself? This could include modifying algorithmic choices, tweaking hyperparameters, etc.

Write a brief summary of changes that you could make to help reduce bias in your model.

---

Since the model has potential experimenter's bias, we could remove the abitrary scales we used for intake type and intake condition. Instead, we could rely strictly on naturally numeric data such as datetime or age upon intake.

**Mitigating Bias Downstream**

Models make predictions. Downstream processes make decisions. What processes and/or rules should be in place for people and systems interpreting and acting on the results of your model to reduce the bias? Describe these below.

---

People and systems interpreting and acting on the results of our model should make it a rule to evenly promote the adoption of all shelter animals regardless of the scales our model used for intake condition and intake type.