#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Regression Project

In this project you will be divided into small groups (two or three people). You will be pointed to a dataset and asked to create a model to solve a problem. Over the course of the day, your team will explore the data and train the best model you can for solving the problem. At the end of the day, your team will give a short presentation about your solution.

## Overview

### Learning Objectives

* Apply scikit-learn or TensorFlow to a dataset to create a regression model.
* Preprocess data for feeding into a model.
* Use a hand-built model to make predictions.
* Measure the quality of predictions from your model.

### Prerequisites

* Introduction to Colab
* Intermediate Python
* Intermediate Pandas
* Visualizations
* Regression
* Regression with scikit-learn
* Regression with TensorFlow

### Estimated Duration

330 minutes (285 minutes working time, 45 minutes for presentations)

### Deliverables

1. A copy of this Colab notebook containing your code and responses to the ethical considerations below.
1. A group presentation. After everyone is done, we will ask each group to stand in front of the class and give a brief presentation about what they have done in this lab. The presentation can be a code walkthrough, a group discussion, a slide show, or any other means that conveys what you did over the course of the day and what you learned. If you do create any artifacts for your presentation, please share them in the class folder.

### Grading Criteria

This project is graded in separate sections that each contribute a percentage of the total score:

1. Building and Using a Model (80%)
1. Ethical Implications (10%)
1. Project Presentation (10%)

#### Building and Using a Model

There are 6 demonstrations of competency listed in the problem statement below. Each competency is graded on a 3 point scale for a total of 18 available points. The following rubric will be used:

| Points | Description |
|--------|-------------|
| 0      | No attempt at the competency |
| 1      | Attempted competency, but in an incorrect manner |
| 2      | Attempted competency correctly, but sub-optimally |
| 3      | Successful demonstration of competency |

The demonstrations of competency show that the team knows how to use the tools of a data scientist, but they are not a good judge of "thinking like a data scientist". 3 additional points will be graded on the teams demonstration of skillful application of data science concepts and graded on the following rubric:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Created a generic model with little insight |
| 2      | Performed some basic data science processes and patterns |
| 3      | Demonstrated mastery of data science and exploration concepts learned so far |

#### Ethical Implications

There are six questions in the **Ethical Implications** section. Each question is worth 2 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at question or answer was off-topic or didn't make sense |
| 1      | Question was answered, but answer missed important considerations  |
| 2      | Answer adequately considered ethical implications |

#### Project Presentation

The project presentation will be graded on participation. All members of a team should actively participate.

## Team

Please enter your team members names in the placeholders in this text area:

*   *Team Member Placeholder*
*   *Team Member Placeholder*
*   *Team Member Placeholder*



# Exercises

## Exercise 1: Coding

[Kaggle](http://www.kaggle.com) hosts a [dataset containing intake and outcome data](https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes) for the [Austin Animal Care Shelter](http://www.austintexas.gov/department/aac). In this project we will **use intake data to predict the number of days that an animal is likely to stay in the shelter before being adopted**.

You are free to use any toolkit that we have covered in this class to solve the problem. That should be at least scikit-learn and TensorFlow.

Important details:

* The [dataset](https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes) offers three files, one for intakes, one for outcomes, and one that joins the two and adds some additional columns. Feel free to use any combination of the files.
* The column we are trying to predict is 'time_in_shelter_days'.
* Do not use any outcome data as features for training the model. We want to be able to predict the time in shelter for any given animal at intake.
* Not all animals have outcomes. Not all outcomes are adoption.

**Graded** demonstrations of competency:
1. Get the data into a Python object.
1. The ability to examine the data programmatically and visually.
1. Perform at least one preprocessing transformation on the data.
1. Creation and training of a regression model.
1. Testing and/or scoring of a model.
1. Model experimentation and tuning: record parameters and objects used along with resulting scores.

### Student Solution

In [0]:
# Use as many text and code blocks as you need to create your solution and take notes.

print("Good luck!")

**Iterations**

Record different attempts at model configurations here:

| Model                        | Parameters                | Score         |
|------------------------------|---------------------------|---------------|
| sklearn LinearRegressor      | none                      | R^2 = 0.00123 |
| sklearn SGDRegressor         | batch_size=50, epochs=100 | R^2 = 0.00011 |

### Answer Key

**Solution**

**Unzip the file**

Students might not have this step codified if they unzipped the file on another machine and uploaded the uncompressed file.

In [0]:
import zipfile

filename = "aac_intakes_outcomes.csv.zip"

with zipfile.ZipFile(filename) as f:
  f.extractall()

filename = filename[0:-4]


print(filename)

**Load the file into a `DataFrame`**

This isn't strictly required, but so far in the class we have been loading the data into `DataFrame`s so expect that the students will continue with this pattern.

In [0]:
import pandas as pd

dataframe = pd.read_csv(filename)
dataframe

**Explore the data**

Every project will likely be different, but look for evidence that the students looked at the data. There should be at least one visualization.

In [0]:
dataframe.describe()

In [0]:
dataframe.columns

In [0]:
dataframe.dtypes

In [0]:
dataframe['time_in_shelter_days'].describe()

In [0]:
dataframe['age_upon_intake_(days)'].describe()

In [0]:
import matplotlib.pyplot as plt

plt.plot(dataframe['age_upon_intake_(days)'], dataframe['time_in_shelter_days'], 'b.')
plt.show()

**Preprocess data**

Student solutions will vary, but there should be some data pre-processing. Many of the columns are strings that could be converted to numeric columns.

At the very least the students should recognize the there are multiple outcome types and that we are only interested in predicting adoption.

In [0]:
dataframe = dataframe[dataframe['outcome_type'] == 'Adoption']

**Build and train the model**

The students are welcome to use scikit-learn and/or TensorFlow. They need to create a regression model, but aren't limited to linear regression. Expect a wide variety of choices at this point.

In the example below the full dataset and did a closed-form solution. Students might use other models that use test/train splits.

In [0]:
from sklearn.linear_model import LinearRegression

features = ['age_upon_intake_(days)', 'intake_month', 'intake_year']
target = 'time_in_shelter_days'

lin_reg = LinearRegression()
lin_reg.fit(dataframe[features], dataframe[target])

**Validate**

Do a testing, validation, or scoring check of the model.

In [0]:
lin_reg.score(dataframe[features], dataframe[target])

**Validation**

In [0]:
# TODO

## Exercise 2: Ethical Implications

Even the most basic of models have the potential to affect segments of the population in different ways. It is important to consider how your model might positively and negative effect different types of users.

In this section of the project you will reflect on the positive and negative implications of your model.

### Student Solution

**Positive Impact**

Your model is trying to solve a problem. Think about who will benefit from that problem being solved and write a brief narrative about how the model will help.

---

*Hypothetical entities will benefit because...*

**Negative Impact**

Models don't often have universal benefit. Think about who might be negatively impacted by the predictions your model is making. This person or persons might not be directly using the model, but instead might be impacted indirectly.

---

*Hypothetical entity will be negatively impacted because...*

**Bias**

Models can be bias for many reasons. The bias can come from the data used to build the model (eg. sampling, data collection methods, available sources) and from the interpretation of the predictions generated by the model.

Think of at least two ways that bias might have been introduced to your model and explain both below.

---

*One source of bias in the model could be...*

*Another source of bias in the model could be...*

**Changing the Dataset to Mitigate Bias**

Bias datasets are one of the primary ways in which bias is introduced to a machine learning model. Look back at the input data that you fed to your model. Think about how you might change something about the data to reduce bias in your model.

What change or changes could you make to your dataset less bias? Consider the data that you have, how and where that data was collected, and what other sources of data might be used to reduce bias.

Write a summary of change that could be made to your input data.

---

*Since the data has potential bias A we can adjust...*

**Changing the Model to Mitigate Bias**

Is there any way to reduce bias by changing the model itself? This could include modifying algorithmic choices, tweaking hyperparameters, etc.

Write a brief summary of changes that you could make to help reduce bias in your model.

---

*Since the model has potential bias A we can adjust...*

**Mitigating Bias Downstream**

Models make predictions. Downstream processes make decisions. What processes and/or rules should be in place for people and systems interpreting and acting on the results of your model to reduce the bias? Describe these below.

---

*Since the predictions have potential bias A we can adjust...*

### Answer Key

**Solution**

In [0]:
# TODO

**Validation**

In [0]:
# N/A