#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Dataset Exploration


For this project, you will be divided into small groups of 2-3 people. You will be given a dataset and an associated problem. Over the course of the day, your team will explore the dataset and train the best model you can in order to solve the problem. At the end of the day, your team will give a short presentation about your model and solution.

## Overview

### Learning Objectives

* Acquire and load dataset(s) into the Pandas structures
* Inspect data columns description and statistics
* Explore data to understand relationship between features
* Draw data insights

### Prerequisites

* Introduction to Colab
* Intermediate Python
* Intermediate Pandas
* Visualizations
* Data Exploration

### Estimated Duration

240 minutes

### Deliverables

1. A **copy of this Colab notebook** containing your code and responses to the ethical considerations below.
1. At the end of the day, we will ask you and your group to stand in front of the class and give a brief **presentation about what you have done**. The presentation can be a code walkthrough, a group discussion, a slide show, or anything else that conveys what you did over the course of the day and what you learned. If you do create any artifacts for your presentation, please share them in the class folder.

### Grading Criteria

This project is graded in separate sections that each contribute a percentage of the total score:

1. Explore data to gain insights (80%)
1. Ethical Implications (10%)
1. Project Presentation (10%)

#### Building and Using a Model

There are 6 demonstrations of competency listed in the problem statement below. Each competency is graded on a scale from 0 to 3 for a total of 18 available points. The following rubric will be used:

| Points | Description |
|--------|-------------|
| 0      | No attempt at competency |
| 1      | Attempted competency, but in an incorrect manner |
| 2      | Attempted competency correctly, but sub-optimally |
| 3      | Successfully demonstrated competency |

The demonstrations of competency show that the team knows how to use the tools of a data scientist, but they are not a good judge of "thinking like a data scientist". 3 additional points will be graded on the team's demonstration of skillful application of data science concepts, using the following rubric:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Created a generic model with little insight |
| 2      | Performed some basic data science processes and patterns |
| 3      | Demonstrated mastery of data science and exploration concepts learned so far |

#### Ethical Implications

There are six questions in the **Ethical Implications** section. Each question is worth 2 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at question or answer was irrelevant |
| 1      | Question was answered, but answer missed important considerations  |
| 2      | Answer adequately addressed ethical implications |

#### Project Presentation

The project presentation will be graded on participation. All members of a team should actively participate.

## Team

Please enter your team members' names in the placeholders in this text area:

*   *Team Member Placeholder*
*   *Team Member Placeholder*
*   *Team Member Placeholder*



# Exercises

## Exercise 1: Coding

[Kaggle](http://www.kaggle.com) hosts a [dataset containing US airline on-time statistics and delay data](https://www.kaggle.com/giovamata/airlinedelaycauses) from the [US Department of Transportation's Bureau of Transportation Statistics (BTS)](https://www.bts.gov/). In this project we will *use flight statistics data to gain insights into US airports and airlines flights in 2008*.

You are free to use any toolkit that we have covered in class to solve the problem (e.g. Pandas, Matplotlib, Seaborn).

**Graded** demonstrations of competency:
1. Get the data into a Python object.
1. Inspect the data for each column's datatype and summary statistics.
1. Explore the data programmatically and visually.
1. Produce an answer and visualization where applicable for at least 3 questions from the list below (or feel free to come up with your own), and discuss any relevant insights.

  * Which US airports is the busiest airport? You can decide how you'd like to measure "business" e.g.: annually, monthly, daily.
  * Of the 2008 flights that are *actually delayed*, think about:
    * Which 10 US airlines have the most delays?
    * Which 10 US airlines have the longest average delay time?
    * Which 10 US airports have the most delays?
    * Which 10 US airports have the longest average delay time?
  * More analysis:
    * Are there patterns on how flight delays are distributed across different hours of the day?
    * How about across months or season? Can you think of any reasons for these seasonal delays?
    * If you look at average delay time or number of delays by airport, does the data show linearity? Does any subset of the data show linearity?
    * Add reason for delay to your delay analysis above.
    * Examine flight frequencies, delays, time of day or year, etc. for a specific airport, airline or origin-arrival airport pair.

### Student Solution

In [0]:
# Use as many text and code blocks as you need to create your solution.
# Make sure to take notes and add lots of code comments, so your instructor
# understands what you are doing!

print("Good luck!")

### Answer Key

Solutions may vary widely. Below is an example solution.

**Solution**

**Upload dataset**

In [0]:
# Upload the file from your computer to the colab runtime.

from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

**Unzip the file**

Students might not have this step codified if they unzipped the file on another machine and uploaded the uncompressed file.

In [0]:
import zipfile

filename = "airlinedelaycauses.zip"

with zipfile.ZipFile(filename) as f:
  f.extractall()

filename = "DelayedFlights.csv"

print(filename)

In [0]:
!ls

**Load the file into a `DataFrame`**

This isn't strictly required, but so far in the class we have been using dataframes, so expect that the students will continue with this pattern.

In [0]:
import pandas as pd

dataframe = pd.read_csv(filename)
dataframe.head()

**Examine the dataset statistics**

In [0]:
dataframe.describe()

In [0]:
dataframe.columns

In [0]:
dataframe.dtypes

**Explore the data**

Every project will likely be different, but look for evidence that the students looked at the data. There should be at least one visualization.

In [0]:
dataframe['DepDelay'].describe()

In [0]:
dataframe['ArrTime'].describe()

In [0]:
import matplotlib.pyplot as plt

plt.plot(dataframe['DepDelay'], dataframe['ArrDelay'], 'b.')
plt.show()

In [0]:
dataframe.groupby(['UniqueCarrier'])[['UniqueCarrier']].count().rename(
    columns={'UniqueCarrier': 'FlightCount'}).sort_values(
        'FlightCount', ascending=False)

**Searching for insights**

Student solutions will vary, but there should be some preprocessing of the data. Many of the columns contain strings that could be converted to numeric columns.

At the very least, students should recognize that there are multiple outcome types and that we are only interested in predicting adoption.

In [0]:
top10_df = grouped_df.sort_values(by=['Flight Count'], ascending=False).head(10)
top10_df.index

In [0]:
delayByMonth = dataframe[dataframe.ArrDelay >= 0].groupby(
    ['Month'])[['ArrDelay']].mean()

In [0]:
dataframe[dataframe.ArrDelay >= 0].groupby(
    ['Dest'])[['ArrDelay']].mean().sort_values('ArrDelay', ascending=False)

## Exercise 2: Ethical Implications

Even the most basic models have the potential to affect segments of the population in different ways. It is important to consider how your model might positively and negatively affect different types of users.

In this section of the project, you will reflect on the ethical implications of your model.

### Student Solution

**Positive Impact**

Your model is trying to solve a problem. Think about who will benefit if the problem is solved, and write a brief narrative about how the model will help.

---

*Hypothetical entities will benefit because...*

**Negative Impact**

Models usually don't have a universal benefit. Think about who might be negatively impacted by the predictions your model is making. This person or persons might not be directly using the model, but instead might be impacted indirectly.

---

*Hypothetical entity will be negatively impacted because...*

**Bias**

Models can be biased for many reasons. The bias can come from the data used to build the model (e.g. sampling, data collection methods, available sources), and from the interpretation of the predictions generated by the model.

Think of at least two ways that bias might have been introduced to your model and explain them below.

---

*One source of bias in the model could be...*

*Another source of bias in the model could be...*

**Changing the Dataset to Mitigate Bias**

The most common way that a model is biased is when the dataset itself is biased. Look back at the input data that you fed to your model. Think about how you might change something about the data to reduce bias in your model.

What changes could you make to make your dataset less biased? Consider the data that you have, how and where that data was collected, and what other sources of data might be used to reduce bias.

Write a summary of the changes that could be made to your input data.

---

*Since the data has potential bias X, we can adjust...*

**Changing the Model to Mitigate Bias**

Are there any ways to reduce bias by changing the model itself? This could include modifying the choice of algorithm,tweaking hyperparameters, etc.

Write a brief summary of any changes that you could make to help reduce bias in your model.

---

*Since the model has potential bias X, we can adjust...*

**Mitigating Bias Downstream**

Models make predictions. Downstream processes make decisions. What processes and/or rules should be in place for people and systems interpreting and acting on the results of your model to reduce the bias? Describe these below.

---

*Since the predictions have potential bias X, we can implement processes...*