#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Dataset Exploration


In this project you will be divided into small groups (two or three people). You will be pointed to a dataset and asked to create a model to solve a problem. Over the course of the day, your team will explore the data and train the best model you can for solving the problem. At the end of the day, your team will give a short presentation about your solution.

## Overview

### Learning Objectives

* Acquire and load dataset(s) into the Pandas structures.
* Inspect data columns description and statistics.
* Explore data to understand relationship between features.
* Draw data insights.

### Prerequisites

* Introduction to Colab
* Intermediate Python
* Intermediate Pandas
* Visualizations
* Data Exploration

### Estimated Duration

240 minutes

### Deliverables

1. A copy of this Colab notebook containing your code and responses to the ethical considerations below.
1. A group presentation. After everyone is done, we will ask each group to stand in front of the class and give a brief presentation about what they have done in this lab. The presentation can be a code walkthrough, a group discussion, a slide show, or any other means that conveys what you did over the course of the day and what you learned. If you do create any artifacts for your presentation, please share them in the class folder.

### Grading Criteria

This project is graded in separate sections that each contribute a percentage of the total score:

1. Explore data to gain insights (80%)
1. Ethical Implications (10%)
1. Project Presentation (10%)

#### Building and Using a Model

There are 6 demonstrations of competency listed in the problem statement below. Each competency is graded on a 3 point scale for a total of 18 available points. The following rubric will be used:

| Points | Description |
|--------|-------------|
| 0      | No attempt at the competency |
| 1      | Attempted competency, but in an incorrect manner |
| 2      | Attempted competency correctly, but sub-optimally |
| 3      | Successful demonstration of competency |

The demonstrations of competency show that the team knows how to use the tools of a data scientist, but they are not a good judge of "thinking like a data scientist". 3 additional points will be graded on the teams demonstration of skillful application of data science concepts and graded on the following rubric:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Created a generic model with little insight |
| 2      | Performed some basic data science processes and patterns |
| 3      | Demonstrated mastery of data science and exploration concepts learned so far |

#### Ethical Implications

There are six questions in the **Ethical Implications** section. Each question is worth 2 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at question or answer was off-topic or didn't make sense |
| 1      | Question was answered, but answer missed important considerations  |
| 2      | Answer adequately considered ethical implications |

#### Project Presentation

The project presentation will be graded on participation. All members of a team should actively participate.

## Team

Please enter your team members names in the placeholders in this text area:

*   *Team Member Placeholder*
*   *Team Member Placeholder*
*   *Team Member Placeholder*



# Exercises

## Exercise 1: Coding

[Kaggle](http://www.kaggle.com) hosts a [dataset containing US airline on-time statistics and delay data](https://www.kaggle.com/giovamata/airlinedelaycauses) from the [US Department of Transportation's Bureau of Transportation Statistics (BTS)](https://www.bts.gov/). In this project we will **use flight statistics data to gain insights into US airports and airlines flights in 2008**.

You are free to use any toolkit that we have covered in this class to solve the problem. That should be at least Pandas and Matplotlib or Seaborn.

**Graded** demonstrations of competency:
1. Get the data into a Python object.
1. Inspect the data for columns' datatype and statistics.
1. Explore the data programmatically and visually.
1. Produce answer and visualization where applicable for at least 3 questions.  Pick from the list of questions below or come up with one yourself, and talk about any insight if any:

  * Which US airports is the busiest airports?  Decide how you'd like to measure it, eg: by annual, monthly, or daily flight traffic?
  * Of the 2008 flights that are __actually delayed__, think about:
    * Which 10 US airlines have the most delays measured it by flight count?
    * Which 10 US airlines have the most delays measured it by average length of delay?
    * Similarly, you can get the top 10 US airports instead of airlines for the previous questions.  Which 10 US airports have the most delays measured it by flight count?
    * Which 10 US airports have the most delays measured it by flight count?
  * More analysis:
    * Is there patterns on how flight delays are distributed across different hours of the day?
    * Similarly, how about across months or season?  Maybe correlate to seasonal weather impact, holiday traffic, etc.
    * If you look at the data beyond the top 10 US airlines or airports is the data show linearity as you examine top 40 US airlines or airports.
    * Reexamine the figures you worked on above by reason for delay.
    * Drill down on particular airport, airline or even origin and arrival airport pairs - and examine flight frequencies, delays, time of day or year, etc.
  * or any questions that your team come up with.

### Student Solution

In [0]:
# Use as many text and code blocks as you need to create your solution and take notes.

print("Good luck!")

### Answer Key

Solutions will vary widely. Below is an example solution.

**Solution**

**Upload dataset**

In [0]:
# Upload the file from your computer to the colab runtime

from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

**Unzip the file**

Students might not have this step codified if they unzipped the file on another machine and uploaded the uncompressed file.

In [0]:
import zipfile

filename = "airlinedelaycauses.zip"

with zipfile.ZipFile(filename) as f:
  f.extractall()

filename = "DelayedFlights.csv"


print(filename)

In [0]:
!ls

**Load the file into a `DataFrame`**

This isn't strictly required, but so far in the class we have been loading the data into `DataFrame`s so expect that the students will continue with this pattern.

In [0]:
import pandas as pd

dataframe = pd.read_csv(filename)
dataframe.head()

**Examine the dataset statistics**

In [0]:
dataframe.describe()

In [0]:
dataframe.columns

In [0]:
dataframe.dtypes

**Explore the data**

Every project will likely be different, but look for evidence that the students looked at the data. There should be at least one visualization.

In [0]:
dataframe['DepDelay'].describe()

In [0]:
dataframe['ArrTime'].describe()

In [0]:
import matplotlib.pyplot as plt

plt.plot(dataframe['DepDelay'], dataframe['ArrDelay'], 'b.')
plt.show()

In [0]:
dataframe.groupby(['UniqueCarrier'])[['UniqueCarrier']].count().rename(columns={'UniqueCarrier':'FlightCount'}).sort_values('FlightCount',ascending=False)

**Searching for insights**

Student solutions will vary, but there should be some data pre-processing. Many of the columns are strings that could be converted to numeric columns.

At the very least the students should recognize the there are multiple outcome types and that we are only interested in predicting adoption.

In [0]:
top10_df = grouped_df.sort_values(by=['Flight Count'],ascending=False).head(10)
top10_df.index


In [0]:
delayByMonth = dataframe[dataframe.ArrDelay >= 0].groupby(['Month'])[['ArrDelay']].mean()

In [0]:
dataframe[dataframe.ArrDelay >= 0].groupby(['Dest'])[['ArrDelay']].mean().sort_values('ArrDelay', ascending=False)

**Validation**

In [0]:
# TODO

## Exercise 2: Ethical Implications

Even the most basic of models have the potential to affect segments of the population in different ways. It is important to consider how your model might positively and negative effect different types of users.

In this section of the project you will reflect on the positive and negative implications of your model.

### Student Solution

**Positive Impact**

Your model is trying to solve a problem. Think about who will benefit from that problem being solved and write a brief narrative about how the model will help.

---

*Hypothetical entities will benefit because...*

**Negative Impact**

Models don't often have universal benefit. Think about who might be negatively impacted by the predictions your model is making. This person or persons might not be directly using the model, but instead might be impacted indirectly.

---

*Hypothetical entity will be negatively impacted because...*

**Bias**

Models can be bias for many reasons. The bias can come from the data used to build the model (eg. sampling, data collection methods, available sources) and from the interpretation of the predictions generated by the model.

Think of at least two ways that bias might have been introduced to your model and explain both below.

---

*One source of bias in the model could be...*

*Another source of bias in the model could be...*

**Changing the Dataset to Mitigate Bias**

Bias datasets are one of the primary ways in which bias is introduced to a machine learning model. Look back at the input data that you fed to your model. Think about how you might change something about the data to reduce bias in your model.

What change or changes could you make to your dataset less bias? Consider the data that you have, how and where that data was collected, and what other sources of data might be used to reduce bias.

Write a summary of change that could be made to your input data.

---

*Since the data has potential bias A we can adjust...*

**Changing the Model to Mitigate Bias**

Is there any way to reduce bias by changing the model itself? This could include modifying algorithmic choices, tweaking hyperparameters, etc.

Write a brief summary of changes that you could make to help reduce bias in your model.

---

*Since the model has potential bias A we can adjust...*

**Mitigating Bias Downstream**

Models make predictions. Downstream processes make decisions. What processes and/or rules should be in place for people and systems interpreting and acting on the results of your model to reduce the bias? Describe these below.

---

*Since the predictions have potential bias A we can adjust...*

### Answer Key

**Solution**

In [0]:
# TODO

**Validation**

In [0]:
# N/A