<a href="https://colab.research.google.com/github/luisgruba/Data-Visualization/blob/master/Luis_Gruber_T02_09_%5B00%5D_Dataset_Exploration_%5BColab%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Dataset Exploration


In [0]:
import pandas as pd
import numpy as np
import altair as alt
import math
import matplotlib
import matplotlib.pyplot as plt
import re
import seaborn as sns
import calendar
from vega_datasets import data

In this project you will be divided into small groups (two or three people). You will be pointed to a dataset and asked to create a model to solve a problem. Over the course of the day, your team will explore the data and train the best model you can for solving the problem. At the end of the day, your team will give a short presentation about your solution.

## Overview

### Learning Objectives

* Acquire and load dataset(s) into the Pandas structures.
* Inspect data columns description and statistics.
* Explore data to understand relationship between features.
* Draw data insights.

### Prerequisites

* Introduction to Colab
* Intermediate Python
* Intermediate Pandas
* Visualizations
* Data Exploration

### Estimated Duration

240 minutes

### Deliverables

1. A copy of this Colab notebook containing your code and responses to the ethical considerations below.
1. A group presentation. After everyone is done, we will ask each group to stand in front of the class and give a brief presentation about what they have done in this lab. The presentation can be a code walkthrough, a group discussion, a slide show, or any other means that conveys what you did over the course of the day and what you learned. If you do create any artifacts for your presentation, please share them in the class folder.

### Grading Criteria

This project is graded in separate sections that each contribute a percentage of the total score:

1. Explore data to gain insights (80%)
1. Ethical Implications (10%)
1. Project Presentation (10%)

#### Building and Using a Model

There are 6 demonstrations of competency listed in the problem statement below. Each competency is graded on a 3 point scale for a total of 18 available points. The following rubric will be used:

| Points | Description |
|--------|-------------|
| 0      | No attempt at the competency |
| 1      | Attempted competency, but in an incorrect manner |
| 2      | Attempted competency correctly, but sub-optimally |
| 3      | Successful demonstration of competency |

The demonstrations of competency show that the team knows how to use the tools of a data scientist, but they are not a good judge of "thinking like a data scientist". 3 additional points will be graded on the teams demonstration of skillful application of data science concepts and graded on the following rubric:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Created a generic model with little insight |
| 2      | Performed some basic data science processes and patterns |
| 3      | Demonstrated mastery of data science and exploration concepts learned so far |

#### Ethical Implications

There are six questions in the **Ethical Implications** secion. Each question is worth 2 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at question or answer was off-topic or didn't make sense |
| 1      | Question was answered, but answer missed important considerations  |
| 2      | Answer adequately considered ethical implications |

#### Project Presentation

The project presentation will be graded on participation. All members of a team should actively participate.

## Team

Please enter your team members names in the placeholders in this text area:

*   *Team Member Placeholder*
*   *Team Member Placeholder*
*   *Team Member Placeholder*



# Exercises

## Exercise 1: Coding

[Kaggle](http://www.kaggle.com) hosts a [dataset containing US airline on-time statistics and delay data](https://www.kaggle.com/giovamata/airlinedelaycauses) from the [US Department of Transportation's Bureau of Transportation Statistics (BTS)](https://www.bts.gov/). In this project we will **use flight statistics data to gain insights into US airports and airlines flights in 2008**.

You are free to use any toolkit that we have covered in this class to solve the problem. That should be at least Pandas and Matplotlib or Seaborn.

Important details:

* The [dataset](https://www.kaggle.com/giovamata/airlinedelaycauses) consists of one file, DelayedFlights.csv.
* The column we are trying to predict is 'time_in_shelter_days'.
* Do not use any outcome data as features for training the model. We want to be able to predict the time in shelter for any given animal at intake.
* Not all animals have outcomes. Not all outcomes are adoption.

**Graded** demonstrations of competency:
1. Get the data into a Python object.
1. Inspect the data for columns' datatype and statistics.
1. Explore the data programmatically and visually.
1. Produce answer and visualization where applicable for at least 3 questions.  Pick from the list of questions below or come up with one yourself, and talk about any insight if any:

  * Which US airports is the busiest airports?  Decide how you'd like to measure it, eg: by annual, monthly, or daily flight traffic?
  * Of the 2008 flights that are __actually delayed__, think about:
    * Which 10 US airlines have the most delays measured it by flight count?
    * Which 10 US airlines have the most delays measured it by average length of delay?
    * Similarly, you can get the top 10 US aiports instead of airlines for the previous questions.  Which 10 US airports have the most delays measured it by flight count?
    * Which 10 US airports have the most delays measured it by flight count?
  * More analysis:
    * Is there patterns on how flight delays are distributed across different hours of the day?
    * Similarly, how about across months or season?  Maybe correllate to seasonal weather impact, holiday traffic, etc.
    * If you look at the data beyond the top 10 US airlines or airports is the data show linearity as you examine top 40 US airlines or airports.
    * Reexamine the figures you worked on above by reason for delay.
    * Drill down on particular airport, airline or even origin and arrival airport pairs - and examine flight frequencies, delays, time of day or year, etc.
  * or any questions that your team come up with.

### Student Solution

In [0]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

# Read the dataset into a pandas dataframe



In [0]:
dataset_filename = "./DelayedFlights.csv"
flight_df = pd.read_csv(dataset_filename, encoding='latin-1')


In [0]:
df = flight_df
print(df.shape)
df = df.dropna()
df.head(10)

In [0]:
origin_delay = df[['DepDelay', 'Origin', 'Month', 'DayofMonth']]
origin_delay.head(1500)

airports = list(set(df['Origin']))
months = list(set(df['Month']))
airports.sort()

origin_delay = origin_delay.set_index(['Month', 'DepDelay'])
origin_delay = origin_delay.sort_index(ascending = False)
#origin_delay = origin_delay.groupby(['Month', 'Origin'])[['DepDelay']].mean()
origin_delay = origin_delay.reset_index()
origin_delay = origin_delay.set_index('Month')
origin_delay


In [0]:
dest_delay = df[['ArrDelay', 'Dest', 'Month', 'DayofMonth']]
dest_delay.head(1500)

airports = list(set(df['Dest']))
months = list(set(df['Month']))
airports.sort()

dest_delay = dest_delay.set_index(['Month', 'ArrDelay'])
dest_delay = dest_delay.sort_index(ascending = False)
#origin_delay = origin_delay.groupby(['Month', 'Origin'])[['DepDelay']].mean()
dest_delay = dest_delay.reset_index()
dest_delay = dest_delay.set_index('Month')
dest_delay

In [0]:
reduced_delay_df = pd.DataFrame()
for month in range(12,0,-1):
  reduced_delay_df = pd.concat([reduced_delay_df, origin_delay.loc[month][:50]])

reduced_delay_df = reduced_delay_df.reset_index()
#d = dict(enumerate(calendar.month_abbr))
#reduced_delay_df['Month'] = reduced_delay_df['Month'].map(d)

reduced_delay_df

In [0]:
reduced_delay_df = pd.DataFrame()
for month in range(12,0,-1):
  reduced_delay_df = pd.concat([reduced_delay_df, origin_delay.loc[month][:50]])

reduced_delay_df = reduced_delay_df.reset_index()

reduced_delay_df['Origin'].nunique()

In [0]:
reduced_arr_delay = pd.DataFrame()
for month in range(12,0,-1):
  reduced_arr_delay = pd.concat([reduced_arr_delay, dest_delay.loc[month][:100]])

reduced_arr_delay = reduced_arr_delay.reset_index()

reduced_arr_delay['Dest'].nunique()

We are dealing with 302 airports and will evaluate delays for ranking the top 50 worst by month

In [0]:
alt.Chart(reduced_delay_df.reset_index()).mark_point().encode(
     alt.X('Month:Q'),
     y='DepDelay:Q',
     color='Origin:N',
     tooltip = ['Origin', 'DepDelay']
 ).interactive()

In [0]:
airports = data.airports()
airports = airports[['iata', 'latitude', 'longitude']]
combi = reduced_delay_df.merge(airports, how = 'left', left_on = 'Origin', right_on = 'iata')
combi = combi.drop(columns=['iata', 'DayofMonth'])

airports_arr = data.airports()
airports_arr = airports_arr[['iata', 'latitude', 'longitude']]
arr_combi = reduced_arr_delay.merge(airports_arr, how = 'left', left_on = 'Dest', right_on = 'iata')
arr_combi = arr_combi.drop(columns=['iata', 'DayofMonth'])
#combi

In [0]:
# US states background
states = alt.topo_feature(data.us_10m.url, feature='states')

background = alt.Chart(states).mark_geoshape(
    fill='darkorange',
    stroke='white'
).properties(
    width=550,
    height=600
).project('albersUsa')

#slider = alt.binding_range(list)
months =["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

# input_slider = alt.binding_select(options = months)

slider = alt.binding_range(min=1, max=12, step=1)
selection = alt.selection_single(fields=['Month'], bind=slider, name = "Current", init={'Month': 1})

color_1 = alt.condition(selection, alt.Color('DepDelay', aggregate='mean', type='quantitative'),alt.value('black'))
color_2 = alt.condition(selection, alt.Color('ArrDelay', aggregate='mean', type='quantitative'),alt.value('black'))


points_delay = alt.Chart(combi).mark_circle(
    size=70,
).encode(
    longitude='longitude:Q',
    latitude='latitude:Q',
    tooltip=['Origin', 'DepDelay', 'Month'],
    color = color_1
).add_selection(
    selection
).transform_filter(
    selection
)

points_arrive = alt.Chart(arr_combi).mark_circle(
    size=80,
).encode(
    longitude='longitude:Q',
    latitude='latitude:Q',
    tooltip=['Dest', 'ArrDelay', 'Month'],
    color = color_2
).add_selection(
    selection
).transform_filter(
    selection
)


(background + points_delay) | (background + points_arrive)

## Exercise 2: Ethical Implications

Even the most basic of models have the potential to affect segments of the population in different ways. It is important to consider how your model might positively and negative effect different types of users.

In this section of the project you will reflect on the positive and negative implications of your model.

### Student Solution

**Positive Impact**

Your model is trying to solve a problem. Think about who will benefit from that problem being solved and write a brief narrative about how the model will help.

---

This model would allow users to see specific airport's delayed timings. This can help people plan their trips better and potentially schedule around specific times- especially people who travel a lot. This can also help airlines plan their trips better to schedule trips to specific places at certain times (i.e., to prevent long delays due to weather.

**Negative Impact**

Models don't often have universal benefit. Think about who might be negatively impacted by the predictions your model is making. This person or persons might not be directly using the model, but instead might be impacted indirectly.

---

Showing how many delays occur at certain airports, could cause an airline or airport to experience a lower number of passengers. This lower number of people departing from the airport would also impact neighboring airports by causing them to see a larger population of travelers in which they also must adjust accordingly. It would impact neighboring airlines who would have to adjust for more passengers as well. A secondary impact this dataset would have would be on the employees of the airports as more delays can cause fewer flights out of the airport which can result in having fewer workers. Some of the outputs this model produces are affected by multivariate system, with factors such as weather and season affecting airports beyond their control.

**Bias**

Models can be bias for many reasons. The bias can come from the data used to build the model (eg. sampling, data collection methods, available sources) and from the interpretation of the predictions generated by the model.

Think of at least two ways that bias might have been introduced to your model and explain both below.

---

One source of bias in the model could be some selection bias. The number of flights coming out of a smaller airport should be viewed differently compared to a very large international airport that sees hundreds of flights daily, however, this dataset views them equally. This skews our data by weighing the flights the same, despite there being a different value. Smaller airports also have a greater chance of being in a location with more weather that would delay a flight. A second issue with having this bias is that the data does not contain more information about longer flights (flights anywhere that are not to the contiguous United States, such as international flights or flights to AK and HI). (International) flights typically use larger aircrafts which can cause a different sort of delay in taxing due to having a larger passenger count, longer refueling and thus a larger preparation time and turnover rate.

This dataset also doesn't include if any of these flights are connecting flights. If we were able to follow trend lines of specific airlines, we may be able to understand if there is more correlation between where those flights are taking off from and how that influences the rest of their trips that day.

A flight delay is when an airline flight takes off and/or lands later than its scheduled time. The Federal Aviation Administration (FAA) considers a flight to be delayed when it is 15 minutes later than its scheduled time. However, our data shows that our smallest delay time is 6 minutes. With this information, we can confirm that there may be some automation bias involved. (Automation bias is a tendency to favor results generated by automated systems over those generated by non-automated systems, irrespective of the error rates of each.) If employees are favoring results generated by automated systems, then it is possible that those systems and models in place could be inaccurate because they are not checking the accuracy of the model. This could lead to some bias towards specific airlines if they are leaving less than 15 minutes later than expected but having that small delay count as a flight delay by the U.S. Department of Transportation's (DOT).

**Changing the Dataset to Mitigate Bias**

Bias datasets are one of the primary ways in which bias is introduced to a machine learning model. Look back at the input data that you fed to your model. Think about how you might change something about the data to reduce bias in your model.

What change or changes could you make to your dataset less bias? Consider the data that you have, how and where that data was collected, and what other sources of data might be used to reduce bias.

Write a summary of change that could be made to your input data.

---

One change that could be made to the dataset to mitigate the bias is to specify how the data was obtained. The Bureau of Transportation Statistics (BTS) wrote that the major data gaps identified in the Statement included statistics on domestic and international flows of freight and passenger traffic by all modes, the extent, and performance of intermodal connections, the financial and operating characteristics of smaller carriers, and the costs of both for-hire and private transportation incurred by each sector of the economy. Filling these gaps would make our dataset less biased as we would have a more specified look into why a flight could be delayed and whether or not the cause is due to humans, weather, or something different. This data is also coming from 2008, yet there is another source from the BTS taken from a more recent year, 2015. These new reports are released every month, meaning the most updated version should be available often. The final change we could make to mitigate bias is to gain a better understanding of the airlines since there are many parent companies with smaller companies underneath them. This would help our analysis by helping us understand if there is a difference in delays between airlines.

**Changing the Model to Mitigate Bias**

Is there any way to reduce bias by changing the model itself? This could include modifying algorithmic choices, tweaking hyperparameters, etc.

Write a brief summary of changes that you could make to help reduce bias in your model.

---

Our model only looked at the flight delays timings specified by the airline, airplane number, and the time of arrival and departure among other details. There is a slight bias in the airline specifications since there is very little information on the overall popularity and traffic that occurs. One way to reduce bias is by modifying the columns and including the average traffic passing through each airport. A second way to mitigate the bias in our model is to join in a dataset about weather patterns. Places this dataset could be obtained is through NOAA, any weather-related news source, or a GitHub/Kaggle. Adding on this data would help our model understand specific causes of delays and when they occur, which could be used to show what time and where to fly.

**Mitigating Bias Downstream**

Models make predictions. Downstream processes make decisions. What processes and/or rules should be in place for people and systems interpreting and acting on the results of your model to reduce the bias? Describe these below.

---

A process (and rule) that should be put in place for people interpreting these predictions should include not ignoring the outliers. There were numerous flights that skewed our flights due to having an extensive delay time (the longest being almost 48 hours). By having humans process these outliers and understand the causes, they would be able to help lower the number of extensive delays. A secondary rule that should be also be put in place for people or systems is to review the data annually. By reviewing past data, one can determine quite a few patterns including popular days to travel and popular days for delays. This would help cross-check a model’s predictions by using human logic to ensure the model is working efficiently.

