<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ds5110/summer-2021/blob/master/05c-flights-intro.ipynb">
<img src="https://github.com/ds5110/summer-2021/raw/master/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>


# 5c -- flights intro

Case study: [nycflights13 dataset](https://github.com/tidyverse/nycflights13) -- larger multi-table database with ER diagram.

### References

* [R4DS -- Ch 15: Relational data](https://r4ds.had.co.nz/relational-data.html) -- r4ds.had.co.nz
  * Original data source: [Bureau of Transportation Statistics](https://www.bts.gov/) -- bts.gov
* [sqlite3](https://docs.python.org/3/library/sqlite3.html) API reference -- python.org


In [None]:
import sqlite3
import pandas as pd

# From sqlite3 to pandas

* [Reading tables](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-tables) -- pandas.pydata.org
* [`pandas.read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_table.html#pandas.read_sql_table) -- pandas.pydata.org
  * SQLAlchemy provides database abstraction if it's installed.
* [`read_sql_table` needs SQLAlchemy](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-tables) -- pandas.pydata.org
  * SQLite is in Python’s standard library by default. 
  * You will need a driver library for other databases. 
    * [psycopg2](https://www.psycopg.org/) for PostgreSQL
    * [pymysql](https://github.com/PyMySQL/PyMySQL) for MySQL
* [SQL queries with pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql)
* [write dataframe to sqlite3](https://stackoverflow.com/questions/14431646/how-to-write-pandas-dataframe-to-sqlite-with-index) -- stackoverflow

# Load a database with pandas

* [.to_sql()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html) API reference docs -- pandas.pydata.org
  * Support is provided for sqlite3.Connection objects. 
  * For another RDBMS, use SQLAlchemy.
* [SQLAlchemy](https://docs.sqlalchemy.org/en/13/core/connections.html)
  * SQLAlchemy makes it possible to use any DB supported by that library. 
  * You are responsible for engine disposal and connection closure when using SQLAlchemy.



## Big data considerations

* Pandas doesn't ["scale" to large datasets](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html)
  * Pandas provides data structures for in-memory analytics
  * For large datasets, you need a parallelization strategy and an appropriate tool.
* [Parquet](https://parquet.apache.org/) works when big problems can divided into chunks
    * Each chunk involves a file that fits into memory
* [Dask](https://dask.org/) has a dataframe API similar to Pandas
    * Dask can use multithreading
    * Dask can scale to distribute jobs on clusters
    * Dask is not subject to the Python Global Interpreter Lock (GIL)


# nycflights13 dataset

* flights departing NYC in 2013
* [tidyverse github site](https://github.com/tidyverse/nycflights13/raw/master/data-raw/) -- doesn't have a `flights.csv`
* [R script chat creates flights table](https://github.com/tidyverse/nycflights13/blob/master/data-raw/flights.R) -- github
* The script shows (reproducibly) how to recreate the table.
* The script points to the authoritative data source (bts.gov)
* Infinitely better than simply posting a CSV file!

In [None]:
base = "http://pbogden.github.io/ds5110/data/nycflights13/"
flights = pd.read_csv(base + "flights.csv").drop("Unnamed: 0", axis=1)
airlines = pd.read_csv(base + "airlines.csv").drop("Unnamed: 0", axis=1)
weather = pd.read_csv(base + "weather.csv").drop("Unnamed: 0", axis=1)

flights

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01 05:00:00
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01 05:00:00
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01 05:00:00
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01 05:00:00
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01 06:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336771,2013,9,30,,1455,,,1634,,9E,3393,,JFK,DCA,,213,14,55,2013-09-30 14:00:00
336772,2013,9,30,,2200,,,2312,,9E,3525,,LGA,SYR,,198,22,0,2013-09-30 22:00:00
336773,2013,9,30,,1210,,,1330,,MQ,3461,N535MQ,LGA,BNA,,764,12,10,2013-09-30 12:00:00
336774,2013,9,30,,1159,,,1344,,MQ,3572,N511MQ,LGA,CLE,,419,11,59,2013-09-30 11:00:00


In [None]:
flights['origin'].unique()

array(['EWR', 'LGA', 'JFK'], dtype=object)

# EDA of flights

* Interested in flight delays
* Can delays be predicted?
