## Part 0: Setup

*The Task*     
Most of us likely were on a plane over the holidays, and chances are that at least one of our flights got delayed. So how well do different airlines do?  We’ll answer those questions in the following sections.

*Terminology*      
We’ll generally use `field`, `column`, and `attribute` interchangeably to mean a named column in a DataFrame.  We’ll also generally assume that `table`, `DataFrame`, and `relation` mean the same thing.

## Part 1: Data Extraction and Cleaning
We'll begin by importing the data and doing some basic information extraction and cleaning. This helps prepare our dataset for analysis later on, in Part 2: Data Wrangling and Analysis.

The data files, whose contents are described in the provided notebook _Dataset Descriptions_, are:

* `airports.dat.txt` - data on airports, in comma-separated values (CSVs) with no header row

* `airlines.dat.txt` - data on airlines, in CSV with no header row

* `routes.dat.txt` - data on flight routes, in CSV with no header row

* http://docs.google.com/uc?export=download&id=1PPtjGx8lr_cDUfVa3qwlk1W8yY6hY91n - remote file with data on actual flights, with performance info, with a header row

* `aircraft_incidents.htm` - webpage that lists commercial aircraft incidents by year

## Step 1: Importing CSV Data

The first task will be to import tabular data.

In [1]:
import csv
import pandas as pd

In [2]:
airports_df = pd.DataFrame({})
airlines_df = pd.DataFrame({})
flights_df = pd.DataFrame({})
routes_df = pd.DataFrame({})

## 1.1 Importing Tabular Data

In [3]:
column_names = ['airport_id', 'airport_name', 'city', 'country', 'iata', 'icao', 'lat', 'lon', 'alt', 'timezone', 'dst', 'tz']
airports_df = pd.read_csv('airports.dat.txt', header = None, names = column_names)

See info on dataframe's schema

In [4]:
airports_df.dtypes

airport_id        int64
airport_name     object
city             object
country          object
iata             object
icao             object
lat             float64
lon             float64
alt               int64
timezone        float64
dst              object
tz               object
dtype: object

In [5]:
column_names = ['airline_id', 'airline_name', 'alias', 'iata', 'icao', 'airline_callsign', 'country', 'active']
airlines_df = pd.read_csv('airlines.dat.txt', header = None, names = column_names)

In [6]:
airlines_df.dtypes

airline_id           int64
airline_name        object
alias               object
iata                object
icao                object
airline_callsign    object
country             object
active              object
dtype: object

In [7]:
column_names = ['airline', 'airline_id', 'src_iata_icao', 'source_id', 'target_iata_icao', 'target_id', 'code_share', 
               'stops', 'equipment']
routes_df = pd.read_csv('routes.dat.txt', header = None, names = column_names)

In [8]:
routes_df.dtypes

airline             object
airline_id          object
src_iata_icao       object
source_id           object
target_iata_icao    object
target_id           object
code_share          object
stops                int64
equipment           object
dtype: object

In [9]:
import requests
from io import StringIO

remote = requests.get('https://docs.google.com/uc?export=download&id=1PPtjGx8lr_cDUfVa3qwlk1W8yY6hY91n').content

flights_df = pd.read_csv(StringIO(remote.decode('utf-8')), 
                         usecols = ['YEAR', 'MONTH', 'DAY_OF_MONTH', 'CARRIER', 'FL_NUM','ORIGIN', 'DEST', 'ARR_DELAY_NEW', 'CANCELLED'])


## 1.1 Final Results

In [10]:
airports_df.head(10)

Unnamed: 0,airport_id,airport_name,city,country,iata,icao,lat,lon,alt,timezone,dst,tz
0,1,Goroka,Goroka,Papua New Guinea,GKA,AYGA,-6.081689,145.391881,5282,10.0,U,Pacific/Port_Moresby
1,2,Madang,Madang,Papua New Guinea,MAG,AYMD,-5.207083,145.7887,20,10.0,U,Pacific/Port_Moresby
2,3,Mount Hagen,Mount Hagen,Papua New Guinea,HGU,AYMH,-5.826789,144.295861,5388,10.0,U,Pacific/Port_Moresby
3,4,Nadzab,Nadzab,Papua New Guinea,LAE,AYNZ,-6.569828,146.726242,239,10.0,U,Pacific/Port_Moresby
4,5,Port Moresby Jacksons Intl,Port Moresby,Papua New Guinea,POM,AYPY,-9.443383,147.22005,146,10.0,U,Pacific/Port_Moresby
5,6,Wewak Intl,Wewak,Papua New Guinea,WWK,AYWK,-3.583828,143.669186,19,10.0,U,Pacific/Port_Moresby
6,7,Narsarsuaq,Narssarssuaq,Greenland,UAK,BGBW,61.160517,-45.425978,112,-3.0,E,America/Godthab
7,8,Nuuk,Godthaab,Greenland,GOH,BGGH,64.190922,-51.678064,283,-3.0,E,America/Godthab
8,9,Sondre Stromfjord,Sondrestrom,Greenland,SFJ,BGSF,67.016969,-50.689325,165,-3.0,E,America/Godthab
9,10,Thule Air Base,Thule,Greenland,THU,BGTL,76.531203,-68.703161,251,-4.0,E,America/Thule


In [11]:
airlines_df.head(10)

Unnamed: 0,airline_id,airline_name,alias,iata,icao,airline_callsign,country,active
0,1,Private flight,\N,-,,,,Y
1,2,135 Airways,\N,,GNL,GENERAL,United States,N
2,3,1Time Airline,\N,1T,RNX,NEXTIME,South Africa,Y
3,4,2 Sqn No 1 Elementary Flying Training School,\N,,WYT,,United Kingdom,N
4,5,213 Flight Unit,\N,,TFU,,Russia,N
5,6,223 Flight Unit State Airline,\N,,CHD,CHKALOVSK-AVIA,Russia,N
6,7,224th Flight Unit,\N,,TTF,CARGO UNIT,Russia,N
7,8,247 Jet Ltd,\N,,TWF,CLOUD RUNNER,United Kingdom,N
8,9,3D Aviation,\N,,SEC,SECUREX,United States,N
9,10,40-Mile Air,\N,Q5,MLA,MILE-AIR,United States,Y


In [12]:
routes_df.head(10)

Unnamed: 0,airline,airline_id,src_iata_icao,source_id,target_iata_icao,target_id,code_share,stops,equipment
0,2B,410,AER,2965,KZN,2990,,0,CR2
1,2B,410,ASF,2966,KZN,2990,,0,CR2
2,2B,410,ASF,2966,MRV,2962,,0,CR2
3,2B,410,CEK,2968,KZN,2990,,0,CR2
4,2B,410,CEK,2968,OVB,4078,,0,CR2
5,2B,410,DME,4029,KZN,2990,,0,CR2
6,2B,410,DME,4029,NBC,6969,,0,CR2
7,2B,410,DME,4029,TGK,\N,,0,CR2
8,2B,410,DME,4029,UUA,6160,,0,CR2
9,2B,410,EGO,6156,KGD,2952,,0,CR2


In [13]:
flights_df.head(10)

Unnamed: 0,YEAR,MONTH,DAY_OF_MONTH,CARRIER,FL_NUM,ORIGIN,DEST,ARR_DELAY_NEW,CANCELLED
0,2018,1,2,WN,1325,SJU,MCO,0.0,0.0
1,2018,1,2,WN,5159,SJU,MCO,0.0,0.0
2,2018,1,2,WN,5890,SJU,MCO,9.0,0.0
3,2018,1,2,WN,6618,SJU,MCO,0.0,0.0
4,2018,1,2,WN,1701,SJU,MDW,8.0,0.0
5,2018,1,2,WN,844,SJU,TPA,23.0,0.0
6,2018,1,2,WN,4679,SJU,TPA,0.0,0.0
7,2018,1,2,WN,6294,SLC,BUR,20.0,0.0
8,2018,1,2,WN,5245,SLC,DAL,0.0,0.0
9,2018,1,2,WN,2278,SLC,DEN,0.0,0.0


## 1.2 Importing Text Data

We are going to scrape data from the aircraft_incidents.htm webpage using the *beautifulsoup4* package.

In [14]:
from bs4 import BeautifulSoup
input_html = "aircraft_incidents.htm"

# Open file I/O
with open(input_html, "r") as ifile:
    # soup is the bs4 object 
    soup = BeautifulSoup(ifile, 'html.parser')

In [15]:
# Select Year and Incident description. It can be seen that they are usually <h3> or <li> tags.

# Assign the results to variable selected_data. 
selected_data = []

for i in range(len(soup.find_all(['h3', 'li']))):
    selected_data.append(soup.find_all(['h3', 'li'])[i].get_text())

Finally, check if all the HTML tags have been removed. Output the message 'No Tag Found!’, if successful.

In [16]:
import re

total = 0
for i in selected_data:
    tag = len(re.findall(r'<.*>', i))
    total += tag
    
if total == 0:
    print('No Tag Found!')

No Tag Found!


In [17]:
# Write selected_data to incidents_raw.txt

file = open('incidents_raw.txt', 'w')
for item in selected_data: 
   file.write(item + '\n')
file.close() 


## Step 2: Simple Data Cleaning

Now we need to do some further cleaning to both the CSV and text data.

## 2.1 Cleaning Tabular Data
We are going to clean the `airlines_df`, `airports_df`, and `routes_df` DataFrames.

In [18]:
# Replace NaNs with blanks if the column is a string
# Everything should be of a consistent type

def fillna_col(series):
    if series.dtype is pd.np.dtype(object):
        return series.fillna('')
    else:
        return series

In [19]:
## Define a second function called nullify that takes a 
## single parameter x. Given the parameter value \N it 
## returns NaN, otherwise it returns the value of the parameter.

import numpy as np

def nullify(x):
    if x == '\\N':
        return np.nan
    else:
        return x


### 2.1.1 Regularizing and removing nulls

In [20]:
# Regularize and remove nulls according to Step 2

airlines_df = airlines_df.applymap(nullify)
airports_df = airports_df.applymap(nullify)
routes_df = routes_df.applymap(nullify)

airlines_df = airlines_df.apply(fillna_col)
airports_df = airports_df.apply(fillna_col)


routes_df = routes_df.dropna(subset = ['airline_id', 'source_id', 'target_id'])
routes_df = routes_df.apply(fillna_col)

### 2.1.2 Changing column types

After all of this, `routes_df.airline_id` will only have integers, but will still have its existing type of object. Let’s convert it to integer

In [21]:
routes_df['airline_id'] = routes_df['airline_id'].astype(int)
routes_df['source_id'] = routes_df['source_id'].astype(int)
routes_df['target_id'] = routes_df['target_id'].astype(int)

## 2.2 Cleaning the Text Data

We will clean the raw text data we stored in `incidents_raw.txt`. For each incident, we want it in the form:

```
1997 January 9 , Comair Flight 3272, an Embraer EMB 120 Brasília, crashes near Ida, Michigan, during a snowstorm, killing all 29 on board.
```

Points to follow during cleaning:

* Remove `[edit]` from the year

* Only select incidents that have occured in the year >= 1997

* Since we extracted the data using tags `<h3>` and `<li>`, it is possible that there was other data extracted too. So filter out unwanted data.

In [22]:
with open('incidents_raw.txt') as file:
    incidents_raw = file.read().splitlines()


incidents_raw2 = []
for item in incidents_raw:
    if '[edit]' in item:
        incidents_raw2.append(item.split('[edit]')[0])
    else:
        incidents_raw2.append(item)

incidents_raw2 = incidents_raw2[incidents_raw2.index('1997'):]


year = 1997
incidents_raw2.append('2019')
incidents_raw3 = []

while year <= 2018:
    for item in incidents_raw2[(incidents_raw2.index(str(year))+1) : (incidents_raw2.index(str(year+1)))]:
        incidents_raw3.append(str(year) + ' ' + item)
    year = year + 1


months = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
        'August', 'September', 'October', 'November', 'December']
incidents_raw4 = []

for i in range(len(incidents_raw3)):
    if any(month in incidents_raw3[i] for month in months) & ('Flight' in incidents_raw3[i]) & ('–' in incidents_raw3[i]):
        incidents_raw4.append(incidents_raw3[i])


clean_incidents = []

for item in incidents_raw4:
    clean_incidents.append(item.replace('–', ','))

Now that we have all the aircraft incidents since 1997, we need to convert them into a Pandas DataFrame.

In [23]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [24]:
# Convert clean_incidents into dataframe incidents_df

incidents_df = pd.DataFrame(columns = ['Date', 'Airline', 'FlightNum'])
stop_words = set(stopwords.words('english')) 

for item in clean_incidents:
    whichdate = item.split(',')[0].strip()
    
    tokens = word_tokenize(item)
    filtered_tokens = [token for token in tokens if not token in stop_words]
    whichairline = ' '.join(filtered_tokens[filtered_tokens.index(',')+1 : filtered_tokens.index('Flight')])
    whichnumber = filtered_tokens[filtered_tokens.index('Flight') + 1]
       
    incidents_df = incidents_df.append({'Date': whichdate, 'Airline': whichairline, 'FlightNum': whichnumber}, ignore_index=True)


We want to change the column type and clean data a bit further

In [25]:
incidents_df['Date'] = pd.to_datetime(incidents_df['Date'])
incidents_df.drop_duplicates(inplace = True)

## Final Output

The following cells just show what the data looks like.

In [26]:
airports_df.head(10)

Unnamed: 0,airport_id,airport_name,city,country,iata,icao,lat,lon,alt,timezone,dst,tz
0,1,Goroka,Goroka,Papua New Guinea,GKA,AYGA,-6.081689,145.391881,5282,10.0,U,Pacific/Port_Moresby
1,2,Madang,Madang,Papua New Guinea,MAG,AYMD,-5.207083,145.7887,20,10.0,U,Pacific/Port_Moresby
2,3,Mount Hagen,Mount Hagen,Papua New Guinea,HGU,AYMH,-5.826789,144.295861,5388,10.0,U,Pacific/Port_Moresby
3,4,Nadzab,Nadzab,Papua New Guinea,LAE,AYNZ,-6.569828,146.726242,239,10.0,U,Pacific/Port_Moresby
4,5,Port Moresby Jacksons Intl,Port Moresby,Papua New Guinea,POM,AYPY,-9.443383,147.22005,146,10.0,U,Pacific/Port_Moresby
5,6,Wewak Intl,Wewak,Papua New Guinea,WWK,AYWK,-3.583828,143.669186,19,10.0,U,Pacific/Port_Moresby
6,7,Narsarsuaq,Narssarssuaq,Greenland,UAK,BGBW,61.160517,-45.425978,112,-3.0,E,America/Godthab
7,8,Nuuk,Godthaab,Greenland,GOH,BGGH,64.190922,-51.678064,283,-3.0,E,America/Godthab
8,9,Sondre Stromfjord,Sondrestrom,Greenland,SFJ,BGSF,67.016969,-50.689325,165,-3.0,E,America/Godthab
9,10,Thule Air Base,Thule,Greenland,THU,BGTL,76.531203,-68.703161,251,-4.0,E,America/Thule


In [27]:
airlines_df.head(10)

Unnamed: 0,airline_id,airline_name,alias,iata,icao,airline_callsign,country,active
0,1,Private flight,,-,,,,Y
1,2,135 Airways,,,GNL,GENERAL,United States,N
2,3,1Time Airline,,1T,RNX,NEXTIME,South Africa,Y
3,4,2 Sqn No 1 Elementary Flying Training School,,,WYT,,United Kingdom,N
4,5,213 Flight Unit,,,TFU,,Russia,N
5,6,223 Flight Unit State Airline,,,CHD,CHKALOVSK-AVIA,Russia,N
6,7,224th Flight Unit,,,TTF,CARGO UNIT,Russia,N
7,8,247 Jet Ltd,,,TWF,CLOUD RUNNER,United Kingdom,N
8,9,3D Aviation,,,SEC,SECUREX,United States,N
9,10,40-Mile Air,,Q5,MLA,MILE-AIR,United States,Y


In [28]:
routes_df.head(10)

Unnamed: 0,airline,airline_id,src_iata_icao,source_id,target_iata_icao,target_id,code_share,stops,equipment
0,2B,410,AER,2965,KZN,2990,,0,CR2
1,2B,410,ASF,2966,KZN,2990,,0,CR2
2,2B,410,ASF,2966,MRV,2962,,0,CR2
3,2B,410,CEK,2968,KZN,2990,,0,CR2
4,2B,410,CEK,2968,OVB,4078,,0,CR2
5,2B,410,DME,4029,KZN,2990,,0,CR2
6,2B,410,DME,4029,NBC,6969,,0,CR2
8,2B,410,DME,4029,UUA,6160,,0,CR2
9,2B,410,EGO,6156,KGD,2952,,0,CR2
10,2B,410,EGO,6156,KZN,2990,,0,CR2


In [29]:
incidents_df.head(10)

Unnamed: 0,Date,Airline,FlightNum
0,1997-01-09,Comair,3272
1,1997-03-18,Stavropolskaya Aktsionernaya Avia,1023
2,1997-04-19,Merpati Nusantara Airlines,106
3,1997-05-08,China Southern Airlines,3456
4,1997-07-31,FedEx Express,14
5,1997-07-17,Sempati Air,304
6,1997-08-06,Korean Air,801
7,1997-08-10,Formosa Airlines,7601
8,1997-09-03,Vietnam Airlines,815
9,1997-09-06,Royal Brunei Airlines,238


## Step 3: Making Data “Persistent”

Now let’s actually save the data in a persistent way, specifically using a relational database.  For simplicity we’ll use SQLite here

In [30]:
import sqlite3
engine = sqlite3.connect('HW1_DB')

In [31]:
# Use to_sql to save your Dataframes to the HW1_DB

airlines_df.to_sql('airlines', engine, index = False, if_exists = 'replace')
airports_df.to_sql('airports', engine, index = False, if_exists = 'replace')
flights_df.to_sql('flights', engine, index = False, if_exists = 'replace')
routes_df.to_sql('routes', engine, index = False, if_exists = 'replace')
incidents_df.to_sql('incidents', engine, index = False, if_exists = 'replace')