# 1. Exploratory Data Analysis

## 1.2. Air Carrier Financial Reports (Form 41 Financial Data)

### 1.2.2. Schedule P-12(a) Fuel

Source: https://www.transtats.bts.gov/Tables.asp?DB_ID=135&DB_Name=Air%20Carrier%20Financial%20Reports%20%28Form%2041%20Financial%20Data%29

<em>Note</em>: Over time both the code and the name of a carrier may change and the same code or name may be assumed by a different airline. To ensure that you are analyzing data from the same airline, TranStats provides four airline-specific variables that identify one and only one carrier or its entity: Airline ID (AirlineID), Unique Carrier Code (UniqueCarrier), Unique Carrier Name (UniqueCarrierName), and Unique Entity (UniqCarrierEntity). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation. US Airways and America West started to report combined on-time data in January 2006 and combined traffic and financial data in October 2007 following their 2005 merger announcement. Delta and Northwest began reporting jointly in January 2010 following their 2008 merger announcement. Continental Micronesia was combined into Continental Airlines in December 2010 and joint reporting began in January 2011. Atlantic Southeast and ExpressJet began reporting jointly in January 2012. United and Continental began reporting jointly in January 2012 following their 2010 merger announcement. Endeavor (9E) operated as Pinnacle prior to August 2013. Envoy (MQ) operated as American Eagle prior to April 2014. Southwest (WN) and AirTran (FL) began reporting jointly in January 2015 following their 2011 merger announcement. American (AA) and US Airways (US) began reporting jointly as AA in July 2015 following their 2013 merger announcement. Alaska (AS) and Virgin America (VX) began reporting jointly as AS in April 2018 following their 2016 merger announcement.
        
- **P12A_fuel**:
    - **Summary**:
        - Air Carrier Financial : Schedule P-12(a)
    - **Description**:
        - This table contains monthly reported fuel costs, and gallons of fuel consumed, by air carrier and category of fuel use, including scheduled and non-scheduled service for domestic and international traffic regions. Data since 2000 are available for major, national, and regional air carriers subject to reporting requirements. For earlier data, go to [Fuel Cost and Consumption 1977-1999](http://www.bts.gov/xml/fuel/report/src/index.xml).
    - **File**:
        - 494124489_T_F41SCHEDULE_P12A.zip

___

In [1]:
# Import libraries to be used:

import pandas as pd
import numpy as np
import os.path
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
import warnings # warnings.filterwarnings(action='ignore') # https://docs.python.org/3/library/warnings.html#the-warnings-filter
# from zipfile import ZipFile # De momento no ha hecho falta 

In [2]:
# Show all columns and rows in DataFrames
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None) # It greatly slows down the output display and freezes the kernel

# Show in notebook
%matplotlib inline

# style -> plt.style.available
# plt.style.use('seaborn')
plt.style.use('ggplot')

# theme
sns.set_theme(context='notebook',
              style="darkgrid") # {darkgrid, whitegrid, dark, white, ticks}

# color_palette -> https://seaborn.pydata.org/generated/seaborn.color_palette.html#seaborn.color_palette
palette = sns.color_palette("flare", as_cmap=True);

In [3]:
if os.name == 'nt': # Windows
    root = r"C:\Users\turge\CompartidoVM\0.TFM"
    print("Running on Windows.")
elif os.name == 'posix': # Ubuntu
    root = "/home/dsc/shared/0.TFM"
    print("Running on Ubuntu.")
print("root path\t", root)

Running on Windows.
root path	 C:\Users\turge\CompartidoVM\0.TFM


___

In [140]:
csv_path = os.path.join(root,
                        "Raw_Data",
                        "US_DoT",
                        "494124489_T_F41SCHEDULE_P12A.zip")
csv_path

'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\494124489_T_F41SCHEDULE_B43.zip'

In [141]:
# Since 'pd.read_csv' works fine with zipped csv files, we can proceed directly:
#cols = pd.read_csv(csv_path, nrows=1).columns # After normally importing it, an undesired extra blank column is loaded
df4 = pd.read_csv(csv_path,
                  encoding='latin1')
                  #usecols=cols[:-1]) # This way, the extra column is disregarded for the loading process
df4

Unnamed: 0,YEAR,CARRIER,CARRIER_NAME,MANUFACTURE_YEAR,UNIQUE_CARRIER_NAME,SERIAL_NUMBER,TAIL_NUMBER,AIRCRAFT_STATUS,OPERATING_STATUS,NUMBER_OF_SEATS,MANUFACTURER,MODEL,CAPACITY_IN_POUNDS,ACQUISITION_DATE,AIRLINE_ID,UNIQUE_CARRIER
0,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7858,N202PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-10-28,20397.0,16
1,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7860,N206PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-10-30,20397.0,16
2,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7873,N207PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-11-26,20397.0,16
3,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7874,N209PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-12-04,20397.0,16
4,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7879,N213PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-12-16,20397.0,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100489,2019,YX,Republic Airline,2008,Republic Airline,17000256,N244JQ,b,Y,76.0,Embraer,ERJ-170-200LR,85517.0,2019-11-04,20452.0,YX
100490,2019,YX,Republic Airline,2008,Republic Airline,17000214,N227JQ,b,Y,76.0,Embraer,ERJ-170-200LR,85517.0,2019-11-26,20452.0,YX
100491,2019,YX,Republic Airline,2008,Republic Airline,17000236,N236JQ,b,Y,76.0,Embraer,ERJ-170-200LR,85517.0,2019-10-31,20452.0,YX
100492,2019,YX,Republic Airline,2008,Republic Airline,17000259,N245JQ,b,Y,76.0,Embraer,ERJ-170-200LR,85517.0,2019-12-03,20452.0,YX


In [142]:
df4.describe()

Unnamed: 0,YEAR,MANUFACTURE_YEAR,NUMBER_OF_SEATS,CAPACITY_IN_POUNDS,AIRLINE_ID
count,100494.0,100494.0,100487.0,100393.0,100486.0
mean,2012.490706,1998.59524,108.11703,67318.04842,20050.33132
std,4.071978,39.631486,76.979477,82383.16767,374.019353
min,2006.0,0.0,0.0,0.0,19386.0
25%,2009.0,1993.0,50.0,34100.0,19805.0
50%,2012.0,2001.0,124.0,41226.0,19977.0
75%,2016.0,2005.0,155.0,75445.0,20366.0
max,2019.0,2019.0,737.0,875000.0,21974.0


In [143]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100494 entries, 0 to 100493
Data columns (total 16 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   YEAR                 100494 non-null  int64  
 1   CARRIER              100445 non-null  object 
 2   CARRIER_NAME         100486 non-null  object 
 3   MANUFACTURE_YEAR     100494 non-null  int64  
 4   UNIQUE_CARRIER_NAME  100486 non-null  object 
 5   SERIAL_NUMBER        100494 non-null  object 
 6   TAIL_NUMBER          100494 non-null  object 
 7   AIRCRAFT_STATUS      100494 non-null  object 
 8   OPERATING_STATUS     100494 non-null  object 
 9   NUMBER_OF_SEATS      100487 non-null  float64
 10  MANUFACTURER         100494 non-null  object 
 11  MODEL                100483 non-null  object 
 12  CAPACITY_IN_POUNDS   100393 non-null  float64
 13  ACQUISITION_DATE     99917 non-null   object 
 14  AIRLINE_ID           100486 non-null  float64
 15  UNIQUE_CARRIER   

There are some missing values. Let's further delve into it:

In [144]:
# Absolute number of missing values by column:
df4.isna().sum()

YEAR                     0
CARRIER                 49
CARRIER_NAME             8
MANUFACTURE_YEAR         0
UNIQUE_CARRIER_NAME      8
SERIAL_NUMBER            0
TAIL_NUMBER              0
AIRCRAFT_STATUS          0
OPERATING_STATUS         0
NUMBER_OF_SEATS          7
MANUFACTURER             0
MODEL                   11
CAPACITY_IN_POUNDS     101
ACQUISITION_DATE       577
AIRLINE_ID               8
UNIQUE_CARRIER          67
dtype: int64

In [145]:
# Relative frequency of missing values by column:
df4.isna().sum() / len(df4) * 100

YEAR                   0.000000
CARRIER                0.048759
CARRIER_NAME           0.007961
MANUFACTURE_YEAR       0.000000
UNIQUE_CARRIER_NAME    0.007961
SERIAL_NUMBER          0.000000
TAIL_NUMBER            0.000000
AIRCRAFT_STATUS        0.000000
OPERATING_STATUS       0.000000
NUMBER_OF_SEATS        0.006966
MANUFACTURER           0.000000
MODEL                  0.010946
CAPACITY_IN_POUNDS     0.100504
ACQUISITION_DATE       0.574164
AIRLINE_ID             0.007961
UNIQUE_CARRIER         0.066671
dtype: float64