# MODIFICAR ENCABEZADO !!!

# 1. Exploratory Data Analysis

## 1.2. Air Carrier Financial Reports (Form 41 Financial Data)

### 1.2.2. Schedule P-12(a) Fuel

Source: https://www.transtats.bts.gov/Tables.asp?DB_ID=135&DB_Name=Air%20Carrier%20Financial%20Reports%20%28Form%2041%20Financial%20Data%29

<em>Note</em>: Over time both the code and the name of a carrier may change and the same code or name may be assumed by a different airline. To ensure that you are analyzing data from the same airline, TranStats provides four airline-specific variables that identify one and only one carrier or its entity: Airline ID (AirlineID), Unique Carrier Code (UniqueCarrier), Unique Carrier Name (UniqueCarrierName), and Unique Entity (UniqCarrierEntity). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation. US Airways and America West started to report combined on-time data in January 2006 and combined traffic and financial data in October 2007 following their 2005 merger announcement. Delta and Northwest began reporting jointly in January 2010 following their 2008 merger announcement. Continental Micronesia was combined into Continental Airlines in December 2010 and joint reporting began in January 2011. Atlantic Southeast and ExpressJet began reporting jointly in January 2012. United and Continental began reporting jointly in January 2012 following their 2010 merger announcement. Endeavor (9E) operated as Pinnacle prior to August 2013. Envoy (MQ) operated as American Eagle prior to April 2014. Southwest (WN) and AirTran (FL) began reporting jointly in January 2015 following their 2011 merger announcement. American (AA) and US Airways (US) began reporting jointly as AA in July 2015 following their 2013 merger announcement. Alaska (AS) and Virgin America (VX) began reporting jointly as AS in April 2018 following their 2016 merger announcement.
        
- **P12A_fuel**:
    - **Summary**:
        - Air Carrier Financial : Schedule P-12(a)
    - **Description**:
        - This table contains monthly reported fuel costs, and gallons of fuel consumed, by air carrier and category of fuel use, including scheduled and non-scheduled service for domestic and international traffic regions. Data since 2000 are available for major, national, and regional air carriers subject to reporting requirements. For earlier data, go to [Fuel Cost and Consumption 1977-1999](http://www.bts.gov/xml/fuel/report/src/index.xml).
    - **File**:
        - 494124489_T_F41SCHEDULE_P12A.zip

___

In [1]:
# Import libraries to be used:

import pandas as pd
import numpy as np
import os.path
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
import warnings # warnings.filterwarnings(action='ignore') # https://docs.python.org/3/library/warnings.html#the-warnings-filter
# from zipfile import ZipFile # De momento no ha hecho falta 

In [2]:
# Show all columns and rows in DataFrames
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None) # It greatly slows down the output display and freezes the kernel

# Show in notebook
%matplotlib inline

# style -> plt.style.available
# plt.style.use('seaborn')
plt.style.use('ggplot')

# theme
sns.set_theme(context='notebook',
              style="darkgrid") # {darkgrid, whitegrid, dark, white, ticks}

# color_palette -> https://seaborn.pydata.org/generated/seaborn.color_palette.html#seaborn.color_palette
palette = sns.color_palette("flare", as_cmap=True);

In [3]:
if os.name == 'nt': # Windows
    root = r"C:\Users\turge\CompartidoVM\0.TFM"
    print("Running on Windows.")
elif os.name == 'posix': # Ubuntu
    root = "/home/dsc/shared/0.TFM"
    print("Running on Ubuntu.")
print("root path\t", root)

Running on Windows.
root path	 C:\Users\turge\CompartidoVM\0.TFM


___

In [4]:
csv_path = os.path.join(root,
                        "Raw_Data",
                        "US_DoT",
                        "494124489_T_F41SCHEDULE_P52.zip")
csv_path

'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\494124489_T_F41SCHEDULE_P52.zip'

In [5]:
# Since 'pd.read_csv' works fine with zipped csv files, we can proceed directly:
cols = pd.read_csv(csv_path, nrows=1).columns # After normally importing it, an undesired extra blank column is loaded
df5 = pd.read_csv(csv_path,
                  encoding='latin1',
                  usecols=cols[:-1]) # This way, the extra column is disregarded for the loading process
df5

Unnamed: 0,PILOT_FLY_OPS,OTH_FLT_FLY_OPS,TRAIN_FLY_OPS,PERS_EXP_FLY_OPS,PRO_FLY_OPS,INTERCHG_FLY_OPS,FUEL_FLY_OPS,OIL_FLY_OPS,RENTAL_FLY_OPS,OTHER_FLY_OPS,INS_FLY_OPS,BENEFITS_FLY_OPS,INCIDENT_FLY_OPS,PAY_TAX_FLY_OPS,OTH_TAX_FLY_OPS,OTHER_EXP_FLY_OPS,TOT_FLY_OPS,AIRFRAME_LABOR,ENGINE_LABOR,AIRFRAME_REPAIR,ENGINE_REPAIRS,INTERCHG_CHARG,AIRFRAME_MATERIALS,ENGINE_MATERIALS,AIRFRAME_ALLOW,AIRFRAME_OVERHAULS,ENGINE_ALLOW,ENGINE_OVERHAULS,TOT_DIR_MAINT,AP_MT_BURDEN,TOT_FLT_MAINT_MEMO,NET_OBSOL_PARTS,AIRFRAME_DEP,ENGINE_DEP,PARTS_DEP,ENG_PARTS_DEP,OTH_FLT_EQUIP_DEP,OTH_FLT_EQUIP_DEP_GRP_I,FLT_EQUIP_A_EXP,FLY_OPS_EXP_I_A,TOT_AIR_OP_EXPENSES,DEV_N_PREOP_EXP,OTH_INTANGIBLES,EQUIP_N_HANGAR_DEP,G_PROP_DEP,CAP_LEASES_DEP,TOTAL_AIR_HOURS,AIR_DAYS_ASSIGN,AIR_FUELS_ISSUED,AIRCRAFT_CONFIG,AIRCRAFT_GROUP,AIRCRAFT_TYPE,AIRLINE_ID,UNIQUE_CARRIER,UNIQUE_CARRIER_NAME,CARRIER,CARRIER_NAME,UNIQUE_CARRIER_ENTITY,REGION,CARRIER_GROUP_NEW,CARRIER_GROUP,YEAR,QUARTER
0,,,,,,,,,587.0,,,,,,,,587.0,,,,,,,,,,,,,,,,,,,,,,,,587.0,,,,,,,,,1,4,441,19704,CO,Continental Air Lines Inc.,CO,Continental Air Lines Inc.,01220,D,3,3,1990,1
1,,,,,,,,,4.0,,,,,,,,4.0,,,,,,,,,,,,,,,,,,,,,,,,4.0,,,,,,,,,1,4,450,19704,CO,Continental Air Lines Inc.,CO,Continental Air Lines Inc.,01220,D,3,3,1990,1
2,,,,,,,,,13.0,,,,,,,,13.0,,,,,,,,,,,,,,,,,,,,,,,,13.0,,,,,,,,,1,4,454,19704,CO,Continental Air Lines Inc.,CO,Continental Air Lines Inc.,01220,D,3,3,1990,1
3,,,,,,,,,58.0,,,,,,,,58.0,,,,,,,,,,,,,,,,,,,,,,,,58.0,,,,,,,,,1,4,461,19704,CO,Continental Air Lines Inc.,CO,Continental Air Lines Inc.,01220,D,3,3,1990,1
4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1446.0,2014.0,843.0,11775.0,55.0,,,,9,9,999,19704,CO,Continental Air Lines Inc.,CO,Continental Air Lines Inc.,01220,D,3,3,1990,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54910,376414.0,0.0,2721.0,9422.0,0.0,0.0,140325.0,682.0,18811.0,0.0,3919.0,61224.0,0.0,16865.0,3846.0,0.0,634229.0,12912.0,-38.0,18324.0,709.0,0.0,9904.0,1032.0,0.0,0.0,0.0,0.0,42843.0,38044.0,80887.0,1364.0,16588.0,34862.0,24.0,0.0,759.0,,0.0,0.0,768713.0,,,,,,148.95,19.02,122635.08,1,6,614,19393,WN,Southwest Airlines Co.,WN,Southwest Airlines Co.,06725,D,3,3,2020,3
54911,409660.0,0.0,4092.0,23247.0,0.0,0.0,696526.0,1687.0,50509.0,0.0,5801.0,130419.0,0.0,13015.0,33283.0,0.0,1368239.0,63297.0,826.0,96858.0,133045.0,0.0,36999.0,1775.0,0.0,0.0,0.0,0.0,332800.0,70992.0,403792.0,-457.0,-22154.0,128383.0,5256.0,2581.0,1698.0,,28124.0,0.0,1915462.0,,,,,,403.01,45.16,348603.29,1,6,612,19393,WN,Southwest Airlines Co.,WN,Southwest Airlines Co.,06725,D,3,3,2019,4
54912,423946.0,0.0,2492.0,18821.0,0.0,0.0,589673.0,1471.0,37355.0,0.0,4158.0,88069.0,0.0,7487.0,28377.0,0.0,1201849.0,64065.0,749.0,63804.0,114369.0,0.0,29058.0,134.0,0.0,0.0,0.0,0.0,272179.0,57573.0,329752.0,347.0,91031.0,19969.0,2587.0,1206.0,940.0,,14684.0,0.0,1662365.0,,,,,,388.00,42.89,337720.65,1,6,612,19393,WN,Southwest Airlines Co.,WN,Southwest Airlines Co.,06725,D,3,3,2016,3
54913,447696.0,0.0,4739.0,14028.0,0.0,0.0,150549.0,688.0,30456.0,0.0,6139.0,67859.0,0.0,22234.0,1973.0,0.0,746361.0,43251.0,5836.0,43233.0,41971.0,0.0,28336.0,1682.0,0.0,0.0,0.0,0.0,164309.0,51127.0,215436.0,-2030.0,-21651.0,130835.0,5728.0,2653.0,1792.0,,28230.0,0.0,1107354.0,,,,,,160.83,45.30,124112.53,1,6,612,19393,WN,Southwest Airlines Co.,WN,Southwest Airlines Co.,06725,D,3,3,2020,2


In [6]:
df5.describe()

Unnamed: 0,PILOT_FLY_OPS,OTH_FLT_FLY_OPS,TRAIN_FLY_OPS,PERS_EXP_FLY_OPS,PRO_FLY_OPS,INTERCHG_FLY_OPS,FUEL_FLY_OPS,OIL_FLY_OPS,RENTAL_FLY_OPS,OTHER_FLY_OPS,INS_FLY_OPS,BENEFITS_FLY_OPS,INCIDENT_FLY_OPS,PAY_TAX_FLY_OPS,OTH_TAX_FLY_OPS,OTHER_EXP_FLY_OPS,TOT_FLY_OPS,AIRFRAME_LABOR,ENGINE_LABOR,AIRFRAME_REPAIR,ENGINE_REPAIRS,INTERCHG_CHARG,AIRFRAME_MATERIALS,ENGINE_MATERIALS,AIRFRAME_ALLOW,AIRFRAME_OVERHAULS,ENGINE_ALLOW,ENGINE_OVERHAULS,TOT_DIR_MAINT,AP_MT_BURDEN,TOT_FLT_MAINT_MEMO,NET_OBSOL_PARTS,AIRFRAME_DEP,ENGINE_DEP,PARTS_DEP,ENG_PARTS_DEP,OTH_FLT_EQUIP_DEP,OTH_FLT_EQUIP_DEP_GRP_I,FLT_EQUIP_A_EXP,FLY_OPS_EXP_I_A,TOT_AIR_OP_EXPENSES,DEV_N_PREOP_EXP,OTH_INTANGIBLES,EQUIP_N_HANGAR_DEP,G_PROP_DEP,CAP_LEASES_DEP,TOTAL_AIR_HOURS,AIR_DAYS_ASSIGN,AIR_FUELS_ISSUED,AIRCRAFT_CONFIG,AIRCRAFT_GROUP,AIRCRAFT_TYPE,AIRLINE_ID,CARRIER_GROUP_NEW,CARRIER_GROUP,YEAR,QUARTER
count,42100.0,21536.0,33917.0,41002.0,20272.0,10900.0,42433.0,27711.0,40725.0,21849.0,40029.0,40700.0,16681.0,40386.0,33283.0,29983.0,44763.0,40861.0,33590.0,41377.0,37683.0,12185.0,42214.0,39019.0,20957.0,12768.0,20128.0,13230.0,43070.0,37860.0,43113.0,24838.0,36727.0,37846.0,38007.0,30069.0,25353.0,27280.0,19781.0,6241.0,45228.0,4399.0,5833.0,8175.0,9279.0,5700.0,26521.0,26447.0,26075.0,54915.0,54915.0,54915.0,54915.0,54915.0,54915.0,54915.0,54915.0
mean,6919.929186,154.491979,421.430504,751.50214,71.487035,8.364261,18747.709669,27.176289,4591.492173,13.630997,150.34131,1733.208156,11.85497,414.427624,822.188672,252.018413,32466.08,1402.811539,251.685987,1649.518982,1998.412978,33.513957,1124.439733,675.073206,423.301202,41.181449,583.069683,143.827134,7117.888638,3025.367654,9767.706669,209.65268,2638.10265,817.075932,221.692088,184.467465,187.192853,3120.346899,558.278331,0.259253,45040.19,1387.362576,1420.452133,754.712904,4275.361168,284.538319,12.009913,1.641377,13207.202828,2.661623,6.743167,726.198944,20043.952235,2.660384,2.660384,2005.516598,2.485878
std,16550.68523,520.253289,970.088722,1446.996484,479.862218,195.696009,45294.830703,91.878626,9407.542904,55.265392,351.998034,4496.868879,92.129184,984.814095,2019.914025,1045.219531,70964.34,3261.057786,647.367842,3917.781937,5800.407649,337.10541,2281.669358,2062.838801,1804.358826,561.023724,2799.578366,975.690565,14503.999242,6705.957426,19775.370193,1244.615854,5868.679181,2877.95352,574.669941,590.244606,613.680178,6328.676413,1595.529415,13.148307,95540.57,4851.603169,4574.536871,2558.756029,11970.052742,801.306937,27.049857,3.44988,25347.7965,2.953946,1.458833,164.189405,300.290441,0.508481,0.508481,8.785579,1.11439
min,-5705.0,-960.38,-901.0,-1729.73,-846.08,-536.0,-13910.0,-627.0,-73273.0,-1097.0,-2156.06,-14277.0,-2503.28,-7748.0,-4179.18,-16940.21,-71274.0,-4340.0,-3773.0,-9702.0,-26376.88,-5476.0,-31192.0,-11468.0,-22158.0,-9145.0,-19047.36,-6310.8,-15747.0,-15476.0,-27593.0,-27902.0,-25381.0,-4886.0,-7982.0,-15900.0,-28951.0,-26863.41,-2879.0,-32.0,-30056.0,-8838.0,-1114.0,-6063.5,-3421.0,-1481.44,-56.86,-0.07,-1137.43,0.0,0.0,7.0,19386.0,1.0,1.0,1990.0,1.0
25%,299.0,0.0,5.81,36.6125,0.0,0.0,418.0,0.0,75.78,0.0,6.0,33.335,0.0,19.9925,0.69,0.0,1413.23,52.0,0.73,36.89,7.0,0.0,36.8125,1.59,0.0,0.0,0.0,0.0,366.0,61.0,487.44,0.0,30.58,11.0,6.0,0.0,0.0,151.0,0.0,0.0,1960.935,0.0,0.86,9.0,89.32,0.0,0.37,0.07,424.26,1.0,6.0,625.0,19805.0,2.0,2.0,1998.0,1.0
50%,1817.0,0.0,83.0,238.43,0.0,0.0,4030.7,6.0,1028.63,0.14,43.0,314.035,0.0,115.0,137.0,5.12,9009.0,358.0,34.0,366.0,271.0,0.0,315.405,61.0,0.0,0.0,0.0,0.0,2295.0,708.985,3050.0,39.0,537.0,149.395,58.88,13.0,12.74,809.13,4.12,0.0,13037.81,10.0,114.0,63.0,539.07,18.0,2.99,0.42,4027.96,1.0,6.0,693.0,20007.0,3.0,3.0,2006.0,2.0
75%,6955.25,66.0,415.0,798.4475,22.5025,0.0,17509.0,26.0,4639.94,6.0,152.0,1553.0,0.0,421.0,760.0,72.0,32775.1,1412.0,227.0,1550.0,1524.0,0.0,1232.0,434.4,109.0,0.0,110.935,0.0,7707.2775,2828.0,10327.16,196.0,2714.995,741.0,229.01,99.0,128.84,3455.0775,359.09,0.0,46847.06,506.0,967.33,385.0,3091.54,184.9625,12.46,1.63,16138.24,2.0,8.0,819.0,20344.0,3.0,3.0,2013.0,3.0
max,606178.0,9612.0,16781.0,26642.0,49729.47,9289.55,930873.0,8040.0,148227.0,1707.0,14930.0,130419.0,3539.0,27838.0,37992.0,20647.17,1368239.0,78353.0,15344.0,98442.0,137362.0,17940.16,59425.0,51883.56,26584.0,12979.03,98468.82,17077.68,350739.0,127336.0,425747.0,36537.0,201304.0,133277.0,41010.15,19387.0,13285.0,294309.0,45475.0,853.0,1915462.0,68903.0,214043.03,29537.0,548352.0,10568.0,735.67,45.64,357903.47,9.0,9.0,999.0,21712.0,3.0,3.0,2020.0,4.0


In [7]:
df5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54915 entries, 0 to 54914
Data columns (total 63 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PILOT_FLY_OPS            42100 non-null  float64
 1   OTH_FLT_FLY_OPS          21536 non-null  float64
 2   TRAIN_FLY_OPS            33917 non-null  float64
 3   PERS_EXP_FLY_OPS         41002 non-null  float64
 4   PRO_FLY_OPS              20272 non-null  float64
 5   INTERCHG_FLY_OPS         10900 non-null  float64
 6   FUEL_FLY_OPS             42433 non-null  float64
 7   OIL_FLY_OPS              27711 non-null  float64
 8   RENTAL_FLY_OPS           40725 non-null  float64
 9   OTHER_FLY_OPS            21849 non-null  float64
 10  INS_FLY_OPS              40029 non-null  float64
 11  BENEFITS_FLY_OPS         40700 non-null  float64
 12  INCIDENT_FLY_OPS         16681 non-null  float64
 13  PAY_TAX_FLY_OPS          40386 non-null  float64
 14  OTH_TAX_FLY_OPS       

Additional information on each column meaning can be found [here](https://www.transtats.bts.gov/Fields.asp?Table_ID=297&SYS_Table_Name=T_F41SCHEDULE_P52&User_Table_Name=Schedule%20P-5.2&Year_Info=1&First_Year=1990&Last_Year=2020&Rate_Info=0&Frequency=Quarterly&Data_Frequency=Annual,Quarterly).

There are some missing values. Let's further delve into it:

In [24]:
# Absolute number of missing values by column:
for i, number in enumerate(df5.isna().sum()):
    print(df5.isna().sum().index[i], "\t",  number)

PILOT_FLY_OPS 	 12815
OTH_FLT_FLY_OPS 	 33379
TRAIN_FLY_OPS 	 20998
PERS_EXP_FLY_OPS 	 13913
PRO_FLY_OPS 	 34643
INTERCHG_FLY_OPS 	 44015
FUEL_FLY_OPS 	 12482
OIL_FLY_OPS 	 27204
RENTAL_FLY_OPS 	 14190
OTHER_FLY_OPS 	 33066
INS_FLY_OPS 	 14886
BENEFITS_FLY_OPS 	 14215
INCIDENT_FLY_OPS 	 38234
PAY_TAX_FLY_OPS 	 14529
OTH_TAX_FLY_OPS 	 21632
OTHER_EXP_FLY_OPS 	 24932
TOT_FLY_OPS 	 10152
AIRFRAME_LABOR 	 14054
ENGINE_LABOR 	 21325
AIRFRAME_REPAIR 	 13538
ENGINE_REPAIRS 	 17232
INTERCHG_CHARG 	 42730
AIRFRAME_MATERIALS 	 12701
ENGINE_MATERIALS 	 15896
AIRFRAME_ALLOW 	 33958
AIRFRAME_OVERHAULS 	 42147
ENGINE_ALLOW 	 34787
ENGINE_OVERHAULS 	 41685
TOT_DIR_MAINT 	 11845
AP_MT_BURDEN 	 17055
TOT_FLT_MAINT_MEMO 	 11802
NET_OBSOL_PARTS 	 30077
AIRFRAME_DEP 	 18188
ENGINE_DEP 	 17069
PARTS_DEP 	 16908
ENG_PARTS_DEP 	 24846
OTH_FLT_EQUIP_DEP 	 29562
OTH_FLT_EQUIP_DEP_GRP_I 	 27635
FLT_EQUIP_A_EXP 	 35134
FLY_OPS_EXP_I_A 	 48674
TOT_AIR_OP_EXPENSES 	 9687
DEV_N_PREOP_EXP 	 50516
OTH_INTANGIBLES 	 4

In [25]:
# Relative frequency of missing values by column:
for i, number in enumerate(df5.isna().sum() / len(df5) * 100):
    print((df5.isna().sum() / len(df5) * 100).index[i], "\t",  number)

PILOT_FLY_OPS 	 23.33606482746062
OTH_FLT_FLY_OPS 	 60.783028316489116
TRAIN_FLY_OPS 	 38.23727578985705
PERS_EXP_FLY_OPS 	 25.335518528635166
PRO_FLY_OPS 	 63.08476736775016
INTERCHG_FLY_OPS 	 80.15114267504325
FUEL_FLY_OPS 	 22.72967313120277
OIL_FLY_OPS 	 49.53837749248839
RENTAL_FLY_OPS 	 25.839934444140944
OTHER_FLY_OPS 	 60.21305654192843
INS_FLY_OPS 	 27.1073477192024
BENEFITS_FLY_OPS 	 25.885459346262408
INCIDENT_FLY_OPS 	 69.62396430847674
PAY_TAX_FLY_OPS 	 26.45725211690795
OTH_TAX_FLY_OPS 	 39.39178730765729
OTHER_EXP_FLY_OPS 	 45.401074387690066
TOT_FLY_OPS 	 18.486752253482656
AIRFRAME_LABOR 	 25.592278976600202
ENGINE_LABOR 	 38.83274150960575
AIRFRAME_REPAIR 	 24.652644996813255
ENGINE_REPAIRS 	 31.379404534280255
INTERCHG_CHARG 	 77.81116270600018
AIRFRAME_MATERIALS 	 23.12847127378676
ENGINE_MATERIALS 	 28.946553764909407
AIRFRAME_ALLOW 	 61.83738504962214
AIRFRAME_OVERHAULS 	 76.74952198852772
ENGINE_ALLOW 	 63.346990803969774
ENGINE_OVERHAULS 	 75.90822179732314
TOT_

In [10]:
csv_path = os.path.join(root,
                        "Raw_Data",
                        "US_DoT",
                        "494124489_T_F41SCHEDULE_P7.zip")
csv_path

'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\494124489_T_F41SCHEDULE_P7.zip'

In [11]:
# Since 'pd.read_csv' works fine with zipped csv files, we can proceed directly:
cols = pd.read_csv(csv_path, nrows=1).columns # After normally importing it, an undesired extra blank column is loaded
df6 = pd.read_csv(csv_path,
                  encoding='latin1',
                  usecols=cols[:-1]) # This way, the extra column is disregarded for the loading process
df6

Unnamed: 0,AIR_OP_EXPENSE,FL_ATT_EXPENSE,FOOD_EXPENSE,OTH_IN_FL_EXPENSE,PAX_SVC_EXPENSE,LINE_SVC_EXPENSE,CONTROL_EXPENSE,LANDING_FEES,AIR_SVC_EXPENSE,TRAFFIC_EXP_PAX,TRAFFIC_EXP_CARGO,TRAFFIC_EXP_OTH,TRAFFIC_EXPENSE,RES_EXP_PAX,RES_EXP_CARGO,RES_EXP_OTH,RES_EXPENSE,AD_EXP_PAX,AD_EXP_CARGO,AD_EXP_INST,AD_EXPENSE,ADMIN_EXPENSE,DEPR_EXP_MAINT,AMORTIZATION,TRANSPORT_EXP,TOTAL_OP_EXPENSE,MAINT_PROP_EQUIP,DEPR_PROP_EQUIP,MAINT_DEPR,SVC_SALES_OP_EXP,AIRLINE_ID,UNIQUE_CARRIER,UNIQUE_CARRIER_NAME,CARRIER,CARRIER_NAME,UNIQUE_CARRIER_ENTITY,REGION,CARRIER_GROUP_NEW,CARRIER_GROUP,YEAR,QUARTER
0,,,,,,,,15.00,15.00,,,-15.00,-15.00,,,,,,,,,,,,,,,,,,20437,FL,AirTran Airways Corporation,FL,AirTran Airways Corporation,16030,L,3,3,2007,4
1,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,20.76,0.00,0.00,20.76,-0.03,0.0,0.00,-0.03,-0.01,0.00,0.00,0.00,20.72,0.00,0.00,0.00,20.72,20436,F9,Frontier Airlines Inc.,F9,Frontier Airlines Inc.,16461,L,3,3,2020,2
2,0.03,,,,,,,,,,,,,,,,,,,,,,,,,0.03,,,,,20357,ER,"Astar USA, LLC",ER,Astar Air Cargo Inc.,16170,A,2,2,2007,4
3,0.42,,,,,,,,,,,,,,,,,,,,,,,,,0.42,,,,,20357,ER,"Astar USA, LLC",ER,Astar Air Cargo Inc.,16170,A,2,2,2010,3
4,0.42,,,,,,,,,,,,,,,,,,,,,,,,,0.42,,,,,20357,ER,"Astar USA, LLC",ER,Astar Air Cargo Inc.,16170,A,2,2,2008,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5705,2977005.64,387471.90,130599.42,8923.99,526995.32,50117.81,54787.40,96158.85,201064.06,131728.88,278175.82,66711.12,476615.82,111296.20,16.90,126347.03,237660.13,0.00,0.0,12987.04,12987.04,1656594.97,41.49,8159.56,1909550.94,8150097.69,42900.71,100522.01,143422.72,3263541.11,19805,AA,American Airlines Inc.,AA,American Airlines Inc.,0A050,D,3,3,2020,1
5706,2984295.08,370915.94,141048.56,12357.77,524322.27,57564.04,63269.47,99961.44,220794.94,127982.19,264274.52,56760.88,449017.59,110922.71,548.11,148172.09,259642.91,0.00,0.0,23552.08,23552.08,814814.79,88.88,5327.42,1812828.39,7197052.62,33002.72,69365.55,102368.27,2399929.15,19805,AA,American Airlines Inc.,AA,American Airlines Inc.,0A050,D,3,3,2018,4
5707,2986050.69,365429.63,146419.92,11748.53,523598.08,58714.92,70358.73,102693.85,231767.50,123268.99,257721.01,60811.62,441801.61,111212.24,129.42,146381.34,257723.00,0.00,0.0,20562.55,20562.55,841453.58,481.18,7481.25,1833740.96,7241794.71,26527.03,70607.28,97134.31,2422003.06,19805,AA,American Airlines Inc.,AA,American Airlines Inc.,0A050,D,3,3,2018,3
5708,3024564.95,381697.12,145684.79,12985.41,540367.31,59500.73,61954.50,108392.15,229847.38,126759.31,270740.55,62223.74,459723.60,116778.66,17.44,153560.24,270356.34,0.00,0.0,31799.13,31799.13,834190.13,83.56,7506.67,1924763.38,7445532.76,34303.82,88026.50,122330.32,2496204.43,19805,AA,American Airlines Inc.,AA,American Airlines Inc.,0A050,D,3,3,2019,4


In [12]:
df6.describe()

Unnamed: 0,AIR_OP_EXPENSE,FL_ATT_EXPENSE,FOOD_EXPENSE,OTH_IN_FL_EXPENSE,PAX_SVC_EXPENSE,LINE_SVC_EXPENSE,CONTROL_EXPENSE,LANDING_FEES,AIR_SVC_EXPENSE,TRAFFIC_EXP_PAX,TRAFFIC_EXP_CARGO,TRAFFIC_EXP_OTH,TRAFFIC_EXPENSE,RES_EXP_PAX,RES_EXP_CARGO,RES_EXP_OTH,RES_EXPENSE,AD_EXP_PAX,AD_EXP_CARGO,AD_EXP_INST,AD_EXPENSE,ADMIN_EXPENSE,DEPR_EXP_MAINT,AMORTIZATION,TRANSPORT_EXP,TOTAL_OP_EXPENSE,MAINT_PROP_EQUIP,DEPR_PROP_EQUIP,MAINT_DEPR,SVC_SALES_OP_EXP,AIRLINE_ID,CARRIER_GROUP_NEW,CARRIER_GROUP,YEAR,QUARTER
count,5709.0,4713.0,4709.0,4698.0,4746.0,5605.0,5609.0,5681.0,5703.0,4659.0,5205.0,3231.0,5673.0,4502.0,4803.0,4055.0,5549.0,4210.0,4065.0,2671.0,5320.0,5699.0,5363.0,5133.0,4831.0,5709.0,5153.0,5654.0,5680.0,5703.0,5710.0,5710.0,5710.0,5710.0,5710.0
mean,326833.0,39214.027087,14876.207018,6975.267746,60606.316091,16600.333936,9886.19192,13087.106052,39074.936602,35916.627358,33944.51458,9127.772581,65839.696148,45212.166513,2608.395953,14467.970064,49511.79011,7334.256086,1075.699717,1329.063253,7293.207934,49587.64,1082.930436,2935.070425,131866.4,714589.3,7178.218147,6837.544232,13318.458551,276458.3,20005.52014,2.916287,2.916287,2005.24063,2.486165
std,463845.2,61205.916004,23821.95255,11682.445074,91120.048244,38043.147556,14977.327697,20371.373436,65710.617763,70011.026797,59782.782417,20980.269636,104810.39458,79788.804012,5433.327394,30377.454314,85640.858422,15338.199679,7562.179503,3917.273331,12689.034612,96043.67,3080.420082,6759.298808,356986.0,1102459.0,14470.193376,14772.787702,25371.519573,428343.9,299.515497,0.276981,0.276981,8.678886,1.113808
min,0.0,-668.43,-5699.67,-13516.0,0.0,-3964.0,-4851.0,-4739.0,-1166.0,-189762.0,-188054.0,-17091.0,-13858.6,-378.99,-2192.37,-8725.0,-387.0,-509296.0,-4155.0,-77802.0,-265125.0,-201595.0,-6063.5,-6834.0,-7589.0,0.03,-11106.0,-632.56,-8823.0,-51296.3,19386.0,2.0,2.0,1990.0,1.0
25%,43994.05,3768.43,720.93,459.4025,5942.66,752.54,969.0,1146.0,4591.785,1127.685,1124.0,39.96,4075.0,1764.5,59.225,31.875,2134.63,82.715,0.0,0.0,291.9375,3928.0,16.0,64.17,267.16,78353.0,243.0,395.25,926.75,29111.85,19805.0,3.0,3.0,1998.0,1.0
50%,161183.0,13078.91,4458.0,2379.5,21261.705,4526.0,4407.87,5426.19,15764.87,9142.0,7920.0,1366.76,23811.0,12801.635,442.0,1477.0,13061.0,1915.0,27.0,29.0,2516.0,17170.0,112.74,666.0,6036.65,293404.5,1768.0,2034.96,3794.775,106464.0,19977.0,3.0,3.0,2006.0,2.0
75%,383831.0,48277.0,20285.0,8129.75,79178.2325,15797.0,12528.0,15645.0,46507.5,33735.0,35866.0,10649.335,73926.0,47094.75,2471.24,15579.0,52789.49,9271.25,368.0,900.0,8846.805,50793.0,700.46,3074.0,56969.5,773864.0,6969.0,6704.3825,12946.5075,319506.0,20304.0,3.0,3.0,2013.0,3.0
max,3499002.0,561094.0,205255.0,126133.0,685815.0,465796.0,146610.59,154716.56,611323.0,698226.0,466375.27,207845.5,766481.39,496628.0,61714.0,215893.93,501141.0,104013.0,291462.0,43387.0,175577.0,2152874.0,29537.0,229197.95,2544532.0,8150098.0,159091.0,548352.0,611175.0,3400202.0,21171.0,3.0,3.0,2020.0,4.0


In [13]:
df6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5710 entries, 0 to 5709
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AIR_OP_EXPENSE         5709 non-null   float64
 1   FL_ATT_EXPENSE         4713 non-null   float64
 2   FOOD_EXPENSE           4709 non-null   float64
 3   OTH_IN_FL_EXPENSE      4698 non-null   float64
 4   PAX_SVC_EXPENSE        4746 non-null   float64
 5   LINE_SVC_EXPENSE       5605 non-null   float64
 6   CONTROL_EXPENSE        5609 non-null   float64
 7   LANDING_FEES           5681 non-null   float64
 8   AIR_SVC_EXPENSE        5703 non-null   float64
 9   TRAFFIC_EXP_PAX        4659 non-null   float64
 10  TRAFFIC_EXP_CARGO      5205 non-null   float64
 11  TRAFFIC_EXP_OTH        3231 non-null   float64
 12  TRAFFIC_EXPENSE        5673 non-null   float64
 13  RES_EXP_PAX            4502 non-null   float64
 14  RES_EXP_CARGO          4803 non-null   float64
 15  RES_

Additional information on each column meaning can be found [here](https://www.transtats.bts.gov/Fields.asp?Table_ID=278&SYS_Table_Name=T_F41SCHEDULE_P7&User_Table_Name=Schedule%20P-7&Year_Info=1&First_Year=1990&Last_Year=2020&Rate_Info=0&Frequency=Quarterly&Data_Frequency=Annual,Quarterly).

There are some missing values. Let's further delve into it:

In [14]:
# Absolute number of missing values by column:
df6.isna().sum()

AIR_OP_EXPENSE              1
FL_ATT_EXPENSE            997
FOOD_EXPENSE             1001
OTH_IN_FL_EXPENSE        1012
PAX_SVC_EXPENSE           964
LINE_SVC_EXPENSE          105
CONTROL_EXPENSE           101
LANDING_FEES               29
AIR_SVC_EXPENSE             7
TRAFFIC_EXP_PAX          1051
TRAFFIC_EXP_CARGO         505
TRAFFIC_EXP_OTH          2479
TRAFFIC_EXPENSE            37
RES_EXP_PAX              1208
RES_EXP_CARGO             907
RES_EXP_OTH              1655
RES_EXPENSE               161
AD_EXP_PAX               1500
AD_EXP_CARGO             1645
AD_EXP_INST              3039
AD_EXPENSE                390
ADMIN_EXPENSE              11
DEPR_EXP_MAINT            347
AMORTIZATION              577
TRANSPORT_EXP             879
TOTAL_OP_EXPENSE            1
MAINT_PROP_EQUIP          557
DEPR_PROP_EQUIP            56
MAINT_DEPR                 30
SVC_SALES_OP_EXP            7
AIRLINE_ID                  0
UNIQUE_CARRIER             12
UNIQUE_CARRIER_NAME         0
CARRIER   

In [15]:
# Relative frequency of missing values by column:
df6.isna().sum() / len(df6) * 100

AIR_OP_EXPENSE            0.017513
FL_ATT_EXPENSE           17.460595
FOOD_EXPENSE             17.530648
OTH_IN_FL_EXPENSE        17.723292
PAX_SVC_EXPENSE          16.882662
LINE_SVC_EXPENSE          1.838879
CONTROL_EXPENSE           1.768827
LANDING_FEES              0.507881
AIR_SVC_EXPENSE           0.122592
TRAFFIC_EXP_PAX          18.406305
TRAFFIC_EXP_CARGO         8.844133
TRAFFIC_EXP_OTH          43.415061
TRAFFIC_EXPENSE           0.647986
RES_EXP_PAX              21.155867
RES_EXP_CARGO            15.884413
RES_EXP_OTH              28.984238
RES_EXPENSE               2.819615
AD_EXP_PAX               26.269702
AD_EXP_CARGO             28.809107
AD_EXP_INST              53.222417
AD_EXPENSE                6.830123
ADMIN_EXPENSE             0.192644
DEPR_EXP_MAINT            6.077058
AMORTIZATION             10.105079
TRANSPORT_EXP            15.394046
TOTAL_OP_EXPENSE          0.017513
MAINT_PROP_EQUIP          9.754816
DEPR_PROP_EQUIP           0.980736
MAINT_DEPR          

In [16]:
df5.columns

Index(['PILOT_FLY_OPS', 'OTH_FLT_FLY_OPS', 'TRAIN_FLY_OPS', 'PERS_EXP_FLY_OPS',
       'PRO_FLY_OPS', 'INTERCHG_FLY_OPS', 'FUEL_FLY_OPS', 'OIL_FLY_OPS',
       'RENTAL_FLY_OPS', 'OTHER_FLY_OPS', 'INS_FLY_OPS', 'BENEFITS_FLY_OPS',
       'INCIDENT_FLY_OPS', 'PAY_TAX_FLY_OPS', 'OTH_TAX_FLY_OPS',
       'OTHER_EXP_FLY_OPS', 'TOT_FLY_OPS', 'AIRFRAME_LABOR', 'ENGINE_LABOR',
       'AIRFRAME_REPAIR', 'ENGINE_REPAIRS', 'INTERCHG_CHARG',
       'AIRFRAME_MATERIALS', 'ENGINE_MATERIALS', 'AIRFRAME_ALLOW',
       'AIRFRAME_OVERHAULS', 'ENGINE_ALLOW', 'ENGINE_OVERHAULS',
       'TOT_DIR_MAINT', 'AP_MT_BURDEN', 'TOT_FLT_MAINT_MEMO',
       'NET_OBSOL_PARTS', 'AIRFRAME_DEP', 'ENGINE_DEP', 'PARTS_DEP',
       'ENG_PARTS_DEP', 'OTH_FLT_EQUIP_DEP', 'OTH_FLT_EQUIP_DEP_GRP_I',
       'FLT_EQUIP_A_EXP', 'FLY_OPS_EXP_I_A', 'TOT_AIR_OP_EXPENSES',
       'DEV_N_PREOP_EXP', 'OTH_INTANGIBLES', 'EQUIP_N_HANGAR_DEP',
       'G_PROP_DEP', 'CAP_LEASES_DEP', 'TOTAL_AIR_HOURS', 'AIR_DAYS_ASSIGN',
       'AIR_FUELS_I

In [17]:
df6.columns

Index(['AIR_OP_EXPENSE', 'FL_ATT_EXPENSE', 'FOOD_EXPENSE', 'OTH_IN_FL_EXPENSE',
       'PAX_SVC_EXPENSE', 'LINE_SVC_EXPENSE', 'CONTROL_EXPENSE',
       'LANDING_FEES', 'AIR_SVC_EXPENSE', 'TRAFFIC_EXP_PAX',
       'TRAFFIC_EXP_CARGO', 'TRAFFIC_EXP_OTH', 'TRAFFIC_EXPENSE',
       'RES_EXP_PAX', 'RES_EXP_CARGO', 'RES_EXP_OTH', 'RES_EXPENSE',
       'AD_EXP_PAX', 'AD_EXP_CARGO', 'AD_EXP_INST', 'AD_EXPENSE',
       'ADMIN_EXPENSE', 'DEPR_EXP_MAINT', 'AMORTIZATION', 'TRANSPORT_EXP',
       'TOTAL_OP_EXPENSE', 'MAINT_PROP_EQUIP', 'DEPR_PROP_EQUIP', 'MAINT_DEPR',
       'SVC_SALES_OP_EXP', 'AIRLINE_ID', 'UNIQUE_CARRIER',
       'UNIQUE_CARRIER_NAME', 'CARRIER', 'CARRIER_NAME',
       'UNIQUE_CARRIER_ENTITY', 'REGION', 'CARRIER_GROUP_NEW', 'CARRIER_GROUP',
       'YEAR', 'QUARTER'],
      dtype='object')

In [18]:
df5_2020 = df5.loc[df5['YEAR'] == 2020, ['UNIQUE_CARRIER', 'CARRIER_NAME', 'TOTAL_DIRECT', 'TOTAL_INDIRECT', 'TOTAL_OP_EXPENSE', 'AIRCRAFT_FUEL']]
df5_2020

KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['TOTAL_DIRECT', 'TOTAL_INDIRECT', 'TOTAL_OP_EXPENSE', 'AIRCRAFT_FUEL'], dtype='object'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

In [None]:
df5g = df5_2020.groupby('CARRIER_NAME').sum().sort_values(by='TOTAL_OP_EXPENSE', ascending=False).head(10)
df5g

In [None]:
sorted(list(df5['CARRIER_NAME'].dropna().unique()))