# 1. Exploratory Data Analysis

## 1.2. Air Carrier Financial Reports (Form 41 Financial Data)

### 1.2.1. Schedule B-43 Inventory

Source: https://www.transtats.bts.gov/Tables.asp?DB_ID=135&DB_Name=Air%20Carrier%20Financial%20Reports%20%28Form%2041%20Financial%20Data%29

<em>Note</em>: Over time both the code and the name of a carrier may change and the same code or name may be assumed by a different airline. To ensure that you are analyzing data from the same airline, TranStats provides four airline-specific variables that identify one and only one carrier or its entity: Airline ID (AirlineID), Unique Carrier Code (UniqueCarrier), Unique Carrier Name (UniqueCarrierName), and Unique Entity (UniqCarrierEntity). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation. US Airways and America West started to report combined on-time data in January 2006 and combined traffic and financial data in October 2007 following their 2005 merger announcement. Delta and Northwest began reporting jointly in January 2010 following their 2008 merger announcement. Continental Micronesia was combined into Continental Airlines in December 2010 and joint reporting began in January 2011. Atlantic Southeast and ExpressJet began reporting jointly in January 2012. United and Continental began reporting jointly in January 2012 following their 2010 merger announcement. Endeavor (9E) operated as Pinnacle prior to August 2013. Envoy (MQ) operated as American Eagle prior to April 2014. Southwest (WN) and AirTran (FL) began reporting jointly in January 2015 following their 2011 merger announcement. American (AA) and US Airways (US) began reporting jointly as AA in July 2015 following their 2013 merger announcement. Alaska (AS) and Virgin America (VX) began reporting jointly as AS in April 2018 following their 2016 merger announcement.

- **B43_inventory**:
    - **Summary**:
        - Air Carrier Financial : Schedule B-43 Inventory
    - **Description**:
        - Annual Inventory of Airframe and Aircraft Engines
    - **File**:
        - 494124489_T_F41SCHEDULE_B43.zip

___

In [1]:
# Import libraries to be used:

import pandas as pd
import numpy as np
import os.path
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
import warnings # warnings.filterwarnings(action='ignore') # https://docs.python.org/3/library/warnings.html#the-warnings-filter
# from zipfile import ZipFile # De momento no ha hecho falta 

In [2]:
# Show all columns and rows in DataFrames
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None) # It greatly slows down the output display and freezes the kernel

# Show in notebook
%matplotlib inline

# style -> plt.style.available
# plt.style.use('seaborn')
plt.style.use('ggplot')

# theme
sns.set_theme(context='notebook',
              style="darkgrid") # {darkgrid, whitegrid, dark, white, ticks}

# color_palette -> https://seaborn.pydata.org/generated/seaborn.color_palette.html#seaborn.color_palette
palette = sns.color_palette("flare", as_cmap=True);

In [3]:
if os.name == 'nt': # Windows
    root = r"C:\Users\turge\CompartidoVM\0.TFM"
    print("Running on Windows.")
elif os.name == 'posix': # Ubuntu
    root = "/home/dsc/shared/0.TFM"
    print("Running on Ubuntu.")
print("root path\t", root)

Running on Windows.
root path	 C:\Users\turge\CompartidoVM\0.TFM


___

In [4]:
csv_path = os.path.join(root,
                        "Raw_Data",
                        "US_DoT",
                        "494124489_T_F41SCHEDULE_B43.zip")
csv_path

'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Raw_Data\\US_DoT\\494124489_T_F41SCHEDULE_B43.zip'

In [5]:
# Since 'pd.read_csv' works fine with zipped csv files, we can proceed directly:
cols = pd.read_csv(csv_path, nrows=1).columns # After normally importing it, an undesired extra blank column is loaded
df3 = pd.read_csv(csv_path,
                  encoding='latin1',
                  usecols=cols[:-1]) # This way, the extra column is disregarded for the loading process
df3

Unnamed: 0,YEAR,CARRIER,CARRIER_NAME,MANUFACTURE_YEAR,UNIQUE_CARRIER_NAME,SERIAL_NUMBER,TAIL_NUMBER,AIRCRAFT_STATUS,OPERATING_STATUS,NUMBER_OF_SEATS,MANUFACTURER,MODEL,CAPACITY_IN_POUNDS,ACQUISITION_DATE,AIRLINE_ID,UNIQUE_CARRIER
0,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7858,N202PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-10-28,20397.0,16
1,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7860,N206PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-10-30,20397.0,16
2,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7873,N207PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-11-26,20397.0,16
3,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7874,N209PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-12-04,20397.0,16
4,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7879,N213PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-12-16,20397.0,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100489,2019,YX,Republic Airline,2008,Republic Airline,17000256,N244JQ,b,Y,76.0,Embraer,ERJ-170-200LR,85517.0,2019-11-04,20452.0,YX
100490,2019,YX,Republic Airline,2008,Republic Airline,17000214,N227JQ,b,Y,76.0,Embraer,ERJ-170-200LR,85517.0,2019-11-26,20452.0,YX
100491,2019,YX,Republic Airline,2008,Republic Airline,17000236,N236JQ,b,Y,76.0,Embraer,ERJ-170-200LR,85517.0,2019-10-31,20452.0,YX
100492,2019,YX,Republic Airline,2008,Republic Airline,17000259,N245JQ,b,Y,76.0,Embraer,ERJ-170-200LR,85517.0,2019-12-03,20452.0,YX


In [6]:
df3.describe()

Unnamed: 0,YEAR,MANUFACTURE_YEAR,NUMBER_OF_SEATS,CAPACITY_IN_POUNDS,AIRLINE_ID
count,100494.0,100494.0,100487.0,100393.0,100486.0
mean,2012.490706,1998.59524,108.11703,67318.04842,20050.33132
std,4.071978,39.631486,76.979477,82383.16767,374.019353
min,2006.0,0.0,0.0,0.0,19386.0
25%,2009.0,1993.0,50.0,34100.0,19805.0
50%,2012.0,2001.0,124.0,41226.0,19977.0
75%,2016.0,2005.0,155.0,75445.0,20366.0
max,2019.0,2019.0,737.0,875000.0,21974.0


In [7]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100494 entries, 0 to 100493
Data columns (total 16 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   YEAR                 100494 non-null  int64  
 1   CARRIER              100445 non-null  object 
 2   CARRIER_NAME         100486 non-null  object 
 3   MANUFACTURE_YEAR     100494 non-null  int64  
 4   UNIQUE_CARRIER_NAME  100486 non-null  object 
 5   SERIAL_NUMBER        100494 non-null  object 
 6   TAIL_NUMBER          100494 non-null  object 
 7   AIRCRAFT_STATUS      100494 non-null  object 
 8   OPERATING_STATUS     100494 non-null  object 
 9   NUMBER_OF_SEATS      100487 non-null  float64
 10  MANUFACTURER         100494 non-null  object 
 11  MODEL                100483 non-null  object 
 12  CAPACITY_IN_POUNDS   100393 non-null  float64
 13  ACQUISITION_DATE     99917 non-null   object 
 14  AIRLINE_ID           100486 non-null  float64
 15  UNIQUE_CARRIER   

There are some missing values. Let's further delve into it:

In [8]:
# Absolute number of missing values by column:
df3.isna().sum()

YEAR                     0
CARRIER                 49
CARRIER_NAME             8
MANUFACTURE_YEAR         0
UNIQUE_CARRIER_NAME      8
SERIAL_NUMBER            0
TAIL_NUMBER              0
AIRCRAFT_STATUS          0
OPERATING_STATUS         0
NUMBER_OF_SEATS          7
MANUFACTURER             0
MODEL                   11
CAPACITY_IN_POUNDS     101
ACQUISITION_DATE       577
AIRLINE_ID               8
UNIQUE_CARRIER          67
dtype: int64

In [9]:
# Relative frequency of missing values by column:
df3.isna().sum() / len(df3) * 100

YEAR                   0.000000
CARRIER                0.048759
CARRIER_NAME           0.007961
MANUFACTURE_YEAR       0.000000
UNIQUE_CARRIER_NAME    0.007961
SERIAL_NUMBER          0.000000
TAIL_NUMBER            0.000000
AIRCRAFT_STATUS        0.000000
OPERATING_STATUS       0.000000
NUMBER_OF_SEATS        0.006966
MANUFACTURER           0.000000
MODEL                  0.010946
CAPACITY_IN_POUNDS     0.100504
ACQUISITION_DATE       0.574164
AIRLINE_ID             0.007961
UNIQUE_CARRIER         0.066671
dtype: float64

At this early point in the analysis, we assume that 0.1% of missing data is acceptable for each column; actually, for most of them, this relative frequency is remarkably lower.

In [10]:
df3.columns

Index(['YEAR', 'CARRIER', 'CARRIER_NAME', 'MANUFACTURE_YEAR',
       'UNIQUE_CARRIER_NAME', 'SERIAL_NUMBER', 'TAIL_NUMBER',
       'AIRCRAFT_STATUS', 'OPERATING_STATUS', 'NUMBER_OF_SEATS',
       'MANUFACTURER', 'MODEL', 'CAPACITY_IN_POUNDS', 'ACQUISITION_DATE',
       'AIRLINE_ID', 'UNIQUE_CARRIER'],
      dtype='object')

In [11]:
df3['OPERATING_STATUS'].unique()

array(['Y', 'N', 'y', ' '], dtype=object)

In [12]:
operating = df3[df3['OPERATING_STATUS'].isin(['Y', 'y'])]
not_operating = df3[df3['OPERATING_STATUS'].isin(['N'])]
unknown = df3[df3['OPERATING_STATUS'].isin([' '])]
print("Operating:\t", len(operating))
print("Not operating:\t", len(not_operating))
print("Unknown:\t", len(unknown))

Operating:	 96493
Not operating:	 3991
Unknown:	 10


Considering these results, it is safe to drop every A/C which is not known to be operating.

In [13]:
df3.drop(index=list(not_operating.index) + list(unknown.index), axis=0, inplace=True)

In [14]:
df3.shape

(96493, 16)

Due to the high cardinality of A/C manufacturers and models, a reduction is required in order to focus on the most used models.

However, a remarkable portion of values are originally provided in various forms; therefore, removing extra blank spaces, or converting all labels to upper case seems necessary. 

In [15]:
cols = ['MANUFACTURER', 'MODEL', 'TAIL_NUMBER', 'AIRCRAFT_STATUS']
for col in cols:
    df3[col] = df3[col].str.strip().str.upper()
df3

Unnamed: 0,YEAR,CARRIER,CARRIER_NAME,MANUFACTURE_YEAR,UNIQUE_CARRIER_NAME,SERIAL_NUMBER,TAIL_NUMBER,AIRCRAFT_STATUS,OPERATING_STATUS,NUMBER_OF_SEATS,MANUFACTURER,MODEL,CAPACITY_IN_POUNDS,ACQUISITION_DATE,AIRLINE_ID,UNIQUE_CARRIER
0,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7858,N202PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-10-28,20397.0,16
1,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7860,N206PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-10-30,20397.0,16
2,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7873,N207PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-11-26,20397.0,16
3,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7874,N209PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-12-04,20397.0,16
4,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7879,N213PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-12-16,20397.0,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100489,2019,YX,Republic Airline,2008,Republic Airline,17000256,N244JQ,B,Y,76.0,EMBRAER,ERJ-170-200LR,85517.0,2019-11-04,20452.0,YX
100490,2019,YX,Republic Airline,2008,Republic Airline,17000214,N227JQ,B,Y,76.0,EMBRAER,ERJ-170-200LR,85517.0,2019-11-26,20452.0,YX
100491,2019,YX,Republic Airline,2008,Republic Airline,17000236,N236JQ,B,Y,76.0,EMBRAER,ERJ-170-200LR,85517.0,2019-10-31,20452.0,YX
100492,2019,YX,Republic Airline,2008,Republic Airline,17000259,N245JQ,B,Y,76.0,EMBRAER,ERJ-170-200LR,85517.0,2019-12-03,20452.0,YX


Even after applying these two cleaning methods, there are still many different labels to refer to the same manufacturer. Hence, a new transformation will be applied to group all these labels under a single one.

In [16]:
manufacturers = ['AIRBUS', 'BOEING']

for manufacturer in manufacturers:
    manufacturer_df = df3[df3['MANUFACTURER'].str.contains(manufacturer)]
    labels_manufacturer = manufacturer_df['MANUFACTURER'].unique()
    df3['MANUFACTURER'].replace(to_replace=labels_manufacturer, value=manufacturer, inplace=True)
    print("Initial manufacturer labels:")
    print(labels_manufacturer)
    print("Final manufacturer label:")
    print(df3[df3['MANUFACTURER'].str.contains(manufacturer)]['MANUFACTURER'].unique(), "\n")

Initial manufacturer labels:
['AIRBUS' 'AIRBUSINDUSTRIES' 'AIRBUSINDUSTRIE' 'AIRBUSCOMPANY']
Final manufacturer label:
['AIRBUS'] 

Initial manufacturer labels:
['BOEING' 'THEBOEINGCOMPANY' 'BOEINGCO' 'BOEINGCOMPANY' 'THEBOEINGCO'
 'BOEING747-446' 'BOEINGCO.']
Final manufacturer label:
['BOEING'] 



Now, let's explore which A/C models remain with these two manufacturers:

In [17]:
df3

Unnamed: 0,YEAR,CARRIER,CARRIER_NAME,MANUFACTURE_YEAR,UNIQUE_CARRIER_NAME,SERIAL_NUMBER,TAIL_NUMBER,AIRCRAFT_STATUS,OPERATING_STATUS,NUMBER_OF_SEATS,MANUFACTURER,MODEL,CAPACITY_IN_POUNDS,ACQUISITION_DATE,AIRLINE_ID,UNIQUE_CARRIER
0,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7858,N202PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-10-28,20397.0,16
1,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7860,N206PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-10-30,20397.0,16
2,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7873,N207PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-11-26,20397.0,16
3,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7874,N209PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-12-04,20397.0,16
4,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7879,N213PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-12-16,20397.0,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100489,2019,YX,Republic Airline,2008,Republic Airline,17000256,N244JQ,B,Y,76.0,EMBRAER,ERJ-170-200LR,85517.0,2019-11-04,20452.0,YX
100490,2019,YX,Republic Airline,2008,Republic Airline,17000214,N227JQ,B,Y,76.0,EMBRAER,ERJ-170-200LR,85517.0,2019-11-26,20452.0,YX
100491,2019,YX,Republic Airline,2008,Republic Airline,17000236,N236JQ,B,Y,76.0,EMBRAER,ERJ-170-200LR,85517.0,2019-10-31,20452.0,YX
100492,2019,YX,Republic Airline,2008,Republic Airline,17000259,N245JQ,B,Y,76.0,EMBRAER,ERJ-170-200LR,85517.0,2019-12-03,20452.0,YX


After operating with procedure:

```python
df3_man = df3[df3['MANUFACTURER'].isin(manufacturers)]
df3_man
```

The following error was prompted later on:

```
<ipython-input-76-ea644e03b3f0>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame
```

Therefore, after reviewing the [recommended official documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy) about it, the next command was used instead.

In [18]:
mask = df3['MANUFACTURER'].isin(manufacturers)
df3_man = df3.loc[mask, :]
df3_man

Unnamed: 0,YEAR,CARRIER,CARRIER_NAME,MANUFACTURE_YEAR,UNIQUE_CARRIER_NAME,SERIAL_NUMBER,TAIL_NUMBER,AIRCRAFT_STATUS,OPERATING_STATUS,NUMBER_OF_SEATS,MANUFACTURER,MODEL,CAPACITY_IN_POUNDS,ACQUISITION_DATE,AIRLINE_ID,UNIQUE_CARRIER
76,2006,5X,United Parcel Service,2000,United Parcel Service,805,N120UP,A,Y,0.0,AIRBUS,A300-6,118673.0,2000-07-25,19917.0,5X
77,2006,5X,United Parcel Service,2000,United Parcel Service,806,N121UP,A,Y,0.0,AIRBUS,A300-6,118673.0,2000-09-21,19917.0,5X
78,2006,5X,United Parcel Service,2000,United Parcel Service,807,N122UP,O,Y,0.0,AIRBUS,A300-6,118673.0,2000-09-12,19917.0,5X
79,2006,5X,United Parcel Service,2000,United Parcel Service,808,N124UP,O,Y,0.0,AIRBUS,A300-6,118673.0,2000-10-20,19917.0,5X
80,2006,5X,United Parcel Service,2000,United Parcel Service,809,N125UP,O,Y,0.0,AIRBUS,A300-6,118673.0,2000-10-30,19917.0,5X
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100141,2019,X9,Omni Air International LLC,1997,Omni Air International LLC,27613,N495AX,B,Y,86.0,BOEING,BOEINGB767-316ERPAX,90129.0,2016-05-16,20377.0,X9
100142,2019,X9,Omni Air International LLC,2004,Omni Air International LLC,33681,N819AX,B,Y,381.0,BOEING,BOEINGB777-222ERPAX,119668.0,2017-01-09,20377.0,X9
100143,2019,X9,Omni Air International LLC,2005,Omni Air International LLC,33682,N828AX,B,Y,381.0,BOEING,BOEINGB777-222ERPAX,121426.0,2017-01-10,20377.0,X9
100144,2019,X9,Omni Air International LLC,2007,Omni Air International LLC,36124,N846AX,B,Y,381.0,BOEING,BOEINGB777-222ERPAX,121426.0,2017-01-10,20377.0,X9


After selecting only the two major worldwide manufacturers (i.e. Airbus and Boeing), the data is roughly reduced to half the size of the original dataset, going from 100494 to 56552 records. This means the dataset will keep around ~56% of the data.

This fact brings about several points:
- On the one hand, there will inevitably be an A/C models variability loss.
    - For example, most turboprops (and even turbojets) would be excluded from this analysis, since Airbus and Boeing mainly manufacture turbofan-engined aircraft. Manufacturers like ATR would be dismissed.
- On the other hand, the analysis would be much clearer as the assessment would focus on the two worldwide major manufacturers only, thus providing an assessment concentrated on the major carriers.

Should this analysis throw incomplete results and/or conclusions at the end of the project, an additional phase could be easily implemented, taking into account other important manufacturers such as:
- Embraer
- Bombardier
- Comac (Commercial Aircraft Corporation of China)
- Etc.

In [19]:
print(df3_man['MODEL'].nunique())
sorted(list(df3_man['MODEL'].unique()))

554


['0',
 '318-111',
 '319-111',
 '319-112',
 '320-112',
 '320-114',
 '320-211',
 '320-211N',
 '320-214',
 '320-251',
 '320-251N',
 '321-211',
 '717-200-PSGR',
 '717-200PASSENGERONLY',
 '727-200',
 '727-200CARGO',
 '727-200F',
 '727-212CARGO',
 '727-223',
 '727-224',
 '727-225',
 '727-231ACARGO',
 '727-232',
 '727-233ACARGO',
 '727-264',
 '727-277-CARGO',
 '727-281-CARGO',
 '727-2B6',
 '727-2F9CARGO',
 '727-2H3',
 '727-2M7',
 '727-2S2F-CARGO',
 '727225-CARGO',
 '72722C-CARGO',
 '727277-CARGO',
 '737-200',
 '737-200C',
 '737-200CARGO',
 '737-205',
 '737-2H4',
 '737-300',
 '737-300PASSENGERONLY',
 '737-300SF',
 '737-301',
 '737-301F',
 '737-306',
 '737-319',
 '737-330',
 '737-3B7',
 '737-400',
 '737-400-PSGR',
 '737-401',
 '737-402',
 '737-403',
 '737-404',
 '737-405',
 '737-436',
 '737-45D',
 '737-484',
 '737-48E',
 '737-490',
 '737-4B7',
 '737-4C9',
 '737-4Q8',
 '737-4YO',
 '737-500',
 '737-500PASSENGERONLY',
 '737-700',
 '737-700PASSENGERONLY',
 '737-732',
 '737-732-PSGR',
 '737-75V',
 '

The resulting number (554) of unique models for the manufacturer selection is unmanageable as is. As a result, some transformation must be performed prior to continue with the analysis:
- Some of them actually refer to the same A/C model, but are labelled differently
- Some of them represent cargo/freighter variants of original airframes
- Even for main models, there are variants to cover

1. First, drawing upon regular expressions, an attempt will be made to group relabel the models so as to lie under a single reference.

Below are the A/C families and models according to manufacturers portfolio:
- [Airbus](https://www.airbus.com/aircraft/passenger-aircraft.html):
    - Previous-generation aircraft:
        - [A300-600](https://www.airbus.com/aircraft/previous-generation-aircraft/a300-600.html#details)
        - [A300-600F](https://www.airbus.com/aircraft/previous-generation-aircraft/a300-600/a300-600f.html#details)
        - [A310](https://www.airbus.com/aircraft/previous-generation-aircraft/a310.html#details)
        - [A340 Family](https://www.airbus.com/aircraft/previous-generation-aircraft/a340-family.html)
            - [A340-200](https://www.airbus.com/aircraft/previous-generation-aircraft/a340-family/a340-200.html#details)
            - [A340-300](https://www.airbus.com/aircraft/previous-generation-aircraft/a340-family/a340-300.html#details)
            - [A340-500](https://www.airbus.com/aircraft/previous-generation-aircraft/a340-family/a340-500.html#details)
            - [A340-600](https://www.airbus.com/aircraft/previous-generation-aircraft/a340-family/a340-600.html#details)        
    - A220 Family:
        - [A220-100](https://www.airbus.com/aircraft/passenger-aircraft/a220-family/a220-100.html)
        - [A220-300](https://www.airbus.com/aircraft/passenger-aircraft/a220-family/a220-300.html)
    - A320 Family:
        - [A318](https://www.airbus.com/aircraft/passenger-aircraft/a320-family/a318.html)
        - [A319neo](https://www.airbus.com/aircraft/passenger-aircraft/a320-family/a319neo.html)
        - [A320neo](https://www.airbus.com/aircraft/passenger-aircraft/a320-family/a320neo.html)
        - [A321neo](https://www.airbus.com/aircraft/passenger-aircraft/a320-family/a321neo.html)
    - A330
        - [A330-200](https://www.airbus.com/aircraft/passenger-aircraft/a330-family/a330-200.html#details)
        - [A330-300](https://www.airbus.com/aircraft/passenger-aircraft/a330-family/a330-300.html#details)
        - [A330-800](https://www.airbus.com/aircraft/passenger-aircraft/a330-family/a330-800.html#details)
        - [A330-900](https://www.airbus.com/aircraft/passenger-aircraft/a330-family/a330-900.html)
    - A350
        - [A350-900](https://www.airbus.com/aircraft/passenger-aircraft/a350xwb-family/a350-900.html#details)
        - [A350-1000](https://www.airbus.com/aircraft/passenger-aircraft/a350xwb-family/a350-1000.html#details)
    - A380
        - [A380](https://www.airbus.com/aircraft/passenger-aircraft/a380.html)
- [Boeing](https://www.boeing.com/commercial/):
    - [B737-NG](https://www.boeing.com/commercial/737ng/)
        - 737-700
        - 737-800
        - 737-900
    - [B737-MAX](https://www.boeing.com/commercial/737max/)
        - 737-MAX-7
        - 737-MAX-8
        - 737-MAX-9
        - 737-MAX-10
    - [B747-8](https://www.boeing.com/commercial/747//)
        - 747-8 Intercontinental
    - [B767](https://www.boeing.com/commercial/767/)
        - 767F
    - [B777](https://www.boeing.com/commercial/777/)
        - 777-200LR
        - 777-300ER
    - [B777X](https://www.boeing.com/commercial/777x/)
        - 777–8
        - 777–9
    - [B787](https://www.boeing.com/commercial/737ng/):
        - B787-8 Dreamliner
        - B787-9 Dreamliner
        - B787-10 Dreamliner

Summarizing, these different models add up to:
- Airbus: 20
- Boeing: 16

So, with a total of 36 different A/C models, a dictionary may be needed to replace each label by its corresponding one.

Notes to group labels:
- "PSGR" / "PAX" = Passenger (variant)
- "ER" = Extended Range (a modified aircraft which available range is increased, e.g. B777-300ER)
- "LR" = Long Range
- "F" = Freighter*

In [20]:
models_airbus_a320fam = ['A318', 'A319neo', 'A320neo', 'A321neo']
models_boeing_b737fam = ['737-700', '737-800', '737-900', '737-MAX-7', '737-MAX-8', '737-MAX-9', '737-MAX-10']
models_a320fam_b737fam = models_airbus_a320fam + models_boeing_b737fam
models_a320fam_b737fam

['A318',
 'A319neo',
 'A320neo',
 'A321neo',
 '737-700',
 '737-800',
 '737-900',
 '737-MAX-7',
 '737-MAX-8',
 '737-MAX-9',
 '737-MAX-10']

In [21]:
a318 = df3_man[df3_man['MODEL'].str.contains(r'318')]
print("Number of A318:", len(a318))
a319 = df3_man[df3_man['MODEL'].str.contains(r'319')]
print("Number of A319:", len(a319))
a320 = df3_man[df3_man['MODEL'].str.contains(r'320')]
print("Number of A320:", len(a320))
a321 = df3_man[df3_man['MODEL'].str.contains(r'321')]
print("Number of A321:", len(a321))

a320_family = a318.append([a319, a320, a321])
print("Total number of A320 Family:", len(a320_family))

Number of A318: 52
Number of A319: 4299
Number of A320: 6248
Number of A321: 2313
Total number of A320 Family: 12912


In [22]:
b737_700 = df3_man[df3_man['MODEL'].str.contains(r'737[- ]*7') & ~df3_man['MODEL'].str.contains(r'MAX')]
print("Number of 737-700:", len(b737_700))
b737_800 = df3_man[df3_man['MODEL'].str.contains(r'737[- ]*8') & ~df3_man['MODEL'].str.contains(r'MAX')]
print("Number of 737-800:", len(b737_800))
b737_900 = df3_man[df3_man['MODEL'].str.contains(r'737[- ]*9') & ~df3_man['MODEL'].str.contains(r'MAX')]
print("Number of 737-900:", len(b737_900))

b737_MAX_7 = df3_man[df3_man['MODEL'].str.contains(r'737[- ]*7') & df3_man['MODEL'].str.contains(r'MAX')]
print("Number of 737-MAX-7:", len(b737_MAX_7))
b737_MAX_8 = df3_man[df3_man['MODEL'].str.contains(r'737[- ]*8') & df3_man['MODEL'].str.contains(r'MAX')]
print("Number of 737-MAX-8:", len(b737_MAX_8))
b737_MAX_9 = df3_man[df3_man['MODEL'].str.contains(r'737[- ]*9') & df3_man['MODEL'].str.contains(r'MAX')]
print("Number of 737-MAX-9:", len(b737_MAX_9))
b737_MAX_10 = df3_man[df3_man['MODEL'].str.contains(r'737[- ]*10') & df3_man['MODEL'].str.contains(r'MAX')]
print("Number of 737-MAX-10:", len(b737_MAX_10))

b737_family = b737_700.append([b737_800, b737_900, b737_MAX_7, b737_MAX_8, b737_MAX_9, b737_MAX_10])
print("Total number of B737 Family:", len(b737_family))

Number of 737-700: 6886
Number of 737-800: 7567
Number of 737-900: 2178
Number of 737-MAX-7: 0
Number of 737-MAX-8: 91
Number of 737-MAX-9: 0
Number of 737-MAX-10: 0
Total number of B737 Family: 16722


***NOTE:*** This transformation is somewhat incorrect, since there are older models that are not longer present in the manufacturer database, but still operating (e.g. A320-CEO). By performing this modification, we may be changing the actual model for a different one, thus doctoring the results of the analysis.

At this point, it is worth it to show the resulting dataset after keeping only these two A/C families, accounting for 12 different A/C models.

In [23]:
print("The total number of Airbus and Boeing A/C adds up to {}.".format(len(df3_man)))
print("The total number of A320 and B737 families is {}.".format(len(a320_family + b737_family)))
print("This results in these families accounting for {:.0f}% of the subset.".format((len(a320_family + b737_family)) * 100 / len(df3_man)))
print("So, at this point, {} rows remain out of an original dataset of {} rows ({:.0f}%).".format(len(a320_family + b737_family), len(df3), len(a320_family + b737_family) * 100 / len(df3)))

The total number of Airbus and Boeing A/C adds up to 56552.
The total number of A320 and B737 families is 29634.
This results in these families accounting for 52% of the subset.
So, at this point, 29634 rows remain out of an original dataset of 96493 rows (31%).


Although retaining only 31% of the original data seems insufficient, the reader should consider the following points:
- These two A/C families (A320 and B737) cover most short to medium haul flights worldwide.
- For a first assessment, it is reasonable to keep only 11 A/C models, since this will in turn result in more manageable comparisons.

However, once again, should this analysis seem incomplete at the end, additional A/C models could be added to carry out more complex studies. New lines of research could delve into similar A/C models according to new criteria, such as:
- Payload
- Distance covered (based on A/C range)
- Geographical operating area
- Propulsion type (tubojet, turboprop, turbofan, etc.)
- Models for particular carriers
- Etc.

In [24]:
df3['AC_FAMILY'] = None
df3

Unnamed: 0,YEAR,CARRIER,CARRIER_NAME,MANUFACTURE_YEAR,UNIQUE_CARRIER_NAME,SERIAL_NUMBER,TAIL_NUMBER,AIRCRAFT_STATUS,OPERATING_STATUS,NUMBER_OF_SEATS,MANUFACTURER,MODEL,CAPACITY_IN_POUNDS,ACQUISITION_DATE,AIRLINE_ID,UNIQUE_CARRIER,AC_FAMILY
0,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7858,N202PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-10-28,20397.0,16,
1,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7860,N206PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-10-30,20397.0,16,
2,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7873,N207PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-11-26,20397.0,16,
3,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7874,N209PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-12-04,20397.0,16,
4,2006,16,PSA Airlines Inc.,2003,PSA Airlines Inc.,7879,N213PS,B,Y,50.0,CANADAIR,CRJ-2/4,47000.0,2003-12-16,20397.0,16,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100489,2019,YX,Republic Airline,2008,Republic Airline,17000256,N244JQ,B,Y,76.0,EMBRAER,ERJ-170-200LR,85517.0,2019-11-04,20452.0,YX,
100490,2019,YX,Republic Airline,2008,Republic Airline,17000214,N227JQ,B,Y,76.0,EMBRAER,ERJ-170-200LR,85517.0,2019-11-26,20452.0,YX,
100491,2019,YX,Republic Airline,2008,Republic Airline,17000236,N236JQ,B,Y,76.0,EMBRAER,ERJ-170-200LR,85517.0,2019-10-31,20452.0,YX,
100492,2019,YX,Republic Airline,2008,Republic Airline,17000259,N245JQ,B,Y,76.0,EMBRAER,ERJ-170-200LR,85517.0,2019-12-03,20452.0,YX,


In [25]:
df3.loc[a320_family.index, 'AC_FAMILY'] = 'A320'
df3.loc[b737_family.index, 'AC_FAMILY'] = 'B737'
df3.loc[list(b737_family.index) + list(a320_family.index), :]

Unnamed: 0,YEAR,CARRIER,CARRIER_NAME,MANUFACTURE_YEAR,UNIQUE_CARRIER_NAME,SERIAL_NUMBER,TAIL_NUMBER,AIRCRAFT_STATUS,OPERATING_STATUS,NUMBER_OF_SEATS,MANUFACTURER,MODEL,CAPACITY_IN_POUNDS,ACQUISITION_DATE,AIRLINE_ID,UNIQUE_CARRIER,AC_FAMILY
1340,2006,AQ,Aloha Airlines Inc.,1999,Aloha Air Cargo,28499,N738AL,B,Y,124.0,BOEING,B737-7,32254.0,1999-10-18,19678.0,KH,B737
1341,2006,AQ,Aloha Airlines Inc.,1999,Aloha Air Cargo,28500,N739AL,B,Y,124.0,BOEING,B737-7,31980.0,1999-11-12,19678.0,KH,B737
1342,2006,AQ,Aloha Airlines Inc.,2001,Aloha Air Cargo,28640,N740AL,B,Y,124.0,BOEING,B737-7,32172.0,2001-03-23,19678.0,KH,B737
1343,2006,AQ,Aloha Airlines Inc.,2001,Aloha Air Cargo,28641,N741AL,B,Y,124.0,BOEING,B737-7,32254.0,2001-03-29,19678.0,KH,B737
1344,2006,AQ,Aloha Airlines Inc.,1999,Aloha Air Cargo,29905,N746AL,B,Y,124.0,BOEING,B737-7,32132.0,2002-10-01,19678.0,KH,B737
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97774,2019,NK,Spirit Air Lines,2017,Spirit Air Lines,7857,N678NK,O,Y,228.0,AIRBUS,A-321-PSGR,43600.0,2017-09-04,20416.0,NK,A320
97775,2019,NK,Spirit Air Lines,2017,Spirit Air Lines,7825,N679NK,O,Y,228.0,AIRBUS,A-321-PSGR,43600.0,2017-10-29,20416.0,NK,A320
97776,2019,NK,Spirit Air Lines,2017,Spirit Air Lines,7870,N680NK,O,Y,228.0,AIRBUS,A-321-PSGR,43600.0,2017-11-18,20416.0,NK,A320
97777,2019,NK,Spirit Air Lines,2017,Spirit Air Lines,7908,N681NK,O,Y,228.0,AIRBUS,A-321-PSGR,43600.0,2017-12-14,20416.0,NK,A320


In [26]:
df3.sample(20)

Unnamed: 0,YEAR,CARRIER,CARRIER_NAME,MANUFACTURE_YEAR,UNIQUE_CARRIER_NAME,SERIAL_NUMBER,TAIL_NUMBER,AIRCRAFT_STATUS,OPERATING_STATUS,NUMBER_OF_SEATS,MANUFACTURER,MODEL,CAPACITY_IN_POUNDS,ACQUISITION_DATE,AIRLINE_ID,UNIQUE_CARRIER,AC_FAMILY
70743,2015,WN,Southwest Airlines Co.,2012,Southwest Airlines Co.,36680,N8302F,O,Y,143.0,BOEING,B-737-8H4,43800.0,2012-03-27,19393.0,WN,B737
81725,2017,EV,ExpressJet Airlines Inc.,2002,ExpressJet Airlines Inc.,145571,N26549,B,Y,50.0,EMBRAER,ERJ-145,48501.0,2010-11-12,20366.0,EV,
16128,2008,AA,American Airlines Inc.,1999,American Airlines Inc.,53623,N973TW,A,Y,140.0,MCDONNELL-DOUGLAS,MD-80,37200.0,1999-09-10,19805.0,AA,
92127,2018,WN,Southwest Airlines Co.,2001,Southwest Airlines Co.,29836,N415WN,O,Y,143.0,BOEING,B-737-7H4,36200.0,2002-04-02,19393.0,WN,B737
54595,2013,UA,United Air Lines Inc.,1999,United Air Lines Inc.,28807,N36247,B,Y,154.0,BOEING,B737-800PAX,174200.0,1999-12-13,19977.0,UA,B737
62328,2014,US,US Airways Inc.,2000,US Airways Inc.,1375,N818AW,A,Y,124.0,AIRBUS,A319-132,36700.0,2000-11-30,20355.0,US,A320
25072,2009,FX,Federal Express Corporation,1984,Federal Express Corporation,191,N401FE,B,Y,0.0,AIRBUS,A310-2CF,75303.0,1995-01-27,20107.0,FX,
73662,2016,DL,Delta Air Lines Inc.,1999,Delta Air Lines Inc.,30345,N382DA,O,Y,160.0,BOEING,737-832-PSGR,43070.0,1999-10-12,19790.0,DL,B737
40399,2011,FL,AirTran Airways Corporation,2001,AirTran Airways Corporation,55127,N603AT,B,Y,117.0,BOEING,717-200PASSENGERONLY,25000.0,2001-06-26,20437.0,FL,
15649,2008,AA,American Airlines Inc.,1998,American Airlines Inc.,29425,N675AN,O,Y,188.0,BOEING,B757-2,49500.0,1998-08-26,19805.0,AA,


In [27]:
df3_output = df3.loc[list(b737_family.index) + list(a320_family.index), :].copy()
df3_output

Unnamed: 0,YEAR,CARRIER,CARRIER_NAME,MANUFACTURE_YEAR,UNIQUE_CARRIER_NAME,SERIAL_NUMBER,TAIL_NUMBER,AIRCRAFT_STATUS,OPERATING_STATUS,NUMBER_OF_SEATS,MANUFACTURER,MODEL,CAPACITY_IN_POUNDS,ACQUISITION_DATE,AIRLINE_ID,UNIQUE_CARRIER,AC_FAMILY
1340,2006,AQ,Aloha Airlines Inc.,1999,Aloha Air Cargo,28499,N738AL,B,Y,124.0,BOEING,B737-7,32254.0,1999-10-18,19678.0,KH,B737
1341,2006,AQ,Aloha Airlines Inc.,1999,Aloha Air Cargo,28500,N739AL,B,Y,124.0,BOEING,B737-7,31980.0,1999-11-12,19678.0,KH,B737
1342,2006,AQ,Aloha Airlines Inc.,2001,Aloha Air Cargo,28640,N740AL,B,Y,124.0,BOEING,B737-7,32172.0,2001-03-23,19678.0,KH,B737
1343,2006,AQ,Aloha Airlines Inc.,2001,Aloha Air Cargo,28641,N741AL,B,Y,124.0,BOEING,B737-7,32254.0,2001-03-29,19678.0,KH,B737
1344,2006,AQ,Aloha Airlines Inc.,1999,Aloha Air Cargo,29905,N746AL,B,Y,124.0,BOEING,B737-7,32132.0,2002-10-01,19678.0,KH,B737
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97774,2019,NK,Spirit Air Lines,2017,Spirit Air Lines,7857,N678NK,O,Y,228.0,AIRBUS,A-321-PSGR,43600.0,2017-09-04,20416.0,NK,A320
97775,2019,NK,Spirit Air Lines,2017,Spirit Air Lines,7825,N679NK,O,Y,228.0,AIRBUS,A-321-PSGR,43600.0,2017-10-29,20416.0,NK,A320
97776,2019,NK,Spirit Air Lines,2017,Spirit Air Lines,7870,N680NK,O,Y,228.0,AIRBUS,A-321-PSGR,43600.0,2017-11-18,20416.0,NK,A320
97777,2019,NK,Spirit Air Lines,2017,Spirit Air Lines,7908,N681NK,O,Y,228.0,AIRBUS,A-321-PSGR,43600.0,2017-12-14,20416.0,NK,A320


In [28]:
csv_output_path = os.path.join(root,
                               "Output_Data",
                               "US_DoT",
                               "B43_inventory_output.csv")
csv_output_path

'C:\\Users\\turge\\CompartidoVM\\0.TFM\\Output_Data\\US_DoT\\B43_inventory_output.csv'

In [29]:
df3_output.to_csv(csv_output_path, encoding='latin1')

___