# Features selection

The original dataset, US Domestic Flights Delay Prediction (2013 - 2018) (Source: [Kaggle](https://www.kaggle.com/datasets/gabrielluizone/us-domestic-flights-delay-prediction-2013-2018)), is provided as a 1.54 GB zip archive. Once decompressed, it contains 60 files with a total size of 13.6 GB. Each file represents one month of data, spanning from January 2014 to December 2018.



## Modules Initialisation

In [1]:
from DataWrangling import *
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Loading the first file

In [2]:
home_dir = Path.cwd().parent
source_zip_file = home_dir / 'data' / 'interim' / 'csv_flight.zip'
data_file = 'csv_flight/report_2014_1.csv'

# reading the first file to evaluate the data
flights_2014_1 = load_data_from(source_zip_file, data_file)

## Data size

In [3]:
print('Shape of the one-month dataset:', flights_2014_1.shape)
print('It uses {:.2f} MB'.format(memory_usage(flights_2014_1)))
memory_usage_per_type(flights_2014_1)

Shape of the one-month dataset: (471949, 110)
It uses 717.16 MB
memory usage for number columns: 316.860 MB
memory usage for object columns: 395.352 MB
memory usage for datetimetz columns: 0.000 MB
memory usage for category columns: 0.000 MB
memory usage for bool columns: 1.350 MB


One month of data contains 471,949 rows and 110 columns, with a total memory size of 1.35 GB. The estimated size of the entire dataset, without reorganization or cleaning, may exceed 81 GB, which could be challenging to process locally. Therefore, a key goal of data preparation will be to reduce the dataset size without compromising its quality

## Data structure and check
### Date of the flight

In [4]:
flights_2014_1.iloc[:,:6].head()

Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate
0,2014,1,1,30,4,2014-01-30
1,2014,1,1,31,5,2014-01-31
2,2014,1,1,1,3,2014-01-01
3,2014,1,1,2,4,2014-01-02
4,2014,1,1,3,5,2014-01-03


0. **Year**
    * **Description:** Year
    * **Data type:** int16
    * **Drop**
    * **Comment:** We have a flight date which includes this data

1. **Quarter**  
    * **Description:** Quarter (1-4)  
    * **Data type:** int8
    * **Drop**
    * **Comment:** We have a flight date which includes this data

2. **Month**  
    * **Description:** Month (1-12)  
    * **Data type:** int8
    * **Drop**
    * **Comment:** We have a flight date which includes this data

3. **DayofMonth**  
    * **Description:** Day of month  
    * **Data type:** int8
    * **Drop**
    * **Comment:** We have a flight date which includes this data

4. **DayOfWeek**  
    * **Description:** Day of week  
    * **Data type:** int8
    * **Drop**
    * **Comment:** We have a flight date which includes this data

5. **FlightDate**  
    * **Description:** Flight Date (yyyymmdd)
    * **Data type:** datetime
    * **Keep**

Values check for Date of the flight: the data accurate, there is no outliers or NA.

In [5]:
flights_2014_1['FlightDate'].describe()

count                           471949
mean     2014-01-15 22:29:46.967659264
min                2014-01-01 00:00:00
25%                2014-01-08 00:00:00
50%                2014-01-16 00:00:00
75%                2014-01-24 00:00:00
max                2014-01-31 00:00:00
Name: FlightDate, dtype: object

I need to change the data type during the data transformation from object to datetime to use memory efficiently.

In [6]:
flights_2014_1['FlightDate'].isna().sum()

0

###  Airline's and flight's details 

In [7]:
flights_2014_1.iloc[:, 6:10].head()

Unnamed: 0,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,Tail_Number
0,AA,19805,AA,N006AA
1,AA,19805,AA,N003AA
2,AA,19805,AA,N002AA
3,AA,19805,AA,N002AA
4,AA,19805,AA,N014AA



6. **Reporting_Airline**
    * **Description:**  Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years. 
    * **Data type:** category 
    * **Keep**
    * **Comment:** I decided to keep this field and drop DOT_ID_Reporting_Airline and IATA_CODE_Reporting_Airline because it is declared as unique and recomended by data provider as such | 

7. **DOT_ID_Reporting_Airline**
    * **Description** An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation.
    * **Drop**
    * **Comment:** Not unique
    
8. **IATA_CODE_Reporting_Airline**
    * **Description:** Code assigned by IATA and commonly used to identify a carrier. As the same code may have been assigned to different carriers over time, the code is not always unique. For analysis, use the Unique Carrier Code.
    * **Drop**
    * **Comment:** Not unique

9. **Tail_Number**
    * **Description:** Tail Number
    * **Drop**
    * **Comment:** Irrelevant to the purposes of the project

10. **Flight_Number_Reporting_Airline**
    * **Description:** Flight Number
    * **Data type:** int16
    * **Keep**
    * **Comment:** Unique number of the flight at a specific day/time

In [8]:
flights_2014_1['Reporting_Airline'].unique()

array(['AA', 'AS', 'DL', 'EV', 'B6', 'F9', 'FL', 'HA', 'MQ', 'US', 'OO',
       'VX', 'WN', 'UA'], dtype=object)

In [9]:
flights_2014_1['Flight_Number_Reporting_Airline'].describe()

count    471949.000000
mean       2304.719605
std        1857.378202
min           1.000000
25%         707.000000
50%        1780.000000
75%        3583.000000
max        8404.000000
Name: Flight_Number_Reporting_Airline, dtype: float64

In [10]:
flights_2014_1[['Reporting_Airline', 'Flight_Number_Reporting_Airline']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 471949 entries, 0 to 471948
Data columns (total 2 columns):
 #   Column                           Non-Null Count   Dtype 
---  ------                           --------------   ----- 
 0   Reporting_Airline                471949 non-null  object
 1   Flight_Number_Reporting_Airline  471949 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 7.2+ MB


In [11]:
flights_2014_1[['Reporting_Airline', 'Flight_Number_Reporting_Airline']].isna().sum()

Reporting_Airline                  0
Flight_Number_Reporting_Airline    0
dtype: int64

The data in the fields Reporting_Airline and Flight_Number_Reporting_Airline is accurate, there is no outliers or NA but I need to change the data type during the data transformation.

###  Origin and Destination detailes

In [12]:
DestFields = ['DestAirportID', 
                'DestAirportSeqID', 
                'DestCityMarketID',
                'Dest',
                'DestCityName',
                'DestState',
                'DestStateFips',
                'DestStateName',
                'DestWac']
flights_2014_1.loc[:5, DestFields]

Unnamed: 0,DestAirportID,DestAirportSeqID,DestCityMarketID,Dest,DestCityName,DestState,DestStateFips,DestStateName,DestWac
0,12278,1227802,30928,ICT,"Wichita, KS",KS,20,Kansas,62
1,12278,1227802,30928,ICT,"Wichita, KS",KS,20,Kansas,62
2,11298,1129803,30194,DFW,"Dallas/Fort Worth, TX",TX,48,Texas,74
3,11298,1129803,30194,DFW,"Dallas/Fort Worth, TX",TX,48,Texas,74
4,11298,1129803,30194,DFW,"Dallas/Fort Worth, TX",TX,48,Texas,74
5,11298,1129803,30194,DFW,"Dallas/Fort Worth, TX",TX,48,Texas,74


In [13]:
OriginFields = ['OriginAirportID', 
                'OriginAirportSeqID', 
                'OriginCityMarketID',
                'Origin',
                'OriginCityName',
                'OriginState',
                'OriginStateFips',
                'OriginStateName',
                'OriginWac']
flights_2014_1.loc[:5, OriginFields]

Unnamed: 0,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,Origin,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac
0,11298,1129803,30194,DFW,"Dallas/Fort Worth, TX",TX,48,Texas,74
1,11298,1129803,30194,DFW,"Dallas/Fort Worth, TX",TX,48,Texas,74
2,12278,1227802,30928,ICT,"Wichita, KS",KS,20,Kansas,62
3,12278,1227802,30928,ICT,"Wichita, KS",KS,20,Kansas,62
4,12278,1227802,30928,ICT,"Wichita, KS",KS,20,Kansas,62
5,12278,1227802,30928,ICT,"Wichita, KS",KS,20,Kansas,62


Origin and Destination data has the same structure so I will treat it the same way.

11.	**OriginAirportID / 20. DestAirportID**  
    * **Description:** Origin/Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.  
    * **Data type:** category  
    * **Keep**  
    * **Comment:** I decided to keep these codes because of its uniqueness, assuared by data provider.  
    
12.	**OriginAirportSeqID / 21. DestAirportSeqID**  
    * **Description:** Origin/Destination Airport, Airport Sequence ID. An identification number assigned by US DOT to identify a unique airport at a given point of time. Airport attributes, such as airport name or coordinates, may change over time.
    * **Drop**
13.	**OriginCityMarketID / 22. DestCityMarketID**
    * **Description:** Origin/Destination Airport, City Market ID. City Market ID is an identification number assigned by US DOT to identify a city market. Use this field to consolidate airports serving the same city market.
    * **Drop**
14.	**Origin / 23. Dest**
    * **Description:** Origin/Destination Airport
    * **Data type:** category
    * **Keep**
    * **Comment:** This core is IATA code of the airport, which is represented in most traveling documents. It could be useful in the model.
15.	**OriginCityName / 24. DestCityName**
    * **Description:** Origin/Destination Airport, City Name
    * **Drop**
16.	**OriginState / 25. DestState**
    * **Description:** Origin/Destination Airport, State Code
    * **Drop**
17.	**OriginStateFips / 26. DestStateFips**
    * **Description:** Origin/Destination Airport, State Fips
    * **Drop**
18.	**OriginStateName / 27. DestStateName**
    * **Description:** Origin/Destination Airport, State Name
    * **Drop**
19.	**OriginWac / 28. DestWac**
    * **Description:** Origin/Destination Airport, World Area Code
    * **Drop**

In [14]:
flights_2014_1.loc[:, 'OriginAirportID'].unique()

array([11298, 12278, 13303, 10666, 14057, 13830, 13796, 14893, 12758,
       12173, 14831, 14747, 14679, 12982, 10299, 11278, 11618, 12892,
       13204, 10721, 13930, 11697, 13487, 14100, 10551, 10170, 14709,
       10754, 11630, 12819, 12523, 10926, 15991, 14828, 14256, 15841,
       13873, 13970, 14107, 14771, 14262, 14908, 10800, 13891, 12889,
       15376, 10423, 11292, 14683, 11884, 14869, 12266, 10397, 15016,
       13198, 10165, 11193, 10713, 15624, 14730, 14027, 12953, 12478,
       10994, 10693, 10781, 10599, 11481, 15304, 14635, 11433, 10821,
       10868, 13342, 11042, 11066, 13244, 10257, 11109, 13485, 10529,
       11775, 11996, 12448, 14193, 11641, 11057, 12206, 12264, 10208,
       10792, 11267, 12992, 14843, 11624, 10874, 14122, 15412, 14524,
       11973, 10140, 13871, 13495, 15370, 14576, 11721, 13851, 13931,
       15249, 12217, 11252, 15024, 13232, 10849, 12451, 14685, 12339,
       14492, 13577, 11503, 13360, 11995, 14986, 12191, 15323, 14570,
       11986, 15096,

In [15]:
flights_2014_1.loc[:, 'DestAirportID'].unique()

array([12278, 11298, 15304, 13830, 14831, 12758, 14893, 14747, 14057,
       13796, 12173, 14679, 10299, 12982, 10666, 11278, 12892, 11618,
       13204, 10721, 13930, 14100, 11697, 13487, 10551, 10170, 14709,
       10754, 11630, 12819, 12523, 15991, 10926, 14828, 14256, 15841,
       13873, 13970, 14107, 14771, 14262, 14908, 10800, 13891, 12889,
       15376, 10423, 11292, 11884, 14683, 14869, 12266, 10397, 15016,
       13198, 10165, 12953, 10713, 12478, 10821, 15624, 14027, 11433,
       13495, 14492, 11066, 13232, 14986, 11996, 11986, 10868, 10994,
       13342, 12448, 11995, 10599, 13244, 14321, 11481, 12264, 11042,
       11057, 14635, 13303, 12339, 11193, 13485, 10529, 14193, 14843,
       12206, 10792, 11267, 10693, 12992, 11624, 10874, 15412, 14122,
       14524, 11973, 10140, 13871, 15370, 14576, 15024, 10781, 11721,
       13851, 13931, 15249, 12217, 11252, 14685, 10849, 12451, 14730,
       11503, 13577, 11641, 13360, 11637, 11540, 12191, 14570, 14307,
       15096, 13422,

In [16]:
flights_2014_1.loc[:, 'Origin'].unique()

array(['DFW', 'ICT', 'MIA', 'BLI', 'PDX', 'OGG', 'OAK', 'SMF', 'KOA',
       'HNL', 'SJC', 'SEA', 'SAN', 'LIH', 'ANC', 'DCA', 'EWR', 'LAX',
       'MCO', 'BOS', 'ORD', 'FLL', 'MSP', 'PHL', 'BET', 'ADQ', 'SCC',
       'BRW', 'FAI', 'KTN', 'JNU', 'CDV', 'YAK', 'SIT', 'PSG', 'WRG',
       'OME', 'OTZ', 'PHX', 'SFO', 'PSP', 'SNA', 'BUR', 'ONT', 'LAS',
       'TUS', 'AUS', 'DEN', 'SAT', 'GEG', 'SLC', 'IAH', 'ATL', 'STL',
       'MCI', 'ADK', 'CVG', 'BOI', 'VPS', 'SDF', 'PBI', 'LGA', 'JFK',
       'CHS', 'BNA', 'BTR', 'BHM', 'ECP', 'TPA', 'RSW', 'DTW', 'BWI',
       'CAE', 'MKE', 'CLE', 'CMH', 'MEM', 'ALB', 'COS', 'MSN', 'BDL',
       'FSD', 'GSP', 'JAN', 'PNS', 'FAY', 'CLT', 'HRL', 'IAD', 'AGS',
       'BUF', 'DAY', 'LIT', 'SJU', 'EYW', 'CAK', 'PIT', 'TYS', 'RIC',
       'GPT', 'ABQ', 'OMA', 'MSY', 'TUL', 'ROC', 'FNT', 'OKC', 'ORF',
       'TLH', 'HSV', 'DAB', 'STT', 'MDW', 'BZN', 'JAX', 'SAV', 'IND',
       'RDU', 'MYR', 'EGE', 'MLB', 'GSO', 'SRQ', 'HOU', 'TRI', 'RNO',
       'GRR', 'SYR',

In [17]:
flights_2014_1.loc[:, 'Dest'].unique()

array(['ICT', 'DFW', 'TPA', 'OGG', 'SJC', 'KOA', 'SMF', 'SEA', 'PDX',
       'OAK', 'HNL', 'SAN', 'ANC', 'LIH', 'BLI', 'DCA', 'LAX', 'EWR',
       'MCO', 'BOS', 'ORD', 'PHL', 'FLL', 'MSP', 'BET', 'ADQ', 'SCC',
       'BRW', 'FAI', 'KTN', 'JNU', 'YAK', 'CDV', 'SIT', 'PSG', 'WRG',
       'OME', 'OTZ', 'PHX', 'SFO', 'PSP', 'SNA', 'BUR', 'ONT', 'LAS',
       'TUS', 'AUS', 'DEN', 'GEG', 'SAT', 'SLC', 'IAH', 'ATL', 'STL',
       'MCI', 'ADK', 'LGA', 'BOI', 'JFK', 'BWI', 'VPS', 'PBI', 'DTW',
       'MSY', 'RDU', 'CMH', 'MDW', 'SRQ', 'GSP', 'GRR', 'CAE', 'CHS',
       'MKE', 'JAN', 'GSO', 'BHM', 'MEM', 'PWM', 'ECP', 'IAD', 'CLE',
       'CLT', 'RSW', 'MIA', 'IND', 'CVG', 'MSN', 'BDL', 'PNS', 'SJU',
       'HRL', 'BUF', 'DAY', 'BNA', 'LIT', 'EYW', 'CAK', 'TYS', 'PIT',
       'RIC', 'GPT', 'ABQ', 'OMA', 'TUL', 'ROC', 'STT', 'BTR', 'FNT',
       'OKC', 'ORF', 'TLH', 'HSV', 'DAB', 'SAV', 'BZN', 'JAX', 'SDF',
       'EGE', 'MYR', 'FAY', 'MLB', 'FAR', 'ELP', 'HOU', 'RNO', 'PVD',
       'SYR', 'MOB',

In [18]:
# Check for equal number of Origin Airport IDs and Origin (IATA codes)
print('Number of unique OriginAirport IDs:', flights_2014_1.loc[:, 'OriginAirportID'].nunique(),
      '\nNumber of unique Origin codes:', flights_2014_1.loc[:, 'Origin'].nunique())

Number of unique OriginAirport IDs: 301 
Number of unique Origin codes: 301


In [19]:
# Check for equal number of Origin Airport IDs and Origin (IATA codes)
print('Number of unique DestAirport IDs:', flights_2014_1.loc[:, 'DestAirportID'].nunique(),
      '\nNumber of unique Destination codes:', flights_2014_1.loc[:, 'Dest'].nunique())

Number of unique DestAirport IDs: 301 
Number of unique Destination codes: 301


In [20]:
# Check for NA-values in the fields, I suppose to keep
flights_2014_1[['OriginAirportID', 'Origin', 'DestAirportID', 'Dest']].isna().sum()

OriginAirportID    0
Origin             0
DestAirportID      0
Dest               0
dtype: int64

The numbers of unique codes for OriginalAirportID/DestAirportID and Origin/Dest are equal, and they don't have NA-values.

###  Departure and Arrival times (scheduled and actual)

In [21]:
# Sample of the block of data for departure times
DepFields = ['CRSDepTime', 
             'DepTime',
             'DepDelay',
             'DepDelayMinutes',
             'DepDel15',
             'DepartureDelayGroups',
             'DepTimeBlk']
flights_2014_1.loc[:5, DepFields]

Unnamed: 0,CRSDepTime,DepTime,DepDelay,DepDelayMinutes,DepDel15,DepartureDelayGroups,DepTimeBlk
0,940,935.0,-5.0,0.0,0.0,-1.0,0900-0959
1,940,951.0,11.0,11.0,0.0,0.0,0900-0959
2,1135,1144.0,9.0,9.0,0.0,0.0,1100-1159
3,1135,1134.0,-1.0,0.0,0.0,-1.0,1100-1159
4,1135,1129.0,-6.0,0.0,0.0,-1.0,1100-1159
5,1135,1141.0,6.0,6.0,0.0,0.0,1100-1159


In [22]:
# Sample of the block of data for arrival times
ArrFields = ['CRSArrTime', 
             'ArrTime',
             'ArrDelay',
             'ArrDelayMinutes',
             'ArrDel15',
             'ArrivalDelayGroups',
             'ArrTimeBlk']
flights_2014_1.loc[:5, ArrFields]

Unnamed: 0,CRSArrTime,ArrTime,ArrDelay,ArrDelayMinutes,ArrDel15,ArrivalDelayGroups,ArrTimeBlk
0,1055,1051.0,-4.0,0.0,0.0,-1.0,1000-1059
1,1055,1115.0,20.0,20.0,1.0,1.0,1000-1059
2,1300,1302.0,2.0,2.0,0.0,0.0,1300-1359
3,1300,1253.0,-7.0,0.0,0.0,-1.0,1300-1359
4,1300,1244.0,-16.0,0.0,0.0,-2.0,1300-1359
5,1300,1301.0,1.0,1.0,0.0,0.0,1300-1359


Two groups of data, for Departure and Arrival times, share the same structure and have similar meanings, differing only in whether they refer to departure or arrival. Therefore, I will group and describe them together. 

All time data in the dataset is in integer format (HHMM). For further analysis, it would be useful to convert this into the number of minutes from the start of the day

29.	**CRSDepTime  40. CRSArrTime** 
    * **Description:** CRS Departure/Arrival Time (local time: hhmm)
    * **Data type:** float
    * **Keep**
    * **Comment:** CRS (Computer Reservation System) represents the scuduled time for the flight. I decided to keep this data at this stage becuase it's not clear yet which will be the drivers of the future model - time or categorical time blocks, such as DepTimeBlk or ArrTimeBlk. I suppose to make this decision later. Needs to conver to the number of minutes.
30.	**DepTime  41. ArrTime**  
    * **Description:** Actual Departure/Arrival Time (local time: hhmm)
    * **Data type:** float
    * **Keep**
    * **Comment:** Needs to conver to the number of minutes.
31.	**DepDelay  42. ArrDelay**  
    * **Description:** Difference in minutes between scheduled and actual departure/arrival time. Early departures/arrival show negative numbers.
    * **Data type:** float
    * **Keep**
32.	**DepDelayMinutes  43. ArrDelayMinutes**
    * **Description:** Difference in minutes between scheduled and actual departure/arrival time. Early departures/arrival set to 0.
    * **Drop**
    * **Comment:** This data dublicates partially the field DepDelay and ArrDelay with only differency that this data doesnt't show the negative values - departures or arrivals earlier
33.	**DepDel15  44. ArrDel15**
    * **Description:** Departure/Arrival Delay Indicator, 15 Minutes or More (1=Yes)
    * **Drop**
    * **Comment:** This field represents a boolean data indicating wheather or not the flight delayed. We have the same, and even more detailed information, in the fields with the delay in minutes
34.	**DepartureDelayGroups  45. ArrivalDelayGroups**
    * **Description:** Departure/Arrival Delay intervals, every (15 minutes from <-15 to >180)
    * **Data type:** category
    * **Keep**
    * **Comment:** This categorical data can be useful for prediction model instead of actual delay time. We have to decide and choose it later.
35.	**DepTimeBlk  46. ArrTimeBlk**
    * **Description:** CRS Departure Time Block, Hourly Intervals
    * **Data type:** category
    * **Keep**
    * **Comment:** This categorical data probably will be more usefull for the prediction model comparing to the acrual departure or arrival times in minutes. 

The following data represents various times and durations for processes that I believe are highly correlated with the data described above. I don’t think this data adds value to the prediction model, so I plan to drop it
36.	TaxiOut: Taxi Out Time, in Minutes  
37.	WheelsOff: Wheels Off Time (local time: hhmm)  
38.	WheelsOn: Wheels On Time (local time: hhmm)  
39.	TaxiIn: Taxi In Time, in Minutes  



Let’s check the fields that are supposed to be kept for null values and inconsistent data.

In [23]:
# Check for values of the Departure block fields
flights_2014_1.loc[:, DepFields].describe()

Unnamed: 0,CRSDepTime,DepTime,DepDelay,DepDelayMinutes,DepDel15,DepartureDelayGroups
count,471949.0,441622.0,441622.0,441622.0,441622.0,441622.0
mean,1324.137998,1340.554716,16.049397,18.382259,0.26782,0.46893
std,459.332673,473.999078,45.539116,44.463942,0.442824,2.503865
min,5.0,1.0,-112.0,0.0,0.0,-2.0
25%,930.0,940.0,-4.0,0.0,0.0,-1.0
50%,1320.0,1333.0,0.0,0.0,0.0,0.0
75%,1715.0,1729.0,17.0,17.0,1.0,1.0
max,2359.0,2400.0,1560.0,1560.0,1.0,12.0


In [24]:
# Check for values of the Arrival block fields
flights_2014_1.loc[:, ArrFields].describe()

Unnamed: 0,CRSArrTime,ArrTime,ArrDelay,ArrDelayMinutes,ArrDel15,ArrivalDelayGroups
count,471949.0,440453.0,439620.0,439620.0,439620.0,439620.0
mean,1510.874029,1501.132577,12.745314,18.775929,0.272949,0.314458
std,470.669308,496.412439,48.05817,44.886828,0.445476,2.645731
min,1.0,1.0,-112.0,0.0,0.0,-2.0
25%,1130.0,1128.0,-11.0,0.0,0.0,-1.0
50%,1527.0,1528.0,-1.0,0.0,0.0,-1.0
75%,1910.0,1915.0,17.0,17.0,1.0,1.0
max,2359.0,2400.0,1530.0,1530.0,1.0,12.0


In [25]:
# Check for NA-values in the Departure block fields
flights_2014_1.loc[:, DepFields].isna().sum()

CRSDepTime                  0
DepTime                 30327
DepDelay                30327
DepDelayMinutes         30327
DepDel15                30327
DepartureDelayGroups    30327
DepTimeBlk                  0
dtype: int64

In [26]:
# Check for NA-values in the Arrival block fields
flights_2014_1.loc[:, ArrFields].isna().sum()

CRSArrTime                0
ArrTime               31496
ArrDelay              32329
ArrDelayMinutes       32329
ArrDel15              32329
ArrivalDelayGroups    32329
ArrTimeBlk                0
dtype: int64

In [27]:
print(flights_2014_1['DepartureDelayGroups'].unique())
print(flights_2014_1['ArrivalDelayGroups'].unique())

[-1.  0.  1.  2.  6.  4.  3.  5. -2. nan 12.  9. 11.  8. 10.  7.]
[-1.  1.  0. -2.  2.  6.  3.  5. nan  7.  4.  8. 12.  9. 10. 11.]


In [28]:
print(sorted(flights_2014_1['DepTimeBlk'].unique()))
print(sorted(flights_2014_1['ArrTimeBlk'].unique()))

['0001-0559', '0600-0659', '0700-0759', '0800-0859', '0900-0959', '1000-1059', '1100-1159', '1200-1259', '1300-1359', '1400-1459', '1500-1559', '1600-1659', '1700-1759', '1800-1859', '1900-1959', '2000-2059', '2100-2159', '2200-2259', '2300-2359']
['0001-0559', '0600-0659', '0700-0759', '0800-0859', '0900-0959', '1000-1059', '1100-1159', '1200-1259', '1300-1359', '1400-1459', '1500-1559', '1600-1659', '1700-1759', '1800-1859', '1900-1959', '2000-2059', '2100-2159', '2200-2259', '2300-2359']


#### Problems to Analyze and Solve:

1.	**DepTime** and **ArrTime** contain time values of ‘2400’ (while the corresponding CRS times only go up to 2359). These need to be converted to 00:00 of the next day.
2.	The time values in HHMM format should be converted into the number of minutes from the start of the day.
3.	**DepTime** has the same number of missing values (NA) as DepDelay, but **ArrTime** and **ArrDelay** have different numbers of NA values compared to **DepTime** and **DepDelay**. This discrepancy might be related to canceled or diverted flights and needs further investigation.
4.	Delay Groups (Arrival and Departure) contain NaN values, which also need to be examined.

In [29]:
# Test that all flight wtth NA Departure Time (DepTime) were cancelled
print('The Cancelled field has only these values:', flights_2014_1['Cancelled'].unique())
print('In this dataset there are', int(flights_2014_1['Cancelled'].sum()), 'cancelled flights in total')
print('Among', flights_2014_1.DepTime.isna().sum(), 'flights with NA DepTime there are', 
      flights_2014_1[flights_2014_1.DepTime.isna()]['Cancelled'].count(),'cancelled flights')

The Cancelled field has only these values: [False  True]
In this dataset there are 30852 cancelled flights in total
Among 30327 flights with NA DepTime there are 30327 cancelled flights


Okay, we see that all NA values in the **DepTime** field are explained by flight cancellations. However, there’s something interesting: the total number of canceled flights is higher than the number of NA values in the **DepTime** field. Does this mean that some flights were canceled but still have a recorded departure time? Let’s take a closer look

In [30]:
# How many Cancelled and Diverted flight are there in total?
flights_2014_1[['Cancelled', 'Diverted']].agg('sum')

Cancelled    30852
Diverted      1477
dtype: int64

In [31]:
# What is a split of flights with/NA Departure Time vs. Cancelled and Diverted flights 
flights_2014_1.groupby(~flights_2014_1['DepTime'].isna())[['Cancelled', 'Diverted']].agg('sum')

Unnamed: 0_level_0,Cancelled,Diverted
DepTime,Unnamed: 1_level_1,Unnamed: 2_level_1
False,30327,0
True,525,1477


We see that 525 flights have a recorded departure time but were canceled.

In [32]:
# Check all cancelled flight with existing departure time don't have Time in Air (flight time)
departured_cancelled_flights = (~flights_2014_1['DepTime'].isna()) & (flights_2014_1['Cancelled'] == 1)
print('Are all flights depatured but were cancelled have NA as AirTime?',
      flights_2014_1[departured_cancelled_flights]['AirTime'].isna().all())

Are all flights depatured but were cancelled have NA as AirTime? True


So, all flights that departed but were canceled never actually took off (as they don’t have AirTime). I can conclude that they left the gate but were canceled before takeoff and returned to the gate.

In [33]:
# Investigating which flights have NA DepartureDelayGroups
NA_DepDelay_group = flights_2014_1['DepartureDelayGroups'].isna()
print('Number of NA_values in DepartureDelayGoups:', NA_DepDelay_group.sum())

cancelled_before_depurture = flights_2014_1['DepTime'].isna() & (flights_2014_1['Cancelled'] == 1)
print('Are all cancelled prior departure flights have NA-value in DepartureDelayGroups?',
      flights_2014_1[cancelled_before_depurture]['DepartureDelayGroups'].isna().all())

Number of NA_values in DepartureDelayGoups: 30327
Are all cancelled prior departure flights have NA-value in DepartureDelayGroups? True


Conclusions:

1.	Flights that have departed can still be canceled after departure (without taking off being returned to the gate) or diverted to a different destination.
2.	All flights with NA values in the actual Departure Time (‘DepTime’ field) were canceled and also have NA values in the DepartureDelayGroups.

Let’s examine the Arrival times in more detail.

In [34]:
# What is a split of flights with/NA Arrival Time vs. Cancelled and Diverted flights 
flights_2014_1.groupby(~flights_2014_1['ArrTime'].isna())[['Cancelled', 'Diverted']].agg('sum')

Unnamed: 0_level_0,Cancelled,Diverted
ArrTime,Unnamed: 1_level_1,Unnamed: 2_level_1
False,30852,644
True,0,833


In this split, it’s interesting to note that some diverted flights (which were directed to other airports) still have an Arrival time. What does this Arrival time indicate? Is it the arrival time at the destination airport or the airport where the flight was diverted? Let’s examine this question using the 'DivReachedDest' field, which indicates whether the flights reached their destination after being diverted.

In [35]:
# Check the numbers from diverted flights' split over w/NA ArrTime 
diverted_but_arrived = (~flights_2014_1['ArrTime'].isna()) & (flights_2014_1['Diverted'] == 1)
print('Number of diverted flights that have ArrTime', 
      flights_2014_1[diverted_but_arrived]['Flight_Number_Reporting_Airline'].count())
print('Number of diverted flights reached their initial destination', 
      flights_2014_1['DivReachedDest'].sum())
print('Are all these the same flitghts?',
      flights_2014_1[diverted_but_arrived]['DivReachedDest'].sum() == flights_2014_1['DivReachedDest'].sum())

Number of diverted flights that have ArrTime 833
Number of diverted flights reached their initial destination 833
Are all these the same flitghts? True


So, yes, all diverted flights that ultimately reached their initial destination have an Arrival time.

1.	All flights with an actual Arrival Time (‘ArrTime’) either flew directly from the Origin to the Destination or were diverted but eventually reached the Destination.
2.	All flights with NA in the ArrTime were either canceled or diverted and landed at a different airport (not the Destination airport).\

In [36]:
# Investigating which flights have NA ArrivalDelayGroups
NA_ArrivalDelay_groups = flights_2014_1['ArrivalDelayGroups'].isna()
flight_cancelled_OR_diverted = (flights_2014_1['Cancelled'] == 1) | (flights_2014_1['Diverted'] == 1)
print('The number of flights with NA value in ArrivalDelayGroups', NA_ArrivalDelay_groups.sum())
print('Are all flights with NA value ArrivalDelayGroups were canceled or diverted?',
      flights_2014_1[NA_ArrivalDelay_groups & flight_cancelled_OR_diverted]['Flight_Number_Reporting_Airline'].count() ==
      flights_2014_1[NA_ArrivalDelay_groups]['Flight_Number_Reporting_Airline'].count())

The number of flights with NA value in ArrivalDelayGroups 32329
Are all flights with NA value ArrivalDelayGroups were canceled or diverted? True


All Diverted or Cancelled flights have NA-value ArrivalDelayGroup

### Flight Status and Reasons for Delay  

I plan to keep all following fields because they contain information that can be useful for interpreting the departure and arrival time fields and for calculating the actual elapsed time, respectively:

47.	**Cancelled**
    * **Description:** Cancelled Flight Indicator (1=Yes)  
    * **Data type:** boolean
    * **Keep**

48.	**CancellationCode**
    * **Description:** Specifies The Reason For Cancellation
    * **Data type:**  category
    * **Keep**

49.	**Diverted**
    * **Description:** Diverted Flight Indicator (1=Yes)  
    * **Data type:** boolean
    * **Keep**
    

Next fileds I suppose to keep to analyse the correclation between these reasons for delay with specific airlines, airport or states:

56.	**CarrierDelay**
    * **Description**: Carrier Delay, in Minutes  
    * **Data type:** float
    * **Keep**

57.	**WeatherDelay**  
    * **Description**: Weather Delay, in Minutes  
    * **Data type:** float
    * **Keep**

58.	**NASDelay** 
    * **Description**: National Air System Delay, in Minutes  
    * **Data type:** float
    * **Keep**

59.	**SecurityDelay**  
    * **Description**: Security Delay, in Minutes  
    * **Data type:** float
    * **Keep**

60.	**LateAircraftDelay**
    * **Description**: Late Aircraft Delay, in Minutes  
    * **Data type:** float
    * **Keep**

In [37]:
# Discovering the values and NA-values
flight_status_fields = ['Cancelled',
                        'CancellationCode',
                        'Diverted']
print(flights_2014_1[flight_status_fields].describe())
print()
print(flights_2014_1[flight_status_fields].isna().sum())

       Cancelled CancellationCode Diverted
count     471949            30852   471949
unique         2                3        2
top        False                B    False
freq      441097            19108   470472

Cancelled                0
CancellationCode    441097
Diverted                 0
dtype: int64


In [38]:
# Which values does the Cancellatino Code field have?
flights_2014_1['CancellationCode'].unique()

array([nan, 'B', 'A', 'C'], dtype=object)

In [39]:
# Examining the presence of Cancellation Codes for all Cancelled flights
print(flights_2014_1.groupby(['Cancelled', 'Diverted'])['CancellationCode'].agg('count'))
no_cancellation_code = flights_2014_1['CancellationCode'].isna()
print('\nNumber of recortds (flights) with absent Cancellation Code but still were cancelled:',
      len(flights_2014_1[no_cancellation_code & ('Cancelled' == 0)]))

Cancelled  Diverted
False      False           0
           True            0
True       False       30852
Name: CancellationCode, dtype: int64

Number of recortds (flights) with absent Cancellation Code but still were cancelled: 0


The conclusion is that the data represented in the Flight Status and Cancellation Code fields is accurate and comprehensive.

### Elapsed time  

The scheduled and actual Elapsed Time data are the primary indicators for assessing the accuracy of the dataset. This is why it is important to retain this data, at least in the initial stage. AirTime is also valuable, as it helps identify flights that were canceled after departure, as we have already noted.

50.	**CRSElapsedTime**
    * **Description:** CRS Elapsed Time of Flight, in Minutes  
    * **Data type:** float
    * **Keep**

51.	**ActualElapsedTime**  
    * **Description:** Elapsed Time of Flight, in Minutes  
    * **Data type:** float
    * **Keep**

52.	**AirTime:**  
    * **Description:** Flight Time, in Minutes  
    * **Data type:** float
    * **Keep**


The following data can be dropped because it is seemed irrelevant or can be obtaing from another fields:  

53.	**Flights**  
    * **Description:** Number of Flights  
    * **Drop**  

54.	**Distance**  
    * **Description:** Distance between airports (miles)  
    * **Drop**  

55.	**DistanceGroup**
    * **Description:**  Distance Intervals, every 250 Miles, for Flight Segment  
    * **Drop**
 
61.	**FirstDepTime**  
    * **Description:** First Gate Departure Time at Origin Airport  
    * **Drop**

62.	**TotalAddGTime**  
    * **Description:** Total Ground Time Away from Gate for Gate Return or Cancelled Flight  
    * **Drop**

63.	**LongestAddGTime**  
    * **Description:** Longest Time Away from Gate for Gate Return or Cancelled Flight  
    * **Drop**

64.	**DivAirportLandings**
    * **Description:** Number of Diverted Airport Landings  
    * **Drop**

As we already know, the next fields can be useful in evaluation of an acrual elapsed time when the flight was diverted, so we need to keep them:  

65.	**DivReachedDest**  
    * **Description:** Diverted Flight Reaching Scheduled Destination Indicator (1=Yes)  
    * **Data type:** boolean
    * **Keep**

66.	**DivActualElapsedTime**
    * **Description:** Elapsed Time of Diverted Flight Reaching Scheduled Destination, in Minutes. The ActualElapsedTime column remains NULL for all diverted flights.  
    * **Data type:** float
    * **Keep**

67.	**DivArrDelay**  
    * **Description:** Difference in minutes between scheduled and actual arrival time for a diverted flight reaching scheduled destination. The ArrDelay column remains NULL for all diverted flights.  
    * **Data Type:** float
    * **Keep**

The distance beetween the scheduled destination and final diverted airport in miles is unimportant for the purposed model because if the flight landed in different location, it is reasonable for the purpose of model consider this fliaght as not arrived to the distanation.

68.	**DivDistance**
    * **Description:** Distance between scheduled destination and final diverted airport (miles). Value will be 0 for diverted flight reaching scheduled destination.  
    * **Drop**

In [40]:
# Checking for possible and NA values for Elapsed time data block
elapsed_time_fields = ['CRSElapsedTime',
                       'ActualElapsedTime',
                       'AirTime']
print(flights_2014_1[elapsed_time_fields].describe())
print(flights_2014_1[elapsed_time_fields].isna().sum())

       CRSElapsedTime  ActualElapsedTime        AirTime
count   471949.000000      439620.000000  439620.000000
mean       137.400615         135.086875     111.636413
std         74.130674          74.329305      72.117051
min         19.000000          16.000000       7.000000
25%         85.000000          80.000000      59.000000
50%        118.000000         116.000000      92.000000
75%        170.000000         168.000000     142.000000
max        670.000000         764.000000     688.000000
CRSElapsedTime           0
ActualElapsedTime    32329
AirTime              32329
dtype: int64


In [41]:
# Checking for possible and NA values for Diverted flights data block
diverted_flights_fields = ['DivReachedDest',
                       'DivActualElapsedTime',
                       'DivArrDelay']
print(flights_2014_1[diverted_flights_fields].describe())
print()
print(flights_2014_1[diverted_flights_fields].isna().sum())

       DivActualElapsedTime  DivArrDelay
count            833.000000   833.000000
mean             364.955582   228.453782
std              195.155613   183.260456
min               90.000000     9.000000
25%              250.000000   129.000000
50%              313.000000   178.000000
75%              417.000000   251.000000
max             1968.000000  1651.000000

DivReachedDest               0
DivActualElapsedTime    471116
DivArrDelay             471116
dtype: int64


### Another information about diverted flights

The rest of the dataset contains 5 equal blocks for five airports where the flight can be diverted consiquently. Each block contains:
* Diverted Airport Code, Airport ID of Diverted Airport,  
* Airport Sequence ID of Diverted Airport,  
* Wheels On Time (local time: hhmm) at Diverted Airport Code,  
* Total Ground Time Away from Gate at Diverted Airport Code,  
* Longest Ground Time Away from Gate at Diverted Airport Code,  
* Wheels Off Time (local time: hhmm) at Diverted Airport Code,  
* Aircraft Tail Number for Diverted Airport Code

All this information is not rellevant to the project and I plan to drop it.

### Conclusion About Data Structure and Quality:

1.	The data needed for the project has been selected.
2.	The proposed data type for each field has been determined.
3.	The quality of the data is good. The information represented in the dataset is mostly comprehensive, but I will examine it in more detail in the next section.
4.	The logic regarding canceled and diverted flights, as well as the implications for actual departure and arrival times, time delays, and elapsed times, has been identified and documented.
5.	During the data transformation stage, I need to change the data types for the selected fields in the model and address the ‘time-2400’ issue.
6.	I plan to use delay times as a predictive variable in the model. I will not rely on the provided delay times and will calculate them directly by subtracting the actual arrival time from the scheduled arrival time. To ensure the accuracy of these times, I will conduct an extensive check in the following sections by comparing them with elapsed times.