# Data Collection

The original dataset, US Domestic Flights Delay Prediction (2013 - 2018) (Source: [Kaggle](https://www.kaggle.com/datasets/gabrielluizone/us-domestic-flights-delay-prediction-2013-2018)), is provided as a zip archive of 1.54 GB. When decompressed, the archive contains 60 files with a total size of 13.6 GB. Each file corresponds to one month of data, starting from January 2014 and ending in December 2018.

Steps for collecting and cleaning the data:

*	Loading the first file
*	Evaluating the data’s size and quality
*	Defining the specific data needed for the project
*	Developing a process for data organization and cleaning
*	Testing the code on one month’s data file
*	Applying the code to all files
*	Accumulating the complete dataset


The archive file (csv_flight.zip) was downloaded and saved locally.

## Loading the first file

In [None]:
# import required modules
import zipfile
import os
import pandas as pd
os.chdir('/Users/a.kholodov/Documents/02. Personal/20. Education/50. Universities/Springboard/Springboard_git/Springboard _repo/CS2-flights-delay-REPO')
os.getcwd()

In [2]:

source_zip_file = 'data/interim/csv_flight.zip'
data_file = 'csv_flight/report_2014_1.csv'

# reading the first file to evaluate the data
with zipfile.ZipFile(source_zip_file) as zip_source:
    with zip_source.open(data_file) as data_file:
        flights_2014_1 = pd.read_csv(data_file, low_memory=False)

In [None]:

# reading the first file to evaluate the data
with zipfile.ZipFile(source_zip_file) as zip_source:
    with zip_source.open(data_file) as data_file:
        flights_2014_1 = pd.read_csv(data_file, low_memory=False)

## Data size

In [None]:
print(flights_2014_1.shape)
print(flights_2014_1.info())

The data for one month contains 471,949 rows and 110 columns, with a total memory size of 396 MB. The estimated size of the entire dataset, without reorganization or cleaning, may exceed 23 GB, which could be challenging to process locally. Therefore, one of the goals of data preparation will be to reduce the dataset size without compromising quality.

## Data structure and check
### Date of the flight

In [None]:
flights_2014_1.iloc[:,:6].head()

0. **Year**
    * **Description:** Year
    * **Data type:** int16
    * **Keep**
    * **Comment:** Despite there is a field FlightData, I decided to keep separat fileds because they can correlate with flights delay

1. **Quarter**  
    * **Description:** Quarter (1-4)  
    * **Data type:** int8
    * **Keep**

2. **Month**  
    * **Description:** Month (1-12)  
    * **Data type:** int8
    * **Keep**

3. **DayofMonth**  
    * **Description:** Day of month  
    * **Data type:** int8
    * **Keep**

4. **DayOfWeek**  
    * **Description:** Day of week  
    * **Data type:** int8
    * **Keep**  

5. **FlightDate**  
    * **Description:** Flight Date (yyyymmdd)
    * **Data type:** datetime
    * **Keep**

Values check for fields of Date of the flight: the data accurate, there is no outliers or NA.

In [None]:
flights_2014_1.loc[:5, ['Year', 'Quarter', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightDate']].describe()

In [None]:
flights_2014_1.iloc[:,5].unique()

But we need to change the data type during the data transformation from current type to the types in the table above.

In [None]:
flights_2014_1.iloc[:,:6].info()

In [None]:
flights_2014_1.iloc[:,:6].isna().sum()

###  Airline's and flight's details 

In [None]:
flights_2014_1.iloc[:, 6:10].head()


6. **Reporting_Airline**
    * **Description:**  Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years. 
    * **Data type:** category 
    * **Keep**
    * **Comment:** I decided to keep this field and drop DOT_ID_Reporting_Airline and IATA_CODE_Reporting_Airline because it is declared as unique and recomended by data provider as such | 

7. **DOT_ID_Reporting_Airline**
    * **Description** An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation.
    * **Drop**
    * **Comment:** Not unique
    
8. **IATA_CODE_Reporting_Airline**
    * **Description:** Code assigned by IATA and commonly used to identify a carrier. As the same code may have been assigned to different carriers over time, the code is not always unique. For analysis, use the Unique Carrier Code.
    * **Drop**
    * **Comment:** Not unique

9. **Tail_Number**
    * **Description:** Tail Number
    * **Drop**
    * **Comment:** Irrelevant to the purposes of the project

10. **Flight_Number_Reporting_Airline**
    * **Description:** Flight Number
    * **Data type:** int16
    * **Keep**
    * **Comment:** Unique number of the flight at a specific day/time

In [None]:
flights_2014_1.iloc[:, 6].unique()

In [None]:
flights_2014_1['Flight_Number_Reporting_Airline'].describe()

In [None]:
flights_2014_1[['Reporting_Airline', 'Flight_Number_Reporting_Airline']].info()

In [None]:
flights_2014_1[['Reporting_Airline', 'Flight_Number_Reporting_Airline']].isna().sum()

The data in the fields Reporting_Airline and Flight_Number_Reporting_Airline is accurate, there is no outliers or NA, But I need to change the data type during the data transformation.

###  Origin and Destination detailes

In [None]:
DestFields = ['DestAirportID', 
                'DestAirportSeqID', 
                'DestCityMarketID',
                'Dest',
                'DestCityName',
                'DestState',
                'DestStateFips',
                'DestStateName',
                'DestWac']
flights_2014_1.loc[:5, DestFields]

In [None]:
OriginFields = ['OriginAirportID', 
                'OriginAirportSeqID', 
                'OriginCityMarketID',
                'Origin',
                'OriginCityName',
                'OriginState',
                'OriginStateFips',
                'OriginStateName',
                'OriginWac']
flights_2014_1.loc[:5, OriginFields]

Origin and Destination data has the same structure so I will treat it the same way.

11.	**OriginAirportID / 20. DestAirportID**  
    * **Description:** Origin/Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.  
    * **Data type:** category  
    * **Keep**  
    * **Comment:** I decided to keep these codes because of its uniqueness, assuared by data provider.  
    
12.	**OriginAirportSeqID / 21. DestAirportSeqID**  
    * **Description:** Origin/Destination Airport, Airport Sequence ID. An identification number assigned by US DOT to identify a unique airport at a given point of time. Airport attributes, such as airport name or coordinates, may change over time.
    * **Drop**
13.	**OriginCityMarketID / 22. DestCityMarketID**
    * **Description:** Origin/Destination Airport, City Market ID. City Market ID is an identification number assigned by US DOT to identify a city market. Use this field to consolidate airports serving the same city market.
    * **Drop**
14.	**Origin / 23. Dest**
    * **Description:** Origin/Destination Airport
    * **Data type:** category
    * **Keep**
    * **Comment:** This core is IATA code of the airport, which is represented in most traveling documents. It could be useful in the model.
15.	**OriginCityName / 24. DestCityName**
    * **Description:** Origin/Destination Airport, City Name
    * **Drop**
16.	**OriginState / 25. DestState**
    * **Description:** Origin/Destination Airport, State Code
    * **Drop**
17.	**OriginStateFips / 26. DestStateFips**
    * **Description:** Origin/Destination Airport, State Fips
    * **Drop**
18.	**OriginStateName / 27. DestStateName**
    * **Description:** Origin/Destination Airport, State Name
    * **Drop**
19.	**OriginWac / 28. DestWac**
    * **Description:** Origin/Destination Airport, World Area Code
    * **Drop**

In [None]:
flights_2014_1.loc[:, 'OriginAirportID'].unique()

In [None]:
flights_2014_1.loc[:, 'DestAirportID'].unique()

In [None]:
flights_2014_1.loc[:, 'Origin'].unique()

In [None]:
flights_2014_1.loc[:, 'Dest'].unique()

In [None]:
# Check for equal number of Origin Airport IDs and Origin (IATA codes)
print('Number of unique OriginAirport IDs:', flights_2014_1.loc[:, 'OriginAirportID'].nunique(),
      '\nNumber of unique Origin codes:', flights_2014_1.loc[:, 'Origin'].nunique())

In [None]:
# Check for equal number of Origin Airport IDs and Origin (IATA codes)
print('Number of unique DestAirport IDs:', flights_2014_1.loc[:, 'DestAirportID'].nunique(),
      '\nNumber of unique Destination codes:', flights_2014_1.loc[:, 'Dest'].nunique())

In [None]:
# Check for NA-values in the fields, I suppose to keep
flights_2014_1[['OriginAirportID', 'Origin', 'DestAirportID', 'Dest']].isna().sum()

The numbers of unique code for OriginalAirportID/DestAirportID and Origin/Dest are equal, and they don't have NA-values.

###  Departure and Arrival times (scheduled and actual)

In [None]:
# Sample of the block of data for departure times
DepFields = ['CRSDepTime', 
             'DepTime',
             'DepDelay',
             'DepDelayMinutes',
             'DepDel15',
             'DepartureDelayGroups',
             'DepTimeBlk']
flights_2014_1.loc[:5, DepFields]

In [None]:
# Sample of the block of data for arrival times
ArrFields = ['CRSArrTime', 
             'ArrTime',
             'ArrDelay',
             'ArrDelayMinutes',
             'ArrDel15',
             'ArrivalDelayGroups',
             'ArrTimeBlk']
flights_2014_1.loc[:5, ArrFields]

Two groups of data (for Departure and Arrival time) have the same stricture and similar meaning (with difference in departure or arrival), so I group and describe them togethher.

All time date in dataset is integer in format HHMM. I think for the further analysis it's worth to convert it into the number of minutes from the start of the day.

29.	**CRSDepTime  40. CRSArrTime** 
    * **Description:** CRS Departure/Arrival Time (local time: hhmm)
    * **Data type:** float
    * **Keep**
    * **Comment:** CRS (Computer Reservation System) represents the scuduled time for the flight. I decided to keep this data at this stage becuase it's not clear yet which will be the drivers of the future model - time or categorical time blocks, such as DepTimeBlk or ArrTimeBlk. I suppose to make this decision later. Needs to conver to the number of minutes.
30.	**DepTime  41. ArrTime**  
    * **Description:** Actual Departure/Arrival Time (local time: hhmm)
    * **Data type:** float
    * **Keep**
    * **Comment:** Needs to conver to the number of minutes.
31.	**DepDelay  42. ArrDelay**  
    * **Description:** Difference in minutes between scheduled and actual departure/arrival time. Early departures/arrival show negative numbers.
    * **Data type:** float
    * **Keep**
32.	**DepDelayMinutes  43. ArrDelayMinutes**
    * **Description:** Difference in minutes between scheduled and actual departure/arrival time. Early departures/arrival set to 0.
    * **Drop**
    * **Comment:** This data dublicates partially the field DepDelay and ArrDelay with only differency that this data doesnt't show the negative values - departures or arrivals earlier
33.	**DepDel15  44. ArrDel15**
    * **Description:** Departure/Arrival Delay Indicator, 15 Minutes or More (1=Yes)
    * **Drop**
    * **Comment:** This field represents a boolean data indicating wheather or not the flight delayed. We have the same, and even more detailed information, in the fields with the delay in minutes
34.	**DepartureDelayGroups  45. ArrivalDelayGroups**
    * **Description:** Departure/Arrival Delay intervals, every (15 minutes from <-15 to >180)
    * **Data type:** category
    * **Keep**
    * **Comment:** This categorical data can be useful for prediction model instead of actual delay time. We have to decide and choose it later.
35.	**DepTimeBlk  46. ArrTimeBlk**
    * **Description:** CRS Departure Time Block, Hourly Intervals
    * **Data type:** category
    * **Keep**
    * **Comment:** This categorical data probably will be more usefull for the prediction model comparing to the acrual departure or arrival times in minutes. 

The folllowing data represent some time and duration for processes which I think highly correlated with data described above. I suppose this data (below) doesn't add any value to the prediction model, and I am going to DROP it.  
36.	TaxiOut: Taxi Out Time, in Minutes  
37.	WheelsOff: Wheels Off Time (local time: hhmm)  
38.	WheelsOn: Wheels On Time (local time: hhmm)  
39.	TaxiIn: Taxi In Time, in Minutes  



Let's check the fields supposed to be kept for null values and incosistent data 

In [None]:
# Check for values of the Departure block fields
flights_2014_1.loc[:, DepFields].describe()

In [None]:
# Check for values of the Arrival block fields
flights_2014_1.loc[:, ArrFields].describe()

In [None]:
# Check for NA-values in the Departure block fields
flights_2014_1.loc[:, DepFields].isna().sum()

In [None]:
# Check for NA-values in the Arrival block fields
flights_2014_1.loc[:, ArrFields].isna().sum()

In [None]:
print(flights_2014_1['DepartureDelayGroups'].unique())
print(flights_2014_1['ArrivalDelayGroups'].unique())

In [None]:
print(sorted(flights_2014_1['DepTimeBlk'].unique()))
print(sorted(flights_2014_1['ArrTimeBlk'].unique()))

#### Problems to analyse and solve:
1. **DepTime** and **ArrTime** contain time values '2400' (at the same time CRS times contain only 2359). I have to convert it into 0h00m of the next day.
2. It is needed to convert time into simple number of minutes from the start of the day.
3. DepTime has the same number of NA-values as DepDelay, but ArrTime and ArrDelay have number of NA-values different of those from NA-values in DepTime and DepDelay. Reason can be related to cancelled and diverted flights. There is need to check.
4. Delay Groups (Arrival and Departure) contain NaN values. Need to investigate.

In [None]:
# Test that all flight wtth NA Departure Time (DepTime) were cancelled
print('The Cancelled field has only these values:', flights_2014_1['Cancelled'].unique())
print('In this dataset there are', int(flights_2014_1['Cancelled'].sum()), 'cancelled flights in total')
print('Among', flights_2014_1.DepTime.isna().sum(), 'flights with NA DepTime there are', 
      flights_2014_1[flights_2014_1.DepTime.isna()]['Cancelled'].count(),'cancelled flights')

Ok. We see that all NA-values in DepTime field are explained by the flight cancellaiton. 
However, there is one more interesting thing: the total number of cancelled flight is higher than the number of NA-values in DepTime field. Does it mean that these flight were cancelled but still have departure time? Let's take a look at this. 

In [None]:
# How many Cancelled and Diverted flight are there in total?
flights_2014_1[['Cancelled', 'Diverted']].agg('sum')

In [None]:
# What is a split of flights with/NA Departure Time vs. Cancelled and Diverted flights 
flights_2014_1.groupby(flights_2014_1['DepTime'].isna())[['Cancelled', 'Diverted']].agg('sum')

In [None]:
# Check all cancelled flight with existing departure time don't have Time in Air (flight time)
departured_cancelled_flights = (~flights_2014_1['DepTime'].isna()) & (flights_2014_1['Cancelled'] == 1)
print('Are all flights depatured but cancelled times have NA as AirTime?',
      flights_2014_1[departured_cancelled_flights]['AirTime'].isna().all())

In [None]:
# Investigating which flights have NA DepartureDelayGroups
NA_DepDelay_group = flights_2014_1['DepartureDelayGroups'].isna()
print('Number of NA_values in DepartureDelayGoups:', NA_DepDelay_group.sum())

cancelled_before_depurture = flights_2014_1['DepTime'].isna() & (flights_2014_1['Cancelled'] == 1)
print('Are all cancelled prior departure flights have NA-value in DepartureDelayGroups?',
      flights_2014_1[cancelled_before_depurture]['DepartureDelayGroups'].isna().all())

1. Departured flights still could be cancelled after departure (didn't take off, ruturned to gate) OR diverted to a different destination.
2. All flight with NA actual Departure Time ('DepTime' field) were canceled and have NA-Value in DepartureDelayGroups

Let’s examine the Arrival times in more detail.

In [None]:
# What is a split of flights with/NA Arrival Time vs. Cancelled and Diverted flights 
flights_2014_1.groupby(flights_2014_1['ArrTime'].isna())[['Cancelled', 'Diverted']].agg('sum')

From this split it is interesting that some diverted (directed to other airports) flights still have Arrival time. What does this arravel time mean? Is it the arrival time to the Destination airport or the airport where the flight had beed diverted?
Le't examine this question using the field 'DivReachedDest' marking the flights reached Destination after being diverted.

In [None]:
# Check the numbers from diverted flights' split over w/NA ArrTime 
diverted_but_arrived = (~flights_2014_1['ArrTime'].isna()) & (flights_2014_1['Diverted'] == 1)
print('Number of diverted flights that have ArrTime', 
      flights_2014_1[diverted_but_arrived]['Flight_Number_Reporting_Airline'].count())
print('Number of diverted flights reached initial destination', 
      flights_2014_1['DivReachedDest'].sum())
print('Are all these the same flitghts?',
      flights_2014_1[diverted_but_arrived]['DivReachedDest'].sum() == flights_2014_1['DivReachedDest'].sum())

So, yes, all diverted flights which finally reached their initial destination have the Arrival time.

1. All flights having an actual Arrival Time ('ArrTime') split over flied directly from Origin to Destination OR were diverted but finally reached the Destination
2. All flights with NA ArrTime were wheather canceled or diverted and landed in different Destination

In [None]:
# Investigating which flights have NA ArrivalDelayGroups
NA_ArrivalDelay_groups = flights_2014_1['ArrivalDelayGroups'].isna()
flight_cancelled_OR_diverted = (flights_2014_1['Cancelled'] == 1) | (flights_2014_1['Diverted'] == 1)
print('The number of flight with NA-value in ArrivalDelayGroups', NA_ArrivalDelay_groups.sum())
print('Are all flights with NA-value ArrivalDelayGroups were canceled or diverted?',
      flights_2014_1[NA_ArrivalDelay_groups & flight_cancelled_OR_diverted]['Flight_Number_Reporting_Airline'].count() ==
      flights_2014_1[NA_ArrivalDelay_groups]['Flight_Number_Reporting_Airline'].count())

All Diverted or Cancelled flights have NA-value ArrivalDelayGroup

### Flight Status and Reasons for Delay  

I’m going to keep the following fields because they contain information that can be useful for interpreting the departure and arrival time fields and for calculating the actual elapsed time, respectively:

47.	**Cancelled**
    * **Description:** Cancelled Flight Indicator (1=Yes)  
    * **Data type:** boolean
    * **Keep**

48.	**CancellationCode**
    * **Description:** Specifies The Reason For Cancellation
    * **Data type:**  category
    * **Keep**

49.	**Diverted**
    * **Description:** Diverted Flight Indicator (1=Yes)  
    * **Data type:** boolean
    * **Keep**
    

Next fileds I suppose to keep to analyse the correclation between these reasons for delay with specific airlines, airport or states:

56.	**CarrierDelay**
    * **Description**: Carrier Delay, in Minutes  
    * **Data type:** float
    * **Keep**

57.	**WeatherDelay**  
    * **Description**: Weather Delay, in Minutes  
    * **Data type:** float
    * **Keep**

58.	**NASDelay** 
    * **Description**: National Air System Delay, in Minutes  
    * **Data type:** float
    * **Keep**

59.	**SecurityDelay**  
    * **Description**: Security Delay, in Minutes  
    * **Data type:** float
    * **Keep**

60.	**LateAircraftDelay**
    * **Description**: Late Aircraft Delay, in Minutes  
    * **Data type:** float
    * **Keep**

In [None]:
# Discovering the values and NA-values
flight_status_fields = ['Cancelled',
                        'CancellationCode',
                        'Diverted']
print(flights_2014_1[flight_status_fields].describe())
print(flights_2014_1[flight_status_fields].isna().sum())

In [None]:
# Which values does the Cancellatino Code field have?
flights_2014_1['CancellationCode'].unique()

In [None]:
# Examining the presence of Cancellation Codes for all Cancelled flights
print(flights_2014_1.groupby(['Cancelled', 'Diverted'])['CancellationCode'].agg('count'))
no_cancellation_code = flights_2014_1['CancellationCode'].isna()
print('Number of recortds (flights) with absent Cancellation Code but still were cancelled:',
      len(flights_2014_1[no_cancellation_code & ('Cancelled' == 0)]))

The conclusion is that the data represented in the Flight Status and Cancellation Code fields is accurate and comprehensive.

### Elapsed time  

The scheduled and actual Elapsed Time data are the primary candidates for the outcome of the proposed prediction model, or at least one of the main components for constructing such a variable. This is why it is important to keep this data. Air Time can also be useful because, as we have already seen, it helps with identifying flights that were cancelled after departure.

50.	***CRSElapsedTime**
    * **Description:** CRS Elapsed Time of Flight, in Minutes  
    * **Data type:** float
    * **Keep**

51.	**ActualElapsedTime**  
    * **Description:** Elapsed Time of Flight, in Minutes  
    * **Data type:** float
    * **Keep**

52.	**AirTime:**  
    * **Description:** Flight Time, in Minutes  
    * **Data type:** float
    * **Keep**


The following data can be dropped because it is seemed irrelevand or can be obtaing from another fields:  

53.	**Flights**  
    * **Description:** Number of Flights  
    * **Drop**  

54.	**Distance**  
    * **Description:** Distance between airports (miles)  
    * **Drop**  

55.	**DistanceGroup**
    * **Description:**  Distance Intervals, every 250 Miles, for Flight Segment  
    * **Drop**
 
61.	**FirstDepTime**  
    * **Description:** First Gate Departure Time at Origin Airport  
    * **Drop**

62.	**TotalAddGTime**  
    * **Description:** Total Ground Time Away from Gate for Gate Return or Cancelled Flight  
    * **Drop**

63.	**LongestAddGTime**  
    * **Description:** Longest Time Away from Gate for Gate Return or Cancelled Flight  
    * **Drop**

64.	**DivAirportLandings**
    * **Description:** Number of Diverted Airport Landings  
    * **Drop**

As we already know, the next fields can be useful in evaluation of an acrual elapsed time when the flight was diverted, so we need to keep them:  

65.	**DivReachedDest**  
    * **Description:** Diverted Flight Reaching Scheduled Destination Indicator (1=Yes)  
    * **Data type:** boolean
    * **Keep**

66.	**DivActualElapsedTime**
    * **Description:** Elapsed Time of Diverted Flight Reaching Scheduled Destination, in Minutes. The ActualElapsedTime column remains NULL for all diverted flights.  
    * **Data type:** float
    * **Keep**

67.	**DivArrDelay**  
    * **Description:** Difference in minutes between scheduled and actual arrival time for a diverted flight reaching scheduled destination. The ArrDelay column remains NULL for all diverted flights.  
    * **Data Type:** float
    * **Keep**

The distance beetween the scheduled destination and final diverted airport in miles is unimportant for the purposed model because if the flight landed in different location, it is reasonable for the purpose of model consider this fliaght as not arrived to the distanation.

68.	**DivDistance**
    * **Description:** Distance between scheduled destination and final diverted airport (miles). Value will be 0 for diverted flight reaching scheduled destination.  
    * **Drop**

In [None]:
# Checking for possible and NA values for Elapsed time data block
elapsed_time_fields = ['CRSElapsedTime',
                       'ActualElapsedTime',
                       'AirTime']
print(flights_2014_1[elapsed_time_fields].describe())
print(flights_2014_1[elapsed_time_fields].isna().sum())

In [None]:
# Checking for possible and NA values for Diverted flights data block
diverted_flights_fields = ['DivReachedDest',
                       'DivActualElapsedTime',
                       'DivArrDelay']
print(flights_2014_1[diverted_flights_fields].describe())
print(flights_2014_1[diverted_flights_fields].isna().sum())

### Another infermation about diverted flights

The rest of the dataset contains 5 equal blocks for five airports where the flight can be diverted consiquently. Each block contains:
* Diverted Airport Code, Airport ID of Diverted Airport,  
* Airport Sequence ID of Diverted Airport,  
* Wheels On Time (local time: hhmm) at Diverted Airport Code,  
* Total Ground Time Away from Gate at Diverted Airport Code,  
* Longest Ground Time Away from Gate at Diverted Airport Code,  
* Wheels Off Time (local time: hhmm) at Diverted Airport Code,  
* Aircraft Tail Number for Diverted Airport Code

All this information is not rellevant to the project. 

### Conclustion about data structure and quality:  

1. The data needed for the project is selected
2. The prpoesd data type for each field is selected
3. The quality of the data is good. The data represented in the dataset is comprehesive
4. The logic of data for calncelled and diverted flights and consiquencies for actual departure and arrival times, time delays and elapsed times was identified and recorded
5. On the data transformaiton stage I have to change data types for selected for the model fields and solve the 'time-2400' problem.
6. The calculation of the resulting elapsed time, which I suppose to take as a predicted variable of the model, will be realise further on the stage of features engeneering. 

## Data transformation

Let's begin with one file from 60 files to test the aproach. After testing we will use the same process for the rest of files.

In [44]:
import numpy as np

data_types = {
    'Year':                 np.int16,               
    'Quarter':              np.int8,                
    'Month':                np.int8,                
    'DayofMonth':           np.int8,
    'DayOfWeek':            np.int8,
    'FlightDate':           'str',
    'Reporting_Airline':    'category',
    'Flight_Number_Reporting_Airline':  np.int16,
    'OriginAirportID':      'category',
    'Origin':               'category',
    'DestAirportID':        'category',
    'Dest':                 'category',
    'CRSDepTime':           np.int16,
    'DepTime':              np.float32,
    'DepDelay':             np.float32,
    'DepartureDelayGroups': 'category',
    'DepTimeBlk':           'category',
    'CRSArrTime':           np.int16,
    'ArrTime':              np.float32,
    'ArrDelay':             np.float32,
    'ArrivalDelayGroups':   'category',
    'ArrTimeBlk':           'category',
    'Cancelled':            np.int8,        # boolean
    'CancellationCode':     'category',
    'Diverted':             np.int8,        # boolean
    'CarrierDelay':         np.float32,
    'WeatherDelay':         np.float32,
    'NASDelay':             np.float32,
    'SecurityDelay':        np.float32,
    'LateAircraftDelay':    np.float32,
    'CRSElapsedTime':       np.float32,
    'ActualElapsedTime':    np.float32,
    'AirTime':              np.float32,
    'DivReachedDest':       np.float32,        # boolean
    'DivActualElapsedTime': np.float32,
    'DivArrDelay':          np.float32,}

In [35]:
def transform_data_from(zip_file, data_file, field_type=None):
    # reading the file
    with zipfile.ZipFile(zip_file) as zip_source:
        with zip_source.open(data_file) as file:
            if field_type != None:
                df = pd.read_csv(file, header = 0, 
                                usecols = field_type.keys(),
                                dtype = field_type)
            else:
                df = pd.read_csv(file, header = 0, low_memory=False)

    # Converting dates and boolean        
    if 'FlightDate' in df.columns:
        df['FlightDate'] = pd.to_datetime(df['FlightDate'])
    if 'DivReachedDest' in df.columns:
        df['DivReachedDest'] = df['DivReachedDest'].fillna(0)
    if 'Cancelled' in df.columns:
        df['Cancelled'] = df['Cancelled'].astype('bool')
    if 'Diverted' in df.columns:
        df['Diverted'] = df['Diverted'].astype('bool')
    if 'DivReachedDest' in df.columns:
        df['DivReachedDest'] = df['DivReachedDest'].astype('bool')
    return df

In [46]:
source_zip = 'data/interim/csv_flight.zip'
source_data = 'csv_flight/report_2014_1.csv'

flights_2014_1_new = transform_data_from(source_zip, source_data, data_types)

In [None]:
pd.set_option('display.max_columns', 200)
flights_2014_1_new.head()

In [None]:
flights_2014_1_new.info()

In [None]:
# Just to see what was the original size of the dataset
flights_2014_1.info()

I converted data into new dataset. Elimination of redundant data and changing data types allowed to reduce the memery usage by more than 9 times.

In [50]:
# To free the memory space I'm deliting datasets, used for discovering the data and testing the data transfortation ideas
del(flights_2014_1, flights_2014_1_new)

In [None]:
source_zip = 'data/interim/csv_flight.zip'
source_path = 'csv_flight/report_'

flights = pd.DataFrame()
for year in range(2014, 2019):
    for month in range(1, 13):
        source_data = source_path + str(year) + '_' + str(month) + '.csv'
        print(source_data)
        flights = pd.concat([flights, transform_data_from(source_zip, source_data, data_types)], ignore_index=True)

In [None]:
flights.info()

In [None]:
len(flights)

In [None]:
# !!!!!!!!![TEMP] Check for adequacy of elapsed times
flights[elapsed_time_fields].describe()

In [None]:
# !!!!!!!!![TEMP] Check for adequacy of elapsed times
flights[flights['CRSElapsedTime'] <= 0]

In [None]:
# !!!!!!!!![TEMP] Check for adequacy of elapsed times
test = transform_data_from(source_zip, source_path + '2018_2.csv')
test[test['CRSElapsedTime'] <=0][DepFields + ArrFields + elapsed_time_fields]

In [None]:
# !!!!!!!!![TEMP] Check for adequacy of elapsed times
test = transform_data_from(source_zip, source_path + '2018_10.csv')
test[test['CRSElapsedTime'] <=0][DepFields + ArrFields + elapsed_time_fields]

In [None]:
flights.isna().any()

The most important variables, characterising the date and time, the origin and destination of the flight don't have NA values and so we don't need to do something at this stage.

The fileds which I suppose to use to engineer the predicted variable (such as delays and arrival times) still have NA-values, but these values are not errors and have some logic behind thev. For example. as we have seen above, the acctual arrival time and delay have NA in case the flight was cancelled. I have to take this into account when I will engineer the predicted variable. 

In [None]:
import matplotlib.pyplot as plt

bins = np.arange(1, 9) - 0.5
plt.hist(flights['DayOfWeek'], bins = bins)
plt.show()

We can see that Saturday has the fewest flights. On other days, the average airports load across the country appears to be similar.

In [None]:
bins = np.arange(1, 14) - 0.5
plt.hist(flights['Month'], bins = bins)
plt.show()

The least busy month is February, while the summer months have the highest number of flights with peak in July.

In [None]:
plt.hist(flights['Year'], bins = [2013.5, 2014.5, 2015.5, 2016.5, 2017.5, 2018.5])

The number of flight increased significantly in 2018

Now when we have the whole dataset with more than 30 mln of flights over 5 years period let's chech the data quality again.

In [None]:
flights[['Year', 'Quarter', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightDate']].describe()

All fields related to the date and time of the flights are correct and dont have outliers.

In [None]:
flights[['Year', 'Quarter', 'Month', 'DayofMonth', 'DayOfWeek']].isna().sum()

In [None]:
flights['Year'].unique()

We also don't have missed data among data related to the date and time of flight.

We need to correct the actual Departure and Arrival Time fields (‘DepTime’ and ‘ArrTime’) because some entries show a time of 2400, which does not align with the scheduled time format (the maximum time is 2359). These should be adjusted to 00:00 of the next day.

In [114]:
# !!!!!! Delete before sending
flights_etalon = flights.copy()

In [115]:
flights = flights_etalon.copy()

In [None]:
# Corrction of 2400 time (it works after change of dtype of datetime fields)

import datetime
time_2400_filter = (flights.ArrTime == 2400) | (flights.DepTime == 2400)
print('There were', sum(time_2400_filter), 'rows having 2400 as Departure or Arrival Time')

def fix_time_2400(df, field_name):
    row_filter = (df[field_name] == 2400)
    for ind in df[row_filter].index:
        df.loc[ind, 'FlightDate'] += datetime.timedelta(days=1)
        df.loc[ind, 'Month'] = df.loc[ind, 'FlightDate'].month
        df.loc[ind, 'DayofMonth'] = df.loc[ind, 'FlightDate'].day
        df.loc[ind, 'DayOfWeek'] = df.loc[ind, 'FlightDate'].weekday()
        df.loc[ind, 'Year'] = df.loc[ind, 'FlightDate'].year
        df.loc[ind, field_name] = 0
    return df

flights = fix_time_2400(flights, 'ArrTime')
flights = fix_time_2400(flights, 'DepTime')

print('There are now', sum(time_2400_filter), 'rows having 2400 as Departure or Arrival Time')

In [117]:
def change_time_format(name):
    time_array = np.array(flights[name])
    flights[name] = (time_array // 100) * 60 + (time_array % 100)

for fld_name in ['ArrTime', 'DepTime', 'CRSArrTime', 'CRSDepTime']:
    change_time_format(fld_name)

In [118]:
import pickle

with open('data/processed/processed_flights.pickle', 'wb') as output_file:
    pickle.dump(flights, output_file, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# !!!!!!!!![TEMP] Check for adequacy of elapsed times
wrong_CRSET_filter = (flights['CRSArrTime'] - flights['CRSDepTime'] - flights['CRSElapsedTime']) != 0
a = flights[wrong_CRSET_filter][['CRSDepTime', 'CRSArrTime', 'CRSElapsedTime']]
a.CRSArrTime - a.CRSDepTime - a.CRSElapsedTime

In [None]:
flights[['ArrTime', 'DepTime', 'CRSArrTime', 'CRSDepTime']].describe()

In [None]:
flights.info()

In [None]:
flights[['ArrTime', 'DepTime', 'CRSArrTime', 'CRSDepTime']].isna().sum()

Let's examine the scheduled and actual elapsed time becuase I suppose to use them building a predictive model

In [None]:
flights[['CRSElapsedTime', 'ActualElapsedTime']].describe()

In [124]:
# Analysis of difference between 'ActualElapsedTime' and 'CRSElapsedTime'
flights['ElapsedTimeDiff'] = flights['ActualElapsedTime'] - flights['CRSElapsedTime']

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

rng = np.random.default_rng(18)
gauss_distr = rng.normal(flights['ElapsedTimeDiff'].mean(),
                        flights['ElapsedTimeDiff'].std(),
                        size=1000000)

bins = np.arange(-50,50)

plt.hist(flights['ElapsedTimeDiff'], bins=bins, density=True)
plt.hist(gauss_distr, histtype='step', bins=bins, color='r', density=True)
plt.xlim(-50,50)
plt.show()


In [None]:
print('Mean difference between scheduled and actual time', round(flights['ElapsedTimeDiff'].mean(), 2))
print('Standard deviation of the difference', round(flights['ElapsedTimeDiff'].std(), 2))

The diffence between acheduled and actual elapsed time is not normally distribute (despite being very close), In average the actual elapsed time less than the scheduled (-4.73 min). However, this is the duration of the flight and not a arrival time.

It is interesting to examine the most negative values of the difference between scheduled and actual elapsed time: how shorter the actual duration of flight can be against its scheduled duration?

In [None]:
flights['ElapsedTimeDiff'].describe()

Interesting!!!! Some flights lasted 250 minutes less than they were scheduled!!! Is it possiblethat these data can be wrong?

How many flights have an elapsed time difference that is an outlier in the range?
To answer this question let's first calculate the relative difference of elapsed time, because on long distanced the absolute difference in minutes can be higher.

In [128]:
# Calculating the relative difference of elapsed time
flights['RelElapTimeDiff'] = flights['ElapsedTimeDiff'] / flights['CRSElapsedTime']


In [None]:
print('Mean relative difference between scheduled and actual time (%)', round(flights['RelElapTimeDiff'].mean() * 100, 2))
print('Standard deviation of the relative difference (%)', round(flights['RelElapTimeDiff'].std() * 100, 2))

In [None]:
rng = np.random.default_rng(18)
gauss_distr = rng.normal(flights['RelElapTimeDiff'].mean(),
                        flights['RelElapTimeDiff'].std(),
                        size=1000000)

bins = np.arange(-1, 1, 0.01)

plt.hist(flights['RelElapTimeDiff'], bins=bins, density=True)
plt.hist(gauss_distr, histtype='step', bins=bins, color='r', density=True)
plt.xlim(-0.5, 0.5)
plt.show()

In [None]:

q25, q50, q75 = flights['RelElapTimeDiff'].quantile([0.25, 0.5, 0.75])
outlier_limit = q25 - (q75 - q25) * 1.5
print('All flights with actual elapsed time', round(outlier_limit * 100, 2), 'percents of scheduled elapsed time are deffinitely outliers')

In [None]:
flights[elapsed_time_fields].describe()

In [None]:
# Filtering flights that had actual elapsed time to short relative to scheduled 
actual_time_less_than_CRRS = flights['RelElapTimeDiff'] < outlier_limit

# Printing the TOP-10 fasters flights
SuperFastFlights = flights[actual_time_less_than_CRRS]
print(SuperFastFlights[['Origin', 
                        'Dest',
                        'ActualElapsedTime', 
                        'CRSElapsedTime', 
                        'ElapsedTimeDiff',
                        'RelElapTimeDiff']]
      .sort_values('RelElapTimeDiff', ascending=True).head(10))

# Printing some summary for these flights
print('In 2014-2018 there were', 
      SuperFastFlights['Flight_Number_Reporting_Airline'].count(), 
      'flight arrived earlier for more than', round(outlier_limit * 100, 2), '% (', 
      round(SuperFastFlights['Flight_Number_Reporting_Airline'].count() / len(flights) * 100, 2),
      '% of total flights)')

In [None]:
# Details for the flight ind = 21568647 which has the most significant difference between scheduled and actual 
# elapsed times
data_details = ['Year', 'Quarter', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightDate']
route_details = ['Origin', 'Dest']
DepArr_details = ['CRSDepTime', 'CRSArrTime', 'DepTime', 'ArrTime']

print(flights.iloc[28530440][data_details + route_details + DepArr_details + elapsed_time_fields])
print('Flight number:', flights.iloc[28530440]['Flight_Number_Reporting_Airline'])

It seems strange that this flight was scheduled with 302 min of elapsed time (more then 5 hours!!!), because according to Google [search](https://www.google.com/maps/dir/SBP,+Airport+Drive,+San+Luis+Obispo,+CA,+USA/San+Francisco+International+Airport+(SFO),+San+Francisco,+CA,+USA/@36.4210897,-122.8306609,8z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x80ecf6bf3876b9f1:0xf486acd07a0f3bd2!2m2!1d-120.6420121!2d35.2375699!1m5!1m1!1s0x808f778c55555555:0xa4f25c571acded3f!2m2!1d-122.3816274!2d37.6191145!3e4?entry=ttu&g_ep=EgoyMDI0MTAwMi4xIKXMDSoASAFQAw%3D%3D) shows that these airports are quite close to each other and estimated the direct flight time as 1h 5min, which is more close to the acctual elapsed time

In [None]:
# Details for the flight ind = 26994453 which has the second most significant difference between scheduled and actual 
# elapsed times
print(flights.iloc[21568647][data_details + route_details + DepArr_details + elapsed_time_fields])
print('Flight number:', flights.iloc[21568647]['Flight_Number_Reporting_Airline'])


The second most differing flight has the same story: Google [search](https://www.google.com/maps/dir/LaGuardia+Airport+(LGA),+East+Elmhurst,+NY,+USA/Dallas+Fort+Worth+International+Airport+(DFW),+2400+Aviation+Dr,+Dallas,+TX+75261,+United+States/@36.3681193,-96.0279203,5z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x89c25f8983424db5:0x772fc4660e9666b3!2m2!1d-73.8742467!2d40.7766422!1m5!1m1!1s0x864c2a660d222aa7:0x73323f5e067d201c!2m2!1d-97.0335765!2d32.8990434!3e4?entry=ttu&g_ep=EgoyMDI0MTAwMi4xIKXMDSoASAFQAw%3D%3D) shows the estimated time of the direct flight 3h 40m which again closer to the actual elapsed time and far away from scheduled 431 min - more than 7h!

In [None]:
# load source file for accessing some detailes that were deleted in the result dataset
_fl = transform_data_from(source_zip, source_path + '2017_10.csv')

# tiltering the flight of interest 
ind = _fl[(_fl['Flight_Number_Reporting_Airline'] == 5028) 
    & (_fl['DayofMonth'] == 9) 
    & (_fl['Origin'] == 'SBP')
    & (_fl['Dest'] == 'SFO')].index
flight_top1 = _fl.iloc[*ind]

# Printing results 
pd.set_option('display.max_rows', 115)
print(_fl.iloc[*ind][:70])

In [None]:
# load source file for accessing some detailes that were deleted in the result dataset
_fl = transform_data_from(source_zip, source_path + '2018_7.csv')

# tiltering the flight of interest 
ind = _fl[(_fl['Flight_Number_Reporting_Airline'] == 5935) 
    & (_fl['DayofMonth'] == 25) 
    & (_fl['Origin'] == 'DFW')
    & (_fl['Dest'] == 'LGA')].index
flight_top2 = _fl.iloc[*ind]

# Printing results 
pd.set_option('display.max_rows', 115)
print(_fl.iloc[*ind][:70])

These two flight were quite ordinary flights: they were not canceled or diverted, they had just one flight (field 'Fligths'), and as we know from Google search above their actual elapsed time seems reasonable but their GRSElapsedTime doesn't. 

Could the CRS Elapsed Time have been calculated incorrectly due to timezone differences?
Let's check it

In [1]:
# !!!!!![TEMP][LOAD]
# !!!!!!!!!!!!!!!!!!
# !!!!!!!!!!!!!!!!!!
# !!!!!!!!!!!!!!!!!!

import pandas as pd
import pickle
import os
import pandas as pd
os.chdir('/Users/a.kholodov/Documents/02. Personal/20. Education/50. Universities/Springboard/Springboard_git/Springboard _repo/CS2-flights-delay-REPO')


with open('data/processed/processed_flights.pickle', 'rb') as source_file:
    flights = pickle.load(source_file)

In [2]:
from datetime import datetime, timedelta, timezone
from dateutil import tz

# Loading timezones for IATA codes of airports
IATAtz_df = pd.read_csv('https://raw.githubusercontent.com/hroptatyr/dateutils/tzmaps/iata.tzmap', 
                        sep = '\t', 
                        index_col=0, 
                        header=None)
IATAtz = IATAtz_df.to_dict('dict')[1]

# Adding timezone to the local time
def airport_time_tz(dt, IATA_code: str):
    return dt.replace(tzinfo=tz.gettz(IATAtz[IATA_code]))

# Converting timezone time to UTC
def airport_time_UTC(dt, IATA_code: str):
    return dt.replace(tzinfo=tz.gettz(IATAtz[IATA_code])).astimezone(tz.UTC)

In [3]:
airports_IATACodes = set(flights['Dest'].unique())
airports_IATACodes = airports_IATACodes.union(set(flights['Origin']))
print('There are', len(airports_IATACodes), 'unique airport in the dataset (both in Origin and Destination)')

There are 372 unique airport in the dataset (both in Origin and Destination)


In [4]:
[ap_code for ap_code in airports_IATACodes if ap_code not in IATAtz.keys()]

[]

All airports are present in our IATA timezones dictionary.

In [None]:
# The fastest flight - TOP1
def time_chech(list_of_flights):
    for a in list_of_flights:
        CRSDepTime = datetime(a.Year, a.Month, a.DayofMonth, a.CRSDepTime // 100, a.CRSDepTime % 100)
        CRSArrTime = datetime(a.Year, a.Month, a.DayofMonth, a.CRSArrTime // 100, a.CRSArrTime % 100)

        print('Flight #', a.Flight_Number_Reporting_Airline)
        print('CRS: ')
        print(a.Origin, '\nTime zone: ', tz.gettz(IATAtz[a.Origin]))
        print('Local departure time:      ', CRSDepTime)
        print('Local departure time (tz): ', airport_time_tz(CRSDepTime, a.Origin))
        print('UTC Departure time:        ', airport_time_UTC(CRSDepTime, a.Origin))
        print('')
        print(a.Dest, '\nTime zone: ', tz.gettz(IATAtz[a.Dest]))
        print('Local departure time:      ', CRSArrTime)
        print('Local departure time (tz): ', airport_time_tz(CRSArrTime, a.Dest))
        print('UTC Departure time:        ', airport_time_UTC(CRSArrTime, a.Dest))
        print('')
        print('Elapsed time (CRS):', round(a.CRSElapsedTime / 60, 2))
        print('Elapsed time (UTC):', round((airport_time_UTC(CRSArrTime, a.Dest) - airport_time_UTC(CRSDepTime, a.Origin)).seconds / 3600, 2))
        print('\n')

time_chech([flight_top1, flight_top2])



As we can wee there is no error due to incorrect calculatinb elapsed time for different timezones of Origin and Destination. 
Let's check all outliers flights

Ok, we see that the data for the flight wiht the earliest arrival is correct. 
Let's check correctness of Elapsed times for the sample of 1000 flights

In [5]:
# Check for correctness of CRS ET (elapsed times) for all records (rows)
from datetime import datetime

# filtering only flights that were not cancelled or diverted
not_canceled_or_diverted = (flights['Cancelled'] != 1) & (flights['Diverted'] != 1)
ordinary_flights = flights[not_canceled_or_diverted]
print('Number of not cancelled or diverted flights', len(ordinary_flights))

Number of not cancelled or diverted flights 29588947


In [6]:
del(flights)

In [7]:
# Rename 'DayofMonth' to 'Day' for pd.to_datetime
ordinary_flights = ordinary_flights.rename(columns= {'DayofMonth': 'Day'})

In [8]:

# Create temporary field 'Minute' from 'CRSDepTime' to use with pd.to_datetime
ordinary_flights['Minute'] = ordinary_flights['CRSDepTime']
ordinary_flights['CRSDepDT'] = pd.to_datetime(ordinary_flights[['Year', 'Month', 'Day', 'Minute']])

In [9]:
ordinary_flights['CRSDepDT'] = [row.CRSDepDT.tz_localize(tz.gettz(IATAtz[row.Origin]), ambiguous=True).
                                astimezone(tz.UTC) for row in ordinary_flights[['CRSDepDT', 'Origin']].itertuples()] 

In [11]:
# Create temporary field 'Minute' from 'CRSArrTime' to use with pd.to_datetime
ordinary_flights['Minute'] = ordinary_flights['CRSArrTime']
ordinary_flights['CRSArrDT'] = pd.to_datetime(ordinary_flights[['Year', 'Month', 'Day', 'Minute']])

In [22]:
ordinary_flights['CRSArrDT'] = [row.CRSArrDT.tz_localize(tz.gettz(IATAtz[row.Dest]), ambiguous=True, 
                                                         nonexistent = 'NaT') #.astimezone(tz.UTC) 
                                for row in ordinary_flights[['CRSArrDT', 'Dest']].itertuples()]

In [28]:
wrong_DST_filter = ordinary_flights['CRSArrDT'].isna()
print('There are', wrong_DST_filter.sum(), 'flights with wrong spring DST')

There are 41 flights with wrong spring DST


Ok, there are 41 flight with wrong spring DST time. I have saved them for future possible analysis. Let's try shift all incorrect flights forward, convert them to UTC and after that check the difference of scheduled and actual Elapsed time

In [32]:
# At first, restore the 'CRSArrDT'
ordinary_flights['Minute'] = ordinary_flights['CRSArrTime']
ordinary_flights['CRSArrDT'] = pd.to_datetime(ordinary_flights[['Year', 'Month', 'Day', 'Minute']])

# Taking only the flights with incorrect DST
flights_with_wrong_DST = ordinary_flights[wrong_DST_filter].copy()

# Shifting forward and converting to UTC
flights_with_wrong_DST['CRSArrDT'] = [row.CRSArrDT.tz_localize(tz.gettz(IATAtz[row.Dest]), ambiguous=True, 
                                                         nonexistent = 'shift_forward').astimezone(tz.UTC) 
                                for row in flights_with_wrong_DST[['CRSArrDT', 'Dest']].itertuples()]

In [53]:
flights_with_wrong_DST[['Flight_Number_Reporting_Airline', 'CRSDepTime', 'CRSArrTime', 'CRSDepDT', 'CRSArrDT', 'CRSElapsedTime']].head(10)

Unnamed: 0,Flight_Number_Reporting_Airline,CRSDepTime,CRSArrTime,CRSDepDT,CRSArrDT,CRSElapsedTime
988959,121,1430,142,2014-03-10 06:50:00+00:00,2014-03-09 12:00:00+00:00,212.0
1310801,414,1295,120,2014-03-10 04:35:00+00:00,2014-03-09 08:00:00+00:00,145.0
6874992,121,1435,147,2015-03-09 06:55:00+00:00,2015-03-08 12:00:00+00:00,212.0
6874996,127,1410,128,2015-03-09 06:30:00+00:00,2015-03-08 12:00:00+00:00,218.0
7186735,855,1305,129,2015-03-09 04:45:00+00:00,2015-03-08 08:00:00+00:00,144.0
12650587,1248,1423,174,2016-03-14 04:43:00+00:00,2016-03-13 07:00:00+00:00,131.0
12667064,2394,1420,161,2016-03-14 04:40:00+00:00,2016-03-13 07:00:00+00:00,121.0
12668617,2490,1429,133,2016-03-14 03:49:00+00:00,2016-03-13 08:00:00+00:00,204.0
12670641,2724,1435,126,2016-03-14 04:55:00+00:00,2016-03-13 10:00:00+00:00,251.0
12670665,2727,1430,132,2016-03-14 04:50:00+00:00,2016-03-13 08:00:00+00:00,142.0


In [37]:
import zipfile
source_zip = 'data/interim/csv_flight.zip'
source_path = 'csv_flight/report_'

test = transform_data_from(source_zip, source_path + '2014_3.csv')

In [56]:
Date_details = ['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightDate']
DepTime_details = ['CRSDepTime', 'DepTime']
ArrTime_details = ['CRSArrTime', 'ArrTime']
ElapsedTime_details = ['CRSElapsedTime', 'ActualElapsedTime']
CRS_details = ['CRSDepTime', 'CRSArrTime', 'CRSElapsedTime']
Route_datails = ['Origin', 'Dest']
test[(test['Flight_Number_Reporting_Airline'] == 121) & 
     (test['DayofMonth'] == 9)][Date_details + 
     CRS_details +
     Route_datails]

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,FlightDate,CRSDepTime,CRSArrTime,CRSElapsedTime,Origin,Dest
86408,2014,3,9,7,2014-03-09,2350,222,212.0,SEA,ANC
106306,2014,3,9,7,2014-03-09,843,1203,200.0,BOS,PBI
264710,2014,3,9,7,2014-03-09,850,940,50.0,ITO,HNL
464191,2014,3,9,7,2014-03-09,850,1100,130.0,STL,SAT
465257,2014,3,9,7,2014-03-09,720,825,65.0,MDW,STL


In [None]:

# Create temporary field 'Minute' from 'DepTime' to use with pd.to_datetime
ordinary_flights['Minute'] = ordinary_flights['DepTime']
ordinary_flights['DepDT'] = pd.to_datetime(ordinary_flights[['Year', 'Month', 'Day', 'Minute']])

# Create temporary field 'Minute' from 'ArrTime' to use with pd.to_datetime
ordinary_flights['Minute'] = ordinary_flights['ArrTime']
ordinary_flights['ArrDT'] = pd.to_datetime(ordinary_flights[['Year', 'Month', 'Day', 'Minute']])

In [None]:
for ind in ordinary_flights.index:
       ordinary_flights.loc[ind, 'CRSET_Diff'] = round(ordinary_flights.loc[ind, 'CRSElapsedTime'] / 60, 2) - \
                                        round((airport_time_UTC(ordinary_flights.loc[ind, 'CRSArrDT'], ordinary_flights.loc[ind, 'Dest']) - \
                                               airport_time_UTC(ordinary_flights.loc[ind, 'CRSDepDT'], ordinary_flights.loc[ind, 'Origin'])).seconds / 3600, 2)

       ordinary_flights.loc[ind, 'ActET_Diff'] = round(ordinary_flights.loc[ind, 'ActualElapsedTime'] / 60, 2) - \
                                        round((airport_time_UTC(ordinary_flights.loc[ind, 'ArrDT'], ordinary_flights.loc[ind, 'Dest']) - \
                                               airport_time_UTC(ordinary_flights.loc[ind, 'DepDT'], ordinary_flights.loc[ind, 'Origin'])).seconds / 3600, 2)

In [None]:
ordinary_flights[:10]