### Table **flights_test**

This table consists of subset of columns from table flights. It represents flights from January 2020 which will be used for evaluation. Therefore, we are missing some features that we are not suppossed to know before the flight lands.

Variables:

- **fl_date**: Flight Date (yyyy-mm-dd)
- **mkt_unique_carrier**: Unique Marketing Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years.
- **branded_code_share**: Reporting Carrier Operated or Branded Code Share Partners
- **mkt_carrier**: Code assigned by IATA and commonly used to identify a carrier. As the same code may have been assigned to different carriers over time, the code is not always unique. For analysis, use the Unique Carrier Code.
- **mkt_carrier_fl_num**: Flight Number
- **op_unique_carrier**: Unique Scheduled Operating Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users,for example, PA, PA(1), PA(2). Use this field for analysis across a range of years.
- **tail_num**: Tail Number
- **op_carrier_fl_num**: Flight Number
- **origin_airport_id**: Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
- **origin**: Origin Airport
- **origin_city_name**: Origin Airport, City Name
- **dest_airport_id**: Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
- **dest**: Destination Airport
- **dest_city_name**: Destination Airport, City Name
- **crs_dep_time**: CRS Departure Time (local time: hhmm)
- **crs_arr_time**: CRS Arrival Time (local time: hhmm)
- **dup**: Duplicate flag marked Y if the flight is swapped based on Form-3A data
- **crs_elapsed_time**: CRS Elapsed Time of Flight, in Minutes
- **flights**: Number of Flights
- **distance**: Distance between airports (miles)

The following script will restructure testing data such that it matches to the training dataset to predict arrival delay 

The expected outcome will be a dataframe that has following features:

- **fl_date**: Flight Date (yyyy-mm-dd)
- **mkt_unique_carrier**: Unique Marketing Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years.
- **mkt_carrier_fl_num**: Flight Number
- **route**: Origin-Dest Airport
- **dest_airport_id**: Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
- **crs_arr_time**: CRS Arrival Time (local time: hhmm)
- **crs_elapsed_time**: CRS Elapsed Time of Flight, in Minutes
- **flights**: Number of Flights
- **distance**: Distance between airports (miles)

In [1]:
import pandas as pd
import datetime

def test_data_format(file_name):
    """
    This funciton will accept the csv file name 
    and return as dataframe after format
    """
    try:
        df = pd.read_csv(file_name)
        df = df.reset_index()
        df = df.drop(columns=['index'])
        # a list of features to drop
        drop_list = ['branded_code_share','mkt_carrier','op_unique_carrier','tail_num',
                     'origin_city_name','dest_city_name','op_carrier_fl_num','dup','flights']
        
        df = df.drop(columns=drop_list)
        
        #combine origin & dest as route
        df['origin'] = df['origin']+'-'+df['dest']
        df = df.drop(columns=['dest'])
        df = df.rename(columns={'origin':'route'})
        
        #convert fl_date into date format
        df['fl_date'] = pd.to_datetime(df['fl_date'], unit='ms').dt.date
        
        # convert crs_dep_time, crs_arr_time into datetime type
        time_features = ['crs_dep_time','crs_arr_time']
        
        # convert into string
        df[time_features] = df[time_features].astype(str)

        # fill the string into 4 digits only
        df[time_features] = df[time_features].apply(lambda x:x.str.zfill(4))

        # Replace every 2400 with 0000
        #df.replace('2400', '0000', inplace=True)
        #df[time_features].where(df[time_features].apply(lambda x:x.str[2:])  >='60').sum()
        
        for i in time_features:
            df[i] = pd.to_datetime(df[i], format='%H%M').dt.time
        
    except:
        print("File name does not exist or file is not in the same directory")
    
    return df
    

In [2]:
name = 'flight_test.csv'

In [3]:
df = test_data_format(name)

In [4]:
df

Unnamed: 0,fl_date,mkt_unique_carrier,mkt_carrier_fl_num,origin_airport_id,route,dest_airport_id,crs_dep_time,crs_arr_time,crs_elapsed_time,distance
0,2020-01-15,DL,745,11433,DTW-SFO,14771,08:30:00,11:09:00,339,2079
1,2020-01-04,DL,1089,11433,DTW-ORD,13930,12:25:00,12:53:00,88,235
2,2020-01-04,UA,5234,11540,ELP-DEN,11292,06:35:00,08:40:00,125,563
3,2020-01-09,AA,4756,10361,ART-PHL,14100,14:37:00,15:49:00,72,287
4,2020-01-11,AA,2047,15376,TUS-DFW,11298,10:45:00,13:52:00,127,813
...,...,...,...,...,...,...,...,...,...,...
9995,2020-01-09,DL,2160,11433,DTW-TPA,15304,14:09:00,16:54:00,165,983
9996,2020-01-08,DL,3865,12889,LAS-LGB,12954,13:15:00,14:33:00,78,231
9997,2020-01-01,HA,108,12173,HNL-KOA,12758,05:10:00,05:55:00,45,163
9998,2020-01-05,WN,3855,10693,BNA-BWI,10821,19:45:00,22:20:00,95,587
