### Description

**Airport**: The IATA airport code of the selected airport.  
**Terminal**: Terminal(s) associated with the airport selected.  
**Date**: Arrival Date(s) of flights associated with the airport/terminal/date range selected.  
**Hour**: Hour(s) of arriving flights during a 24 hour time period for the airport/terminal selected/date range selected.  
**U.S. Citizen Wait Times**:

    Average:The average wait time for all U.S. Citizen passengers on flights arriving in the one hour increment.
    Max:The highest wait time for all U.S. Citizen passengers on a flight arriving in the one hour increment.

**Non U.S. Citizen Wait Times**:

    Average:The average wait time for all Non U.S. Citizen passengers on flights arriving in the one hour increment.
    Max:The highest wait time for all Non U.S. Citizen passengers on a flight arriving in the one hour increment.

**Wait Times**:

    Average:The average wait time for all passengers on flights arriving in the one hour increment.
    Max:The highest wait time for any passenger on a flight arriving in the one hour increment.

**Number of Passengers Time Interval**:

    0-15:The actual number of passengers on flights which arrived in the selected hour who were processed in (0-15) minutes.
    16-30:The actual number of passengers on flights which arrived in the selected hour who were processed in (16-30) minutes.
    31-45:The actual number of passengers on flights which arrived in the selected hour who were processed in (31-45) minutes.
    46-60:The actual number of passengers on flights which arrived in the selected hour who were processed in (46-60) minutes.
    61-90:The actual number of passengers on flights which arrived in the selected hour who were processed in (61-90) minutes.
    91-120:The actual number of passengers on flights which arrived in the selected hour who were processed in (91-120) minutes.
    121Plus:The actual number of passengers on flights which arrived in the selected hour who were processed in (121 Plus) minutes.

**Excluded**:The actual number of passengers on flights which arrived in the selected hour who were excluded from wait time reporting.  
**Total**:Total number of passengers aboard arriving flights during the one hour increment.  
**Flights**:Number of flights arriving during the time period.  
**Booths**:Number of staffed primary inspection booths open to process flights which arrived during the selected hour.   


### Problems
- read dataset and convert column headers to correct format (the first  4 rows are headers, but not all rows contain column names)
- check datatype of each column
- find out columns which have na values
- check if there are any unique values in first two columns
- based on above, are these two columns userful ?   
  
  
- check if all hour values are present or not
- drop the upper bound of  hour column
  Ex: replace  0100 - 0200 with 0100 ...
- check that hour values lie in range 0-2300 or not
- convert hour column from string type to datetime type. (If this doesn't work out easily, then first solve next problem)
- check that length of each hour value is 4, if  not, check why and fix it and then check again   
  
  
- convert Date column to datetime type
- check data types now
- replace Hours with only time part and Date column with only date part (both should  be converted to datetime type before this)
- check data types again
- merge date and hour column into a new column date_time (type of date_time should be datetime not str)
                                                          

In [50]:
#read dataset and convert column headers to correct format (the first 4 rows are headers, but not all rows contain column names)
import pandas as pd
df=pd.read_excel('AWT.xls',header=[0,1,2,3])
col_list=[]
for col in df.columns:
    col=list(col)
    col_list.append(" ".join(filter(lambda x:not x.startswith('Unnamed') and not x.startswith('All') 
                                    and not x.startswith('Number'),col)))
df.columns=col_list
print(df.columns)
df.head()
    

Index(['Airport', 'Terminal', 'Date', 'Hour', 'U.S. Citizen Average Wait Time',
       'U.S. Citizen Max Wait Time', 'Non U.S. Citizen Average Wait Time',
       'Non U.S. Citizen Max Wait Time', 'Wait Times Average Wait Time',
       'Wait Times Max Wait Time', '0-15', '16-30', '31-45', '46-60', '61-90',
       '91-120', '120 plus', 'Excluded', 'Total', 'Flights', 'Booths'],
      dtype='object')


Unnamed: 0,Airport,Terminal,Date,Hour,U.S. Citizen Average Wait Time,U.S. Citizen Max Wait Time,Non U.S. Citizen Average Wait Time,Non U.S. Citizen Max Wait Time,Wait Times Average Wait Time,Wait Times Max Wait Time,...,16-30,31-45,46-60,61-90,91-120,120 plus,Excluded,Total,Flights,Booths
0,ORD,Terminal 5,2017-01-01 00:00:00,0300 - 0400,0,0,0,0,0,0,...,0,0,0,0,0,0,7,7,1,0
1,ORD,Terminal 5,2017-01-01 00:00:00,0400 - 0500,18,51,28,51,20,51,...,160,60,8,0,0,0,13,362,3,10
2,ORD,Terminal 5,2017-01-01 00:00:00,0600 - 0700,11,49,27,49,20,49,...,81,64,19,0,0,0,10,328,1,10
3,ORD,Terminal 5,2017-01-01 00:00:00,0700 - 0800,7,24,12,26,10,26,...,88,0,0,0,0,0,10,370,2,12
4,ORD,Terminal 5,2017-01-01 00:00:00,0800 - 0900,5,25,13,37,10,37,...,121,16,0,0,0,0,11,446,2,12


In [51]:
# check datatype of each column
df.dtypes

Airport                               object
Terminal                              object
Date                                  object
Hour                                  object
U.S. Citizen Average Wait Time         int64
U.S. Citizen Max Wait Time             int64
Non U.S. Citizen Average Wait Time     int64
Non U.S. Citizen Max Wait Time         int64
Wait Times Average Wait Time           int64
Wait Times Max Wait Time               int64
0-15                                   int64
16-30                                  int64
31-45                                  int64
46-60                                  int64
61-90                                  int64
91-120                                 int64
120 plus                               int64
Excluded                               int64
Total                                  int64
Flights                                int64
Booths                                 int64
dtype: object

In [4]:
#find out columns which have na values
df.isna().any()

Airport                               False
Terminal                              False
Date                                  False
Hour                                  False
U.S. Citizen Average Wait Time        False
U.S. Citizen Max Wait Time            False
Non U.S. Citizen Average Wait Time    False
Non U.S. Citizen Max Wait Time        False
Wait Times Average Wait Time          False
Wait Times Max Wait Time              False
0-15                                  False
16-30                                 False
31-45                                 False
46-60                                 False
61-90                                 False
91-120                                False
120 plus                              False
Excluded                              False
Total                                 False
Flights                               False
Booths                                False
dtype: bool

In [7]:
#check if there are any unique values in first two columns
#based on above, are these two columns userful ?
df.Airport.unique()


array(['ORD'], dtype=object)

In [8]:
df.Terminal.unique()

array(['Terminal 5'], dtype=object)

In [9]:
# check if all hour values are present or not

df.Hour.unique()

array(['0300 - 0400', '0400 - 0500', '0600 - 0700', '0700 - 0800',
       '0800 - 0900', '0900 - 1000', '1000 - 1100', '1100 - 1200',
       '1200 - 1300', '1300 - 1400', '1400 - 1500', '1500 - 1600',
       '1600 - 1700', '1700 - 1800', '1800 - 1900', '1900 - 2000',
       '2000 - 2100', '2100 - 2200', '2200 - 2300', '2300 - 0000',
       '0000 - 0100', '0100 - 0200', '0500 - 0600', '0200 - 0300'],
      dtype=object)

In [52]:
# drop the upper bound of hour column Ex: replace 0100 - 0200 with 0100 ...
df.Hour=df.Hour.str[0:4].astype('str')
df.Hour

0       0300
1       0400
2       0600
3       0700
4       0800
        ... 
6699    1800
6700    1900
6701    2000
6702    2100
6703    2300
Name: Hour, Length: 6704, dtype: object

In [53]:
# check that hour values lie in range 0-2300 or not
h=pd.to_numeric(df["Hour"])
h.between(0,2300).all()

True

In [12]:
# check that length of each hour value is 4, if not, check why and fix it and then check again
(df.Hour.str.len()==4).all()

False

In [54]:
# convert hour column from string type to datetime type. (If this doesn't work out easily, then first solve next problem)
df.Hour=pd.to_datetime(df.Hour, format='%H%M')
df.Hour.dt.time

0       03:00:00
1       04:00:00
2       06:00:00
3       07:00:00
4       08:00:00
          ...   
6699    18:00:00
6700    19:00:00
6701    20:00:00
6702    21:00:00
6703    23:00:00
Name: Hour, Length: 6704, dtype: object

In [70]:
# convert Date column to datetime type
#df.Date=pd.to_datetime(pd.Series(df.Date),format="%Y-%m-%d")
df.Date=pd.to_datetime(df.Date)
# awt=df
# a1=awt["Date"].apply(lambda x:pd.to_datetime(x) if isintance(x,str) else pd.to_datetime(x,format='%Y-%d-%m%H:%M:%S'))
# awt["Date"]=a1
# awt.dtypes


In [71]:
# check data types now
df.dtypes

Airport                                       object
Terminal                                      object
Date                                  datetime64[ns]
Hour                                  datetime64[ns]
U.S. Citizen Average Wait Time                 int64
U.S. Citizen Max Wait Time                     int64
Non U.S. Citizen Average Wait Time             int64
Non U.S. Citizen Max Wait Time                 int64
Wait Times Average Wait Time                   int64
Wait Times Max Wait Time                       int64
0-15                                           int64
16-30                                          int64
31-45                                          int64
46-60                                          int64
61-90                                          int64
91-120                                         int64
120 plus                                       int64
Excluded                                       int64
Total                                         

In [72]:
# replace Hours with only time part and Date column with only date part (both should be converted to datetime type before this)

df.Date=df.Date.dt.date
df.Hour=df.Hour.dt.time

In [73]:
# check data types again
df.dtypes

Airport                               object
Terminal                              object
Date                                  object
Hour                                  object
U.S. Citizen Average Wait Time         int64
U.S. Citizen Max Wait Time             int64
Non U.S. Citizen Average Wait Time     int64
Non U.S. Citizen Max Wait Time         int64
Wait Times Average Wait Time           int64
Wait Times Max Wait Time               int64
0-15                                   int64
16-30                                  int64
31-45                                  int64
46-60                                  int64
61-90                                  int64
91-120                                 int64
120 plus                               int64
Excluded                               int64
Total                                  int64
Flights                                int64
Booths                                 int64
dtype: object

In [85]:
# merge date and hour column into a new column date_time (type of date_time should be datetime not str)
#df['date_time']=pd.to_datetime(df['Date'] + " " + df['Hour'], format="%Y-%m-%d %H:%M:%S")
df['date_time'] = pd.to_datetime(df['Date'] + ' ' + df['Hour'])

TypeError: unsupported operand type(s) for +: 'datetime.date' and 'str'