# ABOUT A PROJECT

The purpose of this project is to predict arrival delay of the flights based on some characteristics.

The last given information on a particular flight is a departure delay. We will build a model using this feature and the features that we can obtain before it.

From a business perspective we could be able to predict a delay before the aircraft leaves the ground. The average period of time between departure (the time when an aircraft departs from the gate) and taking off is about 16 minutes. The predicted arrival delay could be usefel for example for air traffic contollers, airports, members of the crew or passengers.

The analysis of an arrival delay will be divided, besides this introduction, into 4 parts:
* data preprocessing,
* data exploration,
* feature engineering,
* model building.


### 1. Data preprocessing
In this part we will adapt our data to the usable format that will allow us to take the next step, not worrying about . We will also create normalized departure and arrival times that will sort our flights by the moment of occurance.

### 2. Data exploration
In the second stage we will visualize data and examine basic statistics in order to determine which features can be useful in predicting arrival delay.

### 3. Feature engineering
We will create new features based mainly on aggregates of recent flights and average of delay grouped by different features 

### 4. Model building
The purpose of data modelling is to predict arrival delays as accurate as it is possible. To achieve that we will build a LightGBM model. It also includes hyperparameters optimization and feature selection.

# DATA
Though the data contain flights from the whole 2015, only the first 4 months will be used due to the computational complexity.

The data used in a project can be found on the following website: https://www.kaggle.com/usdot/flight-delays

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.options.display.max_columns = 100

We will be mainly interested in the flights dataset, though aiports also will be usefeul.

In [3]:
data = pd.read_csv('flights.csv')
airports = pd.read_csv('airports.csv')

  interactivity=interactivity, compiler=compiler, result=result)


# FLIGHTS DATASET

In [4]:
data.shape

(5819079, 31)

The data is composed of over 5,800,000 flights and 31 columns. However, we should only take into consideration the first 4 months.

In [5]:
data = data[data.MONTH <= 4]
data.shape

(1888622, 31)

The data from January to April consists of nearly 1,900,000 rows and 31 columns.

In [6]:
data.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,5,2354.0,-11.0,21.0,15.0,205.0,194.0,169.0,1448,404.0,4.0,430,408.0,-22.0,0,0,,,,,,
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,10,2.0,-8.0,12.0,14.0,280.0,279.0,263.0,2330,737.0,4.0,750,741.0,-9.0,0,0,,,,,,
2,2015,1,1,4,US,840,N171US,SFO,CLT,20,18.0,-2.0,16.0,34.0,286.0,293.0,266.0,2296,800.0,11.0,806,811.0,5.0,0,0,,,,,,
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,20,15.0,-5.0,15.0,30.0,285.0,281.0,258.0,2342,748.0,8.0,805,756.0,-9.0,0,0,,,,,,
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,25,24.0,-1.0,11.0,35.0,235.0,215.0,199.0,1448,254.0,5.0,320,259.0,-21.0,0,0,,,,,,


In [7]:
data.nunique()

YEAR                      1
MONTH                     4
DAY                      31
DAY_OF_WEEK               7
AIRLINE                  14
FLIGHT_NUMBER          6592
TAIL_NUMBER            4601
ORIGIN_AIRPORT          315
DESTINATION_AIRPORT     315
SCHEDULED_DEPARTURE    1258
DEPARTURE_TIME         1438
DEPARTURE_DELAY         957
TAXI_OUT                179
WHEELS_OFF             1437
SCHEDULED_TIME          486
ELAPSED_TIME            704
AIR_TIME                667
DISTANCE               1288
WHEELS_ON              1440
TAXI_IN                 180
SCHEDULED_ARRIVAL      1371
ARRIVAL_TIME           1440
ARRIVAL_DELAY           984
DIVERTED                  2
CANCELLED                 2
CANCELLATION_REASON       4
AIR_SYSTEM_DELAY        469
SECURITY_DELAY          105
AIRLINE_DELAY           781
LATE_AIRCRAFT_DELAY     563
WEATHER_DELAY           491
dtype: int64

# DESCRIPTION OF FLIGHTS FEATURES
* YEAR - only 2015
* MONTH - from January to April (1 to 4)
* DAY
* DAY_OF_WEEK
* AIRLINE - 14 different airlines
* FLIGHT_NUMBER - the number identifying the flight in the format of a digit. It contains over 6500 unique values.
* TAIL_NUMBER - aircraft identification number. It contains abour 4600 unique values.
* ORIGIN_AIRPORT - starting airport of a flight
* DESTINATION_AIRPORT - airport of arrival
* SCHEDULED_DEPARTURE - the planned departure time in the format of HHMM, without any separators
* DEPARTURE_TIME - the factual time of departure from the gate in the format of HHMM
* DEPARTURE_DELAY - a delay in the moment of departure
* TAXI_OUT - the time elapsed between departure time and take off
* WHEELS_OFF - the time of taking off in the format of HHMM
* SCHEDULED_TIME - an expected time of the flight
* ELAPSED_TIME - a factual time of the flight (from departure time to arrival time)
* AIR_TIME - a factual time of the flight (from wheels off to wheels on)
* DISTANCE - a distance between airports
* WHEELS_ON - the time when aicraft touches the ground in the format of HHMM
* TAXI_IN - the time elapsed between wheels on and arrival time
* SCHEDULED_ARRIVAL - the planned arrival time in the format of HHMM
* ARRIVAL_TIME - the factual time of arrival in the format of HHMM
* ARRIVAL_DELAY - an obective of our predictions
* DIVERTED - whether a flight was diverted or not
* CANCELLED - whether a flight was cancelled or not
* CANCELLATION_REASON - there are 4 possible reasons: A - caused by airline/carrier; B - caused by weather; C - caused by National Air System; D - caused by security
* AIR_SYSTEM_DELAY - amount of delay caused by air syste
* SECURITY_DELAY - amount of delay caused by security
* AIRLINE_DELAY - amount of delay caused by airline           
* LATE_AIRCRAFT_DELAY - amount of delay caused by aircraft
* WEATHER_DELAY - amount of delay caused by weather

# AIRPORTS DATASET

In [8]:
airports.shape

(322, 7)

There are 322 unique airports.

In [9]:
airports.head()

Unnamed: 0,IATA_CODE,AIRPORT,CITY,STATE,COUNTRY,LATITUDE,LONGITUDE
0,ABE,Lehigh Valley International Airport,Allentown,PA,USA,40.65236,-75.4404
1,ABI,Abilene Regional Airport,Abilene,TX,USA,32.41132,-99.6819
2,ABQ,Albuquerque International Sunport,Albuquerque,NM,USA,35.04022,-106.60919
3,ABR,Aberdeen Regional Airport,Aberdeen,SD,USA,45.44906,-98.42183
4,ABY,Southwest Georgia Regional Airport,Albany,GA,USA,31.53552,-84.19447


In [10]:
airports.nunique()

IATA_CODE    322
AIRPORT      322
CITY         308
STATE         54
COUNTRY        1
LATITUDE     319
LONGITUDE    319
dtype: int64

# DESCRIPTION OF AIRPORTS FEATURES
* IATA_CODE - the identifier of an airport. It is in the same format as ORIGIN_AIRPORT and DESTINATION_AIRPORT from flights
* AIRPORT - the full name of an airport. All airports have associated IATA_CODE
* CITY - the city of an airport. There is mostly one airport per city but there are also a few cities with many airports.
* STATE - the state of an airport
* COUNTRY - the only country in our dataset is United States
* LATTITUDE - the lattitude of an airport. There are 3 airports without assigned latitude
* LONGITUDE - the longitude of an airport. There are again 3 airports without assigned longitude.

In [11]:
airports[airports.LATITUDE.isna()]

Unnamed: 0,IATA_CODE,AIRPORT,CITY,STATE,COUNTRY,LATITUDE,LONGITUDE
96,ECP,Northwest Florida Beaches International Airport,Panama City,FL,USA,,
234,PBG,Plattsburgh International Airport,Plattsburgh,NY,USA,,
313,UST,Northeast Florida Regional Airport (St. August...,St. Augustine,FL,USA,,
