Install R using Anaconda
https://docs.anaconda.com/anaconda/navigator/tutorials/create-r-environment/

## Ask the right Question

"Predict if a flight is on-time"
Need a statement to direct and validate the work 
Define end goal, starting point, and how to achieve the end goal

Solution Statement
	- Define scope (including data sources)
	- Define target performance
	- Define context for usage
	- Define how solution will be created
	
Scope and Data Sources
	- US flights only
	- Flights between US airports only
	- US Dept of Transport (DOT) database is a good source
	- Using US DOT data, predict if a flight would be on-time

Data
	- Preliminary data review
	- Delays tracked, not on-time
	- Using US DOT data, predict if a flight would be delayed

Performance Targets
	- Binary result (true or False)
	- Coin Flip = 50% Accuracy
	- 70% Accuracy is a common target
	- Using US DOT data, predict with 70+% accuracy if a flight would be delayed

Context 
	- DOT "delayed" => greater than 15 mins after schedule
	- Using US DOT data, predict with 70+% accuracy if a flight would arrive 15+ minutes after the scheduled arrival time

ML Workflow
	- Process DOT data
	- Transform data as required
	- Use the Machine Learning Workflow to process and transform US DOT data the create a prediction model. This model must predict whether a flight would arrive 15+ minutes after the scheduled arrival time with 70+% accuracy.

Final Solution Statement
"Use the Machine Learning Workflow to process and transform US DOT data the create a prediction model. This model must predict whether a flight would arrive 15+ minutes after the scheduled arrival time with 70+% accuracy."


# Preparing the Data

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure:
	- Each variable is a column
	- Each observation is a row
	- Each type of observational unit is a table

Source: http://bit.ly/DOT_OnTime
	- Select Jan 2015 
	- Select DayOfWeek & DayOfMonth
	- Select Reporting_Airline, DOT_ID_Reporting_Airline, IATA_Code_Reporting_Airline, TailNum, FlightNum
	- Select OriginAirportID, OriginAirportSeqID, Origin
	- Select DestAirportID, DestAirportSeqID, Dest
	- Select DepTime, DepDel15, DepTimeBlk
	- Select ArrTime, ArrDel15
	- Select Cancelled, Diverted
	- Select Distance

## Load Data

In [1]:
# load the data CSV into an R dataframe as origData
origData <- read.csv2('C:\\Jupyter Notebooks\\datasets\\DOT\\jan_2015_ontime.csv', sep=",", header=TRUE, stringsAsFactors = FALSE)

In [2]:
# Now that we have the datframe loaded lets check how many rows we have
nrow(origData)

In [3]:
# Too much data for quick testing so lets reduce only to certain main airports 
# We do this by creating an R vector (list)
airports <-c('ATL', 'LAX', 'ORD', 'DFW', 'JFK', 'SFO', 'CTL', 'LAS', 'PHX')

In [4]:
# We do a subset function to only get the data from flights between these airports using a select operation
origData <- subset(origData, DEST %in% airports & ORIGIN %in% airports)

#now check the number if rows again
nrow(origData)

## Prepare & Clean data

In [5]:
# Visually inspect the data using the head command to look at the first two rows
head(origData, 2)

DAY_OF_MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN,...,DEST,DEP_TIME,DEP_DEL15,DEP_TIME_BLK,ARR_TIME,ARR_DEL15,CANCELLED,DIVERTED,DISTANCE,X
1,4,AA,19805,AA,N787AA,1,12478,1247802,JFK,...,LAX,855,0.0,0900-0959,1237,0.0,0.0,0.0,2475.0,
1,4,AA,19805,AA,N795AA,2,12892,1289203,LAX,...,JFK,856,0.0,0900-0959,1651,0.0,0.0,0.0,2475.0,


In [6]:
# It looks like the X column can be dropped as it only contains N/A but to confirm lets check the end of the data
tail(origData, 2)

Unnamed: 0,DAY_OF_MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN,...,DEST,DEP_TIME,DEP_DEL15,DEP_TIME_BLK,ARR_TIME,ARR_DEL15,CANCELLED,DIVERTED,DISTANCE,X
469666,31,6,WN,19393,WN,N659SW,3841,14771,1477101,SFO,...,PHX,1109,0.0,1100-1159,1417,0.0,0.0,0.0,651.0,
469667,31,6,WN,19393,WN,N218WN,4481,14771,1477101,SFO,...,PHX,1426,0.0,1400-1459,1721,0.0,0.0,0.0,651.0,


In [7]:
# Lets remove the X colunn by setting it value to NULL and check it gone
origData$X <- NULL
head(origData, 2)

DAY_OF_MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN,...,DEST_AIRPORT_SEQ_ID,DEST,DEP_TIME,DEP_DEL15,DEP_TIME_BLK,ARR_TIME,ARR_DEL15,CANCELLED,DIVERTED,DISTANCE
1,4,AA,19805,AA,N787AA,1,12478,1247802,JFK,...,1289203,LAX,855,0.0,0900-0959,1237,0.0,0.0,0.0,2475.0
1,4,AA,19805,AA,N795AA,2,12892,1289203,LAX,...,1247802,JFK,856,0.0,0900-0959,1651,0.0,0.0,0.0,2475.0


In [8]:
# Check for Airport and Destination for corroralated data to remove - the closer to 1 the more corroralated
cor(origData[c("ORIGIN_AIRPORT_SEQ_ID", "ORIGIN_AIRPORT_ID")])
cor(origData[c("DEST_AIRPORT_SEQ_ID", "DEST_AIRPORT_ID")])

Unnamed: 0,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_AIRPORT_ID
ORIGIN_AIRPORT_SEQ_ID,1,1
ORIGIN_AIRPORT_ID,1,1


Unnamed: 0,DEST_AIRPORT_SEQ_ID,DEST_AIRPORT_ID
DEST_AIRPORT_SEQ_ID,1,1
DEST_AIRPORT_ID,1,1


In [9]:
# These are corroralated so we can drop the SEQ_ID for both as it is extraneous 
origData$ORIGIN_AIRPORT_SEQ_ID <- NULL
origData$DEST_AIRPORT_SEQ_ID <- NULL

In [10]:
# The 'cor' function works for numeric columns but not for string columns 
# We can check the carrier related string rows by filtering for rows that are different
mismatch <- origData[origData$OP_CARRIER != origData$OP_UNIQUE_CARRIER,]
nrow(mismatch)

In [11]:
# There are no mismatchs so we can drop one of the rows
origData$OP_UNIQUE_CARRIER <- NULL

In [12]:
# Check the changes we have made
head(origData, 2)

DAY_OF_MONTH,DAY_OF_WEEK,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN,DEST_AIRPORT_ID,DEST,DEP_TIME,DEP_DEL15,DEP_TIME_BLK,ARR_TIME,ARR_DEL15,CANCELLED,DIVERTED,DISTANCE
1,4,19805,AA,N787AA,1,12478,JFK,12892,LAX,855,0.0,0900-0959,1237,0.0,0.0,0.0,2475.0
1,4,19805,AA,N795AA,2,12892,LAX,12478,JFK,856,0.0,0900-0959,1651,0.0,0.0,0.0,2475.0


## Molding Data
- Dropping Rows
- Adjusting data types
- Creating new columns, if required

In [13]:
# Arr_Del15 = 1 if True or 0 if False and do the same for Dep_Del15
# Remove all rows with either NA or "" and assign to new dataframe
onTimeData <- origData[!is.na(origData$ARR_DEL15) & origData$ARR_DEL15!="" & !is.na(origData$DEP_DEL15) & origData$DEP_DEL15!="",]

In [14]:
# Now lets compare the number of rows in the new and old dataframes
nrow(origData)
nrow(onTimeData)

In [15]:
# Lets change the format of the some of the columns from string to numeric 
onTimeData$DISTANCE <- as.integer(onTimeData$DISTANCE)
onTimeData$CANCELLED <- as.integer(onTimeData$CANCELLED)
onTimeData$DIVERTED <- as.integer(onTimeData$DIVERTED)

In [16]:
# Lets change the format of the some of the columns from string to factors
onTimeData$ARR_DEL15 <- as.factor(onTimeData$ARR_DEL15)
onTimeData$DEP_DEL15 <- as.factor(onTimeData$DEP_DEL15)
onTimeData$DEST_AIRPORT_ID <- as.factor(onTimeData$DEST_AIRPORT_ID)
onTimeData$ORIGIN_AIRPORT_ID <- as.factor(onTimeData$ORIGIN_AIRPORT_ID)
onTimeData$DAY_OF_WEEK <- as.factor(onTimeData$DAY_OF_WEEK)
onTimeData$DEST <- as.factor(onTimeData$DEST)
onTimeData$ORIGIN <- as.factor(onTimeData$ORIGIN)
onTimeData$DEP_TIME_BLK <- as.factor(onTimeData$DEP_TIME_BLK)
onTimeData$OP_CARRIER<- as.factor(onTimeData$OP_CARRIER)

In [17]:
# We need to check the distribution of the data will allow training of the algorithm
# See how many fights were delayed where 1 = True and 0 = False
t( tapply(onTimeData$ARR_DEL15, onTimeData$ARR_DEL15, length))

0.00,1.00
21769,5624


In [18]:
# We have delayed flights so lets check the prevelance in the data
(5624 / (21769 + 5624))

# Selecting the Algorithm 
- Learning Type: Solution Statement identifes a Prediction Model => Supervised Machine Learning
- Result: Regression (Continous Values) or Classification (Discrete Values) => Classification
- Complexity: Ensemble or Non-Ensemble Algorithms => Initially Non-Ensemble
- Basic vs Enhanced => Initially Basic

Candidate Algorithms 
- Naive Bayes
- Logistic Regression 
- Decision Trees

# Training the Model

Select features
- Origin and Destinations
- Day of the Week
- Carrier
- Departure Time Block
- Arrival Delay 15

Caret Package - Classification And Regression Training
- Data Splitting
- Pre-processing
- Feature selection
- Model tuning
- Common interface across algorithms 

In [19]:
# use library to load package into current R session
library(caret)

Loading required package: lattice
Loading required package: ggplot2


In [20]:
# set the seed number for random package generation
set.seed(122515)

In [21]:
# create a Vector of the subset of features needed to train the model - so we can add or remove columns
featureCols <- c("ARR_DEL15", "DAY_OF_WEEK", "OP_CARRIER", "DEST", "ORIGIN", "DEP_TIME_BLK")

In [22]:
# create dataframe out of the subset of onTimeData that contains only these columns
onTimeDataFiltered <- onTimeData[,featureCols]

In [23]:
# Split the data into training and test using Caret to ensure correct % split and correct distribution between the data sets
inTrainRows <- createDataPartition(onTimeDataFiltered$ARR_DEL15, p=0.70, list=FALSE)

In [24]:
# Lets check the data with the head function
head(inTrainRows, 10)

Resample1
1
3
4
5
6
7
9
10
11
12


In [25]:
# Use row vector as the indicies to select the rows that make up the training data
trainDataFiltered <- onTimeDataFiltered[inTrainRows,]

In [26]:
# simply but a minus (-) in front of the indicies to select the non-Training data i.e. the Test data
testDataFiltered <- onTimeDataFiltered[-inTrainRows,]

In [27]:
# Now we need to check that the data % is split correctly between Traing and Test 
nrow(trainDataFiltered)/(nrow(testDataFiltered) + nrow(trainDataFiltered))
nrow(testDataFiltered)/(nrow(testDataFiltered) + nrow(trainDataFiltered))

In [28]:
# To training the data we use the Caret Train Function
logisticRegModel <- train(ARR_DEL15 ~ ., data=trainDataFiltered, method="glm", family="binomial")

In [29]:
logisticRegModel

Generalized Linear Model 

19176 samples
    5 predictor
    2 classes: '0.00', '1.00' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 19176, 19176, 19176, 19176, 19176, 19176, ... 
Resampling results:

  Accuracy   Kappa     
  0.7950431  0.02468547


# Testing the Model

In [30]:
# Now we will test the Model using the test data and return an object containing the predictions
logRegPrediction <- predict(logisticRegModel, testDataFiltered)

In [31]:
# To evaluate the model we use the Caret Confusion Matrix Model
logRegConfMat <- confusionMatrix(logRegPrediction, testDataFiltered[,"ARR_DEL15"])

In [32]:
# The confusion matric provides groups of performace statistics of the Models predictive cabailities
# A True Postive  B False Positive  (NOT DELAYED)
# C False Negative  D True Negative  (DELAYED)
logRegConfMat

Confusion Matrix and Statistics

          Reference
Prediction 0.00 1.00
      0.00 6494 1653
      1.00   36   34
                                          
               Accuracy : 0.7945          
                 95% CI : (0.7855, 0.8031)
    No Information Rate : 0.7947          
    P-Value [Acc > NIR] : 0.5283          
                                          
                  Kappa : 0.0227          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.99449         
            Specificity : 0.02015         
         Pos Pred Value : 0.79710         
         Neg Pred Value : 0.48571         
             Prevalence : 0.79469         
         Detection Rate : 0.79031         
   Detection Prevalence : 0.99148         
      Balanced Accuracy : 0.50732         
                                          
       'Positive' Class : 0.00            
                                          

### Improve Model Performance
Overall model prediction is pretty good (Accuracy : 0.7945) 
On time flight arrivals prediction results is pretty good (Pos Pred Value : 0.79710)
Delayed flight arrivals prdictions results are bad (Neg Pred Value : 0.48571)
Options for improvement 
- Add Additional Columns
- Adjust Training Settings
- Select Better Alogrithm (Try ensemble - Random Forest Algorithm)

In [33]:
# Load Random Forest Library
library(randomForest)

randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:ggplot2':

    margin



In [34]:
# Lets check R memory limit as I hit "Error: cannot allocate vector of size 1.4 Gb"
memory.limit()

In [35]:
# Lets increases the limit 
memory.limit(size=15000)

In [36]:
# We will use the raw Random Forest constructor instead of caret train method to the difference
# This may take some time ...
rfModel <- randomForest(trainDataFiltered[-1], trainDataFiltered$ARR_DEL15, proximity = TRUE, importance = TRUE)

In [37]:
#Now lets use the new Random Forest Model to predict which flights will be delayed
rfValidation <- predict(rfModel, testDataFiltered)

In [38]:
# Lets evaluate the performance of the model using the Confusion Matrix
rfConfMat <- confusionMatrix(rfValidation, testDataFiltered[,"ARR_DEL15"])
rfConfMat

Confusion Matrix and Statistics

          Reference
Prediction 0.00 1.00
      0.00 6242 1520
      1.00  288  167
                                          
               Accuracy : 0.78            
                 95% CI : (0.7709, 0.7889)
    No Information Rate : 0.7947          
    P-Value [Acc > NIR] : 0.9995          
                                          
                  Kappa : 0.0753          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.95590         
            Specificity : 0.09899         
         Pos Pred Value : 0.80417         
         Neg Pred Value : 0.36703         
             Prevalence : 0.79469         
         Detection Rate : 0.75964         
   Detection Prevalence : 0.94463         
      Balanced Accuracy : 0.52744         
                                          
       'Positive' Class : 0.00            
                                          

We did get a much better result in True Negative - Delayed flights and Specificty results
This can be improved further with a different view on the data such as including Weather Data
Will leave this ML case here for now