The “2006” and “2007” year datasets obtained from the Harvard Website(https://doi.org/10.7910/DVN/HG7NV7) were used fir this project. Data was cleaned apprpriately before conducting the analysisin depth. Data was visualised using both Python andR languages to answer the followingresearch questions;
- When is the best time of day, day of the week, and time of year to fly to minimise delays?
- Do older planes suffer more delays?
- How does the number of people flying between different locations change over time?
- Can you detect cascading failures as delays in one airport create delays in others?
- Use the available variables to construct a model that predicts delays. Produced a detailed analysis using line graphs, scatter plots, box plots, spatial temporal heat maps, hypothesis test as well as correlation heat maps to answer the above questions. Built a supervised classification model that consists of Decision Trees, Logistic Regression and Random Forest to predict the delay status, whether the plane will come late or not. Also a multiple linear regression model was made to predict the future arrival delays in real world scenarios. A report was produced to visualize the conclusions arrived from both Python and R and prove that similar conclusions were drawn from them.