### Analysis and Prediction of Air Traffic Volume and Pattern in the Netherlands Post-COVID-19

**Members & Student numbers**
- Minghao Li 6212999
- Xinyu Yang 6302750
- Yilin Shi 6140343
- Yue Guo 6147275
- Yumeng Pan 6134130

### Introduction

The COVID-19 pandemic had profound impacts on global air traffic. In the Netherlands, travel restrictions and behavioral changes led to substantial shifts in air traffic volume. As the country gradually reopened, the recovery process revealed varying patterns across different types of flights and travel groups. 

By analyzing historical traffic patterns, this project will explore the variations in recovery across different flight modes, while also visualizing the comparison between actual and predicted values. Understanding these changes and predicting future traffic volume in the face of possible new pandemics or other disruptions is crucial for air traffic management and policy-making. 



### Research Objectives

This project aims to analyze these changes by comparing traffic volumes before, during, and after the pandemic, focusing on different types of flights (e.g. domestic, international, passenger, cargo). Additionally, it will forecast the potential impact of future pandemics on air traffic based on historical data, providing insights for aviation management and policy decisions.
 

 
### Research Questions (RQs)
- RQ1: What is the impact of the monthly number of new cases before, during, and after the COVID-19 pandemic on different flight types (e.g. cross-country and local flights) in the Netherlands?
- RQ2: What is the impact of the monthly number of new cases before, during, and after the COVID-19 pandemic on the numbers of passengers, cargo, and mail in the Netherlands?
- RQ3: The prediction of future air traffic volume and pattern in response to potential pandemic or disruptions based on existing data. 



### Data Sources
- Our World in Data (OWID) COVID-19 Dataset:
    - https://ourworldindata.org/coronavirus
    - Pandemic-related data including new cases in each month will be used to correlate air traffic pattern variation.
- CBS (Statistics Netherlands) Open Data:
    - CBS Transport and Mobility Dataset: 
    - https://opendata.cbs.nl/statline/#/CBS/en/dataset/37478eng/table?ts=1728287180831
    - This source is used to research the variation across different flight modes and the numbers of passengers, cargo, and mail through the COVID period.

### Analysis and Visualization of Question 1 and 2

In this section, we analyze the global COVID-19 infection data and its impact on aviation activities. The analysis includes the following steps:

1. **Data Reading and Preprocessing**:
    - Read the global COVID-19 infection data from an Excel file.
    - Convert the 'date' column to datetime format.
    - Group the data by month and calculate the total number of new cases for each month.

2. **Merging with Aviation Data**:
    - Read the monthly aviation data from a CSV file.
    - Parse the date column and group the data by month.
    - Merge the COVID-19 data with the aviation data on the 'month' column.

3. **Visualization**:
    - Plot the number of new COVID-19 cases and various aviation-related variables over time.
    - Use line charts to visualize the trends and relationships between the variables.

The following plots are generated to visualize the data:

1. **Aircraft Movements and COVID-19 New Cases Over Time**:
    - Cross-country flights and local flights are plotted on the primary y-axis.
    - New COVID-19 cases are plotted on the secondary y-axis.

2. **Commercial Air Traffic and COVID-19 New Cases Over Time**:
    - Total passengers, total cargo, and total mail are plotted on the primary y-axis.
    - New COVID-19 cases are plotted on the secondary y-axis.

These visualizations help us understand the impact of the COVID-19 pandemic on aviation activities and identify any potential correlations between the variables.

### Research Question 1: The Impact of COVID-19 on Cross-country and Local Flight Trends

![image.png](attachment:image.png)
The image shows that during non-pandemic periods, both cross-country and local flights follow a seasonal pattern, with more flights in the summer and fewer in the winter. Cross-country flights were hardest hit during the early stages of the pandemic, with numbers sharply declining from November 2019 and reaching a low in May 2020, dropping below 10,000. As the pandemic eased, cross-country flights gradually recovered and returned to near pre-pandemic levels by 2023.

Local flights also declined, though less dramatically, and recovered more slowly, with overall numbers remaining lower. The image suggests no clear relationship between global COVID-19 cases and flight trends. This is likely because flight numbers were more influenced by government policies than the direct rise in cases. Many countries imposed strict travel restrictions, including border closures and flight cancellations, early in the pandemic, often before case numbers peaked.

In conclusion, the pandemic significantly impacted both cross-country and local flight numbers. However, the impact was more closely tied to government policies and travel restrictions than to the rise in case numbers. Flight trends aligned more with the timing and adjustments of these preventive measures.

### Research Question 2: The Impact of COVID-19 on the Numbers of Passengers, Cargo, and Mail


![image.png](attachment:image.png)
Before the pandemic in 2019, the number of passengers fluctuated between 5 million and 8 million, with higher numbers during the summer, likely due to tourism. However, by April 2020, passenger numbers plummeted to around 133,752 due to the outbreak of COVID-19. A slight recovery occurred between July and September 2020, reaching approximately 2 million passengers, although still lower than pre-pandemic levels. This period showed a consistent seasonal pattern of higher passenger numbers in the summer. Following this, passenger numbers declined again from November 2020 to March 2021, then rebounded to around 4 million by September 2021. After that, the number of passengers gradually increased yearly, but the cyclical trend of more passengers in summer and fewer in winter persisted.

![image.png](attachment:image.png)
Before the pandemic, cargo volumes fluctuated between 130,000 and 160,000 tons, with relatively stable cargo operations. Following the outbreak of COVID-19 in early 2020, cargo volumes dropped significantly to a low of 103,420 tons in May, likely due to reduced flights and logistical constraints. From mid-2020 to early 2021, cargo volumes started to recover but continued to fluctuate significantly, possibly due to the resurgence of the pandemic and varying lockdown measures across regions. By 2022 and 2023, cargo volumes had decreased slightly, but after 2023, a gradual increase was observed, although not reaching pre-pandemic levels.

![image.png](attachment:image.png)
From 2019 to January 2020, the number of mail showed considerable fluctuations but remained relatively high, between 1,250 and 2,000 tons. With the onset of the pandemic in early 2020, the number of mail fell sharply, reaching a low of 382 tons between March and May 2020, likely due to global postal disruptions and international lockdowns. From May 2020 onward, mail shipments gradually increased, reflecting a recovery in air mail transport. However, after May 2021, the number of mail began to decline again, showing some fluctuations. By 2023, mail shipments had stabilized but remained well below pre-pandemic levels. 

**Summary**

In the early stages of the pandemic, despite the number of new cases being relatively low, the number of passengers, cargo, and mail dropped significantly. From mid-2020 to mid-2021, even as the number of new cases rose, overall air transport volumes began to recover. In early 2022, as new case numbers spiked, cargo volumes showed a notable decrease, while the number of passengers and mail experienced smaller drops. Similarly, during subsequent peaks in new cases, air transport volumes saw little variation. This suggests a weak correlation between air transport volumes and global new COVID-19 cases, supported by the correlation coefficients, which show that the absolute values between the number of new cases and the number of passengers, cargo, and mail are less than 0.5.


### Research Question 3: Analysis and Prediction


### Plot 

We introduced two new COVID-19 data variables: monthly deaths and monthly vaccinations. Next, we will plot the five aviation-related variables (cross-country flights, local flights, number of passengers, amount of cargo, and amount of mail) alongside the monthly deaths and monthly vaccinations respectively. These plots will help visualize the trends and relationships between the aviation variables and the COVID-19 data over time.

**The first part is to analyze the relationship between monthly deaths and aviation data.**


![image.png](attachment:image.png)
After the outbreak of the pandemic in 2020, monthly deaths surged rapidly, reaching a peak in early 2021. Subsequently, the number of deaths gradually decreased and stabilized by 2022, almost reaching zero. Local flights and cross-country flights saw significant declines during the outbreak, especially cross-country flights (green line). As monthly deaths decreased in early 2022, cross-country flight numbers began to recover. However, once the death rate approached zero, the fluctuations in flight numbers did not directly correlate with it. Thus, there is no evident direct relationship between monthly deaths and local flights or cross-country flights.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)
Passenger numbers dropped sharply during the initial COVID-19 outbreak and recovered as deaths decreased, but there is no strong correlation between the two in the later stages. Cargo volume showed smaller fluctuations compared to passenger numbers and remained relatively stable, with no clear relationship to monthly deaths. Mail volume showed larger fluctuations during the initial stages of the pandemic, especially when the number of deaths peaked. However, mail transport gradually returned to normal while monthly deaths dropped and remained stable. In all cases, monthly deaths did not show a clear long-term direct correlation with the aviation variables.

**Next part is to analyze the relationship between monthly cumulative vaccinations and aviation data.**


![image.png](attachment:image.png)
Monthly cumulative vaccinations (red line) began to rise rapidly in mid-2021 and continued steadily, reaching a stable level by 2022. This indicates the widespread rollout and coverage of vaccinations. The increase in cumulative vaccinations is somewhat correlated with the recovery of cross-country flights over time because the number of cross-country flights rose during the period of the fastest vaccination growth. However, there is no strong direct relationship. Over the entire pandemic period, cumulative vaccinations and flight numbers show no clear connection.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)
Passenger numbers gradually recovered as vaccinations increased, indicating some correlation. Although passenger recovery showed fluctuations, it generally increased during the period of rising vaccinations. Cargo volume experienced slight fluctuations early in the pandemic, but after vaccinations began, cargo volume remained relatively stable, showing minimal impact from vaccination growth. Mail volume showed some recovery after vaccination growth, the fluctuations remained significant, and no sustained recovery trend was observed. The relationship between vaccinations and mail volume is weak. Vaccinations had an impact on passenger numbers in specific period but overall, the correlation between vaccinations and aviation traffic (including passenger, cargo, and mail transport volumes) is weak.

### Correlations with new cases deaths and vaccination

In this section, we will incorporate vaccination and COVID-19 death data to explore potential correlations with aviation-related variables. By examining these correlations, we aim to uncover insights into how the pandemic and vaccination efforts have impacted aviation activities. First, we calculate the correlation coefficient between each variable and the number of deaths as well as the number of vaccinations and new cases. The variables analyzed are as follows: local flights, cross-country flights, total passengers, total cargo, and total mail. These coefficients indicate the strength and direction of the linear relationship between each variable and the number of deaths as well as the number of vaccinations.


![image.png](attachment:image.png)

![image.png](attachment:image.png)

### Attempting Multiple Linear Regression

In this section, we will attempt to use a Multiple Linear Regression model to predict various aviation-related variables based on the number of new COVID-19 cases, monthly deaths, and monthly vaccinations. The variables we will analyze include:

- Local flights (number)
- Cross-country flights (number)
- Total passengers (number)
- Total cargo (ton)
- Total mail (ton)

We will split the data into training and testing sets, fit the Multiple Linear Regression model, and evaluate its performance using metrics such as Mean Squared Error (MSE) and R-squared (R²). Additionally, we will visualize the actual vs. predicted values for each variable.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

The first two graphs show the residuals for the number of local flights and the quantiles of the residuals versus a theoretical normal distribution. Despite the residuals clustering around zero, the spread and presence of outliers suggest that the model may not capture some of the variability in local flights data. The low R² (R²=0.324) also supports the insignificant relationship between local flights number and the variables.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

These graphs, along with the R² value of 0.621, indicate a stronger relationship between cross-country flights number and COVID-related variables compared to domestic flights. This result is logical considering the Netherlands' significant role in international air traffic. 

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

These graphs illustrate that total passenger numbers are closely related to COVID-related variables, as reflected by an R² value of 0.689. This relatively strong correlation suggests that the pandemic has had a notable impact on passenger aviation, with fluctuations in COVID-19 cases, deaths, and vaccinations likely influencing travel demand.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

In comparison, these figures show that total cargo has a much weaker relationship with COVID-related indicators, also indicated by an R² value of 0.349. This suggests that cargo transport has been less affected by pandemic factors such as case numbers, deaths, and vaccinations. The lower correlation may be due to the essential nature of cargo transport, which likely continued relatively unaffected during the pandemic to meet global supply chain demands, regardless of fluctuations in COVID-19 variables.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

These graphs reveal a significant relationship between total mail and COVID-related variables, demonstrated by an R² value of 0.743. This correlation indicates that the pandemic has an influence on mail volumes, likely due to shifts in consumer behavior and increased reliance on delivery services during lockdowns.

**Summary**

 While the multiple linear regression model performs reasonably well for predicting some variables, it falls short in capturing the relationships for others. This suggests that a direct linear approach may not fully reflect the complexities within the data. To address this, we plan to apply some transformation to the variables and investigate whether this approach can reveal stronger or more linear relationships.

### Attempting Multiple Linear Regression with Log Transformation

In this section, we apply a log transformation to the variables to see whether this approach can reveal stronger or more linear relationships. By transforming the data, we aim to enhance the model's accuracy and better account for potential nonlinearities, leading to more robust predictions and deeper insights.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

In these two graphs, the log transformation does not appear to improve the relationship between local flights and the variables. This conclusion is supported by a low R² value of 0.305, indicating that the log transformation fails to uncover a stronger connection. The result suggests a weak or nonexistent relationship between local flights and the COVID-related factors.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

Similarly, the log transformation does not appear to enhance the relationship between cross-country flights and the variables. This is indicated by the relatively low R² value of 0.528, which is lower than that of the original model. This finding suggests that the transformation failed to improve the model's ability to capture the underlying relationship, highlighting the limitations of both the original and transformed models in this context.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)
![image-4.png](attachment:image-4.png)

The same situation applies to the relationships between total passengers and total cargo with the variables. The R² values are 0.561 and 0.339, respectively, both of which are lower than those of the original models. This indicates that the log transformation did not improve the model's ability to capture the underlying relationships between total passengers, total cargo, and the COVID-related factors.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

The relationship between total mail and the variables stands as the only exception. With an R² value of 0.750, which is slightly higher than that of the original model, this improvement may be attributed to the close relationship between total mail and the COVID-related factors.

**Summary**

Based on these results, the log transformation does not lead to any improvement in the model's performance. As a result, we have decided to discontinue its use. Finally, we present the actual vs. predicted value graphs for the five indicators, illustrating the model's predictions compared to the observed data for each variable.

### Actual vs. Predicted Value Graphs

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)
![image-4.png](attachment:image-4.png)
![image-5.png](attachment:image-5.png)
The findings in this section reveal that while multiple linear regression offers some insights to predict aviation indicators, it does not accurately capture the relationship with COVID-related variables. To address this limitation, we plan to apply alternative predictive models, such as random forest model, to further investigate the connection between aviation performance and the pandemic. This change in methodology aims to provide a more comprehensive understanding of the relationships and improve prediction accuracy.

### Attempting Random Forest Model

In this section, we will attempt to use a Random Forest model to predict various aviation-related variables based on the number of new COVID-19 cases. The variables we will analyze include:

- Local flights (number)
- Cross-country flights (number)
- Total passengers (number)
- Total cargo (ton)
- Total mail (ton)

We will split the data into training and testing sets, fit the Random Forest model, and evaluate its performance using metrics such as Mean Squared Error (MSE) and R-squared (R²). Additionally, we will visualize the actual vs. predicted values for each variable.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)
![image-4.png](attachment:image-4.png)
![image-5.png](attachment:image-5.png)
The Random Forest model shows an R² value of 0.6337, indicating that the model explains about 63.37% of the variance. While it demonstrates some predictive ability, it is not highly accurate. The model's MSE is 215,993,945,235.1363, which is quite large, signifying a high prediction error on the test set. These results suggest the model captures some patterns, but the three pandemic variables used are insufficient to fully explain the fluctuations in the aviation industry.

Feature importance analysis shows that `monthly_deaths` has the highest influence, with an importance score of 0.8393, indicating that death counts significantly affect aviation variables. In contrast, `new_cases` has a much smaller impact, with an importance of only 0.0289. 

Visual comparisons reveal that the model performs well during certain periods, but shows larger discrepancies in others. To improve performance, it may be necessary to include additional variables, such as lockdown policy indices, vaccination coverage, and mobility data, to better capture the factors influencing aviation trends.


### Time Series Analysis Steps


 **Stationarity Testing**:
    - Use tests like the Augmented Dickey-Fuller (ADF) test or the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test to check the stationarity of the time series.
    - If the time series is not stationary, make it stationary through differencing, log transformation, etc.

 **Model Selection and Training**:
    - Choose an appropriate time series model, such as ARIMA, SARIMA, Holt-Winters, etc.
    - Fit the model using training data and adjust model parameters for the best fit.

 **Model Evaluation**:
    - Evaluate the model's predictive performance using test data.
    - Calculate evaluation metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), etc.

 **Forecasting and Validation**:
    - Use the trained model to forecast future data.
    - Compare the forecasted results with actual data to validate the model's accuracy.


### Local flight

This time series analysis focuses on local flight volume data, employing various methods including time series decomposition, stationarity tests, SARIMAX model fitting, and performance evaluation. Below is a detailed analysis of the results and potential improvements.

**Time Series Decomposition**

The original data was decomposed into trend, seasonal, and residual components.

![image.png](attachment:image.png)

- Trend Component: The data initially exhibited an upward trend, followed by stabilization between time points 30 and 40, likely due to supply constraints or market saturation.
- Seasonal Component: The seasonal component revealed periodic fluctuations in flight volume, such as peaks between time points 10 and 20, possibly related to holiday travel demand.
- Residual Component: Significant residual deviations were observed between time points 20 and 30, suggesting that the model did not fully capture the data dynamics, potentially indicating unconsidered exogenous variables or nonlinear characteristics.


**Stationarity Test**

Stationarity was tested using ADF and KPSS tests


![stationarytest1.png](attachment:stationarytest1.png)


ADF and KPSS tests indicated that the original data was non-stationary, requiring differencing. After first-order differencing, stationarity improved significantly, though residuals still exhibited some structural bias, suggesting the consideration of higher-order differencing or other transformation methods.


**SARIMAX Model Evaluation**

The SARIMAX model was used to fit the data, employing an ARIMA(1,1,1) structure for feature capture.


![image.png](attachment:image.png)
- Autoregressive Term (AR): The high p-value suggests that the AR term is not significant. It may be advisable to simplify the model by removing this component to reduce complexity and prevent overfitting.
- Moving Average Term (MA): The MA term had a significant p-value, indicating its effectiveness in reducing short-term noise and improving prediction reliability.
- Model Selection Metrics (AIC and BIC): The AIC and BIC values were relatively high, suggesting that parameter optimization via grid search could further improve model fitting and prediction performance.


**Model Prediction Performance Analysis**

**Comparison of Prediction Results with Actual Data**

Test set performance was suboptimal, particularly between time points 50 and 60, where the model failed to capture rapid data fluctuations.
- Prediction Curve Smoothness: To improve responsiveness to short-term fluctuations, it may be worthwhile to increase model flexibility by incorporating additional MA or AR terms or including exogenous variables.
- Impact of External Variables: Flight volume is influenced by factors such as fuel prices, weather, and policy changes. Including these factors to construct a multivariate time series model could enhance predictive accuracy.

**Directions for Improving Prediction Performance**
- More Complex Models: Consider employing higher-order ARIMA or nonlinear models such as LSTM to address high volatility.
- Feature Engineering: Extract additional features related to seasonality and external variables to enhance prediction accuracy.


**Residual Analysis**

![image-2.png](attachment:image-2.png)

**Residual Distribution**

Ideally, residuals should be randomly distributed with a mean close to zero. However, at specific time points (e.g., 55 and 65), residuals significantly deviated from zero.


![image.png](attachment:image.png)

**Autocorrelation Analysis**

ACF and PACF: Significant autocorrelations were observed at multiple lags, suggesting the addition of lag terms or the inclusion of more explanatory variables.


### Monthly death

This analysis focuses on the time series of monthly death counts, using methods such as time series decomposition, stationarity testing, SARIMAX model fitting, and prediction performance evaluation. The following sections present a concise evaluation of the results and potential improvements.


**Time Series Decomposition**

The monthly death data was decomposed into trend, seasonal, and residual components.


![image.png](attachment:image.png)
- Trend Component: The trend shows an increase in deaths, peaking around time point 30, then declining beyond time point 40, reflecting the epidemic's progression and possible interventions.
- Seasonal Component: Seasonal fluctuations indicate periods with consistently higher death counts, likely linked to seasonal factors like increased winter vulnerabilities.
- Residual Component: Significant residual fluctuations suggest exogenous factors or noise not captured by the trend and seasonal components, particularly between time points 20 to 30.


**Stationarity Test**

Stationarity was tested using ADF and KPSS tests


![stationarytest2.png](attachment:stationarytest2.png)

The ADF test indicates non-stationarity. The KPSS test supports this. First-order differencing improved stationarity, though higher-order differencing or alternative transformations might help further.


**SARIMAX Model Evaluation**

The SARIMAX model was used to fit the data, employing an ARIMA(1,1,1) structure for feature capture.


![image.png](attachment:image.png)
- Autoregressive Term (AR): The AR term's high p-value suggests it may not be significant, and removing it could reduce overfitting.
- Moving Average Term (MA): The MA term is significant and helps capture short-term irregularities.
- Model Selection Metrics (AIC and BIC): The high AIC and BIC values suggest room for optimization, possibly through grid search.


**Model Prediction Performance Analysis**

**Comparison of Prediction Results with Actual Data**

The model's predictive performance is suboptimal, particularly during the testing phase, where predictions are too flat and fail to capture actual volatility.
- Prediction Curve Smoothness: The lack of sensitivity to fluctuations suggests adding more AR or MA terms or including exogenous regressors.
- External Variables Impact: Including variables like healthcare capacity and policy interventions could improve accuracy, especially during extreme changes.

**Directions for Improving Prediction Performance**
- Model Complexity: Advanced models like higher-order ARIMA or LSTM could capture complex temporal patterns better.
- Feature Engineering: Include features such as temperature, healthcare utilization, and population movement.


**Residual Analysis**

![image.png](attachment:image.png)

**Residual Distribution**
Residuals show systematic deviations between time points 54 and 64, indicating that the model has not fully captured underlying patterns.



![image-2.png](attachment:image-2.png)

**Autocorrelation Analysis**
ACF and PACF: Significant autocorrelations suggest adding lagged variables or using higher-order differencing.


### Conclusion