<a href="https://colab.research.google.com/github/minerva79/GMS5204/blob/main/Day01_Session03_temporal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Temporal Data

Temporal data refers to observations collected over time - commonly in healthcare, economics, environmental monitoring or operational research. What distinguishes temporal data from other forms is the **time-dependence** between observations.

For example, a patient's heart rate recorded at 10:00 a.m. is not independent of their reading at 10.15 a.m. This violates the classical assumption of independence in many statistical models, and necessitates time-aware techniques.

Understanding and handling temporal data requires us to think in terms of **structure over time**, particularly:

- **Lag**: How past values influence current outcomes.

In this tutorial, we will walk through:

1. Loading and handling of temporal data
2. Identifying lags and periodicity


## 1. Loading a Sample Dataset

In this example, we will download daily climate data from the [Ogimet](https://ogimet.com/index.phtml.en) server through `climate` package in R. Ogimet includes stations globally ([station list](https://opendata.dwd.de/climate_environment/CDC/help/stations_list_CLIMAT_data.txt)).

For Singapore, two stations are available:

- **48694**: Singapore/ Paya Lebar (1.37, 103.92)
- **48698**: Singapore Changi Airport (1.37, 103.98)

We will retrieve data from **Changi Airport (station 48698)** for the last five years.

In [None]:
install.packages("GGally")

In [None]:
#library(climate)
#changi <- meteo_ogimet(date = c(as.Date("2020-01-01"), as.Date("2025-01-01")),
#                       station = 48698, interval="daily")

### To avoid long loading times and potential server issues, we will use pre-downloaded data saved as `changi.csv`.

library(tidyverse)

changi <- read_csv("https://raw.githubusercontent.com/minerva79/GMS5204/refs/heads/main/changi.csv")

head(changi)

Load ED data

ED data in this case is the synthetic temporal information based on the PAROS dataset.

- Ong et al. (2011) Pan-Asian Resuscitation Outcome Study (PAROS). Academic Emergency Medicine. https://www.scri.edu.sg/paros/about-paros/



In [None]:
# loading ED visit data

ed <- read_csv("https://raw.githubusercontent.com/minerva79/GMS5204/refs/heads/main/Synthetic_Sampled_Temporal_Data.csv")

temporal_df <- changi %>% left_join(ed, by=c("Date"="date"))

head(temporal_df %>% select(Date, TemperatureCAvg, synthetic_count))

In [None]:
# Visualising temperature and ED visits

plotdat <- temporal_df %>%
  select(Date, TemperatureCAvg, synthetic_count) %>%
  group_by(Date) %>%
  gather(variable, value, -Date)

ggplot(plotdat, aes(x = Date, y = value)) +
  geom_line(colour = "steelblue") +
  geom_smooth(method = "loess", se=FALSE, color = "darkred") +
  facet_wrap(~variable, nrow=2, scales="free_y") +
  labs(title = "", x = "Date", y = "") +
  theme_minimal()

## 2. Exploring Lag Effects

Lag refers to how previous observations influence current outcomes. In time-series modeling, it's common to create lagged variables to account for delayed effects—for example, yesterday’s weather influencing today’s ED visits.

We will now create lagged versions of both the ED visits and temperature variables.

In [None]:
temporal_df <- temporal_df %>%
  arrange(Date) %>%
  mutate(
    lag1_temp = lag(TemperatureCAvg, 1),
    lag1_ed = lag(synthetic_count, 1),
    lag7_temp = lag(TemperatureCAvg, 7),
    lag7_ed = lag(synthetic_count, 7)
  )

head(temporal_df %>% select(Date, TemperatureCAvg, lag1_temp, synthetic_count, lag1_ed))

Here we plot lagged temperature and lagged ED counts against the current ED count to access possible correlations.

In [None]:
library(GGally)

lag_plot_data <- temporal_df %>%
  select(synthetic_count, lag1_temp, lag1_ed, lag7_temp, lag7_ed) %>%
  drop_na()

ggpairs(lag_plot_data,
        upper = list(continuous = wrap("cor", size = 3)),
        title = "Lagged Variable Correlations with ED Visits")

Cross-correlation plots can be useful for determining at which lag the relationship between two series is strongest.

In [None]:
# Drop rows where either variable is NA before computing cross-correlation
ccf_data <- temporal_df %>%
  select(Date, TemperatureCAvg, synthetic_count) %>%
  drop_na()

ccf_result <- ccf(
  x = ccf_data$TemperatureCAvg,
  y = ccf_data$synthetic_count,
  plot = FALSE,
  lag.max = 30,
  na.action = na.omit
)

# Filter for non-negative lags only
pos_lags <- ccf_result$lag >= 0
plot(
  ccf_result$lag[pos_lags],
  ccf_result$acf[pos_lags],
  type = "h", lwd = 2,
  xlab = "Lag (days)",
  ylab = "Cross-Correlation",
  main = "CCF (0 to +30 days): Temperature vs ED Visits"
)
abline(h = c(-1.96/sqrt(nrow(ccf_data)), 1.96/sqrt(nrow(ccf_data))), col = "blue", lty = "dashed")
abline(h = 0, col = "black")

The cross-correlation function (CCF) between daily average temperature and synthetic ED visits revealed statistically significant positive correlations at lags between 10 and 25 days. This suggests that increases in temperature may be associated with higher ED utilization approximately 1 to 3 weeks later. These findings warrant further exploration using distributed lag models.