# Time Series Analysis I

It is quite common that our data with be in the form of time-ordered observations or data points.  While we can often treat these data without addressing the time component itself (e.g. correlation between two time series, for instance, doesn't know anything about 'time', just the pairing of the $X$ and $Y$ values), the existence of a temporal order and potential temporal relationships (covariance, etc.) does present some unique issues and concerns (as we'll discuss briefly in lecture).  There are entire courses on time series analysis and we'll just scratch the surface here.  

In this notebook, we'll look at stationarity and the decomposition of time series into trend, seasonality, and residual components using Python and specifically statsmodels. 

## Stationarity

Let's start by looking at a simple example of a stationary vs. non-stationary time series (for a discussion of this specifically with respect to statsmodels, see [here](https://www.statsmodels.org/dev/examples/notebooks/generated/stationarity_detrending_adf_kpss.html). First, let's get some libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

# you can omit the line below if you'd like, but I really don't like the default fonts in Python, so I switch to Helvetica
plt.rcParams['font.family'] = 'Helvetica'


We'll now create two time series.  Recall from lecture that a Gaussian (white) noise time series will be stationary -- it will have a mean, variance, and other statistical properties (including a lack of temporal covariance) that do not change and are not dependent on time.  We can also create a non-stationary time series by calculating a cumulative sum of that time series.  This is essentially a discrete random walk or Brownian motion, since there is a strong dependence at any time step to the value of the earlier data and the summation process.  We would a prior expect this to be non-stationary:

In [None]:
np.random.seed(1999)
n_samples = 128

# make a random normal (Gaussian) time series - this will be stationary
data1 = pd.Series(np.random.randn(n_samples))

# calculate the cumulative sum of a random Gaussian series - this will be non-stationary
data2 = pd.Series(np.cumsum(data1))

# plot both series to see what they look like, each on a single subplot
fig, ax = plt.subplots(1, 2, figsize=(6, 3))

ax[0].plot(data1)
ax[0].set_title('Random Series (Stationary)')
ax[0].set_xlabel('Time')
ax[0].set_ylabel('Value')

ax[1].plot(data2)
ax[1].set_title('Cumulative Sum (Non-Stationary)')
ax[1].set_xlabel('Time')
ax[1].set_ylabel('Value')

plt.tight_layout()
plt.show()


We can now call the Augmented Dickey-Fuller (ADF) test from statsmodels ([see here](https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html)).  The test returns various results, but we'll want to look at the 0th (the ADF statistic itself), the 1st (the p value), and the 4th (which shows the critical values, providing some context for the ADF statistic which is otherwise somewhat abstracted from the data):

In [None]:
# get the the Augmented Dickey-Fuller (ADF) test module from statsmodels
from statsmodels.tsa.stattools import adfuller

# Perform the Augmented Dickey-Fuller (ADF) test on the random normal time series
result1 = adfuller(data1)

print('ADF Statistic:', result1[0])  # compare to the critical values below
print('p-value:', result1[1]) # this will be very SMALL, so we REJECT the null hypothesis of non-stationarity - e.g the series is probably STATIONARY
print('Critical Values:', result1[4])


We can see that the p value is very very small, which would lead us to REJECT the null hypotheses.  Somewhat oddly, the null in this case is that the time series is _non-stationary_, so rejecting that null means the series is most likely stationary (I have to think this through each time I do it).

In [None]:
# Perform the Augmented Dickey-Fuller (ADF) test on the cumulative sum time series
result2 = adfuller(data2)

print('ADF Statistic:', result2[0]) # compare to the critical values below
print('p-value:', result2[1]) # this will be LARGER than p=0.05, so we FAIL TO REJECT the null hypothesis of non-stationarity - e.g the series is probably NON-STATIONARY
print('Critical Values:', result2[4])


We now see that the random walk (cumulative sum) time series has a value larger than p=0.05, so we might feel comfortable NOT rejecting the null, and concluding the series is most likely non-stationary. 

## Mauna Loa CO2 record

Below we're going to use the monthly atmospheric CO2 record from Mauna Loa (Hawaii) to look at ways we can use Python for time series decomposition and analysis.  The data come from [here](https://gml.noaa.gov/ccgg/trends/) and specifically [here](https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_mm_mlo.txt).  The monthly data made available by NOAA are the average of much more frequent observations that have been quality controlled. 

First, let's read in the data into a Pandas DataFrame and take a look: 

In [None]:
df = pd.read_csv('co2_mm_mlo.csv')
df

In [None]:
df.info()

### An aside: using `datetime` in Pandas and beyond

Numpy, Pandas, and xarray (amongst other libraries) have the ability to use `datetime64` and `timedelta64` types to assist with indexing temporal data (there are some limitations, however, that climate scientists should be aware of, see: https://discourse.pangeo.io/t/pandas-dtypes-now-free-from-nanosecond-limitation/3106).  There is a very clear discussion of this with respect to Pandas [here](https://pandas.pydata.org/docs/user_guide/timeseries.html). As we will see if greater detail later in the class, this provides better ways to index time series data, but also to automate processes like cross-correlation.  

Here, we're going to do a very simple datetime operation.  As you see above, the data from Mauna Loa comes with a year and month column as well as a decimal year column.  For our purposes but also for practice, let's see how we can turn the 'year' and 'month' columns (which are just integers at the moment) into a true datetime value that Pandas recognizes as providing a set of calendar dates.

First, let's use Panda's indexing to create a new column called `date` in our DataFrame and populate it with a datetime consisting of the year column, the month column, and add a day.  By adding a day (I've used day=1, the first day of the month, but you could use e.g. day=15 the middle of the month, etc). 

In [None]:
# combine 'year' and 'month' into a single new datetime column called 'date'
df['date'] = pd.to_datetime(df[['year', 'month']].assign(day=1))  # required to have 'day=1' here, which makes it a proper datetime value 
df.info() # note that the date has been created as datetime64, but the index still says 'RangeIndex: 780 entries, 0 to 779'

Now, we will set the DataFrame's index (which is currently just the row number) to use the new date column we created.  We use `inplace=True` so that this changes the DataFrame itself:

In [None]:
# now set the 'date' column as the index
df.set_index('date', inplace=True)
df.info() # note that the index now says 'DatetimeIndex: 780 entries, 1959-01-01 to 2023-12-01'

Finally, let's drop the year and month columns, since they've served their purpose.  I'm going to leave the decimal_data for now:

In [None]:
# now we can drop the columns that we aren't going to use - note you need to specify 'axis' for this drop operation
df.drop(['year', 'month'], axis=1, inplace=True) 
df

Let's plot our data now.  The data column co2 has an index which is a datetime `date` as you see above, so Matplotlib will automatically use that datetime for the x axis:

In [None]:
plt.plot(df['co2'],'k')
plt.xlabel('YEAR')
plt.ylabel('CO2 (ppm)')
plt.show()

For fun, let's do our stationarity test, although with a trend and seasonal component we should already anticipate the answer:

In [None]:
stationarity_test = adfuller(df['co2'])

print('ADF Statistic:', stationarity_test[0]) # compare to the critical values below
print('p-value:', stationarity_test[1]) # this will be (much!) larger than p=0.05, so we fail to reject the null hypothesis of non-stationarity - e.g the series is non-stationary
print('Critical Values:', stationarity_test[4])

We're now going to use statsmodels' `seasonal_decompose` from its time series analysis sub-library (see [here](https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html)).  This module performs a very simple operation of identifying the trend and seasonal components and calculating the residual.  This can be useful for exploratory data analysis or for filtering, something we'll talk about on Wednesday:

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

decomposition = seasonal_decompose(df['co2'], model='additive', extrapolate_trend='freq', period=12)

fig = decomposition.plot() # by assigning this to a figure object, we can edit the size of the figure from the default, which is hard to read IMO
fig.set_size_inches((5, 8))
fig.tight_layout()
plt.show()

Once this code has been run, we can extract the components (or the results in general) from the `decomposition` object. For instance, here is the mean annual cycle, which we get by averaging each month (remember, datetime tells Python which month each data point corresponds to) from the seasonal time series:

In [None]:
# Extract the seasonal component
seasonal = decomposition.seasonal

# Calculate the mean seasonal cycle
mean_seasonal_cycle = seasonal.groupby(seasonal.index.month).mean()

# Plot the mean seasonal cycle
months = ['January', 'February', 'March', 'April', 'May', 'June', 
          'July', 'August', 'September', 'October', 'November', 'December']
plt.plot(mean_seasonal_cycle,'-o')
plt.xlabel("MONTH")
plt.xticks(ticks=range(1, 13), labels=months, rotation=45)
plt.ylabel("Mean Seasonal Component (CO2 ppm)")
plt.grid()

plt.show()

We can do something similar with the trend component.  As an example of what we might do with this, let's use Numpy to fit a linear and 2nd order polynomial line to the trend component:

In [None]:
# Extract the trend component
trend = decomposition.trend
time = df['decimal_date']

# Fit a linear trend (degree 1)
linear_fit = np.polyfit(time, trend, deg=1) # first order (linear) polynomial fit
linear_trend = np.polyval(linear_fit, time)

# Fit a quadratic trend (degree 2)
quadratic_fit = np.polyfit(time, trend, deg=2) # 2nd order polynomial fit
quadratic_trend = np.polyval(quadratic_fit, time)

# Plot the linear trend
plt.plot(time, trend, label='CO2 Trend Component', color='black')
plt.plot(time, linear_trend, label='Linear Model Fit', color='red',linestyle='--')
plt.plot(time, quadratic_trend, label='Quadratic Model Fit', color='blue',linestyle='--')
plt.xlabel("YEAR")
plt.ylabel("CO2 (ppm)")

plt.legend()
plt.show()


We can do a bit of simple maths here to see which of the fits has the lowest residual (or error).  This kind of simple calculation could be used to determine or justify an interpretation of the trend -- in this case, is the trend in atmospheric CO2 linear, or is it accelerating? 

In [None]:
# Residual sum of squares for linear fit
rss_linear = np.sum((trend - linear_trend) ** 2)

# Residual sum of squares for quadratic fit
rss_quadratic = np.sum((trend - quadratic_trend) ** 2)

print(f'Residual sum-of-squares for linear fit: {rss_linear}')
print(f'Residual sum-of-squares for quadratic fit: {rss_quadratic}')

We see that the quadratic fit is a better fit to the data, which could support an interpretation that there has been an acceleration in the rate of CO2 increase. 


### Can we make this record stationary? 

We saw above that the CO2 record is non-stationary (no surprise, given the obvious trend and seasonality components).  As discussed in class, there are ways we can attempt to make a series stationary.  Here, we'll do three things:  First, we'll take the log of the CO2 series - if there is a chance in variance through time, this can address this component of non-stationarity.  Second, we'll take the first difference, which will remove the simple trend.  Finally, we'll do another differencing where each point is differenced relative to the point 12 values (12 months) before it.  This will remove the seasonal cycle (note that this is not the only way, or even the best way in some cases, to do this, but it does demonstrate a nice feature of the `.diff` function!)

In [None]:
# Log transformation to stabilize variance - question for you?  Does the variance change at Mauna Loa?  
co2_log = np.log(df['co2'])

# First differencing to remove the trend
co2_diff = co2_log.diff()

# month-to-month differencing to remove remaining seasonal elements - note this will create 12 NaN values so we use .dropna()
co2_seasonal_diff = co2_diff.diff(12).dropna()

# Plotting the transformed data (after differencing)
plt.figure(figsize=(10, 6))
plt.plot(co2_seasonal_diff, label='Stationary CO2 Data')
plt.title('Stationary CO2 Data After Differencing')
plt.xlabel('Time')
plt.ylabel('Differenced Log CO2 Concentration')
plt.legend()
plt.show()

Did these varios transformations make the CO2 series stationary? Let's use the Augmented Dickey-Fuller test one more time to see:

In [None]:
# Apply the Augmented Dickey-Fuller test
co2_adf_result = adfuller(co2_seasonal_diff)
print(f'ADF Statistic: {co2_adf_result[0]}')
print(f'p-value: {co2_adf_result[1]}') # very very small, can REJECT null hypothesis, which means series is likely STATIONARY