#### Introduction to Statistical Models 

**Statistical models** in time series analysis are mathematical frameworks used to analyze and forecast data points collected or recorded at specific time intervals. Time series data is characterized by its temporal ordering, meaning that the order of observations is crucial, and the data often exhibits patterns over time, such as trends, seasonality, and cyclic behavior

**Key Concepts**

 **Time Series Data:** A time series is a sequence of data points indexed in time order. Examples include daily stock prices, monthly sales figures, and annual temperature records.

***Components of Time Series:**
* Trend: The long-term movement in the data, indicating a general direction (upward or downward) over time.

* Seasonality: Regular, periodic fluctuations that occur at specific intervals (e.g., sales spikes during holidays).

* Cyclic Patterns: Fluctuations that occur at irregular intervals, often influenced by economic or business cycles.

* Irregular/Random Component: Unpredictable variations that cannot be attributed to trend, seasonality, or cycles.

**Stationarity:** A stationary time series has statistical properties (mean, variance, autocorrelation) that do not change over time. Many statistical models assume stationarity, and non-stationary data often needs to be transformed (e.g., differencing) to achieve stationarity.

**Autocorrelation:** Autocorrelation measures the correlation of a time series with its own past values. It helps identify patterns and dependencies in the data.

**Common Statistical Models**

* Autoregressive Integrated Moving Average (ARIMA)
* Seasonal ARIMA (SARIMA)
* Exponential Smoothing
* Vector Autoregression (VAR)
* State Space Models

**Model Evaluation**
* **Goodness of Fit:** Statistical tests (e.g., AIC, BIC) and diagnostic plots (e.g., residual plots) are used to evaluate how well a model fits the data.

* **Forecast Accuracy:** Metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE) are used to assess the accuracy of forecasts.

**Difference between deterministic and stochastic models**

**Deterministic models** 
* Deterministic models are mathematical models in which the outcome is precisely determined by the input parameters and initial conditions. There is no randomness involved; given the same initial conditions, the model will always produce the same output.

**Stochastic Models**
* Stochastic models incorporate randomness and uncertainty. The outcome is not fixed and can vary even with the same initial conditions due to the influence of random variables or processes.

**Conditions For Using Autoregressive Models (AR)**

**Autoregressive (AR)** models are a class of time series models that use past values of a variable to predict its future values. They are widely used in time series analysis due to their simplicity and effectiveness. However, certain conditions must be met for AR models to be appropriate and effective. Here are the key conditions for using autoregressive models

* Stationarity: statistical properties (mean, variance, and autocorrelation) do not change over time. Stationary can be tested for in a time series data using: *Augmented Dickey-Fuller (ADF) test* and
*Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test*

* Linearity: AR models assume a linear relationship between the current value of the series and its past values.Residual analysis and statistical tests (e.g., Ramsey's RESET test) can help assess whether the relationship is linear.

* No Autocorrelation in Residuals: The residuals (errors) from the AR model should not exhibit autocorrelation.The Durbin-Watson test and the Ljung-Box test can be used to check for autocorrelation in residuals.

* Sufficient Data Points: A sufficient number of observations is required to estimate the parameters of the AR model reliably.*Rule of Thumb* a common guideline is to have at least 10-15 observations for each parameter being estimated.

* No Multicollinearity:  In AR models, multicollinearity refers to high correlations between the lagged values of the series.Can be tested for using the Variance Inflation Factor (VIF) which can be used to assess multicollinearity among the lagged variables.

* Normality of Residuals:  While not strictly necessary for the estimation of AR models, the residuals are often assumed to be normally distributed.*Testing for Normality:* The Shapiro-Wilk test, Kolmogorov-Smirnov test, and visual inspections (e.g., Q-Q plots) can be used to assess the normality of residuals.

### Real-World Dataset: Global Temperature Data

One widely used real-world dataset is the **Global Temperature dataset**, which contains historical temperature records from various locations around the world. This dataset is often sourced from organizations such as NASA, NOAA (National Oceanic and Atmospheric Administration), and the Hadley Centre. The dataset typically includes:

- **Date**: The time period of the recorded temperature (daily, monthly, or yearly).
- **Temperature**: Average temperature readings (in degrees Celsius or Fahrenheit).
- **Location**: Geographic coordinates or specific locations where the measurements were taken.

### How Statistical Models Could Be Used

Statistical models can be applied to the Global Temperature dataset in various ways to analyze trends, make predictions, and understand the underlying factors affecting temperature changes. Here are some specific applications:

#### 1. **Trend Analysis**

- Identify long-term trends in global temperatures over time.
- **Statistical Models**: 
  - **Linear Regression**: Fit a linear regression model to the temperature data to quantify the trend over time. This can help determine whether temperatures are increasing, decreasing, or remaining stable.
  - **Polynomial Regression**: If the trend is nonlinear, polynomial regression can be used to capture more complex patterns.

#### 2. **Seasonality Analysis**

- Examine seasonal patterns in temperature data (e.g., warmer summers and colder winters).
- **Statistical Models**:
  - **Seasonal Decomposition of Time Series (STL)**: Decompose the time series into trend, seasonal, and residual components to analyze seasonal effects.
  - **SARIMA (Seasonal ARIMA)**: Use SARIMA models to account for both trend and seasonal effects in the data, allowing for more accurate forecasting.

#### 3. **Forecasting Future Temperatures**

- Predict future temperature values based on historical data.
- **Statistical Models**:
  - **ARIMA Models**: Use ARIMA or SARIMA models to forecast future temperatures based on past observations. This can help in understanding potential future climate scenarios.
  - **Exponential Smoothing**: Apply exponential smoothing methods to generate forecasts that give more weight to recent observations.

#### 4. **Impact of External Factors**

- Investigate how external factors (e.g., CO2 emissions, land use changes) affect temperature changes.
- **Statistical Models**:
  - **Multiple Regression Analysis**: Use multiple regression to model the relationship between temperature and various independent variables (e.g., CO2 levels, urbanization, deforestation).
  - **Time Series Regression**: Combine time series analysis with regression to assess the impact of time-varying factors on temperature.

#### 5. **Anomaly Detection**

- Identify unusual temperature spikes or drops that may indicate extreme weather events or climate anomalies.
- **Statistical Models**:
  - **Control Charts**: Use control charts to monitor temperature data and identify points that fall outside of expected ranges.
  - **Z-Score Analysis**: Calculate z-scores to identify temperature anomalies based on standard deviations from the mean.

#### 6. **Climate Change Studies**

- Analyze the long-term effects of climate change on global temperatures.
- **Statistical Models**:
  - **Time Series Analysis**: Use time series models to analyze historical temperature data and assess the rate of change over decades.
  - **Machine Learning Models**: Implement machine learning techniques (e.g., Random Forest, Gradient Boosting) to predict temperature changes based on a variety of climate-related features.



### Definition of AR(p) Models

**Autoregressive (AR) Models** are a class of time series models that express the current value of a time series as a linear combination of its previous values (lags) and a stochastic error term. The notation **AR(p)** indicates that the model uses **p** lagged values of the time series for prediction.

The general form of an AR(p) model can be expressed mathematically as:

\[
X_t = c + \phi_1 X_{t-1} + \phi_2 X_{t-2} + \ldots + \phi_p X_{t-p} + \epsilon_t
\]

Where:
- \(X_t\) is the value of the time series at time \(t\).
- \(c\) is a constant (intercept).
- \(\phi_1, \phi_2, \ldots, \phi_p\) are the parameters (coefficients) of the model that represent the influence of the past values on the current value.
- \(\epsilon_t\) is a white noise error term, which is assumed to be normally distributed with a mean of zero and constant variance.

### Role of Past Values in Prediction

The past values of the time series play a crucial role in the AR(p) model for several reasons:

1. **Capturing Temporal Dependencies**:
   - The AR(p) model leverages the idea that past values of a time series can provide valuable information about its future values. By including lagged values, the model captures the temporal dependencies inherent in the data. For example, if a time series exhibits a strong correlation with its previous values, the AR model can effectively utilize this information for forecasting.

2. **Parameter Estimation**:
   - The coefficients \(\phi_1, \phi_2, \ldots, \phi_p\) are estimated from the historical data. These parameters quantify the strength and direction of the relationship between the current value and its past values. A positive coefficient indicates that an increase in the past value leads to an increase in the current value, while a negative coefficient suggests the opposite.

3. **Lag Selection**:
   - The choice of the number of lags \(p\) is critical in AR(p) models. A higher value of \(p\) allows the model to capture more complex relationships in the data, but it also increases the risk of overfitting. Various criteria, such as the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC), can be used to determine the optimal number of lags.

4. **Forecasting**:
   - When making predictions, the AR(p) model uses the most recent \(p\) observations to compute the forecast. For instance, to predict \(X_{t+1}\), the model would use \(X_t, X_{t-1}, \ldots, X_{t-p+1}\). The forecast is a weighted sum of these past values, where the weights are determined by the estimated coefficients.

5. **Dynamic Nature**:
   - The AR(p) model is dynamic, meaning that as new data points become available, the model can be updated to reflect the most recent information. This adaptability is essential for time series forecasting, as it allows the model to respond to changes in the underlying process generating the data.

### Example

Consider a simple AR(1) model, which uses only the most recent past value:

\[
X_t = c + \phi_1 X_{t-1} + \epsilon_t
\]

In this case, the current value \(X_t\) is directly influenced by the immediately preceding value \(X_{t-1}\). If \(\phi_1\) is positive, it indicates that if \(X_{t-1}\) was high, \(X_t\) is likely to be high as well. Conversely, if \(\phi_1\) is negative, a high \(X_{t-1}\) would suggest a lower \(X_t\).