In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt

In [24]:
# Read the dataset from the CSV file called 'Co2_Data.csv' and store it in a DataFrame called df
df = pd.read_csv('CO2_Data.csv')
df

Unnamed: 0,CO2,Year&Month,Year,Month
0,333.13,1974.38,1974,5
1,332.09,1974.46,1974,6
2,331.10,1974.54,1974,7
3,329.14,1974.63,1974,8
4,327.36,1974.71,1974,9
...,...,...,...,...
156,351.71,1987.38,1987,5
157,350.94,1987.46,1987,6
158,349.10,1987.54,1987,7
159,346.77,1987.63,1987,8


In [25]:
# Drop rows with missing values
df.dropna(inplace=True)

In [26]:
df.set_index(['Year&Month', 'Year', 'Month'], inplace=True)
co2_data = df['CO2']
co2_data

Year&Month  Year  Month
1974.38     1974  5        333.13
1974.46     1974  6        332.09
1974.54     1974  7        331.10
1974.63     1974  8        329.14
1974.71     1974  9        327.36
                            ...  
1987.38     1987  5        351.71
1987.46     1987  6        350.94
1987.54     1987  7        349.10
1987.63     1987  8        346.77
1987.71     1987  9        345.73
Name: CO2, Length: 161, dtype: float64

    Creating Lag Features: The loop for i in range(1, 4) iterates over the values 1, 2, and 3. This loop is used to create lag features, which are previous values of the CO2 column. For each iteration, the line df[f'CO2_Lag{i}'] = df['CO2'].shift(i) creates a new column in the DataFrame df with the name CO2_Lag{i} (e.g., CO2_Lag1, CO2_Lag2, etc.). The values in these new columns are obtained by shifting the values in the 'CO2' column upwards by i positions. This creates a sliding window of size 3, where each value in the 'CO2_Lag1' column represents the CO2 value from the previous time step, 'CO2_Lag2' represents the CO2 value from two time steps ago, and so on.


    Splitting the Data: The next step is to split the dataset into training and test sets. The line X = df[['CO2_Lag1', 'CO2_Lag2', 'CO2_Lag3']] selects the columns 'CO2_Lag1', 'CO2_Lag2', and 'CO2_Lag3' from the DataFrame df and assigns them to the variable X, which represents the input features for the model. The line y = df['CO2'] selects the 'CO2' column as the target variable and assigns it to the variable y. The X variable contains the lagged CO2 values, while y contains the corresponding actual CO2 values.

By performing these steps, you create a dataset with lagged features, drop rows with missing values, and split the data into training and test sets for the autoregression mode

In [13]:
# Create lag features using a sliding window of 3
for i in range(1, 4):
    df[f'CO2_Lag{i}'] = df['CO2'].shift(i)

In [14]:
# Split the data into training and test sets
X = df[['CO2_Lag1', 'CO2_Lag2', 'CO2_Lag3']]
y = df['CO2']

In the context of time series forecasting, lags refer to using previous observations of a variable to predict its future values. Lagged features capture the relationship between a variable and its past values, allowing the model to incorporate the temporal dependencies and patterns in the data.

When creating lagged features, you shift the values of the variable backward in time by a certain number of time steps. For example, if you have monthly data, shifting the variable by one time step creates a lag of one month, and shifting it by two time steps creates a lag of two months.

By including lagged features in a time series forecasting model, you enable the model to use the historical values of the variable as input features to predict its future values. The idea is that the past behavior and patterns of the variable can provide valuable information for predicting its future behavior.

The missing values are filled with the mean value of the corresponding column using the fillna method. The lines X = X.replace([np.inf, -np.inf], np.nan) and y = y.replace([np.inf, -np.inf], np.nan) replace any infinity values with NaN, which are then filled with the mean value. This ensures that the dataset contains valid values for training the autoregression model.

In [17]:
# Check for and handle any remaining missing or invalid values
X = X.replace([np.inf, -np.inf], np.nan)
X = X.fillna(X.mean())
y = y.replace([np.inf, -np.inf], np.nan)
y = y.fillna(y.mean())

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)


In [19]:
# Train the autoregression model
model = LinearRegression()
model.fit(X_train, y_train)


LinearRegression()

In [21]:
# Train the autoregression model
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In [22]:
# Make predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

In [23]:
# Calculate metrics
mse_train = mean_squared_error(y_train, y_pred_train)
mae_train = mean_absolute_error(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)
mae_test = mean_absolute_error(y_test, y_pred_test)

print('Training set metrics:')
print('MSE:', mse_train)
print('MAE:', mae_train)
print('---')
print('Test set metrics:')
print('MSE:', mse_test)
print('MAE:', mae_test)

Training set metrics:
MSE: 1.0030192411108287
MAE: 0.7172473435727573
---
Test set metrics:
MSE: 0.8035712164526403
MAE: 0.7225679101518674
