### **Notebook 4 - Pre-processing and modeling: Linear Regression**

### Introduction

In the previous notebooks, I cleaned the data, carried out exploratory data analysis and fitted various models using auto-regression, linear regression and decision trees. The best model proved to be the SARIMA model. In this section, I'll use this best performing model to forecast future AQI values and save the model as a pickle file.

Table of content:

- Loading data and importing libraries
- Forecasting
- Saving model as pkl

***
### Importing libraries and loading dataset

In [2]:
# Importing libraries

import numpy as np
import pandas as pd

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
from statsmodels.api import tsa
import statsmodels.api as sm
from datetime import datetime
from statsmodels.tsa.statespace.sarimax import SARIMAX

In [3]:
# Loading dataset

aqi_df = pd.read_csv('data/cleaned_aqi.csv', parse_dates=['Date'], index_col='Date')
aqi_df.sample(5)

Unnamed: 0_level_0,AQI,Category,Defining Parameter,Number of Sites Reporting,city_ascii,state_name,lat,lng,population,density,timezone,Year,Month,Day
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1984-02-01,79,Moderate,NO2,20,Los Angeles,California,34.1141,-118.4068,12531334.0,3267.0,America/Los_Angeles,1984,2,1
2006-12-11,54,Moderate,PM2.5,20,Los Angeles,California,34.1141,-118.4068,12531334.0,3267.0,America/Los_Angeles,2006,12,11
2004-02-01,66,Moderate,PM2.5,19,Los Angeles,California,34.1141,-118.4068,12531334.0,3267.0,America/Los_Angeles,2004,2,1
1997-12-22,74,Moderate,NO2,18,Los Angeles,California,34.1141,-118.4068,12531334.0,3267.0,America/Los_Angeles,1997,12,22
2007-06-13,150,Unhealthy for Sensitive Groups,Ozone,19,Los Angeles,California,34.1141,-118.4068,12531334.0,3267.0,America/Los_Angeles,2007,6,13


In [4]:
# Creating a new dataframe with target variable only and resampling to monthly averages

dates_aqi_df = aqi_df['AQI']
dates_aqi_df = pd.DataFrame(dates_aqi_df)

air_quality_monthly = dates_aqi_df.resample('M').mean()

In [5]:
# Train test split

train = pd.DataFrame(air_quality_monthly.loc[air_quality_monthly.index <= '2013-01-01', 'AQI'])
test = pd.DataFrame(air_quality_monthly.loc[air_quality_monthly.index > '2013-01-01', 'AQI'])

***
### Forecasting

In [6]:
# Building the model with the optimized hyperparameters

# Fitting the next model

p, d, q = 0, 1, 1
P, D, Q, s = 0, 1, 1, 12

sarima_model = SARIMAX(train, order=(p, d, q), seasonal_order=(P, D, Q, s))
sarima_model_fit = sarima_model.fit()

sarima_model_fit.summary()

 This problem is unconstrained.


RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  4.35273D+00    |proj g|=  4.39696D-02

At iterate    5    f=  4.28297D+00    |proj g|=  9.78999D-03

At iterate   10    f=  4.27488D+00    |proj g|=  2.49279D-05

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3     11     13      1     0     0   2.163D-06   4.275D+00
  F =   4.2748785648708374     

CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL            


0,1,2,3
Dep. Variable:,AQI,No. Observations:,396.0
Model:,"SARIMAX(0, 1, 1)x(0, 1, 1, 12)",Log Likelihood,-1692.852
Date:,"Sun, 14 Apr 2024",AIC,3391.704
Time:,16:38:57,BIC,3403.548
Sample:,01-31-1980,HQIC,3396.402
,- 12-31-2012,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
ma.L1,-0.9121,0.026,-35.572,0.000,-0.962,-0.862
ma.S.L12,-0.7679,0.039,-19.706,0.000,-0.844,-0.691
sigma2,390.7269,25.286,15.452,0.000,341.168,440.286

0,1,2,3
Ljung-Box (L1) (Q):,8.36,Jarque-Bera (JB):,7.67
Prob(Q):,0.0,Prob(JB):,0.02
Heteroskedasticity (H):,0.35,Skew:,-0.14
Prob(H) (two-sided):,0.0,Kurtosis:,3.64


In [7]:
# Forecasting the next 12 months using the predict function 

forecast = sarima_model_fit.predict(start=1, end=len(train)+len(test))

# Since negative values are not possible, I can trim them

forecast[forecast < 0] = 0

In [27]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=train.index, y=train['AQI'], mode='lines', name='Train'))
fig.add_trace(go.Scatter(x=test.index, y=test['AQI'], mode='lines', name='Test'))
fig.add_trace(go.Scatter(x=forecast.index, y=forecast.values, mode='lines', name='Forecast'))
fig.update_xaxes(rangeslider_visible=True)

fig.update_layout(
    yaxis_title='AQI_monthly_diff',
    xaxis_title='Date',
    title='Train (remainder) and test sets with train-test split highlighted')

fig.show()

***
### Saving the final model

In [32]:
import pickle

# save the iris classification model as a pickle file
model_pkl_file = "sarima_model.pkl"  

with open(model_pkl_file, 'wb') as file:  
    pickle.dump(sarima_model_fit, file)