## Baseline Forecasts - Naiive and Seasonal Naiive

Before starting with other algorithms, I want to estabilish a baseline using simple Naiive forecasting techniques where I will assume the past week value and last year same week's value as the prediction and evaluate the performance

In [45]:
#loading libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler


In [46]:
#loading data
df = pd.read_csv('air_pollution.csv')

In [47]:
#onverting to datetime
df['date'] = pd.to_datetime(df['date'])

In [48]:
# scaler = MinMaxScaler()

# for column in df.columns[1:]:
    
#     df[column] = scaler.fit_transform(df[[column]])

In [49]:
#melting the dataframe so that each row consists data for specific pollutant as seen below
df = pd.melt(df, id_vars='date', value_vars=df.columns[1:]).reset_index(drop=True)

#renaming columns
df.columns = ['date','pollutant','quantity']

#sorting based on date and pollutant
df = df.sort_values(['date','pollutant']).reset_index(drop=True)

df.head()

Unnamed: 0,date,pollutant,quantity
0,2004-02-15,CO,1.068245
1,2004-02-15,NO_2,79.180782
2,2004-02-15,O_3,12.400776
3,2004-02-15,PM10,51.815004
4,2004-02-15,PM25,27.801169


In [50]:
#this function create lag features
def create_lag(df,lag):
    
    #based on the lag input in the function, it creates a lag feature for the desired time
    column_name = 'lag_'+str(lag)
    df[column_name] = df.groupby(['pollutant'])['quantity'].shift(lag)
    
    df = df.dropna()
    
    return df

In [51]:
#creating naiive forecast (lag for 1 time step - 1 week)
df = create_lag(df,1)
#creating seasonal naiive forecast (lag for 1 year - 52 time steps - 52 weeks)
df = create_lag(df,52)

In [52]:
#For all the forecasts train and test is specified as before 2017 and after and including 2017 data
#although no trainin is required here the test set should be consisten
train = df[df['date'].dt.year<2017]
test = df[df['date'].dt.year>=2017]

In [53]:
#evaluation metrics is smape and rmse
def smape(y_true, pred):
    return 100/len(y_true) * np.sum(2 * np.abs(pred - y_true) / (np.abs(y_true) + np.abs(pred)))

def rmse(y_true,pred):
    return np.sqrt(np.mean((pred-y_true)**2))

#### For Naiive Forecast (1 week lag)

In [54]:
print('SMAPE: ',smape(np.array(test['quantity']).ravel(), np.array(test['lag_1']).ravel()))
print('RMSE: ',rmse(np.array(test['quantity']).ravel(), np.array(test['lag_1']).ravel()))

SMAPE:  24.015337719465975
RMSE:  7.744063138965667


#### For Seasonal Naiive Forecast (52 week lag value)

In [55]:
print('SMAPE: ',smape(np.array(test['quantity']).ravel(), np.array(test['lag_52']).ravel()))
print('RMSE: ',rmse(np.array(test['quantity']).ravel(), np.array(test['lag_52']).ravel()))

SMAPE:  29.4866089085294
RMSE:  9.080814752942782
