# Model Development and Evaluation using Mean Daily Values
The following notebook will explore how a multiple linear regression can explore the daily mean concentration of four pollutants with the mean weather data and the upstream and oil and gas activity data.

## Definition of Models
A set of features for each pollutant were selected based on the correlations shown in the heat map and the scatter plots.

**Y1: SO2**
X1: Temperature, X2: Wind speed, X3: Humidity, X4: Gas produced

**Y2: TRS**
X1: Wind speed, X2: Wind direction, X3: Depth drilled, X4: Gas produced

**Y3: NO2**
X1: Temperature, X2: Wind direction, X3: Wind speed, X4: Humidity, X5: Depth drilled, X6: Gas produced

**Y4: O3**
X1: Temperature, X2: Wind direction, X3: Wind speed, X4: Humidity, X5: Depth drilled, X6: Gas produced

### Data Splitting, Loss Functions & Cross-Validation

**Data Splitting**
The dataset will be split into a training and a testing dataset. The test dataset will include approximately 1/3 of the dataset to ensure that both the testing and the training sets are representative of the smaller-sized dataset.

**Feature Selection**
The correlation heat maps will be used to select the dependant variables. The features will not be scaled as these are not large datasets, and therefore the time to converge is not a concern.

**Loss Functions**
The sklearn LinearRegression fits a linear model that minimizes the Root Mean Squared Error (RMSE) loss function. This loss function cannot be changed. This loss function has the model learn the outlier data but applies high penalties to incorrect predictions on outlier data points.

**Cross Validation**
The train-validation-test split will not be used. The model is working with a smaller dataset. Therefore, the data will only be split into a training and a testing set. By introducing a validation set, the smaller size of the training, validation, and testing set will be less representative of the sample data.
A Grid Search Cross-Validation will be used. This grid will explore different parameters for the Linear Regression, including the intercept and is the regressors X will be normalized before regression. A 5-fold split will be used in this grid search to cross validate.

**Evaluation & Testing**
The model will be evaluated using the R^2^ score. The R^2^ measures the proportion of variance in the dependant variable that is predictable from the independent variables and is commonly used for linear regression. A higher R^2^ value indicates a better fit. Residual plots will also be used to ensure no transformations are required for the data and a linear model was the best choice (versus a polynomial).

**Other Possible Models for Future Consideration**
Sklearn offers another linear regression: the Huber Regressor that is a linear regression more robust to outliers and the Lasso model that that combines L1 and L2 loss functions.

In [None]:
# Import third party libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import linear_model as lm
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV
from scipy.stats import circmean
import os

In [None]:
# Read in daily df
daily_mean = pd.read_csv('daily_mean.csv')

As observed in the EDA section, the model when in hourly or daily aggregations have a high amount of noise. Daily models will be fit using the raw daily values, as well as daily values that have undergone a moving average smoothing.