# Exercise Questions

This version of the exercises uses the [Air Quality](https://archive.ics.uci.edu/ml/datasets/Air+Quality#) dataset includes observations of various air quality and pollutant indicators. It's commonly used in regression model tutorials as it has several predictors with linear relationships to multiple pollutants. We will use it here as a simple modeling example that will allow us to focus on model metrics and evaluation procedures.

The dataset includes the following information:
- date/time indicators
- weather indicators
- several pollutants of potential interest

In [None]:
# Import packages and dataset
import numpy as np
import pandas as pd
import seaborn as sns

df = pd.read_csv("air_quality.csv")
df = df.fillna(0)

## Explore the Dataset

Take a few minutes to explore the dataset. How many potential predictors are there for a model predicting `benzene` (a compound that reacts with other chemicals to create smog) quantities in the air?

In [None]:
df.info()
#df.head()

In [None]:
# Create a correlation matrix to identify potential predictors
df_corr = df.corr().round(2)
df_corr

In [None]:
# Setting a threshold for correlation values; optional
y_corr = df_corr["benzene"][np.absolute(df_corr["benzene"])>0.5]
y_corr

In [None]:
# Explore the shape of relationships 
sns.pairplot(df[y_corr.index]);

## Set up for model

1. Compare at least two models using scikit-learn. You can view the list of available linear models with scikit-learn [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model). Use at least 2 metrics from scikit-learn's regression evaluation metrics [here](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics).

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [None]:
# Set up X to only include selected column names
X = df[y_corr.index]

# Fill missing data with 0; you can change this if you choose
X = X.fillna(0)

# Drop outcome variable
X = X.drop("benzene", axis = "columns")

# Get y
y = df["benzene"]

## Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

## Standardize the data
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

### Fit the Model

Fit the first model and evaluate the model metrics. Compare it to at least one other model and select one going forward.

In [None]:
lr = LinearRegression()

# Fit the model
lr.fit(X_train, y_train)

# Generate predictions based on the test set 
y_preds = lr.predict(X_test)

In [None]:
print(f"{lr} training set r2 score: {lr.score(X_train, y_train).round(2)}")
print(f"{lr} test set r2 score: {r2_score(y_test, y_preds).round(2)}")
# Choose another metric

In [None]:
# Compare to another model

## Modeling for Explanation

Modeling for explanation generally requires more detailed information on individual predictors than scikit-learn provides. In order to model for explanation, we will leverage the `statsmodels` library to explore our model further.

In [None]:
import statsmodels.api as sm

In [None]:
x_train = sm.add_constant(X_train)
model = sm.OLS(y_train, x_train)
results = model.fit()

In [None]:
print(results.summary(xname = ["const"] + list(X.columns)))

2. Explore the `statsmodels` OLS regression summary above. Are there features that are _not_ statistically significant predictors of `benzene` in this model? Try removing them (drop them and recreate the training/test set) and re-evaluate the strength and directionality of the t-values. If not, consider which predictors might be more _actionable_ to include.

3. The local government is seeking to identify two or three actions to take to reduce smog in the city related to benzene content in the atmosphere. Generate (1) a summary of results, (2) a series of visualizations of key predictors, and (3) an overall recommendations on what environmental issues to pursue a program for.