<a href="https://colab.research.google.com/github/pearllpatell/BusinessAnalyticsIS4487_Patel/blob/main/lab10_air_quality_fit_model_patel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IS 4487 Lab 10

## Outline

Repeat exercises from Lab 9, but with the *Air Quality Daily AQI* dataset.

Pull the latest "Daily AQI by County" file from this link: https://aqs.epa.gov/aqsweb/airdata/download_files.html#AQI

Your target variable is *CATEGORY*, indicating if the air is healthy.  You can focus on either the entire country, split the country in regions, or focus on just one area (ex. Utah).   You can reduce noise by aggregating the data to the month of season level.   

Can you predict the category based on the location and time of year?  

The AQI is divided into six categories:

*Air Quality Index*

|(AQI) Values	|Levels of Health Concern	        |
|---------------|--------|
|0-50	        |Good	 |
|51-100	        |Moderate	 |
|101-150	    |Unhealthy for Sensitive Groups	|
|151 to 200	    |Unhealthy	 |
|201 to 300	    |Very Unhealthy	 |
|301 to 500	    |Hazardous	 |

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Labs/Scripts/lab10_air_quality_fit_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load Libraries

➡️ Assignment Tasks
- Load any necessary libraries

In [19]:
import pandas as pd
from pandas.api.types import CategoricalDtype
import matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn as sl
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import seaborn as sns

## Import Data into Dataframe

➡️ Assignment Tasks
- Import data from the air quality dataset into a dataframe (in GitHub go to Labs > DataSets)
- Describe or profile the dataframe

In [3]:
df = pd.read_csv ('daily_aqi.csv')

print(df)

       State Name county Name  State Code  County Code        Date  AQI  \
0         Alabama     Baldwin           1            3  2023-01-10   35   
1         Alabama     Baldwin           1            3  2023-01-11   28   
2         Alabama     Baldwin           1            3  2023-01-12   23   
3         Alabama     Baldwin           1            3  2023-01-13   18   
4         Alabama     Baldwin           1            3  2023-01-14   20   
...           ...         ...         ...          ...         ...  ...   
185012    Wyoming      Weston          56           45  2023-06-26   46   
185013    Wyoming      Weston          56           45  2023-06-27   50   
185014    Wyoming      Weston          56           45  2023-06-28   48   
185015    Wyoming      Weston          56           45  2023-06-29   47   
185016    Wyoming      Weston          56           45  2023-06-30   48   

       Category Defining Parameter Defining Site  Number of Sites Reporting  
0          Good      

## Prepare Data

➡️ Assignment Tasks
- Create one dummy variable (true/false) for each of the Defining Parameter values    
- Create variables for month and season
- Perform any other data cleanup needed (remove outliers, nulls, etc.)
- Select the data you would like to use in the model.  If you aggregate data, you will have to decide whether to use the min, max or mean value for AQI

In [5]:
#Create dummy variables for "Defining Parameter"
df_dummy = pd.get_dummies(df['Defining Parameter'], prefix='DefParam')

# Create variables for month and season
df['Date'] = pd.to_datetime(df['Date'])
df['Month'] = df['Date'].dt.month
# Define function to map month to season
def month_to_season(month):
    if month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    elif month in [9, 10, 11]:
        return 'Fall'
    else:
        return 'Winter'
df['Season'] = df['Month'].apply(month_to_season)

# Remove rows with null values
df = df.dropna()
# Remove outliers in AQI
q_low = df['AQI'].quantile(0.01)
q_hi  = df['AQI'].quantile(0.99)
df = df[(df['AQI'] > q_low) & (df['AQI'] < q_hi)]

# 4. Select data for the model - Assuming a decision to use mean AQI for monthly aggregation
df['Year'] = df['Date'].dt.year
# Aggregate AQI by mean, min, and max values per month for potential model input choices
monthly_aqi_stats = df.groupby(['Year', 'Month']).agg(
    Monthly_Mean_AQI=('AQI', 'mean'),
    Monthly_Min_AQI=('AQI', 'min'),
    Monthly_Max_AQI=('AQI', 'max')
).reset_index()

# Combine the prepared dataset
df_final = df.merge(monthly_aqi_stats, how='left', on=['Year', 'Month'])

# Include dummy variables
df_final = pd.concat([df_final, df_dummy], axis=1)

# Selecting final columns, excluding 'Date' but including month, season, and aggregated AQI stats
final_columns = ['Month', 'Season', 'Monthly_Mean_AQI', 'Monthly_Min_AQI', 'Monthly_Max_AQI'] + list(df_dummy.columns)
model_data = df_final[final_columns].drop_duplicates().reset_index(drop=True)

# Display the prepared data
print(model_data.head())


   Month  Season  Monthly_Mean_AQI  Monthly_Min_AQI  Monthly_Max_AQI  \
0    1.0  Winter         35.227229              9.0            104.0   
1    2.0  Winter         36.700433              9.0            104.0   
2    3.0  Spring         40.322500              9.0            104.0   
3    3.0  Spring         40.322500              9.0            104.0   
4    4.0  Spring         43.676444              9.0            104.0   

   DefParam_CO  DefParam_NO2  DefParam_Ozone  DefParam_PM10  DefParam_PM2.5  
0          0.0           0.0             0.0            0.0             1.0  
1          0.0           0.0             0.0            0.0             1.0  
2          0.0           0.0             0.0            0.0             1.0  
3          0.0           0.0             1.0            0.0             0.0  
4          0.0           0.0             1.0            0.0             0.0  


## Create Regression

➡️ Assignment Tasks
- Create a simple linear regression to predict AQI based on as many variables as you can use or derive.
- Visualize the regression with at least one of the variables

In [21]:
# Split the data into features (X) and target variable (y)
X = df.drop('AQI', axis=1)  # Features
y = df['AQI']  # Target variable

# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating and training the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predicting the Test set results
y_pred = model.predict(X_test)

# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Model Performance Metrics:\nRMSE: {rmse}\nR^2: {r2}")

# Coefficients and intercept
print("Model Coefficients:", model.coef_)
print("Model Intercept:", model.intercept_)

# Visualizing the model predictions
# Let's choose 'Temperature' for visualization
plt.scatter(X_test['Temperature'], y_test, color='blue', label='Actual AQI')
plt.scatter(X_test['Temperature'], y_pred, color='red', alpha=0.5, label='Predicted AQI')
plt.title('AQI Prediction vs Actual AQI (Temperature)')
plt.xlabel('Temperature')
plt.ylabel('AQI')
plt.legend()
plt.show()

ValueError: could not convert string to float: 'Tennessee'

## Make a prediction

➡️ Assignment Tasks
- What would you predict the average AQI to be in the month of January?  

Taking these factors into account, the model might predict a lower to moderate AQI for January compared to warmer, drier months, or months with higher levels of specific pollutants, assuming all other factors remain constant.

## OPTIONAL: Compare Air Quality

➡️ Assignment Tasks
- Download the data from a year 20 years prior, using this website: https://aqs.epa.gov/aqsweb/airdata/download_files.html#AQI
- Append the new data to the previous dataframe
- Use the year as a variable in your regression.  Is year a significant factor in predicting AQI?

In [None]:
#import, append and create new model