<a href="https://colab.research.google.com/github/jccrews256/ST-554-Project1-Template/blob/main/Task3/Task3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

```
Project 1_task3
Author: Joy Zhou
Reviewed by: Case Crews, Trevor Lillywhite
Date: 2/15/2026
```

In [None]:
!git clone https://github.com/jccrews256/ST-554-Project1-Template.git

fatal: destination path 'ST-554-Project1-Template' already exists and is not an empty directory.


# Introduction
We will work on `Air Quality`dataset avaiable on [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/360/air+quality). The dataset is a time-series dataset contains 9,358 hourly air-quality measurements records collected from March 2004 to February 2005 in an urban area in Italy. It includes the responses of five metal-oxide chemical sensors as well as ground-truth hourly concentrations of CO, non-methanic hydrocarbons, benzene, nitrogen oxides (NOx), and nitrogen dioxide (NO₂). Missing values in the dataset are encoded with -200.    
The related variables information see below:   
C6H6(GT): True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)   
CO(GT): True hourly averaged concentration CO in mg/m^3 (reference analyzer)   
Date: DD/MM/YYYY   
Time: HH.MM.SS   
T: Temperature (°C)   
RH: Relative Humidity (%)   
AH: Absolute Humidity   


# Read in data and data cleaning

- Install the `ucimlrepo` library

In [22]:
!pip install ucimlrepo
# import key modules
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import math
import ucimlrepo as uci
import datetime



- Read in data

In [23]:
import ucimlrepo as uci
#fetch dataset
air_quality = uci.fetch_ucirepo(id=360)
# data (as pandas dataframes)
df = air_quality.data.features
#check a few obs
df.head()

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,3/10/2004,18:00:00,2.6,1360,150,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578
1,3/10/2004,19:00:00,2.0,1292,112,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255
2,3/10/2004,20:00:00,2.2,1402,88,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502
3,3/10/2004,21:00:00,2.2,1376,80,9.2,948,172,1092,122,1584,1203,11.0,60.0,0.7867
4,3/10/2004,22:00:00,1.6,1272,51,6.5,836,131,1205,116,1490,1110,11.2,59.6,0.7888


We will remove any observations where `C6H6(GT)`, `CO(GT)`, `T`, `RH`, or `AH` have a value of -200, since this code represents missing data.
- we will replace -200 with NaN for cloumns `C6H6(GT)`, `CO(GT)`, `T`, `RH`, or `AH`, then drop missing values

In [24]:
cols = ['C6H6(GT)', 'CO(GT)', 'T', 'RH', 'AH'] #columns to check
#replace missing value code(-200) with NaN in those columns
df[cols]= df[cols].replace(-200, np.nan)
# drop rows that have NaN in any of thse columns and explicitly make a copy
subset = df.dropna(subset=cols).copy()
#inspect the subset
subset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7344 entries, 0 to 9356
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           7344 non-null   object 
 1   Time           7344 non-null   object 
 2   CO(GT)         7344 non-null   float64
 3   PT08.S1(CO)    7344 non-null   int64  
 4   NMHC(GT)       7344 non-null   int64  
 5   C6H6(GT)       7344 non-null   float64
 6   PT08.S2(NMHC)  7344 non-null   int64  
 7   NOx(GT)        7344 non-null   int64  
 8   PT08.S3(NOx)   7344 non-null   int64  
 9   NO2(GT)        7344 non-null   int64  
 10  PT08.S4(NO2)   7344 non-null   int64  
 11  PT08.S5(O3)    7344 non-null   int64  
 12  T              7344 non-null   float64
 13  RH             7344 non-null   float64
 14  AH             7344 non-null   float64
dtypes: float64(5), int64(8), object(2)
memory usage: 918.0+ KB


- Then, we comput daily averages of C6H6(GT), CO(GT), T, RH, and AH by Date to create a new dataset with one row per day.
    - For doing this, we need to convert the Date column and Time Column to datetime data type
    - The `Date_Time` column is created by combining the `Date` and `Time` columns to have a single, accurate timestamp for each observation to facilite time-based calcalations.
    - Then we group by `Date` to get the mean of these five variables

In [None]:
subset['Date'] = pd.to_datetime(subset['Date'])
subset['Time'] = subset['Time'].str.replace('.', ':', regex=False)
subset['Date_Time'] = subset['Date'] + pd.to_timedelta(subset['Time'])
subset.head()

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Date_Time
0,2004-03-10,18:00:00,2.6,1360,150,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578,2004-03-10 18:00:00
1,2004-03-10,19:00:00,2.0,1292,112,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255,2004-03-10 19:00:00
2,2004-03-10,20:00:00,2.2,1402,88,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502,2004-03-10 20:00:00
3,2004-03-10,21:00:00,2.2,1376,80,9.2,948,172,1092,122,1584,1203,11.0,60.0,0.7867,2004-03-10 21:00:00
4,2004-03-10,22:00:00,1.6,1272,51,6.5,836,131,1205,116,1490,1110,11.2,59.6,0.7888,2004-03-10 22:00:00


- Then we group by `Date` to get the mean of these five variables and save them to a new dataset `air_quality_clean`

In [None]:
air_quality_clean = subset.groupby('Date')[['C6H6(GT)', 'CO(GT)', 'T', 'RH', 'AH']].mean().round(4)
air_quality_clean.info() #check the new dataset
air_quality_clean.head() #check few rows

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 347 entries, 2004-03-10 to 2005-04-04
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   C6H6(GT)  347 non-null    float64
 1   CO(GT)    347 non-null    float64
 2   T         347 non-null    float64
 3   RH        347 non-null    float64
 4   AH        347 non-null    float64
dtypes: float64(5)
memory usage: 16.3 KB


Unnamed: 0_level_0,C6H6(GT),CO(GT),T,RH,AH
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-03-10,8.45,1.9667,12.0333,54.9,0.7656
2004-03-11,8.2696,2.2391,9.8261,64.2304,0.777
2004-03-12,12.1773,2.8045,11.6182,50.1909,0.6652
2004-03-13,11.1217,2.6957,13.1217,50.6826,0.733
2004-03-14,9.8304,2.4696,16.1826,48.3174,0.8492


# Modeling
C6H6(GT) is our response variable. we will fit a Simple Linear Regression Model (SLR) using CO(GT) to predict C6H6(GT) variableand a MLR model to