# Read in Hourly Air Data
Yo Kimura generated files that have hourly air measures. Files are very large. Need to see if it is possible to read them in and clean them.

Help from MS Copilot LLM
Annual Inhaled Mass (AIM): How much of the chemical a person would inhale if they stayed in that cell all year. (µg)

Annual Absorbed Dose Index (ADI): How much is taken up into blood (not just inhaled). (µg)

Also may make sense to just sum across all of the values for each grid cell - or each hour for each grid cell

Question - if we know pressure and temperature and the age of the person we could be more detailed in AIM and ADI.



## Description of Program
- program:    ip1_2cv1_HourlyAir
- task:       Read Air files with hourly data
- Version:    2026-01-09
- v2:         Consolidate code and prepare to loop
- v3:         Explore options for aggregated annual values
- project:    Southeast Texas Urban Integrated Field Lab
- funding:	  DOE
- author:     Nathanael Rosenheim

## Step 0: Good Housekeeping

In [None]:
# 1. Import all packages
import pandas as pd     # For obtaining and cleaning tabular data
import os # For saving output to path
import zipfile # For handling zip files

In [None]:
# 2. Check versions
import sys
print("Python Version     ", sys.version)
print("geopandas version: ", pd.__version__)

In [None]:
# 3. Check working directory
# Get information on current working directory (getcwd)
os.getcwd()

In [None]:
#4. Store Program Name for output files to have the same name
programname = "ip1_2cv3_hourlyair"
# Make directory to save output
if not os.path.exists(programname):
    os.mkdir(programname)

# Step 1: Obtain Data
Obtain CSV Files

Posted CSV dump of the camx model ouput.
https://utexas.app.box.com/folder/359619230313

Nathanael saved an example file (a small one) on his local machine.

# Step 2: Clean Data

In [None]:
def obtain_hourly_air_quality_data(folder_name, pollutant_name="benz", resolution="1km"):
    # read in csv file from SourceData\Kimura_Hourly_2026-01-08\hourly_benz_1km.zip
    zip_path = os.path.join("SourceData", folder_name, f"hourly_{pollutant_name}_{resolution}.zip")
    with zipfile.ZipFile(zip_path, 'r') as z:
        with z.open(f'hourly_{pollutant_name}_{resolution}.csv') as f:
            hourly_df = pd.read_csv(f)

    return hourly_df

hourly_benz_df = obtain_hourly_air_quality_data("Kimura_Hourly_2026-01-08", pollutant_name="benz", resolution="1km")


In [None]:
hourly_benz_df.head()

In [None]:
# Adjust 
hourly_benz_df[['TSTEP','BENZ']].describe().T

In [None]:
# how many hours are in the data?
hourly_benz_df['tstamp'].nunique()

In [None]:
5136/24

In [None]:
# Group by ROW and COL and sum the BENZ values
summed_benz_df = hourly_benz_df.groupby(['ROW', 'COL'])['BENZ'].sum().reset_index()

# Display the first few rows of the new dataframe
summed_benz_df.head()

In [None]:
# Get unique ROW, COL, x, and y combinations from the original dataframe
coords_df = hourly_benz_df[['ROW', 'COL', 'x', 'y']].drop_duplicates()

# Merge the coordinates back into the summed dataframe
summed_benz_with_coords_df = pd.merge(summed_benz_df, coords_df, on=['ROW', 'COL'])

# convert ROW and COL to integer
summed_benz_with_coords_df['ROW'] = summed_benz_with_coords_df['ROW'].astype(int)
summed_benz_with_coords_df['COL'] = summed_benz_with_coords_df['COL'].astype(int)

# Display the first few rows of the merged dataframe
summed_benz_with_coords_df.head()

In [None]:
# descriptive stats for BENZ
summed_benz_with_coords_df['BENZ'].describe().T

In [None]:
# add unique id based off ROW and COL
def generate_grid_id(row, col, resolution="1km"):
    return f"air{resolution}_{int(row):04d}_{int(col):04d}"

summed_benz_with_coords_df['grid_id'] = summed_benz_with_coords_df.apply(lambda row: 
                            generate_grid_id(row['ROW'], row['COL'], resolution="1km"), 
                            axis=1)
summed_benz_with_coords_df.head() 

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
plt.scatter(summed_benz_with_coords_df['x'], summed_benz_with_coords_df['y'], c=summed_benz_with_coords_df['BENZ'], cmap='viridis', s=10)
plt.colorbar(label='Sum of BENZ')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Sum of BENZ by Location')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# output to csv
summed_benz_with_coords_df.to_csv(os.path.join(programname, f"{programname}_summed_benz_1km.csv"), index=False)

**LISA Analysis Summary**

LISA (Local Indicators of Spatial Association) identifies spatial clusters:
- **Hot spots (High-High)**: High values surrounded by high neighbors
- **Cold spots (Low-Low)**: Low values surrounded by low neighbors  
- **Outliers**: High-Low or Low-High combinations

**Why LISA for benzene grid data?**
- Grid structure is ideal for neighbor definitions
- Provides statistical significance testing
- Standard method in environmental epidemiology

**Implementation**: Use PySAL library (`esda.Moran_Local()` with `libpysal.weights.lat2W()` for grid weights)

In [None]:
# Import spatial analysis libraries
import geopandas as gpd
from libpysal.weights import lat2W
from esda.moran import Moran, Moran_Local
import numpy as np

# Import custom LISA analysis functions
from ip1_3cv1_LISA_analysis import (
    create_spatial_weights,
    create_spatial_weights_knn,
    calculate_global_morans_i,
    plot_morans_i_scatterplot,
    calculate_local_morans_i,
    plot_lisa_cluster_map
)

In [None]:
w, df_sorted = create_spatial_weights(summed_benz_with_coords_df)

In [None]:
df_sorted.head()

In [None]:
moran_global = calculate_global_morans_i(df_sorted, 'BENZ', w)

In [None]:
plot_morans_i_scatterplot(df_sorted, 'BENZ', w, moran_global)

In [None]:
df_sorted, moran_local = calculate_local_morans_i(df_sorted, 'BENZ', w, significance_level=0.05)

In [None]:
df_sorted.head()

In [None]:
plot_lisa_cluster_map(df_sorted, 'BENZ')

In [None]:
# Run without cold spots from round 1 (remove ocean to see what happens)
df_sorted_no_ocean = df_sorted[~df_sorted['cluster_label'].isin(['LL (Cold spot)'])].copy()
# Use KNN weights instead of grid-based weights since we filtered cells
w_knn, df_sorted_no_ocean = create_spatial_weights_knn(df_sorted_no_ocean, x_col='x', y_col='y', k=8)
df_sorted_no_ocean, moran_local_no_ocean = calculate_local_morans_i(df_sorted_no_ocean, 'BENZ', w_knn, significance_level=0.05)
plot_lisa_cluster_map(df_sorted_no_ocean, 'BENZ')