# Air Quality Dataset Aggregator

This notebook merges air quality data from multiple CSV files, filters by station ID, adds user-defined attributes, and splits the output into multiple files with a maximum of 10,000 rows each.

## Section 1: Import Required Libraries

In [1]:
import pandas as pd
import os
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

## Section 2: Define User Input Parameters

Enter the station ID and the values for the new columns to be added to the dataset.

In [2]:
# User Input Parameters
station_id = "AP001"  # Change this to your desired station ID
boundary = "value1"  # Add your boundary value here
building = "value2"  # Add your building value here
geological = "value3"  # Add your geological value here
highway = "value4"  # Add your highway value here
landuse = "value5"  # Add your landuse value here
natural = "value6"  # Add your natural value here

print("Input Parameters:")
print(f"Station ID: {station_id}")
print(f"Boundary: {boundary}")
print(f"Building: {building}")
print(f"Geological: {geological}")
print(f"Highway: {highway}")
print(f"Landuse: {landuse}")
print(f"Natural: {natural}")

Input Parameters:
Station ID: AP001
Boundary: value1
Building: value2
Geological: value3
Highway: value4
Landuse: value5
Natural: value6


## Section 3: Load CSV Files

In [3]:
# Set the path to the CSV files
current_directory = os.getcwd()
station_hour_path = os.path.join(current_directory, "station_hour.csv")
stations_path = os.path.join(current_directory, "stations.csv")

print(f"Loading CSV files from: {current_directory}")
print(f"Station Hour CSV path: {station_hour_path}")
print(f"Stations CSV path: {stations_path}")

# Load the CSV files
station_hour_df = pd.read_csv(station_hour_path)
stations_df = pd.read_csv(stations_path)

print(f"\nStation Hour DataFrame shape: {station_hour_df.shape}")
print(f"Stations DataFrame shape: {stations_df.shape}")
print(f"\nStation Hour DataFrame columns: {list(station_hour_df.columns)}")
print(f"Stations DataFrame columns: {list(stations_df.columns)}")

Loading CSV files from: /Users/likhithkanigolla/IIITH/code-files/Digital-Twin/Air-Quality-Dataset-Seggregator-DT
Station Hour CSV path: /Users/likhithkanigolla/IIITH/code-files/Digital-Twin/Air-Quality-Dataset-Seggregator-DT/station_hour.csv
Stations CSV path: /Users/likhithkanigolla/IIITH/code-files/Digital-Twin/Air-Quality-Dataset-Seggregator-DT/stations.csv

Station Hour DataFrame shape: (2589083, 16)
Stations DataFrame shape: (230, 5)

Station Hour DataFrame columns: ['StationId', 'Datetime', 'PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'Xylene', 'AQI', 'AQI_Bucket']
Stations DataFrame columns: ['StationId', 'StationName', 'City', 'State', 'Status']


## Section 4: Merge and Filter Data by StationId

In [4]:
# Merge the DataFrames on StationId
# First, check if 'StationId' column exists in both DataFrames
print("Checking for StationId column in both files...")

if 'StationId' in station_hour_df.columns:
    print("✓ StationId found in station_hour.csv")
else:
    print("✗ StationId not found in station_hour.csv")
    print(f"Available columns: {list(station_hour_df.columns)}")

if 'StationId' in stations_df.columns:
    print("✓ StationId found in stations.csv")
else:
    print("✗ StationId not found in stations.csv")
    print(f"Available columns: {list(stations_df.columns)}")

# Merge the two DataFrames on StationId
merged_df = station_hour_df.merge(stations_df, on='StationId', how='inner')
print(f"\nMerged DataFrame shape: {merged_df.shape}")

# Filter by the specified station ID
filtered_df = merged_df[merged_df['StationId'] == station_id].copy()
print(f"Filtered DataFrame shape (for station {station_id}): {filtered_df.shape}")

if filtered_df.shape[0] == 0:
    print(f"⚠️ Warning: No data found for station ID {station_id}")
else:
    print(f"✓ Found {filtered_df.shape[0]} rows for station {station_id}")

Checking for StationId column in both files...
✓ StationId found in station_hour.csv
✓ StationId found in stations.csv

Merged DataFrame shape: (2589083, 20)
Filtered DataFrame shape (for station AP001): (22784, 20)
✓ Found 22784 rows for station AP001


## Section 5: Add User-Defined Columns

In [5]:
# Add the user-defined columns to the filtered DataFrame
filtered_df['boundary'] = boundary
filtered_df['building'] = building
filtered_df['geological'] = geological
filtered_df['highway'] = highway
filtered_df['landuse'] = landuse
filtered_df['natural'] = natural

print("User-defined columns added:")
print(f"- boundary: {boundary}")
print(f"- building: {building}")
print(f"- geological: {geological}")
print(f"- highway: {highway}")
print(f"- landuse: {landuse}")
print(f"- natural: {natural}")
print(f"\nUpdated DataFrame shape: {filtered_df.shape}")
print(f"Updated DataFrame columns: {list(filtered_df.columns)}")

User-defined columns added:
- boundary: value1
- building: value2
- geological: value3
- highway: value4
- landuse: value5
- natural: value6

Updated DataFrame shape: (22784, 26)
Updated DataFrame columns: ['StationId', 'Datetime', 'PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'Xylene', 'AQI', 'AQI_Bucket', 'StationName', 'City', 'State', 'Status', 'boundary', 'building', 'geological', 'highway', 'landuse', 'natural']


## Section 6: Split Data into Chunks and Save to Output Files

This section splits the data into chunks of maximum 10,000 rows and saves each chunk as a separate CSV file in a folder named after the station ID.

In [6]:
# Create output folder named after the station ID
output_folder = os.path.join(current_directory, station_id)
os.makedirs(output_folder, exist_ok=True)
print(f"Output folder created: {output_folder}")

# Parameters for chunking
chunk_size = 10000
total_rows = filtered_df.shape[0]
num_chunks = (total_rows + chunk_size - 1) // chunk_size  # Ceiling division

print(f"\nTotal rows: {total_rows}")
print(f"Chunk size: {chunk_size}")
print(f"Number of chunks: {num_chunks}")

# Split the data into chunks and save each chunk as a CSV file
output_files = []
for i in range(num_chunks):
    start_idx = i * chunk_size
    end_idx = min((i + 1) * chunk_size, total_rows)
    chunk_df = filtered_df.iloc[start_idx:end_idx]
    
    # Create filename with chunk number
    file_number = i + 1
    filename = f"{station_id}_chunk_{file_number}.csv"
    filepath = os.path.join(output_folder, filename)
    
    # Save the chunk to CSV
    chunk_df.to_csv(filepath, index=False)
    output_files.append(filename)
    
    print(f"Saved: {filename} ({chunk_df.shape[0]} rows)")

print(f"\n✓ Processing complete!")
print(f"\nOutput files created in folder '{station_id}':")
for i, file in enumerate(output_files, 1):
    print(f"  {i}. {file}")

Output folder created: /Users/likhithkanigolla/IIITH/code-files/Digital-Twin/Air-Quality-Dataset-Seggregator-DT/AP001

Total rows: 22784
Chunk size: 10000
Number of chunks: 3
Saved: AP001_chunk_1.csv (10000 rows)
Saved: AP001_chunk_2.csv (10000 rows)
Saved: AP001_chunk_3.csv (2784 rows)

✓ Processing complete!

Output files created in folder 'AP001':
  1. AP001_chunk_1.csv
  2. AP001_chunk_2.csv
  3. AP001_chunk_3.csv


## Summary

The notebook has successfully:
1. ✓ Loaded CSV files from the workspace
2. ✓ Merged data from station_hour.csv and stations.csv
3. ✓ Filtered the data by station ID
4. ✓ Added user-defined columns (boundary, building, geological, highway, landuse, natural)
5. ✓ Split the data into chunks of max 10,000 rows
6. ✓ Saved each chunk as a separate CSV file in a folder named after the station ID

The output files are now ready for use!