# 1. Introduction

This project aims to apply the full exploratory data analysis (EDA) workflow covered in Data Analysis and Modern Tools (INSY 6500). The goal of this project is to work with a real-world dataset using NumPy, Pandas, visualization tools, and best practices from lectures. 

The dataset used is sourced from the **World Air Quality Index (WAQI)** and is publicly available on Kaggle: [Global Air Quality Dataset – Kaggle](https://www.kaggle.com/datasets/waqi786/global-air-quality-dataset).   
It includes 10,000 observations collected across multiple international cities, combining pollutant concentrations (such as PM2.5, PM10, NO₂, SO₂, CO, and O₃) with environmental climate variables (such as temperature, humidity, and pressure at the time of observation). The dataset has a heterogeneous structure of  numerical readings, dates, categorical values, and geographic identifiers, making it well-suited for data exploration.

This notebook demonstrates critical ideas in the course, including:

- Named column access and explicit indexing
- handling missing, noisy and inconsistent data
- Method chaining for clearer data pipelines
- Descriptive statistics and summarization 
- Visualization
- Feature engineering for improved interpretability
- Narrative interpretation and analysis
- Organization and reproducibility

## 1.1 Project objectives

We aim to understand the following by the end of the project. 

- Analyze global air quality patterns across countries and cities 
- Investigate and understand relationships among pollutant types
- Explore environmental factors and how they interact with air pollution
- Engineer-derived features that support new insights
- Communicate findings using a clear structure and explanation
- Prepare artifacts for a Streamlit dashboard (graduate requirement)

# 2. Research question 

We defined the following research questions:

### 2.1 Geography and Pollution Levels  
- Which cities and countries exhibit the highest levels of air pollution?
- How do pollution levels differ across global regions?

### 2.2 Environmental and Temporal Patterns
- How do pollutant levels change over time within a location?
- What relationships between pollutants and climate factors such as humidity, pressure and temperature?

### 2.3 Pollutant Interactions and Derived Insights
- Is there a relationship, and how strongly are pollutant types correlated with each other?
- Can multiple engineered metrics provide additional insight?



# 3. Data Setup and Loading (Initial Structure)
In this step, we import all the required Python packages and load the air quality dataset. These packages will be used to perform analysis, visualization, statistics and feature engineering throughout the project. 


In [5]:
# 3. This cell shows the required Libraries and loading the dataset

# numpy and pandas libraries
import pandas as pd
import numpy as np

# visualization libraries ( matplotlib and seaborn) 
import matplotlib.pyplot as plt
import seaborn as sns

# load the dataset (pd.read_csv)
df = pd.read_csv("../data/global_air_quality_data_10000.csv")

# preview first five rows
df.head()  # explore the data

Unnamed: 0,City,Country,Date,PM2.5,PM10,NO2,SO2,CO,O3,Temperature,Humidity,Wind Speed
0,Bangkok,Thailand,2023-03-19,86.57,25.19,99.88,30.63,4.46,36.29,17.67,59.35,13.76
1,Istanbul,Turkey,2023-02-16,50.63,97.39,48.14,8.71,3.4,144.16,3.46,67.51,6.36
2,Rio de Janeiro,Brazil,2023-11-13,130.21,57.22,98.51,9.92,0.12,179.31,25.29,29.3,12.87
3,Mumbai,India,2023-03-16,119.7,130.52,10.96,33.03,7.74,38.65,23.15,99.97,7.71
4,Paris,France,2023-04-04,55.2,36.62,76.85,21.85,2.0,67.09,16.02,90.28,14.16


## 3.1 Data Types, Parsing, and Initial Structure

In [12]:
# Convert date to datetime
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Convert city and country to categorical types
df['City'] = df['City'].astype('category')
df['Country'] = df['Country'].astype('category')

# Identify numeric columns
numeric_cols = ['PM2.5','PM10','NO2','SO2','CO','O3','Temperature','Humidity','Wind Speed']

# Ensure numeric columns are numeric
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce')

# Verify 
df.info()

# memory usage 
memory_usage = df.memory_usage(deep=True).sum()
print(f"\nTotal memory usage (deep=True): {memory_usage / (1024 * 1024):.2f} MB")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   City         10000 non-null  category      
 1   Country      10000 non-null  category      
 2   Date         10000 non-null  datetime64[ns]
 3   PM2.5        10000 non-null  float64       
 4   PM10         10000 non-null  float64       
 5   NO2          10000 non-null  float64       
 6   SO2          10000 non-null  float64       
 7   CO           10000 non-null  float64       
 8   O3           10000 non-null  float64       
 9   Temperature  10000 non-null  float64       
 10  Humidity     10000 non-null  float64       
 11  Wind Speed   10000 non-null  float64       
dtypes: category(2), datetime64[ns](1), float64(9)
memory usage: 802.3 KB

Total memory usage (deep=True): 0.79 MB
