## Brief Insights of Air Quality Index in the US
This project uses air quality index (AQI) data for New York City, sourced from NYC Open Data.
The dataset contains adequate information for analysis, as described on the website.
To narrow focus and make sure of relevance, only data from the past five years will be used in this case.

### 1. Read data and initial info

The first step is to download the dataset from the website and save it under a recognizable name.
Here, I named the file ‚ÄúAir_Quality.csv‚Äù to make it easy to locate and reference later.
After that, I decided to use the pandas library, which provides a convenient and efficient way to read, explore, and manipulate the data for further analysis.

In [54]:
# Import library and load data
import pandas as pd

AQI = pd.read_csv("Air_Quality.csv")
AQI.head()

Unnamed: 0,Unique ID,Indicator ID,Name,Measure,Measure Info,Geo Type Name,Geo Join ID,Geo Place Name,Time Period,Start_Date,Data Value,Message
0,336867,375,Nitrogen dioxide (NO2),Mean,ppb,CD,407,Flushing and Whitestone (CD7),Winter 2014-15,12/01/2014,23.97,
1,336741,375,Nitrogen dioxide (NO2),Mean,ppb,CD,107,Upper West Side (CD7),Winter 2014-15,12/01/2014,27.42,
2,550157,375,Nitrogen dioxide (NO2),Mean,ppb,CD,414,Rockaway and Broad Channel (CD14),Annual Average 2017,01/01/2017,12.55,
3,412802,375,Nitrogen dioxide (NO2),Mean,ppb,CD,407,Flushing and Whitestone (CD7),Winter 2015-16,12/01/2015,22.63,
4,412803,375,Nitrogen dioxide (NO2),Mean,ppb,CD,407,Flushing and Whitestone (CD7),Summer 2016,06/01/2016,14.0,


If we are not sure about the data content and types, we can always check it using data.info() function.

In [55]:
# Check data info to understand the structure
AQI.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18862 entries, 0 to 18861
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unique ID       18862 non-null  int64  
 1   Indicator ID    18862 non-null  int64  
 2   Name            18862 non-null  object 
 3   Measure         18862 non-null  object 
 4   Measure Info    18862 non-null  object 
 5   Geo Type Name   18862 non-null  object 
 6   Geo Join ID     18862 non-null  int64  
 7   Geo Place Name  18862 non-null  object 
 8   Time Period     18862 non-null  object 
 9   Start_Date      18862 non-null  object 
 10  Data Value      18862 non-null  float64
 11  Message         0 non-null      float64
dtypes: float64(2), int64(3), object(7)
memory usage: 1.7+ MB


Now, it is the part where we filter the 5 most recent years data.
But remember to always double check the date values because sometimes different data has different style of showing the values.

In [61]:
# Filter to only use 5 most recent years
AQI["Start_Date"] = pd.to_datetime(AQI["Start_Date"], format="%m/%d/%Y")
AQI_5y = AQI[AQI["Start_Date"] >= "2018/01/01"]
AQI_5y

Unnamed: 0,Unique ID,Indicator ID,Name,Measure,Measure Info,Geo Type Name,Geo Join ID,Geo Place Name,Time Period,Start_Date,Data Value,Message
7,603044,375,Nitrogen dioxide (NO2),Mean,ppb,CD,314,Flatbush and Midwood (CD14),Annual Average 2018,2018-01-01,17.280000,
9,825832,375,Nitrogen dioxide (NO2),Mean,ppb,CD,107,Upper West Side (CD7),Winter 2021-22,2021-12-01,22.075270,
10,741291,375,Nitrogen dioxide (NO2),Mean,ppb,CD,414,Rockaway and Broad Channel (CD14),Annual Average 2021,2021-01-01,11.337294,
15,741290,375,Nitrogen dioxide (NO2),Mean,ppb,CD,414,Rockaway and Broad Channel (CD14),Summer 2021,2021-06-01,4.850858,
16,603098,375,Nitrogen dioxide (NO2),Mean,ppb,CD,414,Rockaway and Broad Channel (CD14),Annual Average 2018,2018-01-01,10.410000,
...,...,...,...,...,...,...,...,...,...,...,...,...
18850,741180,375,Nitrogen dioxide (NO2),Mean,ppb,CD,207,Kingsbridge Heights and Bedford (CD7),Annual Average 2021,2021-01-01,16.857154,
18853,651005,386,Ozone (O3),Mean,ppb,CD,107,Upper West Side (CD7),Summer 2019,2019-06-01,25.570000,
18854,651017,386,Ozone (O3),Mean,ppb,CD,207,Kingsbridge Heights and Bedford (CD7),Summer 2019,2019-06-01,29.230000,
18859,651029,386,Ozone (O3),Mean,ppb,CD,307,Sunset Park (CD7),Summer 2019,2019-06-01,28.780000,


### 2. Counting mean, median, mode
For the next step, we have to count the mean, median, and mode of the air quality index.
From the data info, we know that the Data Value is categorized as a float data, which indicates the decimal number is going to be long.
We can round it to the nearest 2 decimal places. 

In [63]:
# Count mean, median, and mode of 'Data Value' column
mean_AQI = AQI_5y["Data Value"].mean()
mean_AQI = round(mean_AQI, 2)
median_AQI = AQI_5y["Data Value"].median()
mode_AQI = AQI_5y["Data Value"].mode()[0]
print(f"Mean: {mean_AQI}, Median: {median_AQI}, Mode: {mode_AQI}")

Mean: 17.12, Median: 12.83, Mode: 6.73


### 3. Data Visualization
Creating data viz is always a fascinating process because it allows to combine creativity with analysis visuals that reflect insight and personal taste. 
We can always use the dedicated function in python libraries such as plotly, mathplotlib, or seaborn, but this time we will take a more playful approach by  visualizing data using cute emojies!

Before we begin, ensure that the data has been grouped by location and the average AQI for each place has been calculated.
Once that‚Äôs done, we‚Äôre ready to display the top 10 locations‚Äîbrought to life with emojis!

In [64]:
# Group by Geo Place Name and get the averege AQI for each place
AQI_sorted = AQI_5y.groupby("Geo Place Name").agg(average=("Data Value", "mean"))
AQI_sorted = AQI_sorted.reset_index().sort_values(by="average", ascending=False)
AQI_sorted = AQI_sorted.round(2)
AQI_sorted.head(10)

Unnamed: 0,Geo Place Name,average
42,Gramercy Park - Murray Hill,25.81
95,Stuyvesant Town and Turtle Bay (CD6),25.25
20,Clinton and Chelsea (CD4),23.96
18,Chelsea - Clinton,23.42
45,Greenwich Village - SoHo,22.84
63,Midtown (CD5),22.59
46,Greenwich Village and Soho (CD2),21.87
103,Upper East Side (CD8),21.67
60,Lower East Side and Chinatown (CD3),21.61
40,Fort Greene and Brooklyn Heights (CD2),21.55


In [66]:
# Data Visualization
print("Top 10 Places with Highest Average AQI in the Last 5 Years")
print("‚ú¶" * 40)

emoji = ["üçÉ", "üåø", "üå±", "üå≥", "üå≤", "üçÇ", "üçÅ", "üåæ", "üåª", "üåº"]

for (place, value), emoji in zip(AQI_sorted.head(10).values, emoji):
    bar = emoji * int(value // 5)
    print(f"{place:<37} {bar:<5} {value}")

Top 10 Places with Highest Average AQI in the Last 5 Years
‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶‚ú¶
Gramercy Park - Murray Hill           üçÉüçÉüçÉüçÉüçÉ 25.81
Stuyvesant Town and Turtle Bay (CD6)  üåøüåøüåøüåøüåø 25.25
Clinton and Chelsea (CD4)             üå±üå±üå±üå±  23.96
Chelsea - Clinton                     üå≥üå≥üå≥üå≥  23.42
Greenwich Village - SoHo              üå≤üå≤üå≤üå≤  22.84
Midtown (CD5)                         üçÇüçÇüçÇüçÇ  22.59
Greenwich Village and Soho (CD2)      üçÅüçÅüçÅüçÅ  21.87
Upper East Side (CD8)                 üåæüåæüåæüåæ  21.67
Lower East Side and Chinatown (CD3)   üåªüåªüåªüåª  21.61
Fort Greene and Brooklyn Heights (CD2) üåºüåºüåºüåº  21.55


### 4. The hard way

Now,comes the hardest part.
While it is challenge our thought process, it allows us to understand the algorithm behind a function.
In this step, we begin by opening and reading the dataset using csv module.
The header row is skipped to avoid including column names in the analysis, and the selected column (in this case, column Data Values = index 10) is converted into numerical values and stored in a list called data.

Next, we compute three key statistical measures:

Mean: divided the total sum of all values by the number of observations, then rounded to two decimal places.

Median: calculated by sorting the data and finding the middle value. If the dataset contains an even number of entries, the two central values are averaged to represent the median.

Mode: determined by counting how often each value appears and identifying the most frequent one. A small dictionary is used to accumulate occurences, and the resulting modes are rounded.

Finally,we can print the mean, median, and mode(s) of the air quality data.
If there happens to be more than one mode (a tie in frequency), all of them are displayed together for completeness.

In [67]:
# Open the data
import csv

with open("Air_Quality.csv", newline="") as f:
    reader = csv.reader(f)
    next(reader)  # Skip header
    data = [float(row[10]) for row in reader]  # Adjust column index as needed

# Calculate mean
mean_value = sum(data) / len(data)
mean_value = round(mean_value, 2)

# Calculate median
data.sort()
n = len(data)
if n % 2 == 0:
    median_value = (data[n // 2 - 1] + data[n // 2 + 1]) / 2
else:
    median_value = data[n // 2]
median_value = round(median_value, 2)

# Calculate mode
counts = {}
for num in data:
    counts[num] = counts.get(num, 0) + 1
max_count = max(counts.values())
modes = [k for k, v in counts.items() if v == max_count]
modes = [round(mode, 2) for mode in modes]

print("Mean:", mean_value)
print("Median:", median_value)
# Print modes in case that there is more than two modes on the data
if len(modes) == 1:
    print(f"Mode: {modes[0]}")
else:
    print(f"Modes: {', '.join(map(str, modes))}")

Mean: 21.05
Median: 14.79
Mode: 2.0
