#### importing packages that will/might be used throughout the course of this file

In [132]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3

## Background

### About the Dataset
This dataset contains daily records of air pollutant concentrations collected over time, including PM2.5, PM10, NO₂, SO₂, CO, and Ozone. It also logs the Air Quality Index (AQI), along with date (day, month, year) also includes number of holidays and weekday representation. It is ideal for developing models to predict AQI, study pollution patterns, and analyze the effects of temporal or holiday factors on air quality.

##### (importing dataset that will be used to draw and collect data so that we can make decisive conclusions)

In [133]:
df = pd.read_csv("final_dataset.csv")
df.head()

Unnamed: 0,Date,Month,Year,Holidays_Count,Days,PM2.5,PM10,NO2,SO2,CO,Ozone,AQI
0,1,1,2021,0,5,408.8,442.42,160.61,12.95,2.77,43.19,462
1,2,1,2021,0,6,404.04,561.95,52.85,5.18,2.6,16.43,482
2,3,1,2021,1,7,225.07,239.04,170.95,10.93,1.4,44.29,263
3,4,1,2021,0,1,89.55,132.08,153.98,10.42,1.01,49.19,207
4,5,1,2021,0,2,54.06,55.54,122.66,9.7,0.64,48.88,149


### Research Question
Is more CO produced on average by the residents and businesses of the Delhi region on weekdays (numbered 1-5 on the table) or weekends (numbered 6-7)?

### Procedure

The data from this table was collected throughout the span of 5 years, 2021 through 2025, measuring the air pollutant concentrations collected over time, including PM2.5, PM10, NO₂, SO₂, CO, and Ozone. For the case of this study, we will only be focusing on the concentrations of CO. 

To accurately measure the difference in concentration of CO between weekdays and weekends, we're going to collect averages on days numbered 1-5 in the "Days" column and 6-7 in the "Days" column to properly account for days Monday-Friday and Saturday-Sunday, respectively. With this data, we're going to plot the distributions to make a decision on the null hypothesis. The null will be tested a significance level of p = .05.

### Null Hypothesis
There is no difference in the proportion of CO emissions over time between the weekdays and weekends by the residents and businesses of the Delhi region.


In [134]:
df.info
df.describe()
df['Days'].value_counts()

Days
5    209
6    209
7    209
1    209
2    209
3    208
4    208
Name: count, dtype: int64

In [135]:
df['DayType'] = df['Days'].apply(lambda x: 'Weekday' if x in [1, 2, 3, 4, 5] else 'Weekend')

In [136]:
# Mean, Deviation, Variation for CO for Weekdays
weekday_CO_mean = df[df['DayType'] == 'Weekday']['CO'].mean()
weekday_CO_dev = df[df['DayType'] == 'Weekday']['CO'].std()
weekday_CO_var = df[df['DayType'] == 'Weekday']['CO'].var()

# Mean, Deviation, Variation for CO for Weekends
weekend_CO_mean = df[df['DayType'] == 'Weekend']['CO'].mean()
weekend_CO_dev = df[df['DayType'] == 'Weekdend']['CO'].std()
weekend_CO_var = df[df['DayType'] == 'Weekend']['CO'].var()

# Grouped Means, Deviation, Variance
co_means = df.groupby('DayType')['CO'].mean()
co_dev = df.groupby('DayType')['CO'].std()
co_var = df.groupby('DayType')['CO'].var()

print("\nGrouped Means:")
print(co_means)

print("\nGrouped Standard Deviations:")
print(co_dev)

print("\nGrouped Variances:")
print(co_var)


Grouped Means:
DayType
Weekday    1.028408
Weekend    1.019402
Name: CO, dtype: float64

Grouped Standard Deviations:
DayType
Weekday    0.618021
Weekend    0.584032
Name: CO, dtype: float64

Grouped Variances:
DayType
Weekday    0.381949
Weekend    0.341093
Name: CO, dtype: float64
