# AQI Dataset Cleaning Notebook


## Introduction to Cleaning and Data Preparation Functions

### Cleaning the Datasets

The following code uses a loop to clean each dataset, then provides an overview immediately after each is cleaned. 

The methods of cleaning these datasets include:
- Stripping Whitespace in the column names
- Creating a dataframe using on the "date" and "pm25" (columns of interest)
- Converting the "pm25" column values to numerical datatypes for accurate calculations and analysis.
- Dropping any rows where "pm25" values are missing, to maintain integrity of analyses related to air quality measurements.
- Converting the date column to properly formatted datetime, to facilitate time series analysis across one or more datasets.
- Saving the cleaned dataset as a .csv file under a new filename. This preserves the original data to ensure reproducibility of the analysis.

### Dataset Overview/Summary

The code to produce a summary is a structured loop that is run after the clean_data() function is performed.
It provides a quick outline of one or multiple dataframes. Information includes:
- Dataframe shape
- Data types
- Missing Value count
- Dataframe head() preview
- Summary statistics (including mean, median, standard deviation, etc., which provide insights into the distribution and central tendencies of the data)

### Example: Loading the Datasets

Here's how the `datasets` dictionary is structured:

```python

datasets = {
    "edmonton-central, alberta-air-quality.csv": "cleaned_edm_df.csv",
    "calgary-central 2, alberta, canada-air-quality.csv": "cleaned_cgy_df.csv",
    "fort-mcmurray athabasca valley, alberta-air-quality.csv": "cleaned_fort_df.csv",
    "lethbridge,-alberta, canada-air-quality.csv": "cleaned_leth_df.csv"
}
```

In this dictionary:

The key represents the path and filename of the original dataset.
For example, "edmonton-central, alberta-air-quality.csv" is the original file containing air quality data for Edmonton-Central.

The value associated with each key is the filename we wish to assign to the cleaned version of the dataset. 
For instance, "cleaned_edm_df.csv" will be the new file containing the cleaned data derived from the original Edmonton-Central dataset.

### Looping the Dataset dictionary through each Function

The dataset dictionary is used to loop each dataset file through the clean_data() function and the dataset_overview() function, resulting in a cleaned_dataset.csv file, and a dataset overview of each cleaned dataframe. 

## Performing Cleaning and Summary Functions

In [2]:
import pandas as pd

def clean_data(file_path, new_file_name):
    df = pd.read_csv(file_path)
    df.columns = df.columns.str.strip()
    df = df[["date", "pm25"]]
    df['pm25'] = pd.to_numeric(df['pm25'], errors='coerce')
    df = df.dropna(subset=['pm25'])
    df['date'] = pd.to_datetime(df['date'])
    df.to_csv(new_file_name, index=False)
    return df

def dataset_overview(df, file_name):
    print(f"Overview of {file_name}:")
    print("Shape:", df.shape)
    print("Data Types:", df.dtypes)
    print("Missing Values:", df.isnull().sum())
    print("Data Preview (head)", df.head(), "\n")
    print("Summary Statistics:", df.describe(), "\n")

# Define the datasets
datasets = {
    "Data/edmonton-central, alberta-air-quality.csv": "cleaned_edm_df.csv",
    "Data/calgary-central 2, alberta, canada-air-quality.csv": "cleaned_cgy_df.csv",
    "Data/fort-mcmurray athabasca valley, alberta-air-quality.csv": "cleaned_fort_df.csv",
    "Data/lethbridge,-alberta, canada-air-quality.csv": "cleaned_leth_df.csv"
}

# Clean each dataset and get an overview
for original_file, cleaned_file in datasets.items():
    cleaned_df = clean_data(original_file, cleaned_file)
    dataset_overview(cleaned_df, cleaned_file)

Overview of cleaned_edm_df.csv:
Shape: (2267, 2)
Data Types: date    datetime64[ns]
pm25           float64
dtype: object
Missing Values: date    0
pm25    0
dtype: int64
Data Preview (head)         date   pm25
0 2022-09-10   22.0
1 2022-09-11   89.0
2 2022-09-12  131.0
3 2022-09-13   35.0
4 2022-09-14   70.0 

Summary Statistics:                                 date         pm25
count                           2267  2267.000000
mean   2017-12-12 23:18:42.717247488    30.345831
min              2014-08-09 00:00:00     4.000000
25%              2016-05-02 12:00:00    17.000000
50%              2017-12-07 00:00:00    25.000000
75%              2019-07-19 12:00:00    37.000000
max              2022-09-15 00:00:00   239.000000
std                              NaN    20.496925 

Overview of cleaned_cgy_df.csv:
Shape: (2974, 2)
Data Types: date    datetime64[ns]
pm25           float64
dtype: object
Missing Values: date    0
pm25    0
dtype: int64
Data Preview (head)         date  pm25
0 2024-