# Ydata-profiling Tutorial
The Flash team is excited to share with you a small tutorial on Ydata-profiling.
Before jumping into this tutorial, we recommend giving a look to this [README](README.md) in order to get more familiar with Ydata-profiling and its pros/cons ! 

Now that’s being said, let’s dig into a small example where we will explore the US pollution dataset.

## Import libraries

In [None]:
import pandas as pd
from ydata_profiling.utils.cache import cache_file
from ydata_profiling import ProfileReport

## Import data

In this example we will use the US Pollution dataset. 

This dataset provides information about air quality in the United States, with an emphasis on pollutants like Nitrogen Dioxide (NO2), Sulphur Dioxide (SO2), Carbon Monoxide (CO), and Ozone (O3). 

In [None]:
file_name = cache_file(
    "pollution_us_2000_2016.csv",
    "https://query.data.world/s/mz5ot3l4zrgvldncfgxu34nda45kvb",
)

df_raw = pd.read_csv(file_name, index_col=[0])
df_raw["Date Local"] = pd.to_datetime(df_raw["Date Local"])
site = df_raw[df_raw["Site Num"] == 9997]
site.head()

In [None]:
profile = ProfileReport(
    site, 
    tsmode=True,
    title="US Pollution Report", 
    sortby="Date Local"
)

In [None]:
profile.to_file("reports/us_pollution_dataset_report.html")

## Dataset comparison

It is also possible to compare two (or more) datasets using the `compare` function.
This can be useful when comparing two time periods or two populations.

In [None]:
us_pollution_site_5005_report = ProfileReport(
    df_raw[df_raw["Site Num"] == 5005],
    tsmode=True,
    sortby="Date Local",
    type_schema={"SO2 1st Max Hour": "TimeSeries"}, # This enforces a feature to have the Timeseries type
    title="Site 5005 US Pollution Report",
)

us_pollution_site_9997_report = ProfileReport(
    df_raw[df_raw["Site Num"] == 9997],
    tsmode=True,
    sortby="Date Local",
    type_schema={"SO2 1st Max Hour": "TimeSeries"}, # This enforces a feature to have the Timeseries type
    title="Site 9997 US Pollution Report",
)

comparison_report = us_pollution_site_5005_report.compare(us_pollution_site_9997_report)
comparison_report.config.html.style.primary_colors = ["#FCC445", "#57ACD9"]

In [None]:
comparison_report.to_file("reports/us_pollution_dataset_comparison.html")