## Download Data from AWS S3 bucket

For this tutorial, we will use the NOAA Global Historical Climatology Network Daily (GHCN-D) data available on AWS S3. 
You can reach more about the data on Registry of Open Data on AWS [here](https://registry.opendata.aws/noaa-ghcn/).

More information about the dataset, including the metadata descriptions, is available on [NOAA's website](https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily). 

GHCN-D contains **daily observations** over global land areas. It contains station-based measurements from land-based stations worldwide, about two thirds of which are for precipitation measurement only. Some data are more than *175 years* old.

This dataset is very large and to analyze it within Python you need to use Dask Dataframe.

In [None]:
# Download one year of data
! aws s3 cp --no-sign-request s3://noaa-ghcn-pds/csv/by_year/2022.csv .

In [None]:
# Download all data since 2020
! aws s3 cp --no-sign-request s3://noaa-ghcn-pds/csv/by_year/ . --recursive --exclude="*" --include="202*"

## Import Packages

In [None]:
import dask.dataframe as dd

In [None]:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
client

## Read One CSV file

In [None]:
# Read a CSV file
df = dd.read_csv("2022.csv", dtype={'Q_FLAG': 'object'})

In [None]:
df.npartitions

In [None]:
# Read a CSV file
df = dd.read_csv("2022.csv", dtype={'Q_FLAG': 'object'}, blocksize=25e6)

In [None]:
df.npartitions

In [None]:
df.compute()

## Read Multiple CSVs

In [None]:
large_df = dd.read_csv("*.csv", dtype={'Q_FLAG': 'object'}, blocksize=25e6)

In [None]:
large_df.npartitions

In [None]:
large_df

In [None]:
# This is going to fail
#large_df.compute()

In [None]:
mean_values = large_df.groupby("ELEMENT")["DATA_VALUE"].mean()

In [None]:
mean_values.compute()

In [None]:
mean_values    

In [None]:
worcester_df = large_df[large_df["ID"].isin(["US1MAWR0097"])]

In [None]:
worcester_df

In [None]:
worcester_mean = worcester_df.groupby("ELEMENT")["DATA_VALUE"].mean()

In [None]:
worcester_mean

In [None]:
worcester_mean.compute()

In [None]:
# Task: find the station with the highest number of snow days in a year
