# Get Hourly Station Data from GHCNh
## Written By Jared Rennie (@jjrennie)

Taps into the NCEI Website to get hourly station data for a particular year. This repo also has a file tiltled 'get_ghcnh_por.sh' that gets data for its period of record, but fair warning the file sizes are very large (hundreds of MBs)

- GHCNh Info: https://www.ncei.noaa.gov/products/global-historical-climatology-network-hourly
- GHCNh Documentation: https://www.ncei.noaa.gov/oa/global-historical-climatology-network/hourly/doc/ghcnh_DOCUMENTATION.pdf

### What You Need

First off, the entire codebase works in Python 3. In addition to base Python, you will need the following packages installed: 
- pandas (to slice annd dice the data)
- pyarrow or fastparquet (for the parquet data)    
The "easiest" way is to install these is by installing <a href='https://www.anaconda.com' target="_blank">anaconda</a>, and then applying <a href='https://conda-forge.org/' target="_blank">conda-forge</a>. Afterward, then you can install the above packages. 

### Importing Packages
Assuming you did the above, it should (in theory) import everything no problem:

In [1]:
# Import Packages
import sys
import pandas as pd

print("SUCCESS!")

SUCCESS!


If you made it this far, great!

### Insert Arguments
Since a stations entire period of record is very large, the database not only splits up data by year, but also uses the parquet file format. No need to worry though, we can read this data using pandas.

As such, the codebase requires two inputs. First is station ID, and second is year. The ID is 11 digits long, and you can find the list of stations <a href='https://www.ncei.noaa.gov/oa/global-historical-climatology-network/hourly/doc/ghcnh-station-list.csv' target='blank'>here</a>.
- example: Asheville Airport is USW00003812

If you're not sure, you can refer to the documentation above.

 **Change the arguments below to your liking**

In [2]:
# Insert Arguments Here
stationID = 'USW00003812'
inYear=2023

The rest of the code should work without making any changes to it, but if you're interested, keep on reading to see how the sausage is made.

This next block of code will attempt to access the data we want from the NCEI website

In [3]:
# DEFINE URL (parquet)
ghcnh_url = 'https://www.ncei.noaa.gov/oa/global-historical-climatology-network/hourly/access/by-year/'+str(inYear)+'/parquet/GHCNh_'+str(stationID)+'_'+str(inYear)+'.parquet'

# Pull Data into Pandas DataFrame
try:
  ghcnhPandas = pd.read_parquet(ghcnh_url)
except Exception as e:
  sys.exit('SOMETHING WENT WRONG: ',e)
print('SUCCESS')

SUCCESS


If it says "SUCCESS!" then congrats you got the data!

### Let's check the data!
How does it look? 

In [4]:
print(ghcnhPandas)

        Station_ID  Station_name                 DATE   Latitude  Longitude  \
0      USW00003812  ASHEVILLE AP  2023-01-01T00:00:00  35.431702   -82.5378   
1      USW00003812  ASHEVILLE AP  2023-01-01T00:43:00  35.431702   -82.5378   
2      USW00003812  ASHEVILLE AP  2023-01-01T00:52:00  35.431702   -82.5378   
3      USW00003812  ASHEVILLE AP  2023-01-01T00:54:00  35.431702   -82.5378   
4      USW00003812  ASHEVILLE AP  2023-01-01T01:10:00  35.431702   -82.5378   
...            ...           ...                  ...        ...        ...   
13475  USW00003812  ASHEVILLE AP  2023-12-31T20:54:00  35.431702   -82.5378   
13476  USW00003812  ASHEVILLE AP  2023-12-31T21:00:00  35.431702   -82.5378   
13477  USW00003812  ASHEVILLE AP  2023-12-31T21:54:00  35.431702   -82.5378   
13478  USW00003812  ASHEVILLE AP  2023-12-31T22:54:00  35.431702   -82.5378   
13479  USW00003812  ASHEVILLE AP  2023-12-31T23:54:00  35.431702   -82.5378   

        Elevation  temperature temperature_Measurem

Woa! That's a lot of columns! How much data are we talking about here?

In [5]:
for column in ghcnhPandas.columns:
    print(column)

Station_ID
Station_name
DATE
Latitude
Longitude
Elevation
temperature
temperature_Measurement_Code
temperature_Quality_Code
temperature_Report_Type
temperature_Source_Code
temperature_Source_Station_ID
dew_point_temperature
dew_point_temperature_Measurement_Code
dew_point_temperature_Quality_Code
dew_point_temperature_Report_Type
dew_point_temperature_Source_Code
dew_point_temperature_Source_Station_ID
station_level_pressure
station_level_pressure_Measurement_Code
station_level_pressure_Quality_Code
station_level_pressure_Report_Type
station_level_pressure_Source_Code
station_level_pressure_Source_Station_ID
sea_level_pressure
sea_level_pressure_Measurement_Code
sea_level_pressure_Quality_Code
sea_level_pressure_Report_Type
sea_level_pressure_Source_Code
sea_level_pressure_Source_Station_ID
wind_direction
wind_direction_Measurement_Code
wind_direction_Quality_Code
wind_direction_Report_Type
wind_direction_Source_Code
wind_direction_Source_Station_ID
wind_speed
wind_speed_Measurement_Co

OK that seems excessive. How about we only get columns we want. Feel free to edit this block to get the data you specifically want.

In [6]:
# Get a subset of Data based on User inputs
subsetCols=['Station_ID','Station_name','DATE','Latitude','Longitude','Elevation',
            'temperature','dew_point_temperature','sea_level_pressure','wind_direction',
            'wind_speed','wind_gust','precipitation','visibility','remarks']
subsetPandas = ghcnhPandas[subsetCols] 
print(subsetPandas)

        Station_ID  Station_name                 DATE   Latitude  Longitude  \
0      USW00003812  ASHEVILLE AP  2023-01-01T00:00:00  35.431702   -82.5378   
1      USW00003812  ASHEVILLE AP  2023-01-01T00:43:00  35.431702   -82.5378   
2      USW00003812  ASHEVILLE AP  2023-01-01T00:52:00  35.431702   -82.5378   
3      USW00003812  ASHEVILLE AP  2023-01-01T00:54:00  35.431702   -82.5378   
4      USW00003812  ASHEVILLE AP  2023-01-01T01:10:00  35.431702   -82.5378   
...            ...           ...                  ...        ...        ...   
13475  USW00003812  ASHEVILLE AP  2023-12-31T20:54:00  35.431702   -82.5378   
13476  USW00003812  ASHEVILLE AP  2023-12-31T21:00:00  35.431702   -82.5378   
13477  USW00003812  ASHEVILLE AP  2023-12-31T21:54:00  35.431702   -82.5378   
13478  USW00003812  ASHEVILLE AP  2023-12-31T22:54:00  35.431702   -82.5378   
13479  USW00003812  ASHEVILLE AP  2023-12-31T23:54:00  35.431702   -82.5378   

        Elevation  temperature  dew_point_temperatu

That's better! Keep in mind the values are in metric, so if you want imperial, you'll need to convert it.

### Output to CSV
Last step outputs the data to a CSV file so you can load it and play with it elsewhere. Or you can keep playing with it here. You do you!

In [7]:
# Send to CSV
outFile='./'+stationID+'_Hourly_'+str(inYear)+'.csv'
subsetPandas.to_csv(outFile,index=False)

# If you made it this far. Success!
print('SUCESSFULLY GOT DATA AND PUT IN:',outFile)

SUCESSFULLY GOT DATA AND PUT IN: ./USW00003812_Hourly_2023.csv


That's it! Now you have data for 1 station/1 year in the database. What if you want to get multiple stations? Multiple years? It's doable!

**Congrats on completing this notebook! Now go forth and get your data!**