# GSOD ETL Harvester Framework - Download Station Location meta data file

This notebook will focus on downloading the station meta data file, and cleaning it up. 

https://www.ncei.noaa.gov/pub/data/noaa/isd-history.csv

## Weather Station Location File - Global Surface Summary of the Day (GSOD)

The station meta data history file the location, and other data of each of the station around the world, which can be used to fuse weather data with covid data.

# Changelog / To-Do  

 * **2020-05-17**: extracted file 
 * **2020-06-11**: updloaded data to mongo

**To-do**

* figure out how to match station id, and name from GSOD data to this data set


In [9]:
%%bash
python -V
pwd

Python 3.10.2
/Users/owenmccusker/Documents/repos-git/covid-fusion/notebooks/ETL


In [2]:
import os
import os.path
from os import path
import sys
import logging
import requests
import pandas as pd
import pymongo
import json

gsod_noaa_url = "https://www.ncei.noaa.gov/pub/data/noaa/"
gsod_data_dir = "../../data/interim/weather/gsod/"
gsod_metadata_history_filename = "isd-history.csv"
database_name = 'covid_fusion'
collection_name = 'gsod_station_metadata_isd_history'

########################################################    
#
def setup_custom_logger(name):
    formatter = logging.Formatter(fmt='%(asctime)s - %(levelname)s - %(funcName)s - %(module)s - %(message)s')
    handler = logging.StreamHandler()
    handler.setFormatter(formatter)
    logger = logging.getLogger(name)
    logger.setLevel(logging.DEBUG)
    logger.addHandler(handler)
    return logger

########################################################    
#
def shutdown_etl_harvester():
    # remember to close the handlers
    for handler in logger.handlers:
        handler.close()

def get_gsod_station_location_file(url_pathname, local_pathname):
    r = requests.get(url_pathname)
 
    if r.status_code != 404:
        with open(local_pathname, 'wb') as fp:
            fp.write(r.content)
    else:
        logger.error("Response 404: Extracting gsod data file: (" + url_pathname + ")")

def store_metadata_file(db_name, collection_name, local_pathname):
    #    mg_client = pymongo.MongoClient('localhost', 27012)
    mg_client = pymongo.MongoClient()
    mg_db = mg_client[db_name]
    db_cm = mg_db[collection_name]
    
    data = pd.read_csv(local_pathname)
    data_json = json.loads(data.to_json(orient='records'))
    result = db_cm.delete_many({})
    logger.info('delete_many: %s', result.acknowledged)
    logger.info('delete_many: num docs deleted: %s', result.deleted_count)    
    result = db_cm.insert_many(data_json)
        
########################################################    
# Download and clean station meta data file
########################################################
local_pathname =  gsod_data_dir + gsod_metadata_history_filename
url_pathname = gsod_noaa_url + gsod_metadata_history_filename

logger = setup_custom_logger('GSOD-ETL-Station-Metadata')

logger.info('Starting ETL of station metadata file')

# you can call logger = logging.getLogger('GSOD-ETL-Station-Metadata') if needed to initialize logger

if path.exists(local_pathname) == False:
    logger.info("Downloading station location metadata file from: " + url_pathname)
    get_gsod_station_location_file(url_pathname, local_pathname)
else:
    logger.info("File %s already exists", local_pathname) 
    
store_metadata_file(database_name, collection_name, local_pathname)

shutdown_etl_harvester()

logger.info('Finished ETL of station metadata file')


2024-10-27 21:43:13,995 - INFO - <module> - 3213728827 - Starting ETL of station metadata file
2024-10-27 21:43:13,995 - INFO - <module> - 3213728827 - Starting ETL of station metadata file
2024-10-27 21:43:13,997 - INFO - <module> - 3213728827 - File ../../data/interim/weather/gsod/isd-history.csv already exists
2024-10-27 21:43:13,997 - INFO - <module> - 3213728827 - File ../../data/interim/weather/gsod/isd-history.csv already exists
2024-10-27 21:43:14,417 - INFO - store_metadata_file - 3213728827 - delete_many: True
2024-10-27 21:43:14,417 - INFO - store_metadata_file - 3213728827 - delete_many: True
2024-10-27 21:43:14,418 - INFO - store_metadata_file - 3213728827 - delete_many: num docs deleted: 29561
2024-10-27 21:43:14,418 - INFO - store_metadata_file - 3213728827 - delete_many: num docs deleted: 29561
2024-10-27 21:43:14,785 - INFO - <module> - 3213728827 - Finished ETL of station metadata file
2024-10-27 21:43:14,785 - INFO - <module> - 3213728827 - Finished ETL of station me