An R workflow for curation of Philippine Atmospheric, Geophysical, and Astronomical Services Administration (PAGASA) datasets
This repository is a
docker
-containerised,
{targets}
-based,
{renv}
-enabled
R
workflow for the retrieval,
processing, and curation of various Philippine Atmospheric,
Geophysical, and Astronomical Services Administration
(PAGASA) publicly available datasets.
The word paglaom
(pronounced as /paɡˈlaʔom/, [pʌɡˈl̪a.ʔɔm]) is
Bisaya (one of up to 187 languages spoken in the Philippines in addition
to Filipino, which is the national language, and English, which is the
language of instruction in the country) for hope.
PAGASA, the national meteorological and
hydrological services agency of the Philippines, draws its name from the
Filipino word pag-asa which means hope. The repository name choice,
hence, is a play on these words and also a way to showcase the richness
and diversity that exists in the Pilippines.
The paglaom
project aims to maintain a database of curated datasets on
varios atmospheric, geophysical, and astronomical phenomena that are
made publicly available by PAGASA on their website. These datasets tend
to be summaries of the multitude of data that PAGASA collects on a high
frequency basis. They also tend to be in formats that are not
machine-readable (e.g., PDF, PNG, HTML formats) meant for reporting to
the Philippine population. PAGASA does provide more granular and
expansive datasets through a specific data request process. The
paglaom
project aims to showcase publicly available PAGASA data that
can be used for various purposes some of which are:
-
for students who need to make a report on topics covered by PAGASA’s summarised data for a school assignment or project;
-
for individuals who have specific interest in one of the natural phenomena that PAGASA monitors and would like to get raw summarised data in a format that is usable and transferrable into other formats;
-
for data visualisation learners and aficionados who want to try on working on data about the various natural phenomena available from PAGASA and create unique and interesting plots and graphics.
The broader and more blue skies vision of the paglaom
project is to
contribute to the increasing interest in science, technology,
engineering, and mathematics (STEM) subjects particularly in the
Philippines with a collection that showcases topics and data that are
homegrown and embedded into the fabric of Philippine life.
Whilst the paglaom
project by its name and the nature of the data it
curates has an inherent Filipino audience, it is hoped that those
outside of the Philippines will also find the information within useful
in similar contexts described above.
The project repository is structured as follows:
paglaom
|-- .github/
|-- data/
|-- data-raw/
|-- outputs/
|-- R/
|-- reports
|-- renv
|-- renv.lock
|-- .Rprofile
|-- packages.R
|-- _targets_climate.R
|-- _targets_cyclones.R
|-- _targets_dam.R
|-- _targets_heat.R
|-- _targets_setup.R
|-- _targets.R
-
.github
contains project testing and automated deployment of outputs workflows via continuous integration and continuous deployment (CI/CD) using Github Actions. -
data/
contains intermediate and final data outputs produced by the workflow. -
data-raw/
contains raw datasets downloaded from publicly available PAGASA sources that are used in the project. This directory is empty given that the raw datasets from PAGASA are in large file size formats that are not ideal for git versioning hence they are git ignored. This directory is kept here to maintain reproducibility of project directory structure and ensure that the workflow runs as expected when run locally. -
outputs/
contains compiled reports and figures produced by the workflow. -
R/
contains functions developed/created specifically for use in this workflow. -
reports/
contains literate code for R Markdown reports rendered in the workflow. -
renv/
containsrenv
package specific files and directories used by the package for maintaining R package dependencies within the project. The directoryrenv/library
, is a library that contains all packages currently used by the project. This directory, and all files and sub-directories within it, are all generated and managed by therenv
package. Users should not change/edit these manually. -
renv.lock
file is therenv
lockfile which records enough metadata about every package used in this project that it can be re-installed on a new machine. This file is generated by therenv
package and should not be changed/edited manually. -
.Rprofile
file is a project R profile generated when initiatingrenv
for the first time. This file is run automatically every time R is run within this project, andrenv
uses it to configure the R session to use therenv
project library. -
packages.R
file lists out all R package dependencies required by the workflow. -
_targets*.R
files define the steps in the workflow’s data ingest, data processing, data analysis, and reporting pipelines.
This project was built using R 4.4.1
. This project uses the renv
framework to record R package dependencies and versions. Packages and
versions used are recorded in renv.lock
and code used to manage
dependencies is in renv/
and other files in the root project
directory. After cloning this repository, start an R session in the
project’s working directory and then run
renv::restore()
to install all R package dependencies.
Currently, the project has workflows that curate the following datasets:
-
Tropical cyclones data for various cyclones entering the Philippine area of responsibility since 2017;
-
Daily heat index data from various data collection points in the Philippines;
-
Climatological extremes and normals data over time;
-
Daily dam water level data; and,
-
Daily weather forecasts.
The following diagram illustrates these workflows
graph LR
style Graph fill:#FFFFFF00,stroke:#000000;
subgraph Graph
direction LR
x28812d8ea4d86f19(["forecasts_archive_pdfs"]):::outdated --> xcfcf871ca951fc3f["forecasts_data_raw"]:::outdated
x567709ab5f0adc71(["heat_index_pubfiles_url"]):::uptodate --> x56bd7c118ed46a38(["heat_index_links"]):::outdated
xb48a3b157c96bffd(["climate_pubfiles_url"]):::uptodate --> xa37a01adfb45bd68(["climate_directory_urls"]):::uptodate
xc77bb431ac7c3081(["dam_level_url"]):::uptodate --> x0de96327cc07b160(["dam_level_data"]):::outdated
x56bd7c118ed46a38(["heat_index_links"]):::outdated --> xc432bd4e21a7b9fa(["heat_index_links_dates"]):::outdated
xa37a01adfb45bd68(["climate_directory_urls"]):::uptodate --> x26b861c7a0a21b52["climate_pdf_urls"]:::uptodate
x6f87cfcc96bb274d(["cyclone_reports_links"]):::outdated --> x1cf596d0c4f824b5["cyclone_reports_download_files"]:::outdated
xc044beb81380bb4a["climate_download_files"]:::uptodate --> x42da7c0722c063a6(["climate_data_normals_1991_2020"]):::outdated
x56bd7c118ed46a38(["heat_index_links"]):::outdated --> x113a83dcec46090f(["heat_index_links_urls"]):::outdated
xc11d790635c14420(["forecasts_pubfiles_urls"]):::uptodate --> x7dd7e06e8ff4bc40["forecasts_download_files"]:::uptodate
xc044beb81380bb4a["climate_download_files"]:::uptodate --> xe06460aefd475ca2(["climate_data_extremes_2020"]):::outdated
xc044beb81380bb4a["climate_download_files"]:::uptodate --> xc83b489a1c433852(["climate_data_extremes_2021"]):::outdated
xc044beb81380bb4a["climate_download_files"]:::uptodate --> x9b64b30afbfc8ff9(["climate_data_extremes_2022"]):::outdated
xc044beb81380bb4a["climate_download_files"]:::uptodate --> xf620d5783ff15609(["climate_data_extremes_2023"]):::outdated
x7255575025352eb6(["dam_level_data_files"]):::uptodate --> xd2c3c65ab78d2c70(["dam_level_data_processed"]):::uptodate
x26b861c7a0a21b52["climate_pdf_urls"]:::uptodate --> xc044beb81380bb4a["climate_download_files"]:::uptodate
xc432bd4e21a7b9fa(["heat_index_links_dates"]):::outdated --> x4f749438c4164b8e["heat_index_download_files"]:::outdated
x113a83dcec46090f(["heat_index_links_urls"]):::outdated --> x4f749438c4164b8e["heat_index_download_files"]:::outdated
xd2c3c65ab78d2c70(["dam_level_data_processed"]):::uptodate --> xfa0b497de91938bb(["dam_level_data_csv"]):::uptodate
x0de96327cc07b160(["dam_level_data"]):::outdated --> x202d34e7af3ea1c1(["dam_level_data_raw_csv"]):::outdated
x6e2ddacb746493ac(["forecasts_agriculture_archive_pdfs"]):::uptodate --> x6e2ddacb746493ac(["forecasts_agriculture_archive_pdfs"]):::uptodate
x4b1f33bb14e8a195(["forecasts_agriculture_download_files"]):::outdated --> x4b1f33bb14e8a195(["forecasts_agriculture_download_files"]):::outdated
end
To run any of these workflows, run the following command on the R console:
targets::tar_make(dplyr::starts_with("PREFIX"))
replacing "PREFIX"
with the keyword for the type of data. For example,
to run the cyclones workflow from R console:
targets::tar_make(dplyr::starts_with("cyclone"))
or from the command line/terminal as follows:
Rscript -e "targets::tar_make(dplyr::starts_with('cyclone'))"
Running specific components of a workflow involves specifying a target name or target names of the components you want to run. You should be able to run a full workflow path by just specifying the name of the last target in the workflow sequence. For example, the following will run the entire cyclones data workflow (as an alternative to what is shown above):
targets::tar_make(cyclones_peak_data_csv)
or from the command line/terminal as follows:
Rscript -e "targets::tar_make(cyclones_peak_data_csv)"
The target cyclones_peak_data_csv
is the last target of the cyclones
data workflow. Hence, to be able to produce the cyclones_peak_data_csv
target requires running this series of linked targets.
If you would like to run a set of interrelated but not fully linked
targets, you will need to specify more than one target name. For this,
you can use tidyselect
approaches to name targets to be run. For
example:
targets::tar_make(dplyr::starts_with(c("cyclone", "dam"))
will run all targets in the cyclones and dam levels data workflow.
The project also has a workflow for weekly GitHub release of the various raw datasets.
graph LR
style Graph fill:#FFFFFF00,stroke:#000000;
subgraph Graph
direction LR
x8e2305bde709e13c(["paglaom_weekly_release_tag"]):::outdated --> x2ee1cead5469690e(["paglaom_weekly_release"]):::outdated
end
All code created through this project (found in this repository) is released under a GPL-3.0 license license.
Data provided through this project are released under a CC0 license.