Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split up overall ts files #647

Closed
AugustinMortier opened this issue May 13, 2022 · 5 comments · Fixed by #707
Closed

Split up overall ts files #647

AugustinMortier opened this issue May 13, 2022 · 5 comments · Fixed by #707
Assignees
Labels
aeroval-tools Issues related to AeroVal web tools

Comments

@AugustinMortier
Copy link
Member

The overall timeseries (located in hm/ts/) should be separated into multiple files, as they can be very large (>100MB) when considering multiple models, observations, .... All models should be in the same file, but we could separate them at least per observation and perhaps per region.

@AugustinMortier AugustinMortier added the aeroval-tools Issues related to AeroVal web tools label May 13, 2022
@jgriesfeller
Copy link
Member

Maybe a common practice from SQL databases helps here: The principle there is make record sizes as small as possible (and as big as necessary).
In our case the principle should be that the files created should not contain more information than what is needed for the current visualisation.
Maybe we should go over how the data is organised. Is there anybody besides me with some database experience? Remember that we might want to put the whole data into a database at some point. So that effort is not lost.
In addition, I need to know the data format to improve the parallelism.

@lewisblake lewisblake self-assigned this May 30, 2022
@jgriesfeller
Copy link
Member

Since I just had a look at the structure of all the json files for my parallelization efforts, here's a little documentation:
The file hm/ts/stats_ts.json looks like this:

{
  "concpm25": {
    "AN-EEA-MP": {
      "Surface": {
        "IFS-OSUITE": {
          "sconcpm25": {
            "WORLD": {
              "1610668800000": {
                "totnum": 62403,
                "num_valid": 38742,
                "refdata_mean": 25.35994,
                "refdata_std": 29.8055,
                "data_mean": 28.44732,
                "data_std": 38.01789,
                "weighted": 0,
                "rms": 30.18373,
                "R": 0.63196,
                "R_spearman": 0.81003,
                "R_kendall": 0.60445,
                "nmb": 0.12174,
                "mnmb": -0.04012,
                "fge": 0.54547,
                "num_coords_tot": 2013,
                "num_coords_with_data": 1881
              },

So it should be easy to split that as done e.g. with the map files by obs network and variable

@avaldebe
Copy link
Collaborator

nice, here is the pydatic model generated by jsontopydantic

from __future__ import annotations

from pydantic import BaseModel, Field


class Field1610668800000(BaseModel):
    totnum: int
    num_valid: int
    refdata_mean: float
    refdata_std: float
    data_mean: float
    data_std: float
    weighted: int
    rms: float
    r: float = Field(..., alias='R')
    r_spearman: float = Field(..., alias='R_spearman')
    r_kendall: float = Field(..., alias='R_kendall')
    nmb: float
    mnmb: float
    fge: float
    num_coords_tot: int
    num_coords_with_data: int


class World(BaseModel):
    field_1610668800000: Field1610668800000 = Field(..., alias='1610668800000')


class Sconcpm25(BaseModel):
    world: World = Field(..., alias='WORLD')


class IfsOsuite(BaseModel):
    sconcpm25: Sconcpm25


class Surface(BaseModel):
    ifs_osuite: IfsOsuite = Field(..., alias='IFS-OSUITE')


class AnEeaMp(BaseModel):
    surface: Surface = Field(..., alias='Surface')


class Concpm25(BaseModel):
    an_eea_mp: AnEeaMp = Field(..., alias='AN-EEA-MP')


class Model(BaseModel):
    concpm25: Concpm25

@lewisblake
Copy link
Member

I implemented a first attempt at splitting up the timeseries files this morning, and will be testing it out this afternoon.

This was referenced May 31, 2022
lewisblake added a commit that referenced this issue Jun 7, 2022
@AugustinMortier
Copy link
Member Author

The files are still quite big when considering the cams2-83 last-seasons experiment.

ubuntu@aeroval:/var/www/web/data/cams2-83/last-seasons/hm/ts$ ll -h
total 605M
drwxrwsr-x 1 ubuntu ubuntu  314 Jul 11 17:43 ./
drwxrwsr-x 1 ubuntu ubuntu  136 Jul  9 05:40 ../
-rw-rw-r-- 1 ubuntu ubuntu  42M Jul  9 16:35 Obs-concco-Surface.json
-rw-rw-r-- 1 ubuntu ubuntu  63M Jul  9 16:31 Obs-concno2-Surface.json
-rw-rw-r-- 1 ubuntu ubuntu  66M Jul  9 16:37 Obs-conco3-Surface.json
-rw-rw-r-- 1 ubuntu ubuntu  57M Jul  9 16:44 Obs-concpm10-Surface.json
-rw-rw-r-- 1 ubuntu ubuntu  49M Jul  9 16:47 Obs-concpm25-Surface.json
-rw-rw-r-- 1 ubuntu ubuntu  55M Jul  9 16:41 Obs-concso2-Surface.json
-rw-rw-r-- 1 ubuntu ubuntu 276M Jun 27 22:40 stats_ts.json

I suggest to also split by region then, which will also be consistent with other timeseries files (e.g in the forecast directory: region_obsnetwork-variable_verticallayer.json):

ubuntu@aeroval:/var/www/web/data/cams2-83/last-seasons/forecast$ ls
 ALL_Obs-concco_Surface.json

@lewisblake lewisblake linked a pull request Jul 18, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aeroval-tools Issues related to AeroVal web tools
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants