Split up overall ts files #647

AugustinMortier · 2022-05-13T19:30:11Z

The overall timeseries (located in hm/ts/) should be separated into multiple files, as they can be very large (>100MB) when considering multiple models, observations, .... All models should be in the same file, but we could separate them at least per observation and perhaps per region.

jgriesfeller · 2022-05-16T08:15:05Z

Maybe a common practice from SQL databases helps here: The principle there is make record sizes as small as possible (and as big as necessary).
In our case the principle should be that the files created should not contain more information than what is needed for the current visualisation.
Maybe we should go over how the data is organised. Is there anybody besides me with some database experience? Remember that we might want to put the whole data into a database at some point. So that effort is not lost.
In addition, I need to know the data format to improve the parallelism.

jgriesfeller · 2022-05-31T08:43:35Z

Since I just had a look at the structure of all the json files for my parallelization efforts, here's a little documentation:
The file hm/ts/stats_ts.json looks like this:

{
  "concpm25": {
    "AN-EEA-MP": {
      "Surface": {
        "IFS-OSUITE": {
          "sconcpm25": {
            "WORLD": {
              "1610668800000": {
                "totnum": 62403,
                "num_valid": 38742,
                "refdata_mean": 25.35994,
                "refdata_std": 29.8055,
                "data_mean": 28.44732,
                "data_std": 38.01789,
                "weighted": 0,
                "rms": 30.18373,
                "R": 0.63196,
                "R_spearman": 0.81003,
                "R_kendall": 0.60445,
                "nmb": 0.12174,
                "mnmb": -0.04012,
                "fge": 0.54547,
                "num_coords_tot": 2013,
                "num_coords_with_data": 1881
              },

So it should be easy to split that as done e.g. with the map files by obs network and variable

avaldebe · 2022-05-31T10:00:42Z

nice, here is the pydatic model generated by jsontopydantic

from __future__ import annotations

from pydantic import BaseModel, Field


class Field1610668800000(BaseModel):
    totnum: int
    num_valid: int
    refdata_mean: float
    refdata_std: float
    data_mean: float
    data_std: float
    weighted: int
    rms: float
    r: float = Field(..., alias='R')
    r_spearman: float = Field(..., alias='R_spearman')
    r_kendall: float = Field(..., alias='R_kendall')
    nmb: float
    mnmb: float
    fge: float
    num_coords_tot: int
    num_coords_with_data: int


class World(BaseModel):
    field_1610668800000: Field1610668800000 = Field(..., alias='1610668800000')


class Sconcpm25(BaseModel):
    world: World = Field(..., alias='WORLD')


class IfsOsuite(BaseModel):
    sconcpm25: Sconcpm25


class Surface(BaseModel):
    ifs_osuite: IfsOsuite = Field(..., alias='IFS-OSUITE')


class AnEeaMp(BaseModel):
    surface: Surface = Field(..., alias='Surface')


class Concpm25(BaseModel):
    an_eea_mp: AnEeaMp = Field(..., alias='AN-EEA-MP')


class Model(BaseModel):
    concpm25: Concpm25

lewisblake · 2022-05-31T10:47:31Z

I implemented a first attempt at splitting up the timeseries files this morning, and will be testing it out this afternoon.

Split ts files #647

AugustinMortier · 2022-07-11T17:57:47Z

The files are still quite big when considering the cams2-83 last-seasons experiment.

ubuntu@aeroval:/var/www/web/data/cams2-83/last-seasons/hm/ts$ ll -h
total 605M
drwxrwsr-x 1 ubuntu ubuntu  314 Jul 11 17:43 ./
drwxrwsr-x 1 ubuntu ubuntu  136 Jul  9 05:40 ../
-rw-rw-r-- 1 ubuntu ubuntu  42M Jul  9 16:35 Obs-concco-Surface.json
-rw-rw-r-- 1 ubuntu ubuntu  63M Jul  9 16:31 Obs-concno2-Surface.json
-rw-rw-r-- 1 ubuntu ubuntu  66M Jul  9 16:37 Obs-conco3-Surface.json
-rw-rw-r-- 1 ubuntu ubuntu  57M Jul  9 16:44 Obs-concpm10-Surface.json
-rw-rw-r-- 1 ubuntu ubuntu  49M Jul  9 16:47 Obs-concpm25-Surface.json
-rw-rw-r-- 1 ubuntu ubuntu  55M Jul  9 16:41 Obs-concso2-Surface.json
-rw-rw-r-- 1 ubuntu ubuntu 276M Jun 27 22:40 stats_ts.json

I suggest to also split by region then, which will also be consistent with other timeseries files (e.g in the forecast directory: region_obsnetwork-variable_verticallayer.json):

ubuntu@aeroval:/var/www/web/data/cams2-83/last-seasons/forecast$ ls
 ALL_Obs-concco_Surface.json

AugustinMortier added the aeroval-tools Issues related to AeroVal web tools label May 13, 2022

lewisblake self-assigned this May 30, 2022

This was referenced May 31, 2022

Split ts files 647 #660

Merged

format with black #662

Merged

Split ts files 647 #666

Merged

lewisblake added a commit that referenced this issue Jun 7, 2022

Merge pull request #666 from lewisblake/split-ts-files-647

fac79da

Split ts files #647

lewisblake closed this as completed Jun 7, 2022

AugustinMortier reopened this Jul 11, 2022

lewisblake linked a pull request Jul 18, 2022 that will close this issue

Split ts files #707

Merged

lewisblake closed this as completed in #707 Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split up overall ts files #647

Split up overall ts files #647

AugustinMortier commented May 13, 2022

jgriesfeller commented May 16, 2022

jgriesfeller commented May 31, 2022

avaldebe commented May 31, 2022

lewisblake commented May 31, 2022

AugustinMortier commented Jul 11, 2022

Split up overall ts files #647

Split up overall ts files #647

Comments

AugustinMortier commented May 13, 2022

jgriesfeller commented May 16, 2022

jgriesfeller commented May 31, 2022

avaldebe commented May 31, 2022

lewisblake commented May 31, 2022

AugustinMortier commented Jul 11, 2022