Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: as traffic manager I would like to integrate A22 "traffic events" in the Open Data Hub #263

Closed
7 tasks done
rcavaliere opened this issue Mar 4, 2021 · 33 comments
Closed
7 tasks done

Comments

@rcavaliere
Copy link
Member

rcavaliere commented Mar 4, 2021

This user story deals with the integration of the A22 traffic events in the ODH. The implementation is implemented by 1006.org and Catch&Solve. As far as the ODH data model is concerned, my suggestion is to consider an "event" like a station (since it is characterized by coordinates) and to use the metadata to characterize is. In this case we could have the situation that this type of stations does not have types and measurements associated to it.

Specification document:
201202_RichiestaOfferte_EventiStradali.pdf

Tasks covered by this epic:

@rcavaliere rcavaliere changed the title EPIC: as traffic manager I would like to integrated "traffic events" in the Open Data Hub EPIC: as traffic manager I would like to integrate "traffic events" in the Open Data Hub Mar 4, 2021
@rcavaliere rcavaliere changed the title EPIC: as traffic manager I would like to integrate "traffic events" in the Open Data Hub EPIC: as traffic manager I would like to integrate A22 "traffic events" in the Open Data Hub Apr 16, 2021
@bertolla bertolla removed their assignment May 25, 2021
@bertolla bertolla transferred this issue from noi-techpark/bdp-core May 25, 2021
@davidebz
Copy link
Contributor

davidebz commented Jun 4, 2021

As discuss in the email betwenn Roberto and Davide (31/05/21, 18:22
), we can't modify the active flag, this should be made by the "writer"

Note: for A22 events the active flag should not depends on stations during the syncStation
call but using the metadata data_inizio and data_fine.

@rcavaliere
Copy link
Member Author

@davidebz yes correct! Do you have a PR ready also for this DC? Then we should foresee a little development on the writer side in order to handle the status of the events provided by this DC, which can made by @bertolla

@noctho
Copy link
Contributor

noctho commented Jun 4, 2021

I have created a PR request (#280), but I have marked it as WIP, because of the following reasons:

  • In the specification is written that we should let the name of the station empty. But it's a mandatory field, so we have to specificity it's value (eiter the Id, or something like "<description of idtipoevento> <id>" or something else). At the moment, I'm using the id as a placeholder.
  • I have found an idtipoevento that is not specified: 24. When I look at the idsottotipoevento it seems to me to be equivalent to 12 = Informazione. In addition, I haven't found an event with idtipoevento = 12 from 2015 until now.
  • You have to decide how often do you want to schedule this task. As a placeholder, I have specified every 10 minutes.

As @davidebz said it before, at the moment the data collector does not handles the active flag correctly. I've implemented it this way:

  • I start from a starting point specified in the properties and fetch all events from the A22 API within a specified time window. Then I synchronize the retrieved stations and fetch the data for the next time window. I repeat this process until I have reached the current timestamp.
  • I synchronize the stations after each period, because I have noticed that tens of thousands of stations accumulate over all periods. This reduces the workload of the writer, otherwise it explodes it on my testing machine...
  • Then I fetch all events from the API of the active events (/eventi/lista/attivi) and synchronize them. Should we only set these events as active events in our database? Because they are officially active according to the A22 API.

In the specification is also written something about planned events, but I didn't find any event via the API ... I've set the range of the historical API (/eventi/lista/storici) from now to the end of the year, but I only found some events where the data_inizio is in the past and the data_fine is in the future. But these events are missing in the result of /eventi/lista/attivi ...

@rcavaliere
Copy link
Member Author

@noctho thanks for the inputs. My feedback to your points:

  • name of the station: I would suggest to create it automatically from other fields, namely "id_idtipoevento_idsottotipoevento"
  • idtipoevento not specified: thanks for the notice, I will ask A22 for more information
  • scheduler: I would also suggest to put something less, i.e. 5 minutes so that in case of an event we are quick to provide this info
  • active / not active events: in my view we should try to provide to the ODH all events available in the A22 systems, not only the active ones. It's clear that the first ingestion will be quite heavy, but this also depends on the width of the time window we are going to ask. If possible, I would ask all events from 1.1.2018 onwards, since this could be useful for some analysis related to the A22. But once we are aligned with the real-time situation I don't expect any critical issues here, the writer is already designed to not update stations that are not modified with respect to the current state saved. This means, just the "new" or "modified" events will be saved, which is the intended behavior. For the planned events, it's fine, probably A22 does not manage them (currently).

@bertolla do you see particular adaptions needs at the writer level in order to properly handle the integration of this new dataset? Can the "active" field be already managed in the expected way as it is currently implemented or do we need some refactoring?

@rcavaliere rcavaliere assigned bertolla, Piiit and rcavaliere and unassigned davidebz Jun 25, 2021
@Piiit Piiit added the blocked label Jun 30, 2021
@bertolla
Copy link
Contributor

bertolla commented Aug 4, 2021

First test release running on testserver in version 5.3.0-SNAPSHOT of bdp.

@rcavaliere
Copy link
Member Author

@bertolla please integrate this PR by Catche & Solve: #280
@Piiit when could we have the upgrade of the Ninja API so that event data are also available?

@bertolla bertolla removed their assignment Aug 31, 2021
@Piiit
Copy link
Contributor

Piiit commented Sep 29, 2021

@rcavaliere I merged the PR #280, but we do not have a pipeline yet... so that would be needed to test it properly... Then we need to deploy the new core to production, and finally implement the event api on ninja... Hopefully in October I can finalize this

@rcavaliere
Copy link
Member Author

rcavaliere commented Sep 29, 2021

@Piiit thanks for the info! The steps would be:

  • put the DC in the testing environment (implement the associated pipeline), and test it with the ODH core extension (already running in the testing environment!)
  • put the DC and the new ODH core version in production
  • remember to copy the data stored in the testing environment to the production environment

UPDATE by @Piiit
I copy these points into the main comment section... checks are only handled there!

@RudiThoeni
Copy link
Member

planned to work on in the 2nd half of november

@Piiit
Copy link
Contributor

Piiit commented Nov 24, 2021

@rcavaliere I move this back to TODO, since we need to implement the pipelines etc....

@Piiit
Copy link
Contributor

Piiit commented Nov 24, 2021

@rcavaliere I need the following information:

* change the file
  `src/main/resources/it/bz/noi/a22/events/a22connector.properties`. set the url,
  the username and the password to connect to the A22 API (or configure it
  within a CI tool)

url=
user=
password=

@rcavaliere
Copy link
Member Author

@Piiit I have tested the Data Collector. For the provenance_id missing, can we fix on our side or do we need this to be fixed by @noctho? @noctho we also need to check how "data_fine" and "data_inizio" are returned to the ODH, we don't see this data correctly stored in event_interval

@Piiit
Copy link
Contributor

Piiit commented Nov 26, 2021

We can fix the provenance on our end

@noctho
Copy link
Contributor

noctho commented Nov 26, 2021

I set the eventStart and eventEnd properties of EventDto by taking the epoch seconds data_inizio and data_fine that I get from the data service. But as I see here, it should probably be the timestamp in millis:
https://github.com/noi-techpark/bdp-core/blob/adaf491bbc30611392f1fbe41f3d782f3d169e63/dal/src/main/java/it/bz/idm/bdp/dal/Event.java#L182

@rcavaliere
Copy link
Member Author

@Piiit if I see the code, it seems to be everything OK. Can we debug a little bit this in order to understand what happens here?

@Piiit
Copy link
Contributor

Piiit commented Nov 29, 2021

@rcavaliere Provenance insertion fixed, now I will look into the start and end points of each interval, which are stored as bytea and not rangetype... I think, that also this needs to be fixed on our side

@rcavaliere
Copy link
Member Author

@Piiit very good let us know, also @noctho so that he is aware if he has to do something there

@Piiit
Copy link
Contributor

Piiit commented Nov 30, 2021

I set the eventStart and eventEnd properties of EventDto by taking the epoch seconds data_inizio and data_fine that I get from the data service. But as I see here, it should probably be the timestamp in millis: https://github.com/noi-techpark/bdp-core/blob/adaf491bbc30611392f1fbe41f3d782f3d169e63/dal/src/main/java/it/bz/idm/bdp/dal/Event.java#L182

@noctho You were right, I fixed that... also the coordinates where switched... fixed that too

@Piiit
Copy link
Contributor

Piiit commented Nov 30, 2021

@Piiit very good let us know, also @noctho so that he is aware if he has to do something there

The intervals work now, but I need to know two things to conclude it:

  1. Shall we use also timezones with each interval? I suggest, yes... also for all other time fields, but that would be a major enhancement in the core
  2. I suggest to have half-open intervals, with these boundaries [upper, lower)... because, calculations with olap queries would be easier and if a user wants to have a moving window across the timeline, he does not need to think about points which might have already been used in statistics etc.

Finally, I get many duplicate key errors: ERROR: duplicate key value violates unique constraint "uc_event_uuid"

Is the uuid field unique? And if so, why does the data collector create it and not the writer itself? I think we need to rethink this... and maybe have a compound key to be checked for uniqueness....

@Piiit
Copy link
Contributor

Piiit commented Nov 30, 2021

@rcavaliere I had a second look into the uuid problem and suggest to rename it to name and use the category,name tuple as compound unique constraint, and related key... what do you think?

So for users of the API, this might be clearer. They insert and retrieve events, that must have a name and category, and inside that they are unique...

@rcavaliere
Copy link
Member Author

rcavaliere commented Nov 30, 2021

Shall we use also timezones with each interval? I suggest, yes... also for all other time fields, but that would be a major enhancement in the core

I suggest to follow the same approach as for measurements, in which we don't have timezones, isn't it? So it's clear that all timestamps are provided in the same way

@rcavaliere
Copy link
Member Author

rcavaliere commented Nov 30, 2021

@Piiit for the uuid field: the idea we consolidated with Patrick B. was to use this field so to avoid to insert multiple records of the same information. For an event record to be unique, we considered the following data (coming from the A22 web-service):

  • id
  • data_inizio
  • idtipoevento
  • idsottotipoevento
  • lat_fine
  • lat_inizio
  • lon_fine
  • lon_inizio

What is strange are the errors we get. The expected behaviour should be:

  • if record does not exist -> save it
  • if record already exists -> do not save nothing

As you probably say, this logic should be covered by the writer and not by the Data Collector, which should only pass the retrieved data. Can we adjust this behaviour?

@Piiit
Copy link
Contributor

Piiit commented Dec 1, 2021

UPDATED VERSION

@rcavaliere OK. The errors come from the fact that the code was not complete... The check to see if a event already exists was not done, it had just a hard-coded "do-nothing", that is always return that the record does not exist.

So if these things should not be duplicated, we could already have that inside the DB itself... Also the UUID thing could be created with a Postgres native generated columns. We should go with constraints etc., not re-invent that on the client side.

@rcavaliere
Copy link
Member Author

@Piiit ok, please have a look at this and make a proposal on how you suggest to manage this. As said, the DC should not have complex logics, but just provide the data retrieved from the API. All the intelligence related to the data storage should be on the ODH core side.

@Piiit Piiit added the epic label Dec 23, 2021
@Piiit
Copy link
Contributor

Piiit commented Jan 13, 2022

@rcavaliere I suggest the following:

  • The single event is identifiable with the uuid field, which is unique in the whole table
  • I would also add a human-readable name field, which is unique inside the origin + category hierarchy (see below)
  • The event series is identifiable with a new field called event_series_id, which is arbitrary and comes from the client application. I think this is needed, even if I thought differently in previous comments, but the writer cannot always understand automatically if events belong to each other, that is only possible for simple scenarios...
  • I also add a new origin, which tells us where a certain event came from, for example A22
  • The origin, category and event_series_id together form a unique constraint, that is, in each origin = A22 and category = incident for example, there can be no duplicates of event series ids
  • Finally, as a nice-to-have, we could also add a list of tags to each event, which can then be searched and filtered via the new ninja api calls later (could also be added as metadata json as a simpler but slower solution)

For the UUID problem, I checked the code again, also on the event-a22 DC side... I think that the approach taken is not bad. I could not figure out another way to do it, because the writer cannot understand if something already exists in such a complex json structure with so many fields without a major effort and thus performance problems.

I would implement the logic of an unique id generation inside the dc-interface though, such that it is reusable in all DCs... for the UUID and the event_series_id, if not otherwise given.

Here I have two ideas:

  1. The encoding of the original data into a UUID unique field as it is now in the event-a22 dc
  2. Another idea could be to automatically generate the UUID as real UUID field, and have another field that is optional but unique called checksum

What do you think about this?

@rcavaliere
Copy link
Member Author

@Piiit thanks for the proposals, I agree with them.

I think that having two IDs is the right way to go: UUID which is something that we can generate based on the values of the following fields:

  • `id`
    
  • `data_inizio`
    
  • `idtipoevento`
    
  • `idsottotipoevento`
    
  • `lat_fine`
    
  • `lat_inizio`
    
  • `lon_fine`
    
  • `lon_inizio`
    

and event_series_id in order to save the event ID provided by A22 (contained in the field id).

The generation of the UUID value should be kept in the events Data Collector and be reusable by other similar Data Collectors as you suggest. The way it is generated is for me the same, I would suggest something simple and quick also for performance issues.

In my view the Data Collector should be "stupid", e.g. always provide the available event records provided by the A22 API. The writer should just implement a logic that checks if he has already a record with the same event_series_id and UUID: if already stored, the new provided record should not be stored.

As a data consumer, I can then filter by and order by timestamp to get the history of the evolution of that event. The single record, identified by the UUID field, determines a specific "state" of the event.

@Piiit
Copy link
Contributor

Piiit commented Jan 28, 2022

@rcavaliere The DC works now and uses the new writer backend, which is running on our staging server... Lets keep it for some time to check what lands inside Postgres... I will implement the API side now, so we can check it further.

Just a remark: The event ID that comes from A22 is unique, that is we cannot create event series ids ... Is there another field maybe to concatenate events into series?

@rcavaliere
Copy link
Member Author

@Piiit we have to check this looking at the data, thanks for the update indeed

@rcavaliere
Copy link
Member Author

@Piiit I wrote to A22 in order to understand better how the "id" field provided by A22. One question: I see in "name" currently a unique number (e.g. 329404). Where does this number come from in relation to the fields exposed by the A22 web-service?

@Piiit Piiit closed this as completed Feb 8, 2022
@rcavaliere
Copy link
Member Author

rcavaliere commented Feb 17, 2022

@Piiit I am not sure that we still have the desired mapping as far as the A22 events data is concerned. I verified with A22, and they confirm that in case of a queue we should have something like that:
`

ID Data versione Sottotipo evento Km inizio Località inizio Km fine Località fine
496655 01/30/2022 16:15:54 Traffico rallentato 157,900 ROVERETO NORD 206,700 AFFI
496655 01/30/2022 16:31:58 Traffico rallentato 131,400 TRENTO NORD 206,700 AFFI
496655 01/30/2022 16:59:34 Traffico rallentato con code 131,400 TRENTO NORD 206,700 AFFI
496655 01/30/2022 19:33:01 Traffico rallentato con code 157,900 ROVERETO NORD 206,700 AFFI
496655 01/30/2022 19:59:05 Traffico rallentato 157,900 ROVERETO NORD 206,700 AFFI
496655 01/30/2022 20:20:20 Traffico rallentato 179,100 ALA/AVIO 206,700 AFFI
496655 01/30/2022 20:39:34 Traffico rallentato 179,100 ALA/AVIO 206,700 AFFI`

The ID field exposed by A22 should be stored 1:1 into the field event_series_uuid (table event), I see however that at present we do some kind of hashing, as for the uuid field. Can you please check what the Data Collector is doing? In case we can ask Catch&Solve to change this if it was implemented in a different way than what originally indicated.

@Piiit
Copy link
Contributor

Piiit commented Feb 18, 2022

Ok. I'll have to check it on the data collector's side... Let's open a new issue for that

@Piiit
Copy link
Contributor

Piiit commented Feb 18, 2022

About the hashing: that is correct in the sense that if the same id comes from A22 always the same hash should come up... I'll explain that later.

@Piiit
Copy link
Contributor

Piiit commented Feb 18, 2022

@rcavaliere The new issue is #426 ... archiving this therefore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants