Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates rows in the db #386

Closed
EyalBerger opened this issue Mar 9, 2022 · 1 comment
Closed

Duplicates rows in the db #386

EyalBerger opened this issue Mar 9, 2022 · 1 comment
Assignees

Comments

@EyalBerger
Copy link
Collaborator

Hi, I think that for some reason there are many duplicated rows in the db. From analytical point of view it could be solved, but I assume it's more efficient to drop those duplicates.

I've extracted https://openbus-stride-public.s3.eu-west-1.amazonaws.com/stride-siri-requester/2022/03/01/07/00.br, and filtered out one line_ref:
image

As can be seen, it seems the source itself holds one duplicated row, which might could be removed.

In the db it seems this duplicated rows grows to 8 records:
image

code for getting the db data:

siri_vehicle_locations = pd.DataFrame(stride.iterate('/siri_vehicle_locations/list', {
    'recorded_at_time_from': datetime.datetime(2022,3, 1, 8, 55, tzinfo=tz.gettz('Israel')),
    'recorded_at_time_to': datetime.datetime(2022,3, 1, 8, 56, tzinfo=tz.gettz('Israel')),
    'siri_routes__line_ref': '12328',
    'order_by': 'recorded_at_time desc'
}, limit=10000))

display(siri_vehicle_locations.shape)

siri_vehicle_locations[(pd.to_datetime(siri_vehicle_locations.recorded_at_time).dt.time.astype(str)=='06:55:00')]
@OriHoch
Copy link

OriHoch commented Mar 10, 2022

The data is duplicated in the source data, this vehicle location appears in multiple siri snapshots as indicated in the siri_snapshot_id row. The siri snapshot id you extracted is 268538, but this same data appears also in the other siri snapshots.

I think it's better to correctly reflect all source data as-is as it is also important information which we don't want to loose (e.g. can be used to find errors on MOT side). Also, the process that loads the siri data should be as minimal as possible so we can quickly process and load all the source data in the quickest possbile time and with minimal strain on the DB.

Given above points, and given that it's easy to fix this when getting the data from the API, I think we should not fix it for now. If this does require a fix then we would need to find a solution which does not modify the source data, but it may require significant development effort. Closing for now because I don't think we will get to it, but feel free to reopen if you think it's critical.

@OriHoch OriHoch closed this as completed Mar 10, 2022
@OriHoch OriHoch added this to Done in open-bus-stride Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants