-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Utilities for extracing fields, filtering pipeline #43
Conversation
The most important change that this PR introduces is the filtering pipeline which lets us reorganize, add, remove filters as simple objects, future demonstration will be worked on. This means if a filtering mechanism breaks or if we want to add a new one, it will be a simple addition, and then any future user or potential developer can just reuse this pipeline component
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments for clarification, but no technical issues that I see. I think the module format looks good in case someone does want to use the specific utilities for anything, as you mention by importing the utils module to access its functions directly
for property in properties: | ||
extracted_fields[property] = [] | ||
|
||
for feature in geojson["features"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it works well. I am wondering however if a list comprehension may be faster, though I am not able to clearly think how to best do it. Maybe review some examples here for ideas: https://realpython.com/list-comprehension-python/
But maybe the list comprehension does not apply since it is more complicated with multiple properties, so whatever you decide after considering this is good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, I don't really think it can be applied here. The internal variable is different from the iterable that is being looped over, thus preventing a neat list comprehension loop.
What do you think, @OmarMuhammedAli ?
mapillary/utils/filter.py
Outdated
if component["filter"] == "max_date": | ||
try: | ||
temp_data = max_date( | ||
data=temp_data, max_timestamp=component["max_timestamp"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be checking max_timestamp against min_timestamp, opposite of line 45 above? I am not sure but assume this is making sure that min_date and max_date are not equal to each other, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually selects only features
items that have timestamps below this max_timestamp
.
mapillary/utils/filter.py
Outdated
if component["filter"] == "min_date": | ||
try: | ||
temp_data = min_date( | ||
data=temp_data, min_timestamp=component["max_timestamp"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to comment below: should this be min and max being compared, or assigning min from the inpurt? So maybe it should not be max here? Just double checking as you will understand the context better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this is a typo. Fixed.
mapillary/utils/filter.py
Outdated
f" {exception}" | ||
) | ||
|
||
if component["filter"] == "coverage": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"imagae_type" instead of coverage will probably be preferred, but can update later if "coverage" is currently the consistent name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted.
mapillary/utils/extract.py
Outdated
""" | ||
|
||
|
||
def extract_properties(geojson, properties): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a docstring to clarify the parameters and the expected return from this utility? It can be helpful to clarify these details for future developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider it donee, boss.
mapillary/utils/filter.py
Outdated
except TypeError as exception: | ||
logger.warning( | ||
f"[pipeline - haversine_dist, TypeError] Filter not applied, exception thrown," | ||
f" {exception}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor bug here, strings should be concatenated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strings are concatenated in this way, right? I tested this out in the python terminal,
>>> a = 10
>>> b = 20
>>> (
... f"{a}, "
... f"{b}"
... )
'10, 20'
mapillary/utils/filter.py
Outdated
except TypeError as exception: | ||
logger.warning( | ||
f"[pipeline - coverage, TypeError] Filter not applied, exception thrown," | ||
f" {exception}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concatenate strings. Also, you can create a function for throwing that exception to avoid such bugs and to abide by the DRY principle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strings are concatenated in this way, right? I tested this out in the python terminal,
>>> a = 10
>>> b = 20
>>> (
... f"{a}, "
... f"{b}"
... )
'10, 20'
mapillary/utils/filter.py
Outdated
except TypeError as exception: | ||
logger.warning( | ||
f"[pipeline - haversine_dist, TypeError] Filter not applied, exception thrown," | ||
f" {exception}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as the comment above.
mapillary/utils/filter.py
Outdated
if component["filter"] == "min_date": | ||
try: | ||
temp_data = min_date( | ||
data=temp_data, min_timestamp=component["max_timestamp"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data=temp_data, min_timestamp=component["max_timestamp"] | |
data=temp_data, min_timestamp=component["min_timestamp"] |
mapillary/utils/filter.py
Outdated
|
||
# ? This function could use refactoring, but how? | ||
|
||
temp_data = data.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you making a copy of the data
here if the original data
parameter is never used again? why not just use data
instead of temp_data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data
object being passed in is a dict
, which is understood as passed by reference
to functions.
I wrote a small code to test out what I had in mind when writing this line of code. The code is,
test_dict = {'a': 10, 'b': 20, 'c': 30}
def change_dict(new_dict: dict) -> dict:
new_dict['a'] = 40
change_dict(test_dict)
print(test_dict)
The result of this is,
python test.py
{'a': 40, 'b': 20, 'c': 30}
So the dict
got modified by reference. However, there isn't really a reason why data
should be modified itself in the pipeline function. By making a copy
of the data
object, I can make this function idempotent
, as in, it can be passed the same data over and over again and always give the same output. If the pipeline changes the input data it is fed, it may not make sense and maybe buggy because ideally, this function should only care about its own state, no matter how many filters I apply, and never really modify my input object.
I was thinking about it in that way. What do you think?
mapillary/utils/filter.py
Outdated
def max_date(data, max_timestamp): | ||
max_timestamp = date_to_unix_epoch(max_timestamp) | ||
return [ | ||
feature | ||
for feature in data["features"] | ||
if feature["properties"]["captured_at"] <= max_timestamp | ||
] | ||
|
||
|
||
def min_date(data, min_timestamp): | ||
min_timestamp = date_to_unix_epoch(min_timestamp) | ||
return [ | ||
feature | ||
for feature in data["features"] | ||
if feature["properties"]["captured_at"] >= min_timestamp | ||
] | ||
|
||
|
||
def params(data, values, properties): | ||
return [ | ||
feature | ||
for feature in data["features"] | ||
if feature["properties"][properties] in values | ||
] | ||
|
||
|
||
def haversine_dist(data, radius, coords, unit="m"): | ||
|
||
# Define an empty geojson | ||
output = {"type": "FeatureCollection", "features": []} | ||
|
||
for feature in data["features"]: | ||
distance = haversine(coords, feature["geometry"]["coordinates"], unit=unit) | ||
|
||
print(distance) | ||
if distance < radius: | ||
output["features"].append(feature) | ||
|
||
return output | ||
|
||
|
||
def coverage(data, tile): | ||
"""The parameter might be 'all' (both is_pano == true and false), 'pano' | ||
(is_pano == true only), or 'flat' (is_pano == false only)""" | ||
|
||
bool_for_pano_filtering = ( | ||
True if tile == "pano" else False if tile == "flat" else None | ||
) | ||
|
||
if bool_for_pano_filtering is not None: | ||
return [ | ||
feature | ||
for feature in data["features"] | ||
if feature["properties"]["is_pano"] is bool_for_pano_filtering | ||
] | ||
|
||
|
||
def organization_id(data, organization_ids: list): | ||
return [ | ||
feature | ||
for feature in data["features"] | ||
if feature["properties"]["organization_id"] in organization_ids | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although some of the utils are self-explanatory, it is a good practice to add docstrings for each of them. This will not only add clarification to future contributors but will also set a good example to abide by as ongoing development is expected to this module in particular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pipeline function is a brilliant idea for simplifying the filtering process of different GeoJSON inputs.
Two changes/additions that I think are important:
- Create a function that throws the
TypeError
exception to avoid repeating yourself and to avoid the frequent bug mentioned in the above comments. - Add inline documentation for different utilities.
For the above commits,
CC: @OmarMuhammedAli review please |
extracted_fields = {} | ||
|
||
for property in properties: | ||
extracted_fields[property] = [] | ||
|
||
for feature in geojson["features"]: | ||
for property in properties: | ||
if property in feature["properties"]: | ||
extracted_fields[property].append(feature["properties"][property]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referencing @cbeddow 's comment and @Rubix982 's reply:
extracted_fields = {} | |
for property in properties: | |
extracted_fields[property] = [] | |
for feature in geojson["features"]: | |
for property in properties: | |
if property in feature["properties"]: | |
extracted_fields[property].append(feature["properties"][property]) | |
extracted_fields = { | |
property: [ | |
feature["properties"][property] | |
for feature in geojson["features"] | |
for property in properties | |
if property in feature["properties"] | |
] for property in properties | |
}) |
Using a mix of dict comprehension and list comprehension, this works, but is it more efficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels hard to read, though. I'm not sure about efficiency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, they both give the same results. So whatever you guys decide on works! I'll look more into the efficiency aspect though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah if it's not a vastly faster response with the comprehensions then ease of reading is good in my view, if I were a newcomer to the library looking to contribute I think that would be appreciated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work, @Rubix982 !!!!! I really enjoyed reviewing this PR. Can you share some resources that inspired you to apply the design decisions you used here?
extracted_fields = {} | ||
|
||
for property in properties: | ||
extracted_fields[property] = [] | ||
|
||
for feature in geojson["features"]: | ||
for property in properties: | ||
if property in feature["properties"]: | ||
extracted_fields[property].append(feature["properties"][property]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, they both give the same results. So whatever you guys decide on works! I'll look more into the efficiency aspect though
|
||
logger = logging.getLogger("pipeline-logger") | ||
|
||
def pipeline_component(func, data, exception_message, args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All hail functional programming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All hail the Haskell!
Just the DRY principle suggested by someone called "Omar". I think you know him 😉 |
[Feature] Utilities for extracting fields, filtering pipeline)
Pull Request Type
Purpose
To simply the filtering purpose, and for users who might want to extract only certain fields from a GeoJSON. I ran into a problem when testing requirement#2 where I just wanted to check the final
id
property across allfeatures
in a GeoJSON and did not need everything else, so I came up to write a simple function that does it for me.I noticed that the DRY principle was being violated with the filtering ideas. As we go on in the requirements, there are fields that are common across requirements, and can be also common with future requirements, that we would have to write the same code for over and over again - what if can automate that?
A pipeline does just that. It passes the data to the first filter it has received, then that output of that gets passed to the 2nd filter, then the output of the 2nd filter gets passed as the input to the 3rd filter.
Why?
The most important change that this PR introduces is the filtering pipeline which lets us reorganize, add, remove filters as simple objects, future demonstration will be worked on. This means if a filtering mechanism breaks or if we want to add a new one, it will be a simple addition, and then any future user or potential developer can just reuse this pipeline component
Feedback required over
Feedback required when
Whenever feasible
Mentions