Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Utilities for extracing fields, filtering pipeline #43

Merged
merged 5 commits into from
Jul 11, 2021

Conversation

Rubix982
Copy link
Contributor

@Rubix982 Rubix982 commented Jul 9, 2021

[Feature] Utilities for extracting fields, filtering pipeline)

Pull Request Type

  • 🏆 Enhancements - adds utilities for filter pipeline, extracting fields

Purpose

To simply the filtering purpose, and for users who might want to extract only certain fields from a GeoJSON. I ran into a problem when testing requirement#2 where I just wanted to check the final id property across all features in a GeoJSON and did not need everything else, so I came up to write a simple function that does it for me.

I noticed that the DRY principle was being violated with the filtering ideas. As we go on in the requirements, there are fields that are common across requirements, and can be also common with future requirements, that we would have to write the same code for over and over again - what if can automate that?

A pipeline does just that. It passes the data to the first filter it has received, then that output of that gets passed to the 2nd filter, then the output of the 2nd filter gets passed as the input to the 3rd filter.

Why?

The most important change that this PR introduces is the filtering pipeline which lets us reorganize, add, remove filters as simple objects, future demonstration will be worked on. This means if a filtering mechanism breaks or if we want to add a new one, it will be a simple addition, and then any future user or potential developer can just reuse this pipeline component

Feedback required over

  • A quick pair of 👀 on the code
  • Discussion on the technical approach
  • Critique on design

Feedback required when

Whenever feasible

Mentions

  • @cbeddow what do you think about this approach? Can you think of some other utilities like these that can we can write that might be helpful to either in development or to the users? A user can always import the utilities by writing code such as,
from mapillary.utils.filter import pipeline.
pipeline(...)

The most important change that this PR introduces is the filtering pipeline which lets us reorganize, add, remove filters as simple objects, future demonstration will be worked on. This means if a filtering mechanism breaks or if we want to add a new one, it will be a simple addition, and then any future user or potential developer can just reuse this pipeline component
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 9, 2021
Copy link
Member

@cbeddow cbeddow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments for clarification, but no technical issues that I see. I think the module format looks good in case someone does want to use the specific utilities for anything, as you mention by importing the utils module to access its functions directly

for property in properties:
extracted_fields[property] = []

for feature in geojson["features"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it works well. I am wondering however if a list comprehension may be faster, though I am not able to clearly think how to best do it. Maybe review some examples here for ideas: https://realpython.com/list-comprehension-python/

But maybe the list comprehension does not apply since it is more complicated with multiple properties, so whatever you decide after considering this is good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, I don't really think it can be applied here. The internal variable is different from the iterable that is being looped over, thus preventing a neat list comprehension loop.

What do you think, @OmarMuhammedAli ?

if component["filter"] == "max_date":
try:
temp_data = max_date(
data=temp_data, max_timestamp=component["max_timestamp"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be checking max_timestamp against min_timestamp, opposite of line 45 above? I am not sure but assume this is making sure that min_date and max_date are not equal to each other, is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually selects only features items that have timestamps below this max_timestamp.

if component["filter"] == "min_date":
try:
temp_data = min_date(
data=temp_data, min_timestamp=component["max_timestamp"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to comment below: should this be min and max being compared, or assigning min from the inpurt? So maybe it should not be max here? Just double checking as you will understand the context better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this is a typo. Fixed.

f" {exception}"
)

if component["filter"] == "coverage":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"imagae_type" instead of coverage will probably be preferred, but can update later if "coverage" is currently the consistent name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted.

"""


def extract_properties(geojson, properties):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring to clarify the parameters and the expected return from this utility? It can be helpful to clarify these details for future developers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider it donee, boss.

except TypeError as exception:
logger.warning(
f"[pipeline - haversine_dist, TypeError] Filter not applied, exception thrown,"
f" {exception}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor bug here, strings should be concatenated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strings are concatenated in this way, right? I tested this out in the python terminal,

>>> a = 10
>>> b = 20
>>> (
... f"{a}, "
... f"{b}"
... )
'10, 20'

except TypeError as exception:
logger.warning(
f"[pipeline - coverage, TypeError] Filter not applied, exception thrown,"
f" {exception}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concatenate strings. Also, you can create a function for throwing that exception to avoid such bugs and to abide by the DRY principle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strings are concatenated in this way, right? I tested this out in the python terminal,

>>> a = 10
>>> b = 20
>>> (
... f"{a}, "
... f"{b}"
... )
'10, 20'

except TypeError as exception:
logger.warning(
f"[pipeline - haversine_dist, TypeError] Filter not applied, exception thrown,"
f" {exception}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as the comment above.

if component["filter"] == "min_date":
try:
temp_data = min_date(
data=temp_data, min_timestamp=component["max_timestamp"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data=temp_data, min_timestamp=component["max_timestamp"]
data=temp_data, min_timestamp=component["min_timestamp"]


# ? This function could use refactoring, but how?

temp_data = data.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you making a copy of the data here if the original data parameter is never used again? why not just use data instead of temp_data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data object being passed in is a dict, which is understood as passed by reference to functions.

I wrote a small code to test out what I had in mind when writing this line of code. The code is,

test_dict = {'a': 10, 'b': 20, 'c': 30}

def change_dict(new_dict: dict) -> dict:
    new_dict['a'] = 40

change_dict(test_dict)

print(test_dict)

The result of this is,

python test.py
{'a': 40, 'b': 20, 'c': 30}

So the dict got modified by reference. However, there isn't really a reason why data should be modified itself in the pipeline function. By making a copy of the data object, I can make this function idempotent, as in, it can be passed the same data over and over again and always give the same output. If the pipeline changes the input data it is fed, it may not make sense and maybe buggy because ideally, this function should only care about its own state, no matter how many filters I apply, and never really modify my input object.

I was thinking about it in that way. What do you think?

Comment on lines 117 to 179
def max_date(data, max_timestamp):
max_timestamp = date_to_unix_epoch(max_timestamp)
return [
feature
for feature in data["features"]
if feature["properties"]["captured_at"] <= max_timestamp
]


def min_date(data, min_timestamp):
min_timestamp = date_to_unix_epoch(min_timestamp)
return [
feature
for feature in data["features"]
if feature["properties"]["captured_at"] >= min_timestamp
]


def params(data, values, properties):
return [
feature
for feature in data["features"]
if feature["properties"][properties] in values
]


def haversine_dist(data, radius, coords, unit="m"):

# Define an empty geojson
output = {"type": "FeatureCollection", "features": []}

for feature in data["features"]:
distance = haversine(coords, feature["geometry"]["coordinates"], unit=unit)

print(distance)
if distance < radius:
output["features"].append(feature)

return output


def coverage(data, tile):
"""The parameter might be 'all' (both is_pano == true and false), 'pano'
(is_pano == true only), or 'flat' (is_pano == false only)"""

bool_for_pano_filtering = (
True if tile == "pano" else False if tile == "flat" else None
)

if bool_for_pano_filtering is not None:
return [
feature
for feature in data["features"]
if feature["properties"]["is_pano"] is bool_for_pano_filtering
]


def organization_id(data, organization_ids: list):
return [
feature
for feature in data["features"]
if feature["properties"]["organization_id"] in organization_ids
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although some of the utils are self-explanatory, it is a good practice to add docstrings for each of them. This will not only add clarification to future contributors but will also set a good example to abide by as ongoing development is expected to this module in particular.

Copy link
Contributor

@OmarMuhammedAli OmarMuhammedAli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pipeline function is a brilliant idea for simplifying the filtering process of different GeoJSON inputs.

Two changes/additions that I think are important:

  • Create a function that throws the TypeError exception to avoid repeating yourself and to avoid the frequent bug mentioned in the above comments.
  • Add inline documentation for different utilities.

@Rubix982
Copy link
Contributor Author

For the above commits,

  • Tests added for both pipeline and extract_properties, and as such, examples that
  • Refactored pipeline extensively, there is the only part that will have to be configured going into the future, and that is the list of filter functions in the variable function_mapping
  • Added DocStrings to each function in the above commits, along with Usage, and Imports, and Test, wherever possible.
  • Moved tests from utils under /tests/utils/
  • Statically typed function parameters for clarity of use

CC: @OmarMuhammedAli review please

@Rubix982 Rubix982 added this to Pull Requests - In Progress in MLH Fellowship Jul 11, 2021
Comment on lines +36 to +44
extracted_fields = {}

for property in properties:
extracted_fields[property] = []

for feature in geojson["features"]:
for property in properties:
if property in feature["properties"]:
extracted_fields[property].append(feature["properties"][property])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referencing @cbeddow 's comment and @Rubix982 's reply:

Suggested change
extracted_fields = {}
for property in properties:
extracted_fields[property] = []
for feature in geojson["features"]:
for property in properties:
if property in feature["properties"]:
extracted_fields[property].append(feature["properties"][property])
extracted_fields = {
property: [
feature["properties"][property]
for feature in geojson["features"]
for property in properties
if property in feature["properties"]
] for property in properties
})

Using a mix of dict comprehension and list comprehension, this works, but is it more efficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels hard to read, though. I'm not sure about efficiency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, they both give the same results. So whatever you guys decide on works! I'll look more into the efficiency aspect though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if it's not a vastly faster response with the comprehensions then ease of reading is good in my view, if I were a newcomer to the library looking to contribute I think that would be appreciated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted.

Copy link
Contributor

@OmarMuhammedAli OmarMuhammedAli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work, @Rubix982 !!!!! I really enjoyed reviewing this PR. Can you share some resources that inspired you to apply the design decisions you used here?

Comment on lines +36 to +44
extracted_fields = {}

for property in properties:
extracted_fields[property] = []

for feature in geojson["features"]:
for property in properties:
if property in feature["properties"]:
extracted_fields[property].append(feature["properties"][property])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, they both give the same results. So whatever you guys decide on works! I'll look more into the efficiency aspect though


logger = logging.getLogger("pipeline-logger")

def pipeline_component(func, data, exception_message, args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All hail functional programming

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All hail the Haskell!

@OmarMuhammedAli OmarMuhammedAli merged commit 812bd74 into main Jul 11, 2021
@OmarMuhammedAli OmarMuhammedAli deleted the Utilities-Proposal-1 branch July 11, 2021 14:44
@Rubix982
Copy link
Contributor Author

Amazing work, @Rubix982 !!!!! I really enjoyed reviewing this PR. Can you share some resources that inspired you to apply the design decisions you used here?

Just the DRY principle suggested by someone called "Omar". I think you know him 😉

@OmarMuhammedAli OmarMuhammedAli moved this from Pull Requests - In Progress to Done in MLH Fellowship Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. utility
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants