[Feature] Utilities for extracing fields, filtering pipeline #43

Rubix982 · 2021-07-09T11:16:44Z

[Feature] Utilities for extracting fields, filtering pipeline)

Pull Request Type

🏆 Enhancements - adds utilities for filter pipeline, extracting fields

Purpose

To simply the filtering purpose, and for users who might want to extract only certain fields from a GeoJSON. I ran into a problem when testing requirement#2 where I just wanted to check the final id property across all features in a GeoJSON and did not need everything else, so I came up to write a simple function that does it for me.

I noticed that the DRY principle was being violated with the filtering ideas. As we go on in the requirements, there are fields that are common across requirements, and can be also common with future requirements, that we would have to write the same code for over and over again - what if can automate that?

A pipeline does just that. It passes the data to the first filter it has received, then that output of that gets passed to the 2nd filter, then the output of the 2nd filter gets passed as the input to the 3rd filter.

Why?

The most important change that this PR introduces is the filtering pipeline which lets us reorganize, add, remove filters as simple objects, future demonstration will be worked on. This means if a filtering mechanism breaks or if we want to add a new one, it will be a simple addition, and then any future user or potential developer can just reuse this pipeline component

Feedback required over

A quick pair of 👀 on the code
Discussion on the technical approach
Critique on design

Feedback required when

Whenever feasible

Mentions

@cbeddow what do you think about this approach? Can you think of some other utilities like these that can we can write that might be helpful to either in development or to the users? A user can always import the utilities by writing code such as,

from mapillary.utils.filter import pipeline.
pipeline(...)

@OmarMuhammedAli what do you think about this design aspect of the code?

The most important change that this PR introduces is the filtering pipeline which lets us reorganize, add, remove filters as simple objects, future demonstration will be worked on. This means if a filtering mechanism breaks or if we want to add a new one, it will be a simple addition, and then any future user or potential developer can just reuse this pipeline component

cbeddow

A few comments for clarification, but no technical issues that I see. I think the module format looks good in case someone does want to use the specific utilities for anything, as you mention by importing the utils module to access its functions directly

cbeddow · 2021-07-09T20:03:27Z

mapillary/utils/extract.py

+    for property in properties:
+        extracted_fields[property] = []
+
+    for feature in geojson["features"]:


This looks like it works well. I am wondering however if a list comprehension may be faster, though I am not able to clearly think how to best do it. Maybe review some examples here for ideas: https://realpython.com/list-comprehension-python/

But maybe the list comprehension does not apply since it is more complicated with multiple properties, so whatever you decide after considering this is good!

Sadly, I don't really think it can be applied here. The internal variable is different from the iterable that is being looped over, thus preventing a neat list comprehension loop.

What do you think, @OmarMuhammedAli ?

cbeddow · 2021-07-09T20:06:06Z

mapillary/utils/filter.py

+        if component["filter"] == "max_date":
+            try:
+                temp_data = max_date(
+                    data=temp_data, max_timestamp=component["max_timestamp"]


Should this be checking max_timestamp against min_timestamp, opposite of line 45 above? I am not sure but assume this is making sure that min_date and max_date are not equal to each other, is that right?

It actually selects only features items that have timestamps below this max_timestamp.

cbeddow · 2021-07-09T20:07:28Z

mapillary/utils/filter.py

+        if component["filter"] == "min_date":
+            try:
+                temp_data = min_date(
+                    data=temp_data, min_timestamp=component["max_timestamp"]


Related to comment below: should this be min and max being compared, or assigning min from the inpurt? So maybe it should not be max here? Just double checking as you will understand the context better

You're right, this is a typo. Fixed.

cbeddow · 2021-07-09T20:08:40Z

mapillary/utils/filter.py

+                    f" {exception}"
+                )
+
+        if component["filter"] == "coverage":


"imagae_type" instead of coverage will probably be preferred, but can update later if "coverage" is currently the consistent name

OmarMuhammedAli · 2021-07-09T23:16:56Z

mapillary/utils/extract.py

+"""
+
+
+def extract_properties(geojson, properties):


Can you add a docstring to clarify the parameters and the expected return from this utility? It can be helpful to clarify these details for future developers.

Consider it donee, boss.

mapillary/utils/filter.py

OmarMuhammedAli · 2021-07-09T23:31:03Z

mapillary/utils/filter.py

+            except TypeError as exception:
+                logger.warning(
+                    f"[pipeline - haversine_dist, TypeError] Filter not applied, exception thrown,"
+                    f" {exception}"


A minor bug here, strings should be concatenated

Strings are concatenated in this way, right? I tested this out in the python terminal,

>>> a = 10 >>> b = 20 >>> ( ... f"{a}, " ... f"{b}" ... ) '10, 20'

OmarMuhammedAli · 2021-07-09T23:34:04Z

mapillary/utils/filter.py

+            except TypeError as exception:
+                logger.warning(
+                    f"[pipeline - coverage, TypeError] Filter not applied, exception thrown,"
+                    f" {exception}"


Concatenate strings. Also, you can create a function for throwing that exception to avoid such bugs and to abide by the DRY principle.

Strings are concatenated in this way, right? I tested this out in the python terminal,

>>> a = 10 >>> b = 20 >>> ( ... f"{a}, " ... f"{b}" ... ) '10, 20'

OmarMuhammedAli · 2021-07-09T23:34:33Z

mapillary/utils/filter.py

+            except TypeError as exception:
+                logger.warning(
+                    f"[pipeline - haversine_dist, TypeError] Filter not applied, exception thrown,"
+                    f" {exception}"


same as the comment above.

OmarMuhammedAli · 2021-07-09T23:39:24Z

mapillary/utils/filter.py

+        if component["filter"] == "min_date":
+            try:
+                temp_data = min_date(
+                    data=temp_data, min_timestamp=component["max_timestamp"]


Suggested change

data=temp_data, min_timestamp=component["max_timestamp"]

data=temp_data, min_timestamp=component["min_timestamp"]

OmarMuhammedAli · 2021-07-09T23:42:59Z

mapillary/utils/filter.py

+
+    # ? This function could use refactoring, but how?
+
+    temp_data = data.copy()


Why are you making a copy of the data here if the original data parameter is never used again? why not just use data instead of temp_data?

The data object being passed in is a dict, which is understood as passed by reference to functions.

I wrote a small code to test out what I had in mind when writing this line of code. The code is,

test_dict = {'a': 10, 'b': 20, 'c': 30} def change_dict(new_dict: dict) -> dict: new_dict['a'] = 40 change_dict(test_dict) print(test_dict)

The result of this is,

python test.py {'a': 40, 'b': 20, 'c': 30}

So the dict got modified by reference. However, there isn't really a reason why data should be modified itself in the pipeline function. By making a copy of the data object, I can make this function idempotent, as in, it can be passed the same data over and over again and always give the same output. If the pipeline changes the input data it is fed, it may not make sense and maybe buggy because ideally, this function should only care about its own state, no matter how many filters I apply, and never really modify my input object.

I was thinking about it in that way. What do you think?

OmarMuhammedAli · 2021-07-09T23:49:06Z

mapillary/utils/filter.py

+def max_date(data, max_timestamp):
+    max_timestamp = date_to_unix_epoch(max_timestamp)
+    return [
+        feature
+        for feature in data["features"]
+        if feature["properties"]["captured_at"] <= max_timestamp
+    ]
+
+
+def min_date(data, min_timestamp):
+    min_timestamp = date_to_unix_epoch(min_timestamp)
+    return [
+        feature
+        for feature in data["features"]
+        if feature["properties"]["captured_at"] >= min_timestamp
+    ]
+
+
+def params(data, values, properties):
+    return [
+        feature
+        for feature in data["features"]
+        if feature["properties"][properties] in values
+    ]
+
+
+def haversine_dist(data, radius, coords, unit="m"):
+
+    # Define an empty geojson
+    output = {"type": "FeatureCollection", "features": []}
+
+    for feature in data["features"]:
+        distance = haversine(coords, feature["geometry"]["coordinates"], unit=unit)
+
+        print(distance)
+        if distance < radius:
+            output["features"].append(feature)
+
+    return output
+
+
+def coverage(data, tile):
+    """The parameter might be 'all' (both is_pano == true and false), 'pano'
+    (is_pano == true only), or 'flat' (is_pano == false only)"""
+
+    bool_for_pano_filtering = (
+        True if tile == "pano" else False if tile == "flat" else None
+    )
+
+    if bool_for_pano_filtering is not None:
+        return [
+            feature
+            for feature in data["features"]
+            if feature["properties"]["is_pano"] is bool_for_pano_filtering
+        ]
+
+
+def organization_id(data, organization_ids: list):
+    return [
+        feature
+        for feature in data["features"]
+        if feature["properties"]["organization_id"] in organization_ids
+    ]


Although some of the utils are self-explanatory, it is a good practice to add docstrings for each of them. This will not only add clarification to future contributors but will also set a good example to abide by as ongoing development is expected to this module in particular.

OmarMuhammedAli

The pipeline function is a brilliant idea for simplifying the filtering process of different GeoJSON inputs.

Two changes/additions that I think are important:

Create a function that throws the TypeError exception to avoid repeating yourself and to avoid the frequent bug mentioned in the above comments.
Add inline documentation for different utilities.

…pipeline_components to for code readable

Rubix982 · 2021-07-11T09:31:05Z

For the above commits,

Tests added for both pipeline and extract_properties, and as such, examples that
Refactored pipeline extensively, there is the only part that will have to be configured going into the future, and that is the list of filter functions in the variable function_mapping
Added DocStrings to each function in the above commits, along with Usage, and Imports, and Test, wherever possible.
Moved tests from utils under /tests/utils/
Statically typed function parameters for clarity of use

CC: @OmarMuhammedAli review please

OmarMuhammedAli · 2021-07-11T14:10:16Z

mapillary/utils/extract.py

+    extracted_fields = {}
+
+    for property in properties:
+        extracted_fields[property] = []
+
+    for feature in geojson["features"]:
+        for property in properties:
+            if property in feature["properties"]:
+                extracted_fields[property].append(feature["properties"][property])


Referencing @cbeddow 's comment and @Rubix982 's reply:

Suggested change

extracted_fields = {}

for property in properties:

extracted_fields[property] = []

for feature in geojson["features"]:

for property in properties:

if property in feature["properties"]:

extracted_fields[property].append(feature["properties"][property])

extracted_fields = {

property: [

feature["properties"][property]

for feature in geojson["features"]

for property in properties

if property in feature["properties"]

] for property in properties

})

Using a mix of dict comprehension and list comprehension, this works, but is it more efficient?

This feels hard to read, though. I'm not sure about efficiency.

I mean, they both give the same results. So whatever you guys decide on works! I'll look more into the efficiency aspect though

Yeah if it's not a vastly faster response with the comprehensions then ease of reading is good in my view, if I were a newcomer to the library looking to contribute I think that would be appreciated

OmarMuhammedAli

Amazing work, @Rubix982 !!!!! I really enjoyed reviewing this PR. Can you share some resources that inspired you to apply the design decisions you used here?

OmarMuhammedAli · 2021-07-11T14:14:54Z

mapillary/utils/extract.py

+    extracted_fields = {}
+
+    for property in properties:
+        extracted_fields[property] = []
+
+    for feature in geojson["features"]:
+        for property in properties:
+            if property in feature["properties"]:
+                extracted_fields[property].append(feature["properties"][property])


I mean, they both give the same results. So whatever you guys decide on works! I'll look more into the efficiency aspect though

OmarMuhammedAli · 2021-07-11T14:43:20Z

mapillary/utils/filter.py

+
+logger = logging.getLogger("pipeline-logger")
+
+def pipeline_component(func, data, exception_message, args):


All hail functional programming

All hail the Haskell!

Rubix982 · 2021-07-11T14:56:16Z

Amazing work, @Rubix982 !!!!! I really enjoyed reviewing this PR. Can you share some resources that inspired you to apply the design decisions you used here?

Just the DRY principle suggested by someone called "Omar". I think you know him 😉

Rubix982 added the utility label Jul 9, 2021

Rubix982 requested review from cbeddow and OmarMuhammedAli July 9, 2021 11:16

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 9, 2021

cbeddow reviewed Jul 9, 2021

View reviewed changes

OmarMuhammedAli reviewed Jul 9, 2021

View reviewed changes

mapillary/utils/filter.py Show resolved Hide resolved

OmarMuhammedAli reviewed Jul 9, 2021

View reviewed changes

OmarMuhammedAli suggested changes Jul 9, 2021

View reviewed changes

Rubix982 added 4 commits July 11, 2021 13:56

Refactor pipeline logic, add DocString to each filter module, create …

0bce580

…pipeline_components to for code readable

Change coverage to image_type for readability purposes

0b00ad6

Added test for filter.py

1036588

Added documentation, create test, for extract_properties

af62670

Rubix982 added this to Pull Requests - In Progress in MLH Fellowship Jul 11, 2021

OmarMuhammedAli reviewed Jul 11, 2021

View reviewed changes

OmarMuhammedAli approved these changes Jul 11, 2021

View reviewed changes

OmarMuhammedAli reviewed Jul 11, 2021

View reviewed changes

OmarMuhammedAli merged commit 812bd74 into main Jul 11, 2021

OmarMuhammedAli deleted the Utilities-Proposal-1 branch July 11, 2021 14:44

OmarMuhammedAli moved this from Pull Requests - In Progress to Done in MLH Fellowship Jul 12, 2021

Rubix982 mentioned this pull request Jul 12, 2021

[SECURITY - ACCESS_TOKEN] Access Token Introduced, PR#44 #51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Utilities for extracing fields, filtering pipeline #43

[Feature] Utilities for extracing fields, filtering pipeline #43

Rubix982 commented Jul 9, 2021

cbeddow left a comment

cbeddow Jul 9, 2021

Rubix982 Jul 11, 2021

cbeddow Jul 9, 2021

Rubix982 Jul 11, 2021

cbeddow Jul 9, 2021

Rubix982 Jul 11, 2021

cbeddow Jul 9, 2021

Rubix982 Jul 11, 2021

OmarMuhammedAli Jul 9, 2021

Rubix982 Jul 11, 2021

OmarMuhammedAli Jul 9, 2021

Rubix982 Jul 11, 2021

OmarMuhammedAli Jul 9, 2021

Rubix982 Jul 11, 2021

OmarMuhammedAli Jul 9, 2021

OmarMuhammedAli Jul 9, 2021

OmarMuhammedAli Jul 9, 2021

Rubix982 Jul 11, 2021

OmarMuhammedAli Jul 9, 2021

OmarMuhammedAli left a comment

Rubix982 commented Jul 11, 2021

OmarMuhammedAli Jul 11, 2021

Rubix982 Jul 11, 2021

OmarMuhammedAli Jul 11, 2021

cbeddow Jul 13, 2021

Rubix982 Jul 13, 2021

OmarMuhammedAli left a comment

OmarMuhammedAli Jul 11, 2021

OmarMuhammedAli Jul 11, 2021

Rubix982 Jul 11, 2021

Rubix982 commented Jul 11, 2021

	data=temp_data, min_timestamp=component["max_timestamp"]
	data=temp_data, min_timestamp=component["min_timestamp"]


		# ? This function could use refactoring, but how?

		temp_data = data.copy()


		logger = logging.getLogger("pipeline-logger")

		def pipeline_component(func, data, exception_message, args):

[Feature] Utilities for extracing fields, filtering pipeline #43

[Feature] Utilities for extracing fields, filtering pipeline #43

Conversation

Rubix982 commented Jul 9, 2021

[Feature] Utilities for extracting fields, filtering pipeline)

Pull Request Type

Purpose

Why?

Feedback required over

Feedback required when

Mentions

cbeddow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OmarMuhammedAli left a comment

Choose a reason for hiding this comment

Rubix982 commented Jul 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OmarMuhammedAli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rubix982 commented Jul 11, 2021