Skip to content
This repository has been archived by the owner on Feb 1, 2024. It is now read-only.

Add facility/processing type taxonomy and look-up function #1601

Merged
merged 2 commits into from Jan 31, 2022

Conversation

caseycesari
Copy link
Contributor

@caseycesari caseycesari commented Jan 25, 2022

Overview

The look-up function takes the input value, cleans it up by removing non-letter characters and extra spaces, and then attempts to find a match in the taxonomy using various methods.

Connects #1582

Demo

>>> get_facility_and_processing_type('shredding')
('PROCESSING_TYPE', 'EXACT', 'Raw Material Processing or Production', 'Shredding')
>>> get_facility_and_processing_type('pulp')
('PROCESSING_TYPE', 'FUZZY', 'Raw Material Processing or Production', 'Pulp making')
>>> get_facility_and_processing_type('wet painting')
('PROCESSING_TYPE', 'FUZZY', 'Printing, Product Dyeing and Laundering', 'Wet roller printing')
>>> get_facility_and_processing_type('wet printing')
('PROCESSING_TYPE', 'ALIAS', 'Printing, Product Dyeing and Laundering', 'Wet roller printing')
>>> get_facility_and_processing_type('assembly')
('PROCESSING_TYPE', 'EXACT', 'Final Product Assembly', 'Assembly')
>>> get_facility_and_processing_type('assemble')
('PROCESSING_TYPE', 'FUZZY', 'Final Product Assembly', 'Assembly')
>>> get_facility_and_processing_type('final assembly')
('FACILITY_TYPE', 'ALIAS', 'Final Product Assembly', 'Final Product Assembly')

Note

FUZZY_ALIAS_MATCH was not implemented because it ended up not being useful. All non-exact entries were either matched using the alias or fuzzy methods.

Testing Instructions

  • Run ./scripts/update to install the new pip dependency.
  • Run ./scripts/test to ensure the new tests pass.
  • Start a shell (./scripts/manage shell) and manually experiment with the matching function to see if it otherwise is working as intended.
  • Review the taxonomy and suggest any edits.

Checklist

  • fixup! commits have been squashed
  • CI passes after rebase
  • CHANGELOG.md updated with summary of features or fixes, following Keep a Changelog guidelines

@caseycesari caseycesari force-pushed the cpc/add-facility-production-type branch 5 times, most recently from 106e151 to e7c1f37 Compare January 28, 2022 16:03
@caseycesari caseycesari changed the title WIP: Add facility type and production type taxonomy Add facility/processing type taxonomy and look-up function Jan 28, 2022
@caseycesari caseycesari marked this pull request as ready for review January 28, 2022 16:13
@@ -0,0 +1,397 @@
from api.matching import clean
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this stuff in it's own file since there are so many items.

@@ -600,6 +608,68 @@ def test_incorrect_country_code_raises_error(self, mock_get):
)


class FacilityAndProcessingTypeTest(TestCase):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the issue said to create tests using a loop, but I found it easier to incrementally add functionality to the look up function with individual tests. I can condense if need be.

Copy link
Contributor

@jwalgran jwalgran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice implementation. I was excited to see how well this works. I made one suggestion on how we return when there is not match but we could defer that until we actually have client code using it.


# Match must score 85 or higher to be considered usable.
if not matched_value or matched_value[1] < 85:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to consider returning (None, None, None, None) here so that the "shape" of the return values from the function is always the same, which may make the calling code a litter clearer.

>>> (a, b, c, d) = get_facility_and_processing_type('WILL NOT MATCH ANYTHING')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
TypeError: cannot unpack non-iterable NoneType object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that at first, but changed it to None because (None, None, None, None) evaluates to not None. But as you point out, I have some questions about the client code implementation that may require changing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this. Thanks.

FUZZY_MATCH = 'FUZZY'

ALL_PROCESSING_TYPES = {
**OFFICE_PROCESSING_TYPES,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of ** here to compactly build these derived data structures.

@jwalgran jwalgran assigned caseycesari and unassigned jwalgran Jan 31, 2022
The look-up function takes the input value, cleans it up by removing
non-letter characters and extra spaces, and then attempts to find a
match in the taxonomy using various methods.

Refs #1582
@caseycesari caseycesari force-pushed the cpc/add-facility-production-type branch from e7c1f37 to 29de8fa Compare January 31, 2022 20:07
@caseycesari caseycesari merged commit 5593ecd into develop Jan 31, 2022
@caseycesari caseycesari deleted the cpc/add-facility-production-type branch January 31, 2022 20:16
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants