Add extended field reports #1826

TaiWilkin · 2022-04-29T18:52:11Z

Overview

Adds new reports for extended fields:

Contributors With Extended Fields Cumulative: This report lists
the number of contributors who have submitted each field type. It is
organized into months, where each month's count includes all extended
fields with an update date of that month or earlier.
Facilities With Extended Fields Cumulative: This report lists
the number of facilities that have each field type. It is
organized into months, where each month's count includes all extended
fields with an update date of that month or earlier.
Processing Type Facility Type Matched Values: This report lists each
taxonomy value that has been assigned to an extended field, accompanied
by a list of the raw values that have been matched to that value (separated
by '|').
Processing Type Facility Type Unmatched Values: This report lists
each raw value that was submitted as a Processing_Type_Facility_Type
and wasn't matched to a taxonomy value, as well as a count of the number
of times that raw value was submitted.
Submitted Product Type Values: This report lists each raw value that
was submitted for a product_type extended field, as well as a count of
the number of times that raw value was submitted.

Connects #1767

Demo

Notes

The issue requested one report with "Report that lists all taxonomy values for product_type, processing_type_facility_type, and the raw values that were fuzzy matched to these values",
but product_type doesn't use any fuzzy matching. It could be that this was meant to be processing_type
(which is already included under the processing_type_facility_type report), but because the
OAR team has also mentioned wanting to evaluate the product_type values being submitted,
I included the product_type values as a separate report.

Testing Instructions

Run ./scripts/server
Visit contributors_with_extended_fields_cumulative
Visit facilities_with_extended_fields_cumulative
Visit processing_type_facility_type_matched_values
Visit processing_type_facility_type_unmatched_values
Visit submitted_product_type_values
Confirm each report contains the expected values and that the format makes sense

Checklist

fixup! commits have been squashed
CI passes after rebase
CHANGELOG.md updated with summary of features or fixes, following Keep a Changelog guidelines

jwalgran

The logic in these Python reports was easy for me to follow. I ran into a few issues in testing.

src/django/api/reports.py

jwalgran · 2022-04-29T20:51:43Z

src/django/api/reports.py

+        'value__raw_values', 'value__matched_values'
+    ).values_list('value__raw_values', 'value__matched_values').iterator():
+        for i, raw_value in enumerate(raw_values.split('|')):
+            if matched_values[i][3] is not None:


When testing this on staging there was an index out of range exception on this line

https://rollbar.com/OpenApparelRegistry/OpenApparelRegistry/items/1549/

I pushed a commit to to the test branch on staging to get more information

diff --git a/src/django/api/reports.py b/src/django/api/reports.py index c0bc97a6..dfab169b 100644 --- a/src/django/api/reports.py +++ b/src/django/api/reports.py @@ -143,9 +143,16 @@ def processing_type_facility_type_matched(): 'value__raw_values', 'value__matched_values' ).values_list('value__raw_values', 'value__matched_values').iterator(): for i, raw_value in enumerate(raw_values.split('|')): - if matched_values[i][3] is not None: - temp_data[matched_values[i][3]].add(raw_value.strip()) - temp_data[matched_values[i][2]].add(raw_value.strip()) + try: + if matched_values[i][3] is not None: + temp_data[matched_values[i][3]].add(raw_value.strip()) + temp_data[matched_values[i][2]].add(raw_value.strip()) + except IndexError: + raise Exception('matched_values has unexpected shape.\n' + 'i = {}\n' + 'raw_value = {}\n' + 'matched_values = {}'.format( + i, raw_value, matched_values)) data = dict() for (raw_value, match_values) in temp_data.items(): @@ -164,8 +171,15 @@ def processing_type_facility_type_unmatched(): 'value__raw_values', 'value__matched_values' ).values_list('value__raw_values', 'value__matched_values').iterator(): for i, raw_value in enumerate(raw_values.split('|')): - if matched_values[i][3] is None: - data[raw_value.strip()] = data[raw_value.strip()] + 1 + try: + if matched_values[i][3] is None: + data[raw_value.strip()] = data[raw_value.strip()] + 1 + except IndexError: + raise Exception('matched_values has unexpected shape.\n' + 'i = {}\n' + 'raw_value = {}\n' + 'matched_values = {}'.format( + i, raw_value, matched_values)) rows = sort_by_first_column(data.items()) return [['processing_type_facility_type', 'times_submitted'], rows]

Here is the additional details showed up in Roolbar

https://rollbar.com/OpenApparelRegistry/OpenApparelRegistry/items/1551/

Exception: matched_values has unexpected shape. i = 5 raw_value = Review matched_values = [['PROCESSING_TYPE', 'EXACT', 'Printing, Product Dyeing and Laundering', 'Finishing'], [None, None, None, None], ['PROCESSING_TYPE', 'EXACT', 'Final Product Assembly', 'Sewing'], ['PROCESSING_TYPE', 'EXACT', 'Warehousing / Distribution', 'Packing'], [None, None, None, None]] (Most recent call last)

There are only 5 values in matched_values, so there points to there being record in the database where there are more raw_values than matched_values, which is unexpected.

Looking at the staging database with

select value->'raw_values', value->'matched_values' from api_extendedfield where field_name = 'processing_type';

I see that value-'raw_values' contains a mix of

Single strings ("Manufacturing")

Arrays (["cutting", "packing", "sewing"])

Pipe-delimited strings ("Finishing|Electronic Embroidery|Sewing|Printing|Weaving")

On reflection this makes sense. API users can send literal arrays, but list uploaders will submit single strings (perhaps containing pipes)

After adding some single string and array values to the facility_type and processing_type values in the database, I was able to replicate these errors locally. I updated the report to process the raw_values in the same way as the extended fields script here. Noting the removal of duplicate values there, I tested submitting a facility_type with repeated values and confirmed that this caused an indexing error. Using the same process of processing and deduplication as the extended_fields processing in report resulted in the duplicate and variously formatted values being handled correctly without error.

jwalgran · 2022-04-29T22:54:39Z

src/django/api/reports.py

+            updated_at__year__lte=month.year,
+            updated_at__month__lte=month.month)
+
+        for field_name, _ in ExtendedField.FIELD_CHOICES:


Not a problem introduced in this PR, but adding this use of FIELD_CHOICES exposed the fact that facility type and processing type are repeated . We will want to remove them, which will likely require creating a migration since is "changes" the choices attached to a model field.

https://github.com/open-apparel-registry/open-apparel-registry/blame/a070e8be9dce56112de4ebb757f477c374b0e595/src/django/api/models.py#L2387-L2398

I made a model change and a migration in 2ae5f45. I checked my local data after migrating to confirm that removing two of the values didn't have any impact on my processing_type and facility_type fields, just in case, and confirmed there were no empty values for the field_name.

TaiWilkin · 2022-05-02T19:21:23Z

Thanks for the thorough review and for testing on staging. This is ready for another look.

jwalgran

I ran into a crash when testing on staging. My research into this is included below but the short version is that there are 4 processing type extended fields created from claims have a mismatch between the count of raw and matched values. I propose working around this by reporting the error to Rollbar rather than crash and open a follow up issue to resolve it.

https://rollbar.com/OpenApparelRegistry/OpenApparelRegistry/items/1549/

Using Rollbar to look at the local values at the point of the exception there are 7 deduped_values and 6 matched_values. A

[
"Denim Services",
"Fabric Dyeing",
"Pleating",
"Printing",
"Sewing",
"Spinning",
"Trimming/Embellishment"
]
[
[
"PROCESSING_TYPE",
"FUZZY",
"Printing, Product Dyeing and Laundering",
"Dyeing"
],
[
"PROCESSING_TYPE",
"EXACT",
"Final Product Assembly",
"Pleating"
],
[
"PROCESSING_TYPE",
"EXACT",
"Printing, Product Dyeing and Laundering",
"Printing"
],
[
"PROCESSING_TYPE",
"EXACT",
"Final Product Assembly",
"Sewing"
],
[
"PROCESSING_TYPE",
"EXACT",
"Raw Material Processing or Production",
"Spinning"
],
[
"PROCESSING_TYPE",
"FUZZY",
"Textile or Material Production",
"Embellishment"
]
]

This should not happen. The unmatched "Denim Services" should result in a [None, None, None, None] match row. I tested this by POSTing a facility to the API with

"facility_type_processing_type": [
"Denim Services",
"Fabric Dyeing",
"Pleating",
"Printing",
"Sewing",
"Spinning",
"Trimming/Embellishment"
]
And it did create the None item in matched_values

I ran this query on staging to try and get more information

select * from (
  select facility_id, facility_claim_id, facility_list_item_id,
         CASE WHEN jsonb_typeof(value->'raw_values') = 'array'
           THEN jsonb_array_length(value->'raw_values')
         ELSE -1
         END as raw_count,
         jsonb_array_length(value->'matched_values') as matched_count
    from api_extendedfield
   where field_name = 'processing_type'
  ) s
where raw_count > 0 and raw_count != matched_count;

There were 4 extended fields created from claims that had a count mismatch between raw and matched

openapparelregistry-> where raw_count > 0 and raw_count != matched_count;
   facility_id   | facility_claim_id | facility_list_item_id | raw_count | matched_count
-----------------+-------------------+-----------------------+-----------+---------------
 IT2019311JZR7ZX |                56 |                       |         7 |             6
 CN201928823FEJ3 |                53 |                       |        12 |            11
 PK2021341XSCC8Y |               333 |                       |        19 |            17
 IN202108462CDJV |               350 |                       |        24 |            22

When I ran the query against production I got the same 4 rows.

In staging there are 66 other processing type extended field rows from claims with matching counts.

Here is a partial WIP commit that I added to my staging test branch that adds a try/catch with reporting to Rollbar. This fixes the report on staging and logs 4 errors to Rollbar, which reinforces the finding that the report code handles the other cases properly.

1352f2b

TaiWilkin · 2022-05-04T15:23:02Z

@jwalgran I'm pretty sure I know exactly where this error is coming from: this PR #1677, when we dropped Denim Services and several other values. We decided at the time to treat them as invalid values vs set them to None, but that's causing issues with the logic here.
I'd like to suggest we filter those values out here as well and don't consider them. Then in addition I think we also should still add the try/catch so if there are future misalignments we are alerted but don't crash.

jwalgran

Thank you for looking up and referencing that previous PR. Yes. I agree with your plan to filter out these values but also use the try/catch.

jwalgran · 2022-05-04T16:07:44Z

src/django/api/reports.py

+    for field_name, _ in ExtendedField.FIELD_CHOICES:
+        columns.append(field_name)


The OAR team request that we add a "(claim)" suffix to the country, name and address column headers to clearly indicate that they come from the claim process, bot facility data submission.

Followup question: We had added country as one of the initial fields, but I don't believe that the claim form actually includes country as a value for the facility (only for the Office Information section). Do we want to remove this as well as the two duplicate values, since it's not actually a possible field type at the moment?

Good idea. Having the country column filled with all zeros all the time just adds noise to the report.

TaiWilkin · 2022-05-05T12:07:33Z

Updated processing_type reports (I only have this month's data in my system right now):

This is ready for another look.

jwalgran

Looks good, and error free on staging! Thanks for working through the revisions with me.

- Contributors With Extended Fields Cumulative: This report lists the number of contributors who have submitted each field type. It is organized into months, where each month's count includes all extended fields with an update date of that month or earlier. - Facilities With Extended Fields Cumulative: This report lists the number of facilities that have each field type. It is organized into months, where each month's count includes all extended fields with an update date of that month or earlier. - Processing Type Facility Type Matched Values: This report lists each taxonomy value that has been assigned to an extended field, accompanied by a list of the raw values that have been matched to that value. - Processing Type Facility Type Unmatched Values: This report lists each raw value that was submitted as a Processing_Type_Facility_Type and wasn't matched to a taxonomy value, as well as a count of the number of times that raw value was submitted. - Submitted Product Type Values: This report lists each raw value that was submitted for a product_type extended field, as well as a count of the number of times that raw value was submitted.

Two of the choices for ExtendedField.field_name were mistakenly duplicated (facility_type and processing_type). Removes the duplicates via migration and model change.

TaiWilkin · 2022-05-06T12:39:55Z

Thank you for the debugging help and detailed feedback!

TaiWilkin requested a review from jwalgran April 29, 2022 18:52

TaiWilkin assigned jwalgran Apr 29, 2022

TaiWilkin force-pushed the tw/ep-reports branch from 4d660cb to 55de5c2 Compare April 29, 2022 18:56

jwalgran reviewed Apr 29, 2022

View reviewed changes

jwalgran assigned TaiWilkin Apr 29, 2022

TaiWilkin requested a review from jwalgran May 2, 2022 19:20

jwalgran reviewed May 4, 2022

View reviewed changes

TaiWilkin requested a review from jwalgran May 5, 2022 12:05

jwalgran approved these changes May 5, 2022

View reviewed changes

jwalgran removed their assignment May 5, 2022

TaiWilkin added 2 commits May 6, 2022 08:25

Remove duplicate field_name choices

f2f13ed

Two of the choices for ExtendedField.field_name were mistakenly duplicated (facility_type and processing_type). Removes the duplicates via migration and model change.

TaiWilkin force-pushed the tw/ep-reports branch from bed4a7f to f2f13ed Compare May 6, 2022 12:26

TaiWilkin merged commit 810947d into develop May 6, 2022

TaiWilkin deleted the tw/ep-reports branch May 6, 2022 12:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add extended field reports #1826

Add extended field reports #1826

TaiWilkin commented Apr 29, 2022 •

edited

jwalgran left a comment

jwalgran Apr 29, 2022

TaiWilkin May 2, 2022 •

edited

jwalgran Apr 29, 2022

TaiWilkin May 2, 2022

TaiWilkin commented May 2, 2022

jwalgran left a comment

TaiWilkin commented May 4, 2022 •

edited

jwalgran left a comment

jwalgran May 4, 2022

TaiWilkin May 4, 2022

jwalgran May 4, 2022

TaiWilkin commented May 5, 2022 •

edited

jwalgran left a comment

TaiWilkin commented May 6, 2022

		for field_name, _ in ExtendedField.FIELD_CHOICES:
		columns.append(field_name)

Add extended field reports #1826

Add extended field reports #1826

Conversation

TaiWilkin commented Apr 29, 2022 • edited

Overview

Demo

Notes

Testing Instructions

Checklist

jwalgran left a comment

Choose a reason for hiding this comment

jwalgran Apr 29, 2022

Choose a reason for hiding this comment

TaiWilkin May 2, 2022 • edited

Choose a reason for hiding this comment

jwalgran Apr 29, 2022

Choose a reason for hiding this comment

TaiWilkin May 2, 2022

Choose a reason for hiding this comment

TaiWilkin commented May 2, 2022

jwalgran left a comment

Choose a reason for hiding this comment

TaiWilkin commented May 4, 2022 • edited

jwalgran left a comment

Choose a reason for hiding this comment

jwalgran May 4, 2022

Choose a reason for hiding this comment

TaiWilkin May 4, 2022

Choose a reason for hiding this comment

jwalgran May 4, 2022

Choose a reason for hiding this comment

TaiWilkin commented May 5, 2022 • edited

jwalgran left a comment

Choose a reason for hiding this comment

TaiWilkin commented May 6, 2022

TaiWilkin commented Apr 29, 2022 •

edited

TaiWilkin May 2, 2022 •

edited

TaiWilkin commented May 4, 2022 •

edited

TaiWilkin commented May 5, 2022 •

edited