Conversation
4d660cb
to
55de5c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in these Python reports was easy for me to follow. I ran into a few issues in testing.
src/django/api/reports.py
Outdated
'value__raw_values', 'value__matched_values' | ||
).values_list('value__raw_values', 'value__matched_values').iterator(): | ||
for i, raw_value in enumerate(raw_values.split('|')): | ||
if matched_values[i][3] is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When testing this on staging there was an index out of range exception on this line
https://rollbar.com/OpenApparelRegistry/OpenApparelRegistry/items/1549/
I pushed a commit to to the test branch on staging to get more information
diff --git a/src/django/api/reports.py b/src/django/api/reports.py
index c0bc97a6..dfab169b 100644
--- a/src/django/api/reports.py
+++ b/src/django/api/reports.py
@@ -143,9 +143,16 @@ def processing_type_facility_type_matched():
'value__raw_values', 'value__matched_values'
).values_list('value__raw_values', 'value__matched_values').iterator():
for i, raw_value in enumerate(raw_values.split('|')):
- if matched_values[i][3] is not None:
- temp_data[matched_values[i][3]].add(raw_value.strip())
- temp_data[matched_values[i][2]].add(raw_value.strip())
+ try:
+ if matched_values[i][3] is not None:
+ temp_data[matched_values[i][3]].add(raw_value.strip())
+ temp_data[matched_values[i][2]].add(raw_value.strip())
+ except IndexError:
+ raise Exception('matched_values has unexpected shape.\n'
+ 'i = {}\n'
+ 'raw_value = {}\n'
+ 'matched_values = {}'.format(
+ i, raw_value, matched_values))
data = dict()
for (raw_value, match_values) in temp_data.items():
@@ -164,8 +171,15 @@ def processing_type_facility_type_unmatched():
'value__raw_values', 'value__matched_values'
).values_list('value__raw_values', 'value__matched_values').iterator():
for i, raw_value in enumerate(raw_values.split('|')):
- if matched_values[i][3] is None:
- data[raw_value.strip()] = data[raw_value.strip()] + 1
+ try:
+ if matched_values[i][3] is None:
+ data[raw_value.strip()] = data[raw_value.strip()] + 1
+ except IndexError:
+ raise Exception('matched_values has unexpected shape.\n'
+ 'i = {}\n'
+ 'raw_value = {}\n'
+ 'matched_values = {}'.format(
+ i, raw_value, matched_values))
rows = sort_by_first_column(data.items())
return [['processing_type_facility_type', 'times_submitted'], rows]
Here is the additional details showed up in Roolbar
https://rollbar.com/OpenApparelRegistry/OpenApparelRegistry/items/1551/
Exception:
matched_values has unexpected shape.
i = 5
raw_value = Review
matched_values = [['PROCESSING_TYPE', 'EXACT', 'Printing, Product Dyeing and Laundering', 'Finishing'], [None, None, None, None], ['PROCESSING_TYPE', 'EXACT', 'Final Product Assembly', 'Sewing'], ['PROCESSING_TYPE', 'EXACT', 'Warehousing / Distribution', 'Packing'], [None, None, None, None]]
(Most recent call last)
There are only 5 values in matched_values
, so there points to there being record in the database where there are more raw_values
than matched_values
, which is unexpected.
Looking at the staging database with
select value->'raw_values', value->'matched_values'
from api_extendedfield
where field_name = 'processing_type';
I see that value-'raw_values'
contains a mix of
- Single strings (
"Manufacturing"
) - Arrays (
["cutting", "packing", "sewing"]
) - Pipe-delimited strings (
"Finishing|Electronic Embroidery|Sewing|Printing|Weaving"
)
On reflection this makes sense. API users can send literal arrays, but list uploaders will submit single strings (perhaps containing pipes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After adding some single string and array values to the facility_type and processing_type values in the database, I was able to replicate these errors locally. I updated the report to process the raw_values in the same way as the extended fields script here. Noting the removal of duplicate values there, I tested submitting a facility_type with repeated values and confirmed that this caused an indexing error. Using the same process of processing and deduplication as the extended_fields processing in report resulted in the duplicate and variously formatted values being handled correctly without error.
updated_at__year__lte=month.year, | ||
updated_at__month__lte=month.month) | ||
|
||
for field_name, _ in ExtendedField.FIELD_CHOICES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a problem introduced in this PR, but adding this use of FIELD_CHOICES
exposed the fact that facility type and processing type are repeated . We will want to remove them, which will likely require creating a migration since is "changes" the choices attached to a model field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a model change and a migration in 2ae5f45. I checked my local data after migrating to confirm that removing two of the values didn't have any impact on my processing_type and facility_type fields, just in case, and confirmed there were no empty values for the field_name.
Thanks for the thorough review and for testing on staging. This is ready for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into a crash when testing on staging. My research into this is included below but the short version is that there are 4 processing type extended fields created from claims have a mismatch between the count of raw and matched values. I propose working around this by reporting the error to Rollbar rather than crash and open a follow up issue to resolve it.
https://rollbar.com/OpenApparelRegistry/OpenApparelRegistry/items/1549/
Using Rollbar to look at the local values at the point of the exception there are 7 deduped_values
and 6 matched_values
. A
[
"Denim Services",
"Fabric Dyeing",
"Pleating",
"Printing",
"Sewing",
"Spinning",
"Trimming/Embellishment"
]
[
[
"PROCESSING_TYPE",
"FUZZY",
"Printing, Product Dyeing and Laundering",
"Dyeing"
],
[
"PROCESSING_TYPE",
"EXACT",
"Final Product Assembly",
"Pleating"
],
[
"PROCESSING_TYPE",
"EXACT",
"Printing, Product Dyeing and Laundering",
"Printing"
],
[
"PROCESSING_TYPE",
"EXACT",
"Final Product Assembly",
"Sewing"
],
[
"PROCESSING_TYPE",
"EXACT",
"Raw Material Processing or Production",
"Spinning"
],
[
"PROCESSING_TYPE",
"FUZZY",
"Textile or Material Production",
"Embellishment"
]
]
This should not happen. The unmatched "Denim Services" should result in a [None, None, None, None] match row. I tested this by POSTing a facility to the API with
"facility_type_processing_type": [
"Denim Services",
"Fabric Dyeing",
"Pleating",
"Printing",
"Sewing",
"Spinning",
"Trimming/Embellishment"
]
And it did create the None item in matched_values
I ran this query on staging to try and get more information
select * from (
select facility_id, facility_claim_id, facility_list_item_id,
CASE WHEN jsonb_typeof(value->'raw_values') = 'array'
THEN jsonb_array_length(value->'raw_values')
ELSE -1
END as raw_count,
jsonb_array_length(value->'matched_values') as matched_count
from api_extendedfield
where field_name = 'processing_type'
) s
where raw_count > 0 and raw_count != matched_count;
There were 4 extended fields created from claims that had a count mismatch between raw and matched
openapparelregistry-> where raw_count > 0 and raw_count != matched_count;
facility_id | facility_claim_id | facility_list_item_id | raw_count | matched_count
-----------------+-------------------+-----------------------+-----------+---------------
IT2019311JZR7ZX | 56 | | 7 | 6
CN201928823FEJ3 | 53 | | 12 | 11
PK2021341XSCC8Y | 333 | | 19 | 17
IN202108462CDJV | 350 | | 24 | 22
When I ran the query against production I got the same 4 rows.
In staging there are 66 other processing type extended field rows from claims with matching counts.
Here is a partial WIP commit that I added to my staging test branch that adds a try/catch with reporting to Rollbar. This fixes the report on staging and logs 4 errors to Rollbar, which reinforces the finding that the report code handles the other cases properly.
@jwalgran I'm pretty sure I know exactly where this error is coming from: this PR #1677, when we dropped Denim Services and several other values. We decided at the time to treat them as invalid values vs set them to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for looking up and referencing that previous PR. Yes. I agree with your plan to filter out these values but also use the try/catch.
src/django/api/reports.py
Outdated
for field_name, _ in ExtendedField.FIELD_CHOICES: | ||
columns.append(field_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OAR team request that we add a "(claim)" suffix to the country
, name
and address
column headers to clearly indicate that they come from the claim process, bot facility data submission.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followup question: We had added country
as one of the initial fields, but I don't believe that the claim form actually includes country
as a value for the facility (only for the Office Information section). Do we want to remove this as well as the two duplicate values, since it's not actually a possible field type at the moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Having the country column filled with all zeros all the time just adds noise to the report.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, and error free on staging! Thanks for working through the revisions with me.
- Contributors With Extended Fields Cumulative: This report lists the number of contributors who have submitted each field type. It is organized into months, where each month's count includes all extended fields with an update date of that month or earlier. - Facilities With Extended Fields Cumulative: This report lists the number of facilities that have each field type. It is organized into months, where each month's count includes all extended fields with an update date of that month or earlier. - Processing Type Facility Type Matched Values: This report lists each taxonomy value that has been assigned to an extended field, accompanied by a list of the raw values that have been matched to that value. - Processing Type Facility Type Unmatched Values: This report lists each raw value that was submitted as a Processing_Type_Facility_Type and wasn't matched to a taxonomy value, as well as a count of the number of times that raw value was submitted. - Submitted Product Type Values: This report lists each raw value that was submitted for a product_type extended field, as well as a count of the number of times that raw value was submitted.
Two of the choices for ExtendedField.field_name were mistakenly duplicated (facility_type and processing_type). Removes the duplicates via migration and model change.
Thank you for the debugging help and detailed feedback! |
Overview
Adds new reports for extended fields:
the number of contributors who have submitted each field type. It is
organized into months, where each month's count includes all extended
fields with an update date of that month or earlier.
the number of facilities that have each field type. It is
organized into months, where each month's count includes all extended
fields with an update date of that month or earlier.
taxonomy value that has been assigned to an extended field, accompanied
by a list of the raw values that have been matched to that value (separated
by '|').
each raw value that was submitted as a Processing_Type_Facility_Type
and wasn't matched to a taxonomy value, as well as a count of the number
of times that raw value was submitted.
was submitted for a product_type extended field, as well as a count of
the number of times that raw value was submitted.
Connects #1767
Demo
Notes
The issue requested one report with "Report that lists all taxonomy values for product_type, processing_type_facility_type, and the raw values that were fuzzy matched to these values",
but
product_type
doesn't use any fuzzy matching. It could be that this was meant to beprocessing_type
(which is already included under the
processing_type_facility_type
report), but because theOAR team has also mentioned wanting to evaluate the
product_type
values being submitted,I included the
product_type
values as a separate report.Testing Instructions
./scripts/server
Checklist
fixup!
commits have been squashed