Skip to content
This repository has been archived by the owner on Feb 1, 2024. It is now read-only.

Add extended field reports #1826

Merged
merged 2 commits into from May 6, 2022
Merged

Add extended field reports #1826

merged 2 commits into from May 6, 2022

Conversation

TaiWilkin
Copy link
Contributor

@TaiWilkin TaiWilkin commented Apr 29, 2022

Overview

Adds new reports for extended fields:

  • Contributors With Extended Fields Cumulative: This report lists
    the number of contributors who have submitted each field type. It is
    organized into months, where each month's count includes all extended
    fields with an update date of that month or earlier.
  • Facilities With Extended Fields Cumulative: This report lists
    the number of facilities that have each field type. It is
    organized into months, where each month's count includes all extended
    fields with an update date of that month or earlier.
  • Processing Type Facility Type Matched Values: This report lists each
    taxonomy value that has been assigned to an extended field, accompanied
    by a list of the raw values that have been matched to that value (separated
    by '|').
  • Processing Type Facility Type Unmatched Values: This report lists
    each raw value that was submitted as a Processing_Type_Facility_Type
    and wasn't matched to a taxonomy value, as well as a count of the number
    of times that raw value was submitted.
  • Submitted Product Type Values: This report lists each raw value that
    was submitted for a product_type extended field, as well as a count of
    the number of times that raw value was submitted.

Connects #1767

Demo

Screen Shot 2022-04-29 at 2 43 15 PM

Screen Shot 2022-04-29 at 2 43 24 PM

Screen Shot 2022-04-29 at 2 43 35 PM

Screen Shot 2022-04-29 at 2 43 45 PM

Screen Shot 2022-04-29 at 2 43 58 PM

Notes

The issue requested one report with "Report that lists all taxonomy values for product_type, processing_type_facility_type, and the raw values that were fuzzy matched to these values",
but product_type doesn't use any fuzzy matching. It could be that this was meant to be processing_type
(which is already included under the processing_type_facility_type report), but because the
OAR team has also mentioned wanting to evaluate the product_type values being submitted,
I included the product_type values as a separate report.

Testing Instructions

Checklist

  • fixup! commits have been squashed
  • CI passes after rebase
  • CHANGELOG.md updated with summary of features or fixes, following Keep a Changelog guidelines

Copy link
Contributor

@jwalgran jwalgran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in these Python reports was easy for me to follow. I ran into a few issues in testing.

src/django/api/reports.py Outdated Show resolved Hide resolved
src/django/api/reports.py Outdated Show resolved Hide resolved
src/django/api/reports.py Outdated Show resolved Hide resolved
src/django/api/reports.py Outdated Show resolved Hide resolved
src/django/api/reports.py Show resolved Hide resolved
'value__raw_values', 'value__matched_values'
).values_list('value__raw_values', 'value__matched_values').iterator():
for i, raw_value in enumerate(raw_values.split('|')):
if matched_values[i][3] is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When testing this on staging there was an index out of range exception on this line

https://rollbar.com/OpenApparelRegistry/OpenApparelRegistry/items/1549/

I pushed a commit to to the test branch on staging to get more information

diff --git a/src/django/api/reports.py b/src/django/api/reports.py
index c0bc97a6..dfab169b 100644
--- a/src/django/api/reports.py
+++ b/src/django/api/reports.py
@@ -143,9 +143,16 @@ def processing_type_facility_type_matched():
         'value__raw_values', 'value__matched_values'
     ).values_list('value__raw_values', 'value__matched_values').iterator():
         for i, raw_value in enumerate(raw_values.split('|')):
-            if matched_values[i][3] is not None:
-                temp_data[matched_values[i][3]].add(raw_value.strip())
-                temp_data[matched_values[i][2]].add(raw_value.strip())
+            try:
+                if matched_values[i][3] is not None:
+                    temp_data[matched_values[i][3]].add(raw_value.strip())
+                    temp_data[matched_values[i][2]].add(raw_value.strip())
+            except IndexError:
+                raise Exception('matched_values has unexpected shape.\n'
+                                'i = {}\n'
+                                'raw_value = {}\n'
+                                'matched_values = {}'.format(
+                                    i, raw_value, matched_values))
 
     data = dict()
     for (raw_value, match_values) in temp_data.items():
@@ -164,8 +171,15 @@ def processing_type_facility_type_unmatched():
         'value__raw_values', 'value__matched_values'
     ).values_list('value__raw_values', 'value__matched_values').iterator():
         for i, raw_value in enumerate(raw_values.split('|')):
-            if matched_values[i][3] is None:
-                data[raw_value.strip()] = data[raw_value.strip()] + 1
+            try:
+                if matched_values[i][3] is None:
+                    data[raw_value.strip()] = data[raw_value.strip()] + 1
+            except IndexError:
+                raise Exception('matched_values has unexpected shape.\n'
+                                'i = {}\n'
+                                'raw_value = {}\n'
+                                'matched_values = {}'.format(
+                                    i, raw_value, matched_values))
 
     rows = sort_by_first_column(data.items())
     return [['processing_type_facility_type', 'times_submitted'], rows]

Here is the additional details showed up in Roolbar

https://rollbar.com/OpenApparelRegistry/OpenApparelRegistry/items/1551/

Exception:

matched_values has unexpected shape.
i = 5
raw_value = Review
matched_values = [['PROCESSING_TYPE', 'EXACT', 'Printing, Product Dyeing and Laundering', 'Finishing'], [None, None, None, None], ['PROCESSING_TYPE', 'EXACT', 'Final Product Assembly', 'Sewing'], ['PROCESSING_TYPE', 'EXACT', 'Warehousing / Distribution', 'Packing'], [None, None, None, None]]

(Most recent call last)

There are only 5 values in matched_values, so there points to there being record in the database where there are more raw_values than matched_values, which is unexpected.

Looking at the staging database with

select value->'raw_values', value->'matched_values'
  from api_extendedfield
 where field_name = 'processing_type';

I see that value-'raw_values' contains a mix of

  • Single strings ("Manufacturing")
  • Arrays (["cutting", "packing", "sewing"])
  • Pipe-delimited strings ("Finishing|Electronic Embroidery|Sewing|Printing|Weaving")

On reflection this makes sense. API users can send literal arrays, but list uploaders will submit single strings (perhaps containing pipes)

Copy link
Contributor Author

@TaiWilkin TaiWilkin May 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adding some single string and array values to the facility_type and processing_type values in the database, I was able to replicate these errors locally. I updated the report to process the raw_values in the same way as the extended fields script here. Noting the removal of duplicate values there, I tested submitting a facility_type with repeated values and confirmed that this caused an indexing error. Using the same process of processing and deduplication as the extended_fields processing in report resulted in the duplicate and variously formatted values being handled correctly without error.

updated_at__year__lte=month.year,
updated_at__month__lte=month.month)

for field_name, _ in ExtendedField.FIELD_CHOICES:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem introduced in this PR, but adding this use of FIELD_CHOICES exposed the fact that facility type and processing type are repeated . We will want to remove them, which will likely require creating a migration since is "changes" the choices attached to a model field.

166006960-ad98c044-0077-4889-b8cf-7c31f04cc1a9

https://github.com/open-apparel-registry/open-apparel-registry/blame/a070e8be9dce56112de4ebb757f477c374b0e595/src/django/api/models.py#L2387-L2398

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a model change and a migration in 2ae5f45. I checked my local data after migrating to confirm that removing two of the values didn't have any impact on my processing_type and facility_type fields, just in case, and confirmed there were no empty values for the field_name.

@TaiWilkin
Copy link
Contributor Author

Thanks for the thorough review and for testing on staging. This is ready for another look.

Copy link
Contributor

@jwalgran jwalgran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into a crash when testing on staging. My research into this is included below but the short version is that there are 4 processing type extended fields created from claims have a mismatch between the count of raw and matched values. I propose working around this by reporting the error to Rollbar rather than crash and open a follow up issue to resolve it.

https://rollbar.com/OpenApparelRegistry/OpenApparelRegistry/items/1549/

Using Rollbar to look at the local values at the point of the exception there are 7 deduped_values and 6 matched_values. A

[
"Denim Services",
"Fabric Dyeing",
"Pleating",
"Printing",
"Sewing",
"Spinning",
"Trimming/Embellishment"
]
[
[
"PROCESSING_TYPE",
"FUZZY",
"Printing, Product Dyeing and Laundering",
"Dyeing"
],
[
"PROCESSING_TYPE",
"EXACT",
"Final Product Assembly",
"Pleating"
],
[
"PROCESSING_TYPE",
"EXACT",
"Printing, Product Dyeing and Laundering",
"Printing"
],
[
"PROCESSING_TYPE",
"EXACT",
"Final Product Assembly",
"Sewing"
],
[
"PROCESSING_TYPE",
"EXACT",
"Raw Material Processing or Production",
"Spinning"
],
[
"PROCESSING_TYPE",
"FUZZY",
"Textile or Material Production",
"Embellishment"
]
]

This should not happen. The unmatched "Denim Services" should result in a [None, None, None, None] match row. I tested this by POSTing a facility to the API with

"facility_type_processing_type": [
"Denim Services",
"Fabric Dyeing",
"Pleating",
"Printing",
"Sewing",
"Spinning",
"Trimming/Embellishment"
]
And it did create the None item in matched_values

I ran this query on staging to try and get more information

select * from (
  select facility_id, facility_claim_id, facility_list_item_id,
         CASE WHEN jsonb_typeof(value->'raw_values') = 'array'
           THEN jsonb_array_length(value->'raw_values')
         ELSE -1
         END as raw_count,
         jsonb_array_length(value->'matched_values') as matched_count
    from api_extendedfield
   where field_name = 'processing_type'
  ) s
where raw_count > 0 and raw_count != matched_count;

There were 4 extended fields created from claims that had a count mismatch between raw and matched

openapparelregistry-> where raw_count > 0 and raw_count != matched_count;
   facility_id   | facility_claim_id | facility_list_item_id | raw_count | matched_count
-----------------+-------------------+-----------------------+-----------+---------------
 IT2019311JZR7ZX |                56 |                       |         7 |             6
 CN201928823FEJ3 |                53 |                       |        12 |            11
 PK2021341XSCC8Y |               333 |                       |        19 |            17
 IN202108462CDJV |               350 |                       |        24 |            22

When I ran the query against production I got the same 4 rows.

In staging there are 66 other processing type extended field rows from claims with matching counts.

Here is a partial WIP commit that I added to my staging test branch that adds a try/catch with reporting to Rollbar. This fixes the report on staging and logs 4 errors to Rollbar, which reinforces the finding that the report code handles the other cases properly.

1352f2b

Screen Shot 2022-05-04 at 12 06 41 AM

@TaiWilkin
Copy link
Contributor Author

TaiWilkin commented May 4, 2022

@jwalgran I'm pretty sure I know exactly where this error is coming from: this PR #1677, when we dropped Denim Services and several other values. We decided at the time to treat them as invalid values vs set them to None, but that's causing issues with the logic here.
I'd like to suggest we filter those values out here as well and don't consider them. Then in addition I think we also should still add the try/catch so if there are future misalignments we are alerted but don't crash.

Copy link
Contributor

@jwalgran jwalgran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking up and referencing that previous PR. Yes. I agree with your plan to filter out these values but also use the try/catch.

Comment on lines 204 to 205
for field_name, _ in ExtendedField.FIELD_CHOICES:
columns.append(field_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OAR team request that we add a "(claim)" suffix to the country, name and address column headers to clearly indicate that they come from the claim process, bot facility data submission.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup question: We had added country as one of the initial fields, but I don't believe that the claim form actually includes country as a value for the facility (only for the Office Information section). Do we want to remove this as well as the two duplicate values, since it's not actually a possible field type at the moment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Having the country column filled with all zeros all the time just adds noise to the report.

@TaiWilkin TaiWilkin requested a review from jwalgran May 5, 2022 12:05
@TaiWilkin
Copy link
Contributor Author

TaiWilkin commented May 5, 2022

Updated processing_type reports (I only have this month's data in my system right now):
Screen Shot 2022-05-05 at 8 04 53 AM
Screen Shot 2022-05-05 at 8 04 43 AM
This is ready for another look.

Copy link
Contributor

@jwalgran jwalgran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and error free on staging! Thanks for working through the revisions with me.

@jwalgran jwalgran removed their assignment May 5, 2022
- Contributors With Extended Fields Cumulative: This report lists
the number of contributors who have submitted each field type. It is
organized into months, where each month's count includes all extended
fields with an update date of that month or earlier.
- Facilities With Extended Fields Cumulative: This report lists
the number of facilities that have each field type. It is
organized into months, where each month's count includes all extended
fields with an update date of that month or earlier.
- Processing Type Facility Type Matched Values: This report lists each
taxonomy value that has been assigned to an extended field, accompanied
by a list of the raw values that have been matched to that value.
- Processing Type Facility Type Unmatched Values: This report lists
each raw value that was submitted as a Processing_Type_Facility_Type
and wasn't matched to a taxonomy value, as well as a count of the number
of times that raw value was submitted.
- Submitted Product Type Values: This report lists each raw value that
was submitted for a product_type extended field, as well as a count of
the number of times that raw value was submitted.
Two of the choices for ExtendedField.field_name were mistakenly
duplicated (facility_type and processing_type). Removes the duplicates
via migration and model change.
@TaiWilkin
Copy link
Contributor Author

Thank you for the debugging help and detailed feedback!

@TaiWilkin TaiWilkin merged commit 810947d into develop May 6, 2022
@TaiWilkin TaiWilkin deleted the tw/ep-reports branch May 6, 2022 12:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants