Refactor of Image Matching Algorithm #29

clarakosi · 2021-09-01T17:12:03Z

Changes:

Adds spark udf
Modifies schema for top_candidates column to now view null
image suggestions as an empty array
Saves output as parquet in hdfs

Does not enable cluster mode in spark because it does not appear to be possible with jupyter notebooks

Changes: * Adds spark udf * Modifies schema for top_candidates column to now view null image suggestions as an empty array * Saves output as parquet in hdfs

Output is slightly modified now from an array of objects to that of an array of json. This change will need to be accounted for in the ETL pipeline

gmodena

LGTM!

I have a small comment re schema, that can be adressed once we move those bits of code out of this repo.

The notebook & udf refactoring look great! There's a couple of things that will need changes once we convert it to a script, but we can discuss those separarely.

gmodena · 2021-11-11T22:30:25Z

etl/schema.py

    schema = (
        StructType()
-        .add("pandas_idx", StringType(), True)


Nice that we got rid of this!

FYI: these changes to spark & hql code will need to be incorporated into https://gitlab.wikimedia.org/gmodena/platform-airflow-dags/-/tree/multi-project-dags-repo/image-matching

I'm actually not going to merge this PR into this repo but keep this one consistent with version one of the algorithm. I'll make PR on your GitLab repo

gmodena · 2021-11-11T22:33:39Z

etl/transform.py

@@ -62,11 +62,11 @@ def __init__(self, dataFrame: DataFrame):

    def transform(self) -> DataFrame:
        with_recommendations = (
-            self.dataFrame.where(~F.col("top_candidates").isNull())
+            self.dataFrame.where(F.size(F.col("top_candidates")) > 0)


Here you removed a null check, but top_candidates schema says the column could be nullable.
I think your approach here is the correct one, but could we make the schema consistent?

Initial draft of refactoring efforts

a622448

Changes: * Adds spark udf * Modifies schema for top_candidates column to now view null image suggestions as an empty array * Saves output as parquet in hdfs

clarakosi force-pushed the refactoring branch from 3053b67 to 9d5d259 Compare September 29, 2021 22:50

Add algorithm_v3 which uses pandas udf instead of spark udf

7259a13

Output is slightly modified now from an array of objects to that of an array of json. This change will need to be accounted for in the ETL pipeline

clarakosi force-pushed the refactoring branch from 9d5d259 to 7259a13 Compare September 29, 2021 22:50

Update pipeline and tests to work with algorithm_v2

ceb3db0

clarakosi changed the title ~~Initial draft of refactoring efforts~~ Refactor of Image Matching Algorithm Nov 9, 2021

clarakosi requested a review from gmodena November 9, 2021 14:23

gmodena approved these changes Nov 11, 2021

View reviewed changes

clarakosi closed this Nov 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor of Image Matching Algorithm #29

Refactor of Image Matching Algorithm #29

clarakosi commented Sep 1, 2021

gmodena left a comment

gmodena Nov 11, 2021

clarakosi Nov 12, 2021

gmodena Nov 11, 2021

clarakosi Nov 12, 2021

Refactor of Image Matching Algorithm #29

Refactor of Image Matching Algorithm #29

Conversation

clarakosi commented Sep 1, 2021

gmodena left a comment

Choose a reason for hiding this comment

gmodena Nov 11, 2021

Choose a reason for hiding this comment

clarakosi Nov 12, 2021

Choose a reason for hiding this comment

gmodena Nov 11, 2021

Choose a reason for hiding this comment

clarakosi Nov 12, 2021

Choose a reason for hiding this comment