Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor of Image Matching Algorithm #29

Closed
wants to merge 3 commits into from

Conversation

clarakosi
Copy link
Collaborator

Changes:

  • Adds spark udf
  • Modifies schema for top_candidates column to now view null
    image suggestions as an empty array
  • Saves output as parquet in hdfs

Does not enable cluster mode in spark because it does not appear to be possible with jupyter notebooks

Changes:
* Adds spark udf
* Modifies schema for top_candidates column to now view null
image suggestions as an empty array
* Saves output as parquet in hdfs
Output is slightly modified now from an array of objects to that of
an array of json. This change will need to be accounted for in the
ETL pipeline
@clarakosi clarakosi changed the title Initial draft of refactoring efforts Refactor of Image Matching Algorithm Nov 9, 2021
Copy link
Collaborator

@gmodena gmodena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I have a small comment re schema, that can be adressed once we move those bits of code out of this repo.

The notebook & udf refactoring look great! There's a couple of things that will need changes once we convert it to a script, but we can discuss those separarely.

schema = (
StructType()
.add("pandas_idx", StringType(), True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that we got rid of this!

FYI: these changes to spark & hql code will need to be incorporated into https://gitlab.wikimedia.org/gmodena/platform-airflow-dags/-/tree/multi-project-dags-repo/image-matching

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not going to merge this PR into this repo but keep this one consistent with version one of the algorithm. I'll make PR on your GitLab repo

@@ -62,11 +62,11 @@ def __init__(self, dataFrame: DataFrame):

def transform(self) -> DataFrame:
with_recommendations = (
self.dataFrame.where(~F.col("top_candidates").isNull())
self.dataFrame.where(F.size(F.col("top_candidates")) > 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you removed a null check, but top_candidates schema says the column could be nullable.
I think your approach here is the correct one, but could we make the schema consistent?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@clarakosi clarakosi closed this Nov 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants