New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor of Image Matching Algorithm #29
Conversation
Changes: * Adds spark udf * Modifies schema for top_candidates column to now view null image suggestions as an empty array * Saves output as parquet in hdfs
3053b67
to
9d5d259
Compare
Output is slightly modified now from an array of objects to that of an array of json. This change will need to be accounted for in the ETL pipeline
9d5d259
to
7259a13
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I have a small comment re schema, that can be adressed once we move those bits of code out of this repo.
The notebook & udf refactoring look great! There's a couple of things that will need changes once we convert it to a script, but we can discuss those separarely.
schema = ( | ||
StructType() | ||
.add("pandas_idx", StringType(), True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice that we got rid of this!
FYI: these changes to spark & hql code will need to be incorporated into https://gitlab.wikimedia.org/gmodena/platform-airflow-dags/-/tree/multi-project-dags-repo/image-matching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not going to merge this PR into this repo but keep this one consistent with version one of the algorithm. I'll make PR on your GitLab repo
@@ -62,11 +62,11 @@ def __init__(self, dataFrame: DataFrame): | |||
|
|||
def transform(self) -> DataFrame: | |||
with_recommendations = ( | |||
self.dataFrame.where(~F.col("top_candidates").isNull()) | |||
self.dataFrame.where(F.size(F.col("top_candidates")) > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you removed a null check, but top_candidates
schema says the column could be nullable.
I think your approach here is the correct one, but could we make the schema consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
Changes:
image suggestions as an empty array
Does not enable cluster mode in spark because it does not appear to be possible with jupyter notebooks