-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSM-POI: Include brand property [DRAFT] #69
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result is looking good! 👍🏽 Just added a few naming and coding convention comments.
@IritaSee as discussed, here's some more background and assistance for task 3: To add a new column to the PySpark data frame, use The UDF itself is applied for each row. So when you pass the Since the UDF is executed many times, we don't want to and can't read the CSV with the brand names a thousand times within the UDF. That's what 'broadcasts' in PySpark are for. They wrap a serializable value and make them available from inside the UDF. This is an example of how to use a broadcast from inside a UDF: To make the main function more readable, you can create a static function that reads the CSV with the brand names, creates a broadcast from it, adds the two columns The static function would be applied after the way, node, and relation data frames have been joined. It could look something like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From looking and the code it seems basically ready. What is still missing @IritaSee?
I feel not ready yet with the PR because of testing issue... I tested them all but I feel the branch did not run smoothly... I think we need to discuss about this but not for so long :") @mattigrthr |
# Conflicts: # kuwala/common/python_utils/src/FileSelector.py # kuwala/common/python_utils/src/spark_udfs.py # kuwala/core/database/importer/src/population_density_importer.py # kuwala/pipelines/osm-poi/src/Downloader.py # kuwala/pipelines/osm-poi/src/Processor.py # kuwala/pipelines/osm-poi/src/main.py
This PR is currently on hold. See: #25 (comment)
|
In this PR, I wanted to solve issue #25 by creating a CSV file to list and also help the brand property matching process. This PR including:
Compile a list of all brand names and operator names from the name-suggestion-index as a CSV with the columns
id
,display_name
, andwiki_data
intmp
folder.Deadline: 05.01.2022
Create a PySpark UDF similar to the ones in the osm-poi/src/Processor.py that takes a string and returns the best match against the name-index-suggestions using fuzzy-wuzzy
Deadline: 06.01.2022
Apply that function for the brand and operator columns which creates two new columns
brand_matched
andoperator_matched
.Deadline: 07.01.2022