OSM-POI: Include brand property [DRAFT] #69

IritaSee · 2022-01-03T09:11:26Z

In this PR, I wanted to solve issue #25 by creating a CSV file to list and also help the brand property matching process. This PR including:

Compile a list of all brand names and operator names from the name-suggestion-index as a CSV with the columns id, display_name, and wiki_data in tmp folder.
Deadline: 05.01.2022
Create a PySpark UDF similar to the ones in the osm-poi/src/Processor.py that takes a string and returns the best match against the name-index-suggestions using fuzzy-wuzzy
Deadline: 06.01.2022
Apply that function for the brand and operator columns which creates two new columns brand_matched and operator_matched.
Deadline: 07.01.2022

mattigrthr

The result is looking good! 👍🏽 Just added a few naming and coding convention comments.

kuwala/pipelines/osm-poi/src/Downloader.py

mattigrthr · 2022-01-11T10:28:19Z

@IritaSee as discussed, here's some more background and assistance for task 3:

To add a new column to the PySpark data frame, use withColumn().
For example df_pois = df_pois.withColumn('brand_matched', match_brand_name(col('brand'))) would add a new column to df_pois called 'brand_matched' which would contain the value returned from the PySpark-UDF match_brand_name.

The UDF itself is applied for each row. So when you pass the col('brand') to the UDF it is working on all values individually (e.g., if there are 1,000 rows in that data frame, the UDF is executed 1,000 times). It effectively works on one string, e.g., 'McDonalds', and simply returns the best match from the brand name list.

Since the UDF is executed many times, we don't want to and can't read the CSV with the brand names a thousand times within the UDF. That's what 'broadcasts' in PySpark are for. They wrap a serializable value and make them available from inside the UDF. This is an example of how to use a broadcast from inside a UDF:

To make the main function more readable, you can create a static function that reads the CSV with the brand names, creates a broadcast from it, adds the two columns brand_matched and operator_matched by applying the UDF and returns the result.

The static function would be applied after the way, node, and relation data frames have been joined. It could look something like this: df_pois = Processor.match_brand_and_operator_names(df_pois)

mattigrthr

From looking and the code it seems basically ready. What is still missing @IritaSee?

kuwala/pipelines/osm-poi/src/Processor.py

kuwala/scripts/initialize_windows.sh

IritaSee · 2022-01-27T07:12:21Z

From looking and the code it seems basically ready. What is still missing @IritaSee?

I feel not ready yet with the PR because of testing issue... I tested them all but I feel the branch did not run smoothly... I think we need to discuss about this but not for so long :") @mattigrthr

# Conflicts: # kuwala/common/python_utils/src/FileSelector.py # kuwala/common/python_utils/src/spark_udfs.py # kuwala/core/database/importer/src/population_density_importer.py # kuwala/pipelines/osm-poi/src/Downloader.py # kuwala/pipelines/osm-poi/src/Processor.py # kuwala/pipelines/osm-poi/src/main.py

…OSM pipeline

mattigrthr · 2022-02-03T13:46:31Z

This PR is currently on hold. See: #25 (comment)

After doing some initial tests with the brand and operator name matching, it turns out that including the matching in the OSM-POI pipeline directly would increase the runtime significantly. Therefore, we have decided to store the consolidated list of brand and operator names in a separate table in Postgres, which can then be used later in transformation blocks (e.g., on a filtered set of POIs and thus drastically reduce the runtime).

Since the canvas development currently has a higher priority for the core team, this issue is up for grabs again.

add name-suggestion-index download

86ef2c5

IritaSee requested a review from mattigrthr January 3, 2022 09:11

IritaSee added enhancement New feature or request good first issue Good for newcomers pipeline/osm-poi Issues related to the osm-poi pipeline labels Jan 3, 2022

IritaSee linked an issue Jan 3, 2022 that may be closed by this pull request

OSM-POI: Include brand property #25

Open

IritaSee added 4 commits January 3, 2022 16:18

add repo checking

1c62f71

[undone] interating over json

b74d373

add brand_name_downloader

d6fb766

add name downloader to main, add operator naming

ac14e90

mattigrthr reviewed Jan 5, 2022

View reviewed changes

IritaSee added 14 commits January 6, 2022 15:32

rename to more suitable function names

f5d401f

add staticmethod for download_names

600c3ef

add staticmethod to download_names

017b99d

fix variable names to add context

96ae427

fix variable typo

69ac660

add debug venv folder to ignore

d2c2460

change - to None as default value

f173ec5

add operator:wikidata

2ff61be

add func to match brands and operators, then add to spark

3c93476

fix algorithm

2f3896f

remove unused code

a572b30

update fuzzywuzzy to thefuzz in osm-poi related

d2467a7

fix nan processing

e74f64b

fix: change search to brand and operator

1290708

mattigrthr mentioned this pull request Jan 11, 2022

OSM-POI: Include brand property #25

Open

IritaSee added 2 commits January 12, 2022 06:17

recreate matching function

6230a6b

apply withcolumn in main matching function

6519b76

IritaSee added 7 commits January 15, 2022 09:54

Merge branch 'master' into feature/include-brand-property

32e767f

add brand_matched operator_matched name_matched

848f984

fix run_cli convert add extra cd

d101e2d

readjust dowloader to new temp folder

724ac03

add default statement

c2bb65c

update temp dir

898945b

add downloading message

f08ea40

mattigrthr reviewed Jan 26, 2022

View reviewed changes

kuwala/pipelines/osm-poi/src/Processor.py Outdated Show resolved Hide resolved

kuwala/scripts/initialize_windows.sh Outdated Show resolved Hide resolved

add empty as return

1c153fb

IritaSee added 6 commits January 27, 2022 15:20

fix missleadnig var name

1c4b9dc

revert irrelevant change to this branch

2489e94

code cleanup

c01806f

delete reference repo, rename reference file

a149f1b

add id sorting for reference file, change print to log

f4e651b

remove reference repo, ignore reference file

fae5761

IritaSee marked this pull request as ready for review January 27, 2022 14:27

IritaSee requested a review from mattigrthr January 27, 2022 14:27

Matti Grotheer added 2 commits February 2, 2022 17:34

Resolve formatting and linting errors; Remove name matching UDF from …

0661a62

…OSM pipeline

mattigrthr marked this pull request as draft February 3, 2022 13:47

mattigrthr requested review from mattigrthr and removed request for mattigrthr February 3, 2022 13:48

mattigrthr unassigned IritaSee Feb 3, 2022

IritaSee changed the title ~~OSM-POI: including brand property~~ OSM-POI: Include brand property [DRAFT] Mar 2, 2022

arifluthfi16 closed this May 3, 2022

arifluthfi16 deleted the feature/include-brand-property branch May 3, 2022 08:59

arifluthfi16 restored the feature/include-brand-property branch May 3, 2022 11:32

arifluthfi16 reopened this May 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSM-POI: Include brand property [DRAFT] #69

OSM-POI: Include brand property [DRAFT] #69

IritaSee commented Jan 3, 2022 •

edited by mattigrthr

Loading

mattigrthr left a comment

mattigrthr commented Jan 11, 2022

mattigrthr left a comment

IritaSee commented Jan 27, 2022

mattigrthr commented Feb 3, 2022 •

edited

Loading

OSM-POI: Include brand property [DRAFT] #69

Are you sure you want to change the base?

OSM-POI: Include brand property [DRAFT] #69

Conversation

IritaSee commented Jan 3, 2022 • edited by mattigrthr Loading

mattigrthr left a comment

Choose a reason for hiding this comment

mattigrthr commented Jan 11, 2022

mattigrthr left a comment

Choose a reason for hiding this comment

IritaSee commented Jan 27, 2022

mattigrthr commented Feb 3, 2022 • edited Loading

IritaSee commented Jan 3, 2022 •

edited by mattigrthr

Loading

mattigrthr commented Feb 3, 2022 •

edited

Loading