Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matchUrls #7133

Closed
Cj-Malone opened this issue Oct 3, 2022 · 15 comments · Fixed by #8312
Closed

matchUrls #7133

Cj-Malone opened this issue Oct 3, 2022 · 15 comments · Fixed by #8312
Labels
enhancement Actionable - add an enhancement to the source code question Not Actionable - just a question about something

Comments

@Cj-Malone
Copy link
Collaborator

I've disused this before, but I don't think I have here.

It would be nice if NSI stored the valid urls for features. For example Bank of Scotland could be could get something along the lines of "matchUrls": ["^https:\/\/branches\.bankofscotland\.co\.uk\/([-\w]+\/[-\/'\w]+)$"].

It would allow downstream data users like iD to perform validation on the website, and it would help All The Places match it's output to NSI.

@LaoshuBaby LaoshuBaby added the i dont know ¯\_(ツ)_/¯ label Oct 3, 2022
@LaoshuBaby
Copy link
Collaborator

Maybe this is a OSM data import problem, more than a NSI preset problem.

NSI currently don't collect those kind of data. Even if don't care license problem about request its website for detailed branch information, iD should be a generic editor and can't develop a specific feature for validating via parsing webpage. And its' rule is pre prepared.

@Cj-Malone
Copy link
Collaborator Author

I'm not suggesting that iD should parse a webpage. I'm suggesting that it could validate that a poi has a website tag that matches the brand it has.

@Cj-Malone
Copy link
Collaborator Author

So, to continue my example, it would tell us the following is incorrect.

brand:wikidata=Q627381 (Bank of Scotland)
website=https://www.rbs.co.uk/Branch/160038-London%20Drummonds.html

It either needs to be updated to brand:wikidata=Q160126 (Royal Bank of Scotland) or a valid Bank of Scotland website being added.

@LaoshuBaby
Copy link
Collaborator

LaoshuBaby commented Oct 3, 2022

I'm not suggesting that iD should parse a webpage. I'm suggesting that it could validate that a poi has a website tag that matches the brand it has.

Sorry, I misunderstand your previous suggestion, I wrong think you want to do something like
parse_branch_detail( request( url ) )

But if you want to match a brand, current NSI can match it from name:*/brand:*, and if it get a wikidata, wikidata will be a excellent indicator. Those key and tag are more common and more convenience than website, I think. In your example, brand:wikidata=Q627381 is enough.

And this will not only related to NSI, but also change nsi.js in iD's repo, which maybe much harder than only modify in this repo.


However, I'm not a core-maintainer, I only a PR merger, you can wait for another people's opinion.

@Cj-Malone
Copy link
Collaborator Author

I'm not suggesting that iD should parse a webpage. I'm suggesting that it could validate that a poi has a website tag that matches the brand it has.

Sorry, I misunderstand your previous suggestion, I wrong think you want to do something like parse_branch_detail( request( url ) )

No worries. That's All The Places domain, and I would defiantly recommend everyone checks it out.

But if you want to match a brand, current NSI can match it from name:*/brand:*, and if it get a wikidata, wikidata will be a excellent indicator. Those key and tag are more common and more convenience than website, I think. In your example, brand:wikidata=Q627381 is enough.

So in All The Places, we're about to have the issue that we can't just match on brand:wikidata=Q627381 because that gives us both the ATM and the Bank. It's not ideal but we can work around that by explicitly setting it to look for brand:wikidata=Q627381 + amenity=atm, however there are times when that doesn't work either like Gap, there are several Gap formats, all with shop=clothes and the same wikidata. Having the website regex in NSI is just another metric for us. I think iD gets the most benefit for this.

And this will not only related to NSI, but also change nsi.js in iD's repo, which maybe much harder than only modify in this repo.

Yeah, they need the data first though.

However, I'm not a core-maintainer, I only a PR merger, you can wait for another people's opinion.

Hopefully @bhousel or another core maintainer will have input.

@bhousel
Copy link
Member

bhousel commented Oct 5, 2022

Hopefully @bhousel or another core maintainer will have input.

Sorry to say I am confused by this issue too.

We don't actually check the brands' websites on this project, so I don't think NSI is a good place to store regex to validate
those websites.

Doesn't AllThePlaces already have this - they are a collection of website scrapers, so they know better what all the websites are?

I like the idea of connecting NSI and ATP, but - using the website regex seems a brittle way to connect the two projects.

@Cj-Malone
Copy link
Collaborator Author

We don't actually check the brands' websites on this project,

I'm not asking it to, just to extend matchNames and matchTags to matchUrls. Nor do I think it should be a required field, just available to contributors and in turn available to data consumers.

Doesn't AllThePlaces already have this - they are a collection of website scrapers, so they know better what all the websites are?

We have a collection on them already, and are always collecting more. I'd be happy to do the work to format them for NSI and open a PR once I've got to OK.

I like the idea of connecting NSI and ATP, but - using the website regex seems a brittle way to connect the two projects.

For 90% of the ATP spiders they have 0 or 1 match to NSI using the wikidata alone. This will help with some of the edge cases, and it will not help with some of them. But as I said, I don't think ATP if the only place this benefits.

@Cj-Malone Cj-Malone changed the title Website Regex matchUrls Oct 10, 2022
@UKChris-osm
Copy link
Collaborator

Would the aim of matchURL be for the brand / operator entry to point ATP in the right place to start scraping data, rather than ATP creating a spider.py for each brand?

It seems like a reasonable way to link NSI / ATP, as we are aiming to help map physical locations, as is ATP as far as I can tell.

The only downside I see from this end is validation. Would the NSI scripts need to ping every matchURL to see if it gets a response?

@UKChris-osm UKChris-osm added question Not Actionable - just a question about something and removed i dont know ¯\_(ツ)_/¯ labels Oct 31, 2022
@Cj-Malone
Copy link
Collaborator Author

Would the aim of matchURL be for the brand / operator entry to point ATP in the right place to start scraping data, rather than ATP creating a spider.py for each brand?

I have toyed with the idea of NSI being the source of spiders, some of our recent spiders are just config files, matchUrls would make that entirely populateable from NSI. Unfortunately not all sites follow SEO specs.

The only downside I see from this end is validation. Would the NSI scripts need to ping every matchURL to see if it gets a response?

I don't follow that NSI needs to validate these URLs at all. Does NSI validate matchNames or matchTags in any automated way? This is ATP domain, it's ran weekly and if we find a URL structure changes, we can fix it NSI.

We could kickstart matchUrls to some degree from ATP output, or even current OSM data, but it probably needs human review.

@Cj-Malone
Copy link
Collaborator Author

@rjw62 you might have some input since you've both got a collection of regexs for different brands, and have recently been pointing out mistakes in the ATP regexs.

I'd really like for all of the OSM community to share one source of these. It avoids us all duplicating effort, it means everyone will know where to fix it and benefit from improvements or new data, it means more people will be aware of the data and use it in ways we haven't thought of yet.

@UKChris-osm
Copy link
Collaborator

Unfortunately not all sites follow SEO specs.

What aspect of Sainsbury's is causing an issue? Would https://stores.sainsburys.co.uk/sitemap.xml not be usable?

Generally speaking, I think including a URL reference for POI locations within the NSI would be a good idea.

@UKChris-osm UKChris-osm added the enhancement Actionable - add an enhancement to the source code label Dec 11, 2022
@Cj-Malone
Copy link
Collaborator Author

What aspect of Sainsbury's is causing an issue? Would https://stores.sainsburys.co.uk/sitemap.xml not be usable?

Sorry, this is me badly vocalising it again. Sainsbury's would absolutely work with matchUrls. It'd be something like matchUrls = [ "^https://stores\.sainsburys\.co\.uk/(\d+)/[-\w]+$" ]

I was rambling about a fantasy that one day we wouldn't need need write ATP spiders, ATP could just parse standard SEO data just from using NSI entries. Sainsbury's wouldn't work with that... right now at least, maybe one day when the standards are better adopted.

Greggs or Lidl wouldn't "work" as they don't have individual webpages for each store, although you could argue they should be something like matchUrls = [ "^$" ] would be right and editors should warn on users setting any website value.

Generally speaking, I think including a URL reference for POI locations within the NSI would be a good idea.

:)

@UKChris-osm
Copy link
Collaborator

I just had a look at the Sainsbury's web site and it seems the store pages are built using JavaScript on the fly. Is there no option for ATP to build a DOM first then extract the data needed for each store?

Greggs seem to have a URL for each store, which is https://www.greggs.co.uk/shop-finder?shop-code=1259 where the code at the end is different for each store.

@Cj-Malone
Copy link
Collaborator Author

Cj-Malone commented Dec 11, 2022

I just had a look at the Sainsbury's web site and it seems the store pages are built using JavaScript on the fly. Is there no option for ATP to build a DOM first then extract the data needed for each store?

We don't execute JS at the moment, but we get the Sainsbury's data via the API.

Greggs seem to have a URL for each store, which is https://www.greggs.co.uk/shop-finder?shop-code=1259 where the code at the end is different for each store.

Thank you, I didn't check Lidl either just recalled the 2 from memory so that may have per store urls now too. I'll update ATP to publish the Greggs website though, thank you. I don't think that URL existed when I wrote the spider 4 months ago so all the more reason to get these regexs into NSI where people are looking, and out of ATP where nobody is looking.

This is getting a bit off topic. What are the next steps, could we just start adding matchUrls to NSI entries, or do we need to update NSI code too?

@Cj-Malone
Copy link
Collaborator Author

This did get quite off topic, but I'm always happy to talk about ATP if anyone has questions.

For the matchUrls idea, I got a "location URL match pattern" property on Wikidata. Since the above PR, NSI now caches it in wikidata.json :)

Thanks everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Actionable - add an enhancement to the source code question Not Actionable - just a question about something
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants