-
Notifications
You must be signed in to change notification settings - Fork 869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
matchUrls #7133
Comments
Maybe this is a OSM data import problem, more than a NSI preset problem. NSI currently don't collect those kind of data. Even if don't care license problem about request its website for detailed branch information, iD should be a generic editor and can't develop a specific feature for validating via parsing webpage. And its' rule is pre prepared. |
I'm not suggesting that iD should parse a webpage. I'm suggesting that it could validate that a poi has a |
So, to continue my example, it would tell us the following is incorrect.
It either needs to be updated to |
Sorry, I misunderstand your previous suggestion, I wrong think you want to do something like But if you want to match a brand, current NSI can match it from And this will not only related to NSI, but also change nsi.js in iD's repo, which maybe much harder than only modify in this repo. However, I'm not a core-maintainer, I only a PR merger, you can wait for another people's opinion. |
No worries. That's All The Places domain, and I would defiantly recommend everyone checks it out.
So in All The Places, we're about to have the issue that we can't just match on
Yeah, they need the data first though.
Hopefully @bhousel or another core maintainer will have input. |
Sorry to say I am confused by this issue too. We don't actually check the brands' websites on this project, so I don't think NSI is a good place to store regex to validate Doesn't AllThePlaces already have this - they are a collection of website scrapers, so they know better what all the websites are? I like the idea of connecting NSI and ATP, but - using the website regex seems a brittle way to connect the two projects. |
I'm not asking it to, just to extend matchNames and matchTags to matchUrls. Nor do I think it should be a required field, just available to contributors and in turn available to data consumers.
We have a collection on them already, and are always collecting more. I'd be happy to do the work to format them for NSI and open a PR once I've got to OK.
For 90% of the ATP spiders they have 0 or 1 match to NSI using the wikidata alone. This will help with some of the edge cases, and it will not help with some of them. But as I said, I don't think ATP if the only place this benefits. |
Would the aim of It seems like a reasonable way to link NSI / ATP, as we are aiming to help map physical locations, as is ATP as far as I can tell. The only downside I see from this end is validation. Would the NSI scripts need to ping every |
I have toyed with the idea of NSI being the source of spiders, some of our recent spiders are just config files,
I don't follow that NSI needs to validate these URLs at all. Does NSI validate We could kickstart |
@rjw62 you might have some input since you've both got a collection of regexs for different brands, and have recently been pointing out mistakes in the ATP regexs. I'd really like for all of the OSM community to share one source of these. It avoids us all duplicating effort, it means everyone will know where to fix it and benefit from improvements or new data, it means more people will be aware of the data and use it in ways we haven't thought of yet. |
What aspect of Sainsbury's is causing an issue? Would https://stores.sainsburys.co.uk/sitemap.xml not be usable? Generally speaking, I think including a URL reference for POI locations within the NSI would be a good idea. |
Sorry, this is me badly vocalising it again. Sainsbury's would absolutely work with I was rambling about a fantasy that one day we wouldn't need need write ATP spiders, ATP could just parse standard SEO data just from using NSI entries. Sainsbury's wouldn't work with that... right now at least, maybe one day when the standards are better adopted. Greggs or Lidl wouldn't "work" as they don't have individual webpages for each store, although you could argue they should be something like
:) |
I just had a look at the Sainsbury's web site and it seems the store pages are built using JavaScript on the fly. Is there no option for ATP to build a DOM first then extract the data needed for each store? Greggs seem to have a URL for each store, which is https://www.greggs.co.uk/shop-finder?shop-code=1259 where the code at the end is different for each store. |
We don't execute JS at the moment, but we get the Sainsbury's data via the API.
Thank you, I didn't check Lidl either just recalled the 2 from memory so that may have per store urls now too. I'll update ATP to publish the Greggs website though, thank you. I don't think that URL existed when I wrote the spider 4 months ago so all the more reason to get these regexs into NSI where people are looking, and out of ATP where nobody is looking. This is getting a bit off topic. What are the next steps, could we just start adding |
This did get quite off topic, but I'm always happy to talk about ATP if anyone has questions. For the Thanks everyone. |
I've disused this before, but I don't think I have here.
It would be nice if NSI stored the valid urls for features. For example Bank of Scotland could be could get something along the lines of
"matchUrls": ["^https:\/\/branches\.bankofscotland\.co\.uk\/([-\w]+\/[-\/'\w]+)$"]
.It would allow downstream data users like iD to perform validation on the
website
, and it would help All The Places match it's output to NSI.The text was updated successfully, but these errors were encountered: