Retrieve http links

Chrome blocks the downloads of some resources. These resources are external and their URL is in HTTP and not HTTPS. More details about the change in Chrome here: https://blog.chromium.org/2020/02/protecting-users-from-insecure.html

This happens only for resources which should be downloaded by the browser (if the resource is a web page, e.g. for a documentation, the problem does not happen) The idea is to find all resources with an external URL in HTTP.

Connect to the mongodb host.
Run this mongo --quiet udata --eval 'printjson(db.dataset.find({"resources.url":/^http:/}, {slug:1, _id:1, title:1, "resources.url":1}).toArray())' > output_http.json
Get the generated json in your workspace (scp mongodb_host:output_http.json .)
Edit the file to replace all regexes [^\s]+$"(.+?)"$ by "$1". This will remove the unsupported by json ObjectID and ISODate objects by their values.
Make sure the script scripts/filterhttp.py is in the same place as the json, and run it python filterhttp.py. This will generate the output_http_parsed.json, which you can reformat with an IDE.

When the list is available, it can be filtered thanks to some heuristics:

remove all private datasets
all rest webservices can be removed (it is very rare to force the download on such a webservice)
check URLs by extension: all xml, pdf files and all webpages rarely force download, they are displayed in the browser, and it is not an issue for Chrome.

Retrieve broken links

These broken links are based on Croquemort. Some false positives are possible (some websites block bots), and some false negative too, for example when an old url on a website is redirected to the homepage without 404.

Connect to the mongodb host.
Run this mongo --quiet udata --eval 'printjson(db.dataset.find({"resources.extras.check:available":false}, {slug:1, _id:1, title:1, "resources.extras":1}).toArray())' > output.json
Get the generated json in your workspace (scp mongodb_host:output.json .)
Edit the file to replace all regexes [^\s]+$"(.+?)"$ by "$1". This will remove the unsupported by json ObjectID and ISODate objects by their values.
Make sure the script scripts/filterlinks.py is in the same place as the json, and run it python filterlinks.py. This will generate the brokenlinks.json, which you can reformat with an IDE.

Don't forget to remove from the list all private datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retrieve http links

Retrieve broken links

About

Releases

Packages

Languages

opendatalu/udata-fetching-scripts

Folders and files

Latest commit

History

Repository files navigation

Retrieve http links

Retrieve broken links

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages