Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7084 crawlable file access #7579

Merged
merged 17 commits into from Feb 17, 2021
Merged

7084 crawlable file access #7579

merged 17 commits into from Feb 17, 2021

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Feb 4, 2021

What this PR does / why we need it:

This API provides an html view of a dataset as a set of directory indexes of its folder tree, that you can click through or crawl with wget.

Which issue(s) this PR closes:

Closes #7084

Special notes for your reviewer:

There's a long-ish discussion in the issue. It may be easier to read it starting with the last comments, as is always the case with these things.
I suggest to start with the documentation (in the API guide, doc/sphinx-guides/source/api/native-api.rst), then consult the discussion in the issue if necessary.

Note that the guide entry has images. They will show up when rendered, but not in GitHub previews. They can be viewed in doc/sphinx-guides/source/api/img.

Suggestions on how to test this:

See the note above. The guide should have enough information to be able to test it.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

no

Is there a release notes update needed for this change?:

yes.

Additional documentation:

@sekmiller sekmiller self-assigned this Feb 5, 2021
Copy link
Contributor

@sekmiller sekmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Needs to be brought up to date with dev.

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from Review 🦁 to QA 🔎✅ Feb 8, 2021
@sekmiller sekmiller removed their assignment Feb 8, 2021
@landreev
Copy link
Contributor Author

landreev commented Feb 8, 2021

Looks good. Needs to be brought up to date with dev.

OK, will do.

@kcondon kcondon self-assigned this Feb 8, 2021
@landreev
Copy link
Contributor Author

landreev commented Feb 9, 2021

Going to move this back into dev. temporarily. Per QA, want to investigate a) direct downloading of sub-sub folders and b) maybe returning a 404 when a non-existing folder is requested.
Also, would like to add an example of a robots.txt entry to the guide.
(with sub-sub folders there's a chance that the only thing needed is another clarification for the guide - will confirm)

@landreev landreev moved this from Review 🦁 to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Feb 9, 2021
@kcondon
Copy link
Contributor

kcondon commented Feb 9, 2021

Tested basic functionality and it works. Had a couple questions/issues:

  1. filenames for the index are in the form of the command line
  2. cannot figure out how to directly address subsubdir
  3. specifying a non existent dir results in an index file anyway (I believe it is essentially empty)

@landreev landreev self-assigned this Feb 9, 2021
@djbrooke djbrooke added this to the 5.4 milestone Feb 11, 2021
@landreev
Copy link
Contributor Author

@kcondon
Putting the PR back into QA;
the following changes have been made:

  • It's actually very easy to tell wget not to check robots.txt; much easier than to tell the admins to change their system robots.txt. Modified the recommended wget command line in the API guide.
  • Modified the guide to consistently use the path with the slash on the end. I.e., .../dirindex/ or .../dirindex/?folder=... etc. - since this is the correct form.
  • .../dirindex/?folder=subfolder/subsubfolder should be working.
  • Made the names of the saved directory index files less messy.
  • A call to list a non-existing folder will result in a 404/NOT FOUND.

@landreev landreev moved this from IQSS Team - In Progress 💻 to QA 🔎✅ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Feb 16, 2021
@landreev landreev removed their assignment Feb 16, 2021
@kcondon
Copy link
Contributor

kcondon commented Feb 17, 2021

@landreev This works for subdir and for non existent folder but now there are no directory listings as part of the download, only blank directories named dirindex

@kcondon kcondon removed their assignment Feb 17, 2021
@kcondon kcondon moved this from QA 🔎✅ to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Feb 17, 2021
@landreev
Copy link
Contributor Author

landreev commented Feb 17, 2021

@landreev This works for subdir and for non existent folder but now there are no directory listings as part of the download, only blank directories named dirindex

I wish it were true. As in, it would be preferable not to save these directory listings, and only save the real files in the dataset; nobody really needs them.
But wget does want to save them, and there doesn't seem to be a way to tell it not to. So in the guide I'm essentially telling the user to ignore the content of dirindex. And the saved index files in that directory are called .index...html, so I'm purposefully making them less visible/more ignorable. :)

@kcondon kcondon assigned kcondon and unassigned landreev Feb 17, 2021
@kcondon kcondon moved this from IQSS Team - In Progress 💻 to QA 🔎✅ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Feb 17, 2021
@kcondon
Copy link
Contributor

kcondon commented Feb 17, 2021

Ah, you pointed me to your above comment, after I reported the issue.
Your suggestion of viewing them via ls -la worked:

ls -la
total 16
drwxr-xr-x. 2 root root 117 Feb 17 19:41 .
drwxr-xr-x. 5 root root 52 Feb 17 19:41 ..
-rw-r--r--. 1 root root 698 Feb 17 19:41 .index.html
-rw-r--r--. 1 root root 728 Feb 17 19:41 .index-subdir1.html
-rw-r--r--. 1 root root 562 Feb 17 19:41 .index-subdir1_subsubdir1.html
-rw-r--r--. 1 root root 705 Feb 17 19:41 .index-subdir2.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Implement access to the files in the dataset as a virtual folder tree
5 participants