Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major overhaul to port manubot-rootstock functionality #2

Merged
merged 21 commits into from Aug 7, 2017

Conversation

dhimmel
Copy link
Member

@dhimmel dhimmel commented Jul 29, 2017

This PR aims to move much of the Python code from manubot-rootstock to this package.

@dhimmel
Copy link
Member Author

dhimmel commented Jul 30, 2017

@agitter big changes on the way. Packaging lot's of the manubot-rootstock python code here.

Another goal which I haven't mentioned elsewhere is to cache citation metadata requests at the earliest possible stage. Therefore if the processing code changes, we don't need to flush the entire cache every time. This will likely use requests-cache. Still looking at the best way to use it in requests-cache/requests-cache#96.

This PR will change the referencing format to be pandoc compatible as per manubot/rootstock#2 (comment). If there are major breaking changes that you've wanted but previously deemed too destructive, now's the time to suggest them.

@agitter
Copy link
Member

agitter commented Jul 30, 2017

Structuring this as a distinct package is a great idea. It will help separate the content from the build system and minimize our need to merge manubot-rootstock changes into a cloned repository.

After changing citations to be pandoc compatible, will we still use the previous syntax such as @doi:10.1371/journal.pone.0032235 and @url:https://github.com/greenelab/manubot or switch to the tag-like syntax in the linked comment, such as @smith04?

I haven't looked at the code changes closely. Did you reimplement the essential parts of arxiv2bib on your own?

@dhimmel
Copy link
Member Author

dhimmel commented Jul 30, 2017

Did you reimplement the essential parts of arxiv2bib on your own?

Yes. No more bibtex and the arXiv citeproc should now be more extensive.

After changing citations to be pandoc compatible, will we still use the previous syntax such as @doi:10.1371/journal.pone.0032235 and @url:https://github.com/greenelab/manubot

Yes, but you will need a semicolon to separate multiple references:

[@doi:10.1371/journal.pone.0032235; @url:https://github.com/greenelab/manubot]

But you will be able to do things like:

@doi:10.1371/journal.pone.0032235 says blah
[@doi:10.1371/journal.pone.0032235, p 33-35; @url:https://github.com/greenelab/manubot].

See this. Currently, I've deviated from pandoc style in 1 important way. References can end in slash. This is because so many URLs end in slash. We don't have to allow this (you can use tags for citations that aren't valid pandoc citation syntax).

@agitter
Copy link
Member

agitter commented Jul 31, 2017

Currently, I've deviated from pandoc style in 1 important way.

I agree with that choice. URLs ending with slash are too common a case to force the use of tags.

@dhimmel dhimmel changed the title WIP: Major overhaul to port manubot-rootstock functionality Major overhaul to port manubot-rootstock functionality Aug 1, 2017
@dhimmel
Copy link
Member Author

dhimmel commented Aug 1, 2017

Manubot is now a command line utility. It now requires a specific set of files to be in the content directory. Process the example-manuscript in the PR like:

manubot \
  --content-directory=tests/example-manuscript/content \
  --output-directory=tests/example-manuscript/output \
  --log-level=INFO

The process should now be more tolerant to manuscript citation errors. Builds will still fail if there are any errors, but the process should complete.

@agitter now's a good time to review.

prepare_manuscript(args)

if error_handler.fired:
print('Failure: exiting with code 1 due to logged errors')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably have this as a log myself ;-)

@agitter
Copy link
Member

agitter commented Aug 2, 2017

now's a good time to review.

@dhimmel how thorough of a review would you like? I'll have to wait until this weekend at the earliest to do a complete line-by-line review.

@dhimmel
Copy link
Member Author

dhimmel commented Aug 2, 2017

how thorough of a review would you like? I'll have to wait until this weekend at the earliest to do a complete line-by-line review.

Happy to wait. This pull request requires lot's of updates to manuscript repos, so we should review it carefully and make sure we get everything right.

@dhimmel
Copy link
Member Author

dhimmel commented Aug 3, 2017

8ca7c86 enables providing additional template variables for jinja2, like:

manubot \
  --content-directory=tests/example-manuscript/content \
  --output-directory=tests/example-manuscript/output \
  --template-variables-path=https://github.com/greenelab/scihub/raw/c59273a4b5a91fd905e88662191d2a213497a682/0.configuration.json \
  --log-level=INFO

Therefore, you could have your analysis repo generate a bunch of stats, which you refer to by variable in the manuscript.

@agitter
Copy link
Member

agitter commented Aug 4, 2017

@dhimmel Because there are a lot of changes here, would it be helpful to have one of the contributors from manubot-rootstock review this as well? Even this weekend I may not be able to do a sufficiently deep review on my own.

@dhimmel
Copy link
Member Author

dhimmel commented Aug 4, 2017

Because there are a lot of changes here, would it be helpful to have one of the contributors from manubot-rootstock review this as well?

It's very hard to review this many changes at once. But @slochower @vsmalladi, your comments are welcome.

We can also merge this PR without extensive review and take the codebase from there with follow up PRs to address any changes people have.

@slochower
Copy link
Collaborator

I'm less familiar with doing code review through GitHub's interface. Is there a way I could merge the PR on a local clone of the repository and run it through test of various citation styles to make sure all's well?

@dhimmel
Copy link
Member Author

dhimmel commented Aug 4, 2017

@slochower checking out the code from a PR is a great way to review it. Oftentimes, there is no substitute for actually testing code.

You can checkout this PR into a branch. Let's assume this repo is your upstream remote. Then you can do:

git fetch upstream pull/2/head:pr-2
git checkout pr-2

dhimmel added a commit to dhimmel/manubot-rootstock that referenced this pull request Aug 4, 2017
@vsmalladi
Copy link
Collaborator

Well start taking a look and see if I notice anything.

@slochower
Copy link
Collaborator

Me too, sorry didn't get a chance to do it today, but I can take a look early next week if this isn't merged by then.

Copy link
Member

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only focused on the tests so far. The test code looks reasonable. Everything passes for me locally except the test that fails on Travis in 21bf789. I don't understand the Invalid response from https://data.datacite.org/10.7287%2Fpeerj.preprints.3100v1 error because the URL works in my browser. Maybe switch to a different example for now?

I ran the local tests in the manubot conda environment from manubot-rootstock after installing this package by following the .travis.yml install commands.

@dhimmel
Copy link
Member Author

dhimmel commented Aug 5, 2017

I don't understand the Invalid response error because the URL works in my browser.

Reported upstream at datacite/datacite#187. I'll skip the test for time being.

Copy link
Member

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've browsed through all of the files now and only have these minor comments. The new structure looks logical and more extensible to me. I didn't test the code outside of the formal test cases, but I'm okay merging this and testing it with our real manuscripts.


def citeproc_passthrough(csl_item, set_id=None):
"""
Fix errors in a CSL item and optional change its id.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optionally change

Read and process command line arguments.
"""
parser = argparse.ArgumentParser()
parser.add_argument('--content-directory', type=pathlib.Path, required=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--content-directory and --output-directory may seem like obvious arguments, but I would find it helpful to add a help message anyway.

parser = argparse.ArgumentParser()
parser.add_argument('--content-directory', type=pathlib.Path, required=True)
parser.add_argument('--output-directory', type=pathlib.Path, required=True)
parser.add_argument('--template-variables-path',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to include an example manuscript that uses the template variables feature, perhaps in manubot-rootstock instead of here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will add an example manuscript under tests for testing --template-variables-path

.pipe(collections.Counter)
)
ref_counts['total'] = sum(ref_counts.values())
stats['reference_counts'] = ref_counts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the reference counts and word count get logged at some level? I may have missed it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No ATM they only get saved to variables.json. Hence they're available to the jinja2 templator / whenever you have access to the output files (anytime besides PR CI builds). Do you think all of variables.tsv should be logged? In what cases do you like seeing the reference and word counts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to have the user set a word count limit that would trigger logging?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhimmel if the manuscript has a word count or reference limit imposed by a journal or conference, logging those values so they can be inspected during PR review is potentially useful. Those two variables have special significance to me. I don't think it is necessary to log all of variables.tsv.

@vsmalladi I think that setting a word count threshold to trigger logging would be more complicated than logging at an appropriate level.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agitter Ya i think logging the word count and reference limit would be what i would look at. I think the only other thing I would possibly log, would be the number of figures. Since each journal could also have a limit to number of figures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsmalladi that's a good idea, it could log figure and table counts. That could come in a later pull request if @dhimmel wants to get this massive one merged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agitter That sound reasonable.

@vsmalladi
Copy link
Collaborator

I have ran the installation and test code locally and everything seems to be working. I am starting to look line by line at the code.

@dhimmel
Copy link
Member Author

dhimmel commented Aug 7, 2017

Okay thanks everyone for the feedback. In the interest of time, I'm going to merge.

@dhimmel dhimmel merged commit ab4bb26 into manubot:master Aug 7, 2017
@dhimmel dhimmel deleted the develop branch August 7, 2017 16:54
dhimmel added a commit to manubot/rootstock that referenced this pull request Aug 9, 2017
* Update for expanded manubot package
* Create distinct caching directory
* Deployment: references branch now output
* Rewrite documentation
* Rename CONTRIBUTING.md to USAGE.md

Refs manubot/manubot#2
dhimmel added a commit to manubot/rootstock that referenced this pull request Aug 9, 2017
This build is based on
084e30d.

This commit was created by the following Travis CI build and job:
https://travis-ci.org/greenelab/manubot-rootstock/builds/262705136
https://travis-ci.org/greenelab/manubot-rootstock/jobs/262705137

[ci skip]

The full commit message that triggered this build is copied below:

Update for expanded manubot package (#48)

* Update for expanded manubot package
* Create distinct caching directory
* Deployment: references branch now output
* Rewrite documentation
* Rename CONTRIBUTING.md to USAGE.md

Refs manubot/manubot#2
dhimmel added a commit to manubot/rootstock that referenced this pull request Aug 9, 2017
This build is based on
084e30d.

This commit was created by the following Travis CI build and job:
https://travis-ci.org/greenelab/manubot-rootstock/builds/262705136
https://travis-ci.org/greenelab/manubot-rootstock/jobs/262705137

[ci skip]

The full commit message that triggered this build is copied below:

Update for expanded manubot package (#48)

* Update for expanded manubot package
* Create distinct caching directory
* Deployment: references branch now output
* Rewrite documentation
* Rename CONTRIBUTING.md to USAGE.md

Refs manubot/manubot#2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants