Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make jiant pip installable #887

Merged
merged 11 commits into from Aug 25, 2019
Merged

Conversation

davidbenton
Copy link
Contributor

This should be considered a work-in-progress, not a complete PR. It passes tests, but might break some untested things, and there are a few things I'll flag for review.

Because I'm still getting to know jiant, I chose running the tutorial experiment using a pip-installed package as my goal. This changeset allows that (without requiring conda or manual dependency installation). It doesn't directly address setting up an environment or downloading data using the pip package. To try it yourself, you can:

# create a new python virtual environment (conda not required)
# set up environment variables per tutorial, including pointing to an existing data directory, e.g.
export JIANT_PROJECT_PREFIX=~/jiant/tutorial
export JIANT_DATA_DIR=~/jiant/data

# copy default.conf from repo

# copy tutorial config file content from https://github.com/nyu-mll/jiant/blob/master/tutorials/setup_tutorial.md to tutorial.conf

# install package
pip install --upgrade --index-url https://test.pypi.org/simple/ jiant==1.1.0a0

# run experiment
python -m jiant --config_file tutorial.conf

Some items worth reviewing:

  • A maintainer should check the metadata in setup.py - I took a shot at it...
  • Config was moved to inside the package, per setuptools recommendation, but this could break things. I only tried the tutorial and the test suite.
  • Assuming a distributed package should include only what is required to run typical tasks, I only added package dependencies necessary to run the tutorial. Are there other tasks/tests that should be considered?
  • This doesn't expose any shell or python scripts as CLI commands, but we might want some for downloading data or generating a local copy of default.conf, for example.
  • I moved main.main() into jiant.__main__ for convenience of running python -m jiant, and to keep necessary code inside the jiant package. I kept main.py for backward compatibility, and there's some duplicated code, but not too much.
  • tests and other root-level packages were not included in the package. Does this satisfy the need for a pip package without the other packages? If no, it might make sense to reorganize some directories.

This works for the tutorial experiment, but is still a work-in-progress.

* Add pip dependencies - matched environment.yml as closely as possible
* Move config into jiant package
@pep8speaks
Copy link

pep8speaks commented Aug 13, 2019

Hello @davidbenton! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 You can repair most issues by installing black and running: black -l 100 ./*. If you contribute often, have a look at the 'Contributing' section of the README for instructions on doing this automatically.

Comment last updated at 2019-08-24 00:19:26 UTC

@davidbenton
Copy link
Contributor Author

see #877

@davidbenton
Copy link
Contributor Author

I see I need to get demo.py running to pass CI, which should be straightforward. I have just been running nose2.

The PEP 8 issues aren't picked up by black locally, and I'm not sure why, but I'll clean that up too.

@davidbenton
Copy link
Contributor Author

Okay, minor changes made. crossing my fingers on CI

Another item to note: if these changes are going in the right direction, documentation will need some updates, including change to config paths and pip installation.

@pruksmhc
Copy link
Contributor

Thanks for taking this on! On your questions, most definitely run the superglue baselines and make sure the changes don't break anything from those runs (scripts/superglue-baselines.sh).
I'll defer to @iftenney on the last pip question.

Copy link
Contributor

@sleepinyourhat sleepinyourhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together!

Moving config/ and main() seems reasonable. I'm fine with keeping a legacy main.py and deleting it at a future release.

@@ -0,0 +1,609 @@
"""Train a multi-task model using AllenNLP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub is treating this as a substantial diff rather than a move. Are there any substantial changes here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only changes are making a long string multi-line, and using pkg_resources to locate defaults.conf if no config is passed.


To debug this, run with -m ipdb:

python -m ipdb main.py --config_file ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a double import warning for ipdb when I run it that way. ipdb docs say that formulation is python 2.7 only. Are we supporting python 2.7? 3.X? I've been using 3.6 and 3.7 for the two environments I've been working in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm—I haven't been using ipdb lately. Does anyone else here know what the recommended practice is here?

We're only aiming to support 3.6/3.7—we should also clarify that somewhere...

@sleepinyourhat
Copy link
Contributor

Also: We haven't been enforcing the PEP8 issues that don't also pop up in Black, though some of them can still be worth looking at. We should probably turn off the PEP8 bot...

Thanks!

@sleepinyourhat
Copy link
Contributor

And to confirm, this generally all looks productive/reasonable. LMK when it's ready for a final look.

@davidbenton
Copy link
Contributor Author

Thanks for review and comments! I'm glad it's going in the right direction.

@pruksmhc I ran the boolq task locally using superglue-baseline.sh. No surprise, but it's a bit slow to run with no GPU, ha! After about four days, it completed without error. That's a successful smoke test, but let me know if all (or other specific) tasks need to be run.

@sleepinyourhat Functionally, I think we have pip-installability complete, but depending what they're doing, users might run into some issues using jiant via pip. Specifically, they won't have easy access to the config files or shell scripts. You can see examples of this in my pip tutorial recipe above, where data and defaults.conf need to be manually fetched/located by the user. There are some options available to us for addressing this, but I don't have good sense of real-world use cases.

Do we need to make it easier for pip users to get config files or use some shell scripts? If yes, is that in scope for #877, or open new issue for it?

@sleepinyourhat
Copy link
Contributor

That sounds reasonable. This PR seems to be net helpful as long as the standard workflow (clone, create/edit config files, modify code) isn't any harder. We don't have a great case study workflow to try that would involve pip yet, though. Once we start experimenting with flambé and similar tools, the pain points will become clearer.

setup.py Show resolved Hide resolved
@davidbenton
Copy link
Contributor Author

After a thorough review, I see that defaults.conf is the only config file ever needed at runtime. Symlinking that into the jiant package would allow us to leave other config where it is, which would greatly simplify the documentation and script changes to this PR.

Is a symlink reasonable to make defaults.conf accessible from old and new paths? I've tested in OS X, and should also work the same in *nix. I believe Windows has full symlink support as of Windows 10, but I don't have an easy way to test.

@sleepinyourhat
Copy link
Contributor

I suspect people will often want to name other config files explicitly, both at the command line, and as inheritances within their new config files, so we'll need to make sure that's relatively painless/documented.

I'm fine with symlinks if they're generally supported/safe for multi-platform git projects.

@davidbenton
Copy link
Contributor Author

Symlink will work for most, but I am not confident it works for all Windows users, so I will move all config to the package (as it is in the PR now), update paths, and document. I think I have what I need, and I'll update the PR tomorrow.

@davidbenton
Copy link
Contributor Author

Got hung up debugging the CircleCI changes, but I have a pip environment passing tests there now. Not finished with other work, but going to call it a day on this. Will pick this up by the end of the week.

@davidbenton
Copy link
Contributor Author

Okay, I think this is ready for review @sleepinyourhat.

Note the significant changes to CircleCI, and let me know if you have questions about anything there. I saw in my merge just now that #897 specified a version of conda, which this wipes out. Not sure if that's a red flag or not.

I updated config paths everywhere, but did not attempt to run every script, etc. I did, as noted above, run one task from superglue-baselines.sh.

@davidbenton
Copy link
Contributor Author

Oh, and I'll create the documentation PR if the README.md copy is approved.

Copy link
Contributor

@sleepinyourhat sleepinyourhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! I'll merge tomorrow unless anyone else raises concerns.

@@ -3,7 +3,7 @@
jobs:
test:
docker:
- image: continuumio/miniconda3:4.6.14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me, as long as we have a fixed version and tests are passing.

@davidbenton
Copy link
Contributor Author

Great! You'll need to create a PyPI account to upload package distributions, if you don't have one, and then the notes in setup.py should cover it.

@sleepinyourhat sleepinyourhat merged commit b11d710 into nyu-mll:master Aug 25, 2019
@HaokunLiu
Copy link
Member

Hey @davidbenton , I need to verify a new branch in jiant is pip installable, what do I need to do?

@sleepinyourhat
Copy link
Contributor

sleepinyourhat commented Sep 6, 2019

Installing git repos with pip should be pretty standard. Roughly, IIRC: Make a clone/copy of the repo, exit any environment/virtualenv you're using that already has jiant set up, and run something like pip install jiant from the directory. Then try running demo.conf and/or any additional code you need to run to verify that the dependencies that you're adding loaded correctly.

@davidbenton
Copy link
Contributor Author

@HaokunLiu Everything above seems right, but I'd recommend going a step further and creating a new virtual environment to make sure you're not pulling from existing global/user installed dependencies. The CircleCI config.yml steps create a virtual environment and pip install the package from local filesystem, so you can use that as a recipe, if it's helpful.

@HaokunLiu
Copy link
Member

Thanks, I finally did it.
BTW, does what you said mean, any PR that passes the CircleCI check is automatically pip installable?

@davidbenton
Copy link
Contributor Author

Yes, that's right. As pip use cases emerge, it might be necessary to add some more checks, but CircleCI currently does basic package installation via pip, including dependencies.

@HaokunLiu
Copy link
Member

HaokunLiu commented Sep 7, 2019

@davidbenton I've got another problem. Using RoBERTa tokenizer needs to run "python -m spacy download en". I can manually run this when running experiments. But when I'm using Robert tokenizer in a test script, there is no place for me to ask spacy to download en. Is there any way we can do this in setup, or somewhere else?

phu-pmh pushed a commit that referenced this pull request Apr 17, 2020
* Create initial package for PyPI (pip) distribution

* Working PyPI package - can run tutorial experiment

This works for the tutorial experiment, but is still a work-in-progress.

* Add pip dependencies - matched environment.yml as closely as possible
* Move config into jiant package

* PEP 8 long strings changed to multi-line

* Update CI demo.conf path

* Use locally installed pip package for CircleCI

* Update call to ipdb

* Update config paths in non-python files

* Document release and distribution

* Name fix.

* Add note to 'Contributing'
@jeswan jeswan added the jiant-v1-legacy Relevant to versions <= v1.3.2 label Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jiant-v1-legacy Relevant to versions <= v1.3.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants