-
Notifications
You must be signed in to change notification settings - Fork 753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate batch prediction #184
Integrate batch prediction #184
Conversation
/cc @jlewi @ankushagarwal This PR is a relatively large refactor because virtually all the code requires the t2t modules. The Python2/3 split was causing maintainability issues, so I just merged into one python package with multiple modules. For Batch Prediction, |
/cc @yixinshi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 34 files at r1, 4 of 25 files at r3.
Reviewable status: 6 of 48 files reviewed, 7 unresolved discussions (waiting on @activatedgeek, @yixinshi, @jlewi, and @ankushagarwal)
code_search/README.md, line 153 at r3 (raw file):
We run another `Dataflow` pipeline to use the exported model above and get a high-dimensional embedding of each of our code. Specify the model version (which is a UNIX timestamp) from the output directory. This should be the name of
Grammarerror "for each of our code"
code_search/README.md, line 164 at r3 (raw file):
(env2.7) $ export SAVED_MODEL_DIR=${GCS_DIR}/output/export/Servo/${MODEL_VERSION}
(env2.7) $ code-search-predict -r DataflowRunner --problem=github_function_docstring -i "${GCS_DIR}/data/*.csv" \
What is code-search-predict? Is it a binary? I can't find it.
code_search/README.md, line 187 at r3 (raw file):
$ docker run --rm -p8501:8501 gcr.io/kubeflow-images-public/tensorflow-serving-1.8 tensorflow_model_server \
Lets change this to use Kubernetes please. Lets not introduce docker.
If need be you can do
kubectl run.
You can fix this later
code_search/src/code_search/do_fns/embeddings.py, line 6 at r3 (raw file):
from cStringIO import StringIO import apache_beam as beam from ..transforms.process_github_files import ProcessGithubFiles
Use absolute imports here and everywhere else please. Make code_search the top level package.
code_search/src/code_search/transforms/code_embed.py, line 4 at r3 (raw file):
from kubeflow_batch_predict.dataflow.batch_prediction import PredictionDoFn from ..do_fns.embeddings import EncodeExample
You should always use absolute import paths.
You should figure out what you want the top level package to be and then setup your Python path that way.
e.g.
The top level package could be code_search
you wiould then do something like
from code_search.do_fns.embeddings import EncodeExample
Can we fix that in this PR?
Also you should import complete packages eg
from code_search.do_fns import embeddings
rather than importing specific classes from the package.
code_search/src/code_search/transforms/process_github_files.py, line 6 at r3 (raw file):
from apache_beam.io.gcp.internal.clients import bigquery from ..do_fns import ExtractFuncInfo
Use absolute imports here and everywhere else please. Make code_search the top level package.
code_search/src/setup.py, line 14 at r3 (raw file):
['python', '-m', 'spacy', 'download', 'en'], # TODO(sanyamkapoor): This isn't ideal but no other way for a seamless install right now. ['pip', 'install', 'https://github.com/kubeflow/batch-predict/tarball/master']
Should you at least pin a version?
Sure thing! Is there a reason we prefer absolute imports over relative? I spent some time reading about it but could not build a strong opinion on one or the other. One of the issues that I often face with absolute imports is cyclical dependencies (e.g. when one
This is a python wrapper auto-generated after a
All of the commands in README are currently for the user to run everything locally. I will certainly finalize the story and make it more accessible once I am finished. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 46 files reviewed, 3 unresolved discussions (waiting on @jlewi, @activatedgeek, @yixinshi, and @ankushagarwal)
code_search/src/code_search/do_fns/github_files.py, line 31 at r4 (raw file):
def process(self, element): # pylint: disable=no-self-use try: from ..utils import get_function_docstring_pairs
Can we make this an absolute import?
Why is this import here? Is this to do with how DoFns get serialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 46 files reviewed, 6 unresolved discussions (waiting on @jlewi, @activatedgeek, @yixinshi, and @ankushagarwal)
code_search/src/code_search/do_fns/embeddings.py, line 21 at r4 (raw file):
result = dict(zip(keys, values)) yield result
Can you explain how this code works? Is next returning a single element?
How can you have a yield statement when you aren't iterating over anything? Does it just return a single item and then a stop iteration on the next call
code_search/src/code_search/do_fns/github_files.py, line 45 at r4 (raw file):
class ExtractFuncInfo(beam.DoFn): # pylint: disable=abstract-method """Convert pair tuples from `TokenizeCodeDocstring` into dict containing query-friendly keys"""
What is meant by query friendly? Do you mean the strings to return to the user?
code_search/src/code_search/do_fns/github_files.py, line 46 at r4 (raw file):
try: info_rows = [dict(zip(self.info_keys, pair)) for pair in element.pop('pairs')]
What s info_keys.
Can you add a doc string please?..
…asy container translation
Ah sorry, this slipped through. Yes, this has to do with the serialization of DoFns.
I think this needs a better doc. I'll update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 46 files reviewed, 6 unresolved discussions (waiting on @jlewi, @activatedgeek, @yixinshi, and @ankushagarwal)
code_search/README.md, line 164 at r3 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
What is code-search-predict? Is it a binary? I can't find it.
This is a python wrapper auto-generated after a pip install
and is defined in setup.py
.
code_search/README.md, line 187 at r3 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Lets change this to use Kubernetes please. Lets not introduce docker.
If need be you can dokubectl run.
You can fix this later
All of the commands in README are currently for the user to run everything locally. I will certainly finalize the story and make it more accessible once I am finished.
code_search/src/code_search/do_fns/embeddings.py, line 21 at r4 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Can you explain how this code works? Is next returning a single element?
How can you have a yield statement when you aren't iterating over anything? Does it just return a single item and then a stop iteration on the next call
next
takes in the CSV reader (which is an iterator) and gets the next value from the iterator. I am using the CSV Reader instead of manual string parsing because there were a lot of corner cases in how CSV uses the delimiter and the actual data characters. To iterate over the complete data, next needs to be called as many times as possible (until it throws an StopIteration
). Here, I know that there's only ever a single data reference in the iterator and call it one time.
code_search/src/code_search/do_fns/github_files.py, line 31 at r4 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Can we make this an absolute import?
Why is this import here? Is this to do with how DoFns get serialized?
Ah sorry, this slipped through. Yes, this has to do with the serialization of DoFns.
code_search/src/code_search/do_fns/github_files.py, line 45 at r4 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
What is meant by query friendly? Do you mean the strings to return to the user?
I think this needs a better doc. I'll update.
code_search/src/code_search/do_fns/github_files.py, line 46 at r4 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
try: info_rows = [dict(zip(self.info_keys, pair)) for pair in element.pop('pairs')]
What s info_keys.
Can you add a doc string please?..
Done.
code_search/src/setup.py, line 14 at r3 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Should you at least pin a version?
Done. For now pinning to my fork, still needs discussion on how to finalize the PTransform
interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 46 files reviewed, 1 unresolved discussion (waiting on @jlewi, @yixinshi, and @ankushagarwal)
code_search/src/setup.py, line 14 at r3 (raw file):
Previously, activatedgeek (Sanyam Kapoor) wrote…
Done. For now pinning to my fork, still needs discussion on how to finalize the
PTransform
interface.
Change the TODO to be reference an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 46 files reviewed, 1 unresolved discussion (waiting on @jlewi, @yixinshi, and @ankushagarwal)
code_search/src/setup.py, line 14 at r3 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Change the TODO to be reference an issue
Done.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR integrates Batch Prediction over the Tensorflow SavedModel as a Dataflow job. This needed a massive refactor because Dataflow needed the T2T model to be imported as a module to make the predictions. The unification of all modules was therefore necessary.
A bunch of changes:
A lot of the diffs are just because of the move, the main folders to be reviewed are
src/code_search/do_fns
src/code_search/transforms
This change is