Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate batch prediction #184

Merged
merged 21 commits into from
Jul 23, 2018
Merged

Integrate batch prediction #184

merged 21 commits into from
Jul 23, 2018

Conversation

activatedgeek
Copy link
Contributor

@activatedgeek activatedgeek commented Jul 17, 2018

This PR integrates Batch Prediction over the Tensorflow SavedModel as a Dataflow job. This needed a massive refactor because Dataflow needed the T2T model to be imported as a module to make the predictions. The unification of all modules was therefore necessary.

A bunch of changes:

A lot of the diffs are just because of the move, the main folders to be reviewed are

  • src/code_search/do_fns
  • src/code_search/transforms

This change is Reviewable

@activatedgeek
Copy link
Contributor Author

activatedgeek commented Jul 18, 2018

/cc @jlewi @ankushagarwal

This PR is a relatively large refactor because virtually all the code requires the t2t modules. The Python2/3 split was causing maintainability issues, so I just merged into one python package with multiple modules.

For Batch Prediction, PredictionDoFn seems to return invalid pickles and is still needs to be checked.

@activatedgeek
Copy link
Contributor Author

/cc @yixinshi

@activatedgeek activatedgeek changed the title [WIP] Integrate batch prediction Integrate batch prediction Jul 18, 2018
@activatedgeek activatedgeek changed the title Integrate batch prediction [WIP] Integrate batch prediction Jul 19, 2018
@activatedgeek activatedgeek changed the title [WIP] Integrate batch prediction Integrate batch prediction Jul 20, 2018
Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 34 files at r1, 4 of 25 files at r3.
Reviewable status: 6 of 48 files reviewed, 7 unresolved discussions (waiting on @activatedgeek, @yixinshi, @jlewi, and @ankushagarwal)


code_search/README.md, line 153 at r3 (raw file):

We run another `Dataflow` pipeline to use the exported model above and get a high-dimensional embedding of each of
our code. Specify the model version (which is a UNIX timestamp) from the output directory. This should be the name of 

Grammarerror "for each of our code"


code_search/README.md, line 164 at r3 (raw file):

(env2.7) $ export SAVED_MODEL_DIR=${GCS_DIR}/output/export/Servo/${MODEL_VERSION}
(env2.7) $ code-search-predict -r DataflowRunner --problem=github_function_docstring -i "${GCS_DIR}/data/*.csv" \

What is code-search-predict? Is it a binary? I can't find it.


code_search/README.md, line 187 at r3 (raw file):

$ docker run --rm -p8501:8501 gcr.io/kubeflow-images-public/tensorflow-serving-1.8 tensorflow_model_server \

Lets change this to use Kubernetes please. Lets not introduce docker.
If need be you can do

kubectl run.

You can fix this later


code_search/src/code_search/do_fns/embeddings.py, line 6 at r3 (raw file):

from cStringIO import StringIO
import apache_beam as beam
from ..transforms.process_github_files import ProcessGithubFiles

Use absolute imports here and everywhere else please. Make code_search the top level package.


code_search/src/code_search/transforms/code_embed.py, line 4 at r3 (raw file):

from kubeflow_batch_predict.dataflow.batch_prediction import PredictionDoFn

from ..do_fns.embeddings import EncodeExample

You should always use absolute import paths.

You should figure out what you want the top level package to be and then setup your Python path that way.
e.g.
The top level package could be code_search

you wiould then do something like
from code_search.do_fns.embeddings import EncodeExample

Can we fix that in this PR?

Also you should import complete packages eg

from code_search.do_fns import embeddings

rather than importing specific classes from the package.


code_search/src/code_search/transforms/process_github_files.py, line 6 at r3 (raw file):

from apache_beam.io.gcp.internal.clients import bigquery

from ..do_fns import ExtractFuncInfo

Use absolute imports here and everywhere else please. Make code_search the top level package.


code_search/src/setup.py, line 14 at r3 (raw file):

  ['python', '-m', 'spacy', 'download', 'en'],
  # TODO(sanyamkapoor): This isn't ideal but no other way for a seamless install right now.
  ['pip', 'install', 'https://github.com/kubeflow/batch-predict/tarball/master']

Should you at least pin a version?

@activatedgeek
Copy link
Contributor Author

activatedgeek commented Jul 20, 2018

@jlewi

You should always use absolute import paths.
Also you should import complete packages

Sure thing! Is there a reason we prefer absolute imports over relative? I spent some time reading about it but could not build a strong opinion on one or the other. One of the issues that I often face with absolute imports is cyclical dependencies (e.g. when one __init__.py imports something and its siblings are trying to import a parent module. The only option that remains is removing all imports from __init__.py)

What is code-search-predict? Is it a binary? I can't find it.

This is a python wrapper auto-generated after a pip install and is defined in setup.py.

Lets change this to use Kubernetes please. Lets not introduce docker.

All of the commands in README are currently for the user to run everything locally. I will certainly finalize the story and make it more accessible once I am finished.

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 46 files reviewed, 3 unresolved discussions (waiting on @jlewi, @activatedgeek, @yixinshi, and @ankushagarwal)


code_search/src/code_search/do_fns/github_files.py, line 31 at r4 (raw file):

  def process(self, element): # pylint: disable=no-self-use
    try:
      from ..utils import get_function_docstring_pairs

Can we make this an absolute import?

Why is this import here? Is this to do with how DoFns get serialized?

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 46 files reviewed, 6 unresolved discussions (waiting on @jlewi, @activatedgeek, @yixinshi, and @ankushagarwal)


code_search/src/code_search/do_fns/embeddings.py, line 21 at r4 (raw file):

    result = dict(zip(keys, values))
    yield result

Can you explain how this code works? Is next returning a single element?

How can you have a yield statement when you aren't iterating over anything? Does it just return a single item and then a stop iteration on the next call


code_search/src/code_search/do_fns/github_files.py, line 45 at r4 (raw file):

class ExtractFuncInfo(beam.DoFn):
  # pylint: disable=abstract-method
  """Convert pair tuples from `TokenizeCodeDocstring` into dict containing query-friendly keys"""

What is meant by query friendly? Do you mean the strings to return to the user?


code_search/src/code_search/do_fns/github_files.py, line 46 at r4 (raw file):

    try:
      info_rows = [dict(zip(self.info_keys, pair)) for pair in element.pop('pairs')]

What s info_keys.

Can you add a doc string please?..

@activatedgeek
Copy link
Contributor Author

Can we make this an absolute import?
Why is this import here? Is this to do with how DoFns get serialized?

Ah sorry, this slipped through. Yes, this has to do with the serialization of DoFns.

Can you explain how this code works? Is next returning a single element?
How can you have a yield statement when you aren't iterating over anything? Does it just return a single item and then a stop iteration on the next call

next takes in the CSV reader (which is an iterator) and gets the next value from the iterator. I am using the CSV Reader instead of manual string parsing because there were a lot of corner cases in how CSV uses the delimiter and the actual data characters. To iterate over the complete data, next needs to be called as many times as possible (until it throws an StopIteration). Here, I know that there's only ever a single data reference in the iterator and call it one time.

What is meant by query friendly? Do you mean the strings to return to the user?

I think this needs a better doc. I'll update.

Copy link
Contributor Author

@activatedgeek activatedgeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 46 files reviewed, 6 unresolved discussions (waiting on @jlewi, @activatedgeek, @yixinshi, and @ankushagarwal)


code_search/README.md, line 164 at r3 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

What is code-search-predict? Is it a binary? I can't find it.

This is a python wrapper auto-generated after a pip install and is defined in setup.py.


code_search/README.md, line 187 at r3 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Lets change this to use Kubernetes please. Lets not introduce docker.
If need be you can do

kubectl run.

You can fix this later

All of the commands in README are currently for the user to run everything locally. I will certainly finalize the story and make it more accessible once I am finished.


code_search/src/code_search/do_fns/embeddings.py, line 21 at r4 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Can you explain how this code works? Is next returning a single element?

How can you have a yield statement when you aren't iterating over anything? Does it just return a single item and then a stop iteration on the next call

next takes in the CSV reader (which is an iterator) and gets the next value from the iterator. I am using the CSV Reader instead of manual string parsing because there were a lot of corner cases in how CSV uses the delimiter and the actual data characters. To iterate over the complete data, next needs to be called as many times as possible (until it throws an StopIteration). Here, I know that there's only ever a single data reference in the iterator and call it one time.


code_search/src/code_search/do_fns/github_files.py, line 31 at r4 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Can we make this an absolute import?

Why is this import here? Is this to do with how DoFns get serialized?

Ah sorry, this slipped through. Yes, this has to do with the serialization of DoFns.


code_search/src/code_search/do_fns/github_files.py, line 45 at r4 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

What is meant by query friendly? Do you mean the strings to return to the user?

I think this needs a better doc. I'll update.


code_search/src/code_search/do_fns/github_files.py, line 46 at r4 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…
    try:
      info_rows = [dict(zip(self.info_keys, pair)) for pair in element.pop('pairs')]

What s info_keys.

Can you add a doc string please?..

Done.


code_search/src/setup.py, line 14 at r3 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Should you at least pin a version?

Done. For now pinning to my fork, still needs discussion on how to finalize the PTransform interface.

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 46 files reviewed, 1 unresolved discussion (waiting on @jlewi, @yixinshi, and @ankushagarwal)


code_search/src/setup.py, line 14 at r3 (raw file):

Previously, activatedgeek (Sanyam Kapoor) wrote…

Done. For now pinning to my fork, still needs discussion on how to finalize the PTransform interface.

Change the TODO to be reference an issue

Copy link
Contributor Author

@activatedgeek activatedgeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 46 files reviewed, 1 unresolved discussion (waiting on @jlewi, @yixinshi, and @ankushagarwal)


code_search/src/setup.py, line 14 at r3 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Change the TODO to be reference an issue

Done.

@jlewi
Copy link
Contributor

jlewi commented Jul 23, 2018

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 636cf1c into kubeflow:master Jul 23, 2018
@activatedgeek activatedgeek deleted the integrate-batch-prediction branch July 23, 2018 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants