Integrate batch prediction #184

activatedgeek · 2018-07-17T23:48:19Z

This PR integrates Batch Prediction over the Tensorflow SavedModel as a Dataflow job. This needed a massive refactor because Dataflow needed the T2T model to be imported as a module to make the predictions. The unification of all modules was therefore necessary.

A bunch of changes:

Migrate Python3 packages to Python2
Updated README for Local Usage and exact translation to Kubeflow jobs
Dataflow Job to generate string embeddings using Batch Prediction (Fixes [code_search] Batch Prediction of Similarity Transformer Model #180)

A lot of the diffs are just because of the move, the main folders to be reviewed are

src/code_search/do_fns
src/code_search/transforms

This change is

activatedgeek · 2018-07-18T01:34:19Z

/cc @jlewi @ankushagarwal

This PR is a relatively large refactor because virtually all the code requires the t2t modules. The Python2/3 split was causing maintainability issues, so I just merged into one python package with multiple modules.

For Batch Prediction, PredictionDoFn seems to return invalid pickles and is still needs to be checked.

activatedgeek · 2018-07-18T01:35:13Z

/cc @yixinshi

jlewi

Reviewed 2 of 34 files at r1, 4 of 25 files at r3.
Reviewable status: 6 of 48 files reviewed, 7 unresolved discussions (waiting on @activatedgeek, @yixinshi, @jlewi, and @ankushagarwal)

code_search/README.md, line 153 at r3 (raw file):

We run another `Dataflow` pipeline to use the exported model above and get a high-dimensional embedding of each of
our code. Specify the model version (which is a UNIX timestamp) from the output directory. This should be the name of

Grammarerror "for each of our code"

code_search/README.md, line 164 at r3 (raw file):

(env2.7) $ export SAVED_MODEL_DIR=${GCS_DIR}/output/export/Servo/${MODEL_VERSION}
(env2.7) $ code-search-predict -r DataflowRunner --problem=github_function_docstring -i "${GCS_DIR}/data/*.csv" \

What is code-search-predict? Is it a binary? I can't find it.

code_search/README.md, line 187 at r3 (raw file):

$ docker run --rm -p8501:8501 gcr.io/kubeflow-images-public/tensorflow-serving-1.8 tensorflow_model_server \

Lets change this to use Kubernetes please. Lets not introduce docker.
If need be you can do

kubectl run.

You can fix this later

code_search/src/code_search/do_fns/embeddings.py, line 6 at r3 (raw file):

from cStringIO import StringIO
import apache_beam as beam
from ..transforms.process_github_files import ProcessGithubFiles

Use absolute imports here and everywhere else please. Make code_search the top level package.

code_search/src/code_search/transforms/code_embed.py, line 4 at r3 (raw file):

from kubeflow_batch_predict.dataflow.batch_prediction import PredictionDoFn

from ..do_fns.embeddings import EncodeExample

You should always use absolute import paths.

You should figure out what you want the top level package to be and then setup your Python path that way.
e.g.
The top level package could be code_search

you wiould then do something like
from code_search.do_fns.embeddings import EncodeExample

Can we fix that in this PR?

Also you should import complete packages eg

from code_search.do_fns import embeddings

rather than importing specific classes from the package.

code_search/src/code_search/transforms/process_github_files.py, line 6 at r3 (raw file):

from apache_beam.io.gcp.internal.clients import bigquery

from ..do_fns import ExtractFuncInfo

Use absolute imports here and everywhere else please. Make code_search the top level package.

code_search/src/setup.py, line 14 at r3 (raw file):

  ['python', '-m', 'spacy', 'download', 'en'],
  # TODO(sanyamkapoor): This isn't ideal but no other way for a seamless install right now.
  ['pip', 'install', 'https://github.com/kubeflow/batch-predict/tarball/master']

Should you at least pin a version?

activatedgeek · 2018-07-20T17:55:10Z

@jlewi

You should always use absolute import paths.
Also you should import complete packages

Sure thing! Is there a reason we prefer absolute imports over relative? I spent some time reading about it but could not build a strong opinion on one or the other. One of the issues that I often face with absolute imports is cyclical dependencies (e.g. when one __init__.py imports something and its siblings are trying to import a parent module. The only option that remains is removing all imports from __init__.py)

What is code-search-predict? Is it a binary? I can't find it.

This is a python wrapper auto-generated after a pip install and is defined in setup.py.

Lets change this to use Kubernetes please. Lets not introduce docker.

All of the commands in README are currently for the user to run everything locally. I will certainly finalize the story and make it more accessible once I am finished.

jlewi

Reviewable status: 1 of 46 files reviewed, 3 unresolved discussions (waiting on @jlewi, @activatedgeek, @yixinshi, and @ankushagarwal)

code_search/src/code_search/do_fns/github_files.py, line 31 at r4 (raw file):

  def process(self, element): # pylint: disable=no-self-use
    try:
      from ..utils import get_function_docstring_pairs

Can we make this an absolute import?

Why is this import here? Is this to do with how DoFns get serialized?

jlewi

Reviewable status: 1 of 46 files reviewed, 6 unresolved discussions (waiting on @jlewi, @activatedgeek, @yixinshi, and @ankushagarwal)

code_search/src/code_search/do_fns/embeddings.py, line 21 at r4 (raw file):

    result = dict(zip(keys, values))
    yield result

Can you explain how this code works? Is next returning a single element?

How can you have a yield statement when you aren't iterating over anything? Does it just return a single item and then a stop iteration on the next call

code_search/src/code_search/do_fns/github_files.py, line 45 at r4 (raw file):

class ExtractFuncInfo(beam.DoFn):
  # pylint: disable=abstract-method
  """Convert pair tuples from `TokenizeCodeDocstring` into dict containing query-friendly keys"""

What is meant by query friendly? Do you mean the strings to return to the user?

code_search/src/code_search/do_fns/github_files.py, line 46 at r4 (raw file):

    try:
      info_rows = [dict(zip(self.info_keys, pair)) for pair in element.pop('pairs')]

What s info_keys.

Can you add a doc string please?..

…asy container translation

activatedgeek · 2018-07-23T18:30:15Z

Can we make this an absolute import?
Why is this import here? Is this to do with how DoFns get serialized?

Ah sorry, this slipped through. Yes, this has to do with the serialization of DoFns.

Can you explain how this code works? Is next returning a single element?
How can you have a yield statement when you aren't iterating over anything? Does it just return a single item and then a stop iteration on the next call

next takes in the CSV reader (which is an iterator) and gets the next value from the iterator. I am using the CSV Reader instead of manual string parsing because there were a lot of corner cases in how CSV uses the delimiter and the actual data characters. To iterate over the complete data, next needs to be called as many times as possible (until it throws an StopIteration). Here, I know that there's only ever a single data reference in the iterator and call it one time.

What is meant by query friendly? Do you mean the strings to return to the user?

I think this needs a better doc. I'll update.

activatedgeek

Reviewable status: 1 of 46 files reviewed, 6 unresolved discussions (waiting on @jlewi, @activatedgeek, @yixinshi, and @ankushagarwal)

code_search/README.md, line 164 at r3 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

What is code-search-predict? Is it a binary? I can't find it.

This is a python wrapper auto-generated after a pip install and is defined in setup.py.

code_search/README.md, line 187 at r3 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Lets change this to use Kubernetes please. Lets not introduce docker.
If need be you can do

kubectl run.

You can fix this later

All of the commands in README are currently for the user to run everything locally. I will certainly finalize the story and make it more accessible once I am finished.

code_search/src/code_search/do_fns/embeddings.py, line 21 at r4 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Can you explain how this code works? Is next returning a single element?

How can you have a yield statement when you aren't iterating over anything? Does it just return a single item and then a stop iteration on the next call

next takes in the CSV reader (which is an iterator) and gets the next value from the iterator. I am using the CSV Reader instead of manual string parsing because there were a lot of corner cases in how CSV uses the delimiter and the actual data characters. To iterate over the complete data, next needs to be called as many times as possible (until it throws an StopIteration). Here, I know that there's only ever a single data reference in the iterator and call it one time.

code_search/src/code_search/do_fns/github_files.py, line 31 at r4 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Can we make this an absolute import?

Why is this import here? Is this to do with how DoFns get serialized?

Ah sorry, this slipped through. Yes, this has to do with the serialization of DoFns.

code_search/src/code_search/do_fns/github_files.py, line 45 at r4 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

What is meant by query friendly? Do you mean the strings to return to the user?

I think this needs a better doc. I'll update.

code_search/src/code_search/do_fns/github_files.py, line 46 at r4 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

    try:
      info_rows = [dict(zip(self.info_keys, pair)) for pair in element.pop('pairs')]
What s info_keys.

Can you add a doc string please?..

Done.

code_search/src/setup.py, line 14 at r3 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Should you at least pin a version?

Done. For now pinning to my fork, still needs discussion on how to finalize the PTransform interface.

jlewi

Reviewable status: 1 of 46 files reviewed, 1 unresolved discussion (waiting on @jlewi, @yixinshi, and @ankushagarwal)

code_search/src/setup.py, line 14 at r3 (raw file):

Previously, activatedgeek (Sanyam Kapoor) wrote…

Done. For now pinning to my fork, still needs discussion on how to finalize the PTransform interface.

Change the TODO to be reference an issue

…ependency.

activatedgeek

Reviewable status: 1 of 46 files reviewed, 1 unresolved discussion (waiting on @jlewi, @yixinshi, and @ankushagarwal)

code_search/src/setup.py, line 14 at r3 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Change the TODO to be reference an issue

Done.

jlewi · 2018-07-23T23:25:58Z

/lgtm
/approve

k8s-ci-robot · 2018-07-23T23:26:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the do-not-merge/work-in-progress label Jul 17, 2018

k8s-ci-robot requested review from wbuchwalter and zjj2wry July 17, 2018 23:48

k8s-ci-robot added the size/XL label Jul 17, 2018

k8s-ci-robot removed request for wbuchwalter and zjj2wry July 17, 2018 23:49

k8s-ci-robot requested review from ankushagarwal and jlewi July 18, 2018 01:34

k8s-ci-robot requested a review from yixinshi July 18, 2018 01:35

activatedgeek changed the title ~~[WIP] Integrate batch prediction~~ Integrate batch prediction Jul 18, 2018

k8s-ci-robot added size/XXL and removed do-not-merge/work-in-progress size/XL labels Jul 18, 2018

activatedgeek changed the title ~~Integrate batch prediction~~ [WIP] Integrate batch prediction Jul 19, 2018

k8s-ci-robot added the do-not-merge/work-in-progress label Jul 19, 2018

activatedgeek changed the title ~~[WIP] Integrate batch prediction~~ Integrate batch prediction Jul 20, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress label Jul 20, 2018

jlewi suggested changes Jul 20, 2018

View reviewed changes

activatedgeek added 7 commits July 23, 2018 10:56

Refactor the dataflow package

a0f9d57

Create placeholder for new prediction pipeline

4c17b98

[WIP] add dofn for encoding

9026113

Merge all modules under single package

f687337

Pipeline data flow complete, wip prediction values

0ffc32b

Fallback to custom commands for extra dependency

6b2d3d3

Working Dataflow runner installs, separate docker-related folder

6612a79

activatedgeek added 13 commits July 23, 2018 10:56

[WIP] Updated local user journey in README, fully working commands, e…

4d1a679

…asy container translation

Working Batch Predictions.

378f035

Remove docstring embeddings

dc8dd53

Complete batch prediction pipeline

efde06f

Update Dockerfiles and T2T Ksonnet components

56a7d23

Fix linting

dd37359

Downgrade runtime to Python2, wip memory issues so use lesser data

dcc1985

Pin master to index 0.

295e599

Working batch prediction pipeline

64229d7

Modular Github Batch Prediction Pipeline, stores back to BigQuery

2466335

Fix lint errors

14db3cf

Fix module-wide imports, pin batch-prediction version

30cbe3d

Fix relative import, update docstrings

0c2ee9b

activatedgeek commented Jul 23, 2018

View reviewed changes

jlewi suggested changes Jul 23, 2018

View reviewed changes

Add references to issue and current workaround for Batch Prediction d…

259399e

…ependency.

activatedgeek commented Jul 23, 2018

View reviewed changes

jlewi approved these changes Jul 23, 2018

View reviewed changes

k8s-ci-robot assigned jlewi Jul 23, 2018

k8s-ci-robot added the lgtm label Jul 23, 2018

k8s-ci-robot added the approved label Jul 23, 2018

k8s-ci-robot merged commit 636cf1c into kubeflow:master Jul 23, 2018

activatedgeek deleted the integrate-batch-prediction branch July 23, 2018 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate batch prediction #184

Integrate batch prediction #184

activatedgeek commented Jul 17, 2018 •

edited

Loading

activatedgeek commented Jul 18, 2018 •

edited

Loading

activatedgeek commented Jul 18, 2018

jlewi left a comment

activatedgeek commented Jul 20, 2018 •

edited

Loading

jlewi left a comment

jlewi left a comment

activatedgeek commented Jul 23, 2018

activatedgeek left a comment

jlewi left a comment

activatedgeek left a comment

jlewi commented Jul 23, 2018

k8s-ci-robot commented Jul 23, 2018

Integrate batch prediction #184

Integrate batch prediction #184

Conversation

activatedgeek commented Jul 17, 2018 • edited Loading

activatedgeek commented Jul 18, 2018 • edited Loading

activatedgeek commented Jul 18, 2018

jlewi left a comment

Choose a reason for hiding this comment

activatedgeek commented Jul 20, 2018 • edited Loading

jlewi left a comment

Choose a reason for hiding this comment

jlewi left a comment

Choose a reason for hiding this comment

activatedgeek commented Jul 23, 2018

activatedgeek left a comment

Choose a reason for hiding this comment

jlewi left a comment

Choose a reason for hiding this comment

activatedgeek left a comment

Choose a reason for hiding this comment

jlewi commented Jul 23, 2018

k8s-ci-robot commented Jul 23, 2018

activatedgeek commented Jul 17, 2018 •

edited

Loading

activatedgeek commented Jul 18, 2018 •

edited

Loading

activatedgeek commented Jul 20, 2018 •

edited

Loading