Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kaggle blog post #180

Merged
merged 1 commit into from
Aug 30, 2018
Merged

Kaggle blog post #180

merged 1 commit into from
Aug 30, 2018

Conversation

pdmack
Copy link
Member

@pdmack pdmack commented Aug 26, 2018

/assign @jlewi
/assign @sarahmaddox

/cc @abhi-g
/cc @ewilderj
/cc @aronchick

Related to: #79
Please suggest specific and literal changes via in-line review comments. :-)


This change is Reviewable

Copy link
Contributor

@sarahmaddox sarahmaddox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Exciting too. I have a few suggestions to make the content flow more easily.

+++

## Kaggle
[Kaggle](http://kaggle.com/) is home to the world's largest community of data scientists and AI/ML researchers. It's a diverse community ranging from newcomers to accredited research scientists, where participants collaborate and compete on-line to refine algorithms and techniques that are judged to produce the "best" model. The competitions can be organized by anyone but many companies and institutions award significant cash prizes to the winners. Beyond the academic benefit of the competitions, it also provides a means to identify top candidates for data science careers with these corporations as they increase their investments in AI and ML.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendation: Replace "on-line" with "online" here and in 2 other places below. Current best practice is to avoid hyphens where possible, and to use one word for "online".

+++

## Kaggle
[Kaggle](http://kaggle.com/) is home to the world's largest community of data scientists and AI/ML researchers. It's a diverse community ranging from newcomers to accredited research scientists, where participants collaborate and compete on-line to refine algorithms and techniques that are judged to produce the "best" model. The competitions can be organized by anyone but many companies and institutions award significant cash prizes to the winners. Beyond the academic benefit of the competitions, it also provides a means to identify top candidates for data science careers with these corporations as they increase their investments in AI and ML.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second sentence is quite long, and the meaning of the last bit is ambiguous. Recommendation: Replace this:
"where participants collaborate and compete on-line to refine algorithms and techniques that are judged to produce the “best” model."

with this:
"where participants collaborate online to refine algorithms and techniques. In organized competitions, judges decide on the entries that produce the best model."


For new data scientists, there are competitions that are interesting thought experiments. The most notable of these is predicting which persons would survive the Titanic disaster, which in fact we will use for our example below.

The Kaggle development platform itself is organized into competitions, scripts called "kernels", and datasets which are used to derive the models submitted to the competitions. There are also short form on-line classes for introductions to Python and machine learning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where it says 'scripts called "kernels"', I'd suggest a little more info. Rather than being just a script, a kernel is a combination of environment, input, code, and output.
http://blog.kaggle.com/2016/07/08/kaggle-kernel-a-new-name-for-scripts/


![Run Notebook](../nb-run.svg)

This [particular public notebook](https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python) was chosen based on the high number of votes it has as a kernel for the Titanic competition, the richness of some of the visualizations, and also because it uses XGBoost, a library that Kubeflow does not currently include in its supported TensorFlow notebooks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendation: Change this:

"This particular public notebook was chosen based on..."

to this:

"I chose this particular public notebook based on..."

Reason: when I first read it, I assumed the "was chosen" meant that the notebook had been awarded a prize in some Kaggle competition, and I expected the next sentence to tell me more about that. Changing the sentence to indicate that you chose it makes things much clearer.

- the Kaggle image includes TensorFlow 1.9 or greater built with [AVX2 support](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2), so the image may not run on some older CPU
- unlike the Kubeflow curated notebooks, the default notebook user (`jovyan`) does not have the permissions to install new packages to global locations (but more on that below)

### What's this all about?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this section ("What's this all about?") up above the "Images" section, and change it to a level 2 heading. It's super important that people know the goal of this blog post, and this section tells it nicely. When I first read the "Images" section, I wondered why you were telling me all that. Then I read the "What's this all about?" section, and things became much clearer.

@pdmack
Copy link
Member Author

pdmack commented Aug 27, 2018

@sarahmaddox thanks for the feedback. PTAL latest

@jlewi
Copy link
Contributor

jlewi commented Aug 27, 2018

Woo Hoo!

@abhi-g
Copy link
Member

abhi-g commented Aug 27, 2018

looking great!

@sarahmaddox
Copy link
Contributor

Nice!
/lgtm

@sarahmaddox
Copy link
Contributor

/lgtm

@jlewi
Copy link
Contributor

jlewi commented Aug 27, 2018

This looks awesome but let me circulate it with the Kaggle folks.
/hold

rosbo added a commit to Kaggle/docker-python that referenced this pull request Aug 27, 2018
This enables use cases such as this one: kubeflow/website#180
@k8s-ci-robot k8s-ci-robot removed the lgtm label Aug 28, 2018
@pdmack
Copy link
Member Author

pdmack commented Aug 28, 2018

Kaggle/kaggle-api#84 was just closed. Backend issue apparently.

Thus, I need to rewrite a portion of this.

@jlewi
Copy link
Contributor

jlewi commented Aug 28, 2018

Ack.

@pdmack
Copy link
Member Author

pdmack commented Aug 29, 2018

Ready for (final?) review.

@abhi-g
Copy link
Member

abhi-g commented Aug 29, 2018

Any outstanding reasons for the hold on this PR?

@sarahmaddox
Copy link
Contributor

/lgtm

@jlewi
Copy link
Contributor

jlewi commented Aug 29, 2018

/lgtm
/approve

The hold was to give the Kaggle folks a chance to review it.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jlewi
Copy link
Contributor

jlewi commented Aug 30, 2018

Got the signoff from Kaggle.

/hold cancel

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 6 files reviewed, 5 unresolved discussions (waiting on @sarahmaddox, @pdmack, @abhi-g, @aronchick, and @ewilderj)


content/blog/kaggle_on_kubeflow.md, line 39 at r2 (raw file):

- it's a very large notebook, over 21 GB in size. Docker pulls and notebook launches can take a lengthy period of time
- the versions of TensorFlow, PyTorch, XGBoost, and the other libraries included may change at any time

nit: change over time

@jlewi
Copy link
Contributor

jlewi commented Aug 30, 2018

Merging manually because reviewable is blocking merge; looks like reviewable wasn't configured to not add reviewable status to PR.

@jlewi jlewi merged commit bdb24ed into kubeflow:master Aug 30, 2018
@pdmack
Copy link
Member Author

pdmack commented Aug 30, 2018

I'll follow up on the nit soon

abhi-g pushed a commit to abhi-g/website that referenced this pull request Aug 30, 2018
abhi-g added a commit to abhi-g/website that referenced this pull request Aug 30, 2018
abhi-g added a commit that referenced this pull request Aug 30, 2018
Merge pull request #180 from pdmack/kaggle-blog
@pdmack pdmack deleted the kaggle-blog branch April 4, 2019 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants