<h1> Text Classification using TensorFlow/Keras on Cloud ML Engine </h1>

This notebook illustrates:
<ol>
<li> Creating datasets for Machine Learning using BigQuery
<li> Creating a text classification model using the Estimator API with a Keras model
<li> Training on Cloud ML Engine
<li> Deploying the model
<li> Predicting with model
<li> Rerun with pre-trained embedding
</ol>

In [1]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'

In [2]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.8'

In [5]:
import os
output = os.popen("gcloud config get-value project").readlines()
project_name = output[0][:-1]

# change these to try this notebook out
PROJECT = project_name
BUCKET = project_name
#BUCKET = BUCKET.replace("qwiklabs-gcp-", "inna-bckt-")
REGION = 'europe-west1'  ## note: Cloud ML Engine not availabe in europe-west3!

# set environment variables:
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

print(PROJECT)
print(BUCKET)
print(REGION)



qwiklabs-gcp-40be833ab06c22a0
qwiklabs-gcp-40be833ab06c22a0
europe-west1


In [4]:
import tensorflow as tf
print(tf.__version__)

  from ._conv import register_converters as _register_converters


1.8.0


We will look at the titles of articles and figure out whether the article came from the New York Times, TechCrunch or GitHub. 

We will use [hacker news](https://news.ycombinator.com/) as our data source. It is an aggregator that displays tech related headlines from various  sources.

### Creating Dataset from BigQuery 

Hacker news headlines are available as a BigQuery public dataset. The [dataset](https://bigquery.cloud.google.com/table/bigquery-public-data:hacker_news.stories?tab=details) contains all headlines from the sites inception in October 2006 until October 2015. 

Here is a sample of the dataset:

In [7]:
import google.datalab.bigquery as bq
import pandas as pd

## set pandas display options to display all content 
## in each field, adding line-breaks as needed:
pd.set_option('display.max_colwidth', -1)

query="""
SELECT
  url, title, score
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  LENGTH(title) > 10
  AND score > 10
LIMIT 10
"""
df = bq.Query(query).execute().result().to_dataframe()
df

Unnamed: 0,url,title,score
0,http://www.dumpert.nl/mediabase/6560049/3eb18e64/even_bellen_met_de_nsa.html,"Calling the NSA: ""I accidentally deleted an e-mail, can you help me recover it?""",258
1,,Show HN: Panda now with Product Hunt and more,11
2,,Skype is down,11
3,http://blog.liip.ch/archive/2013/10/28/hhvm-and-symfony2.html,Amazing performance with HHVM and PHP with a Symfony 2 application,11
4,http://www.gamedev.net/page/resources/_/technical/general-programming/a-journey-through-the-cpu-pipeline-r3115,A Journey Through the CPU Pipeline,11
5,,Ask HN: Has Sortfolio worked for you as a contractor?,11
6,,Offer HN: Free SEO For Your Startup/Company,11
7,,WARN HN: Gmail account fishing,11
8,http://jfarcand.wordpress.com/2011/02/25/atmosphere-0-7-released-websocket-gwt-wicket-redis-xmpp-async-io/,"Atmosphere Framework 0.7 released: GWT, Wicket, Redis, XMPP, Async I/O",11
9,,Ask HN: Which mobile advertising platforms do you use?,11


Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with <i>nytimes</i>

In [8]:
query="""
SELECT
  ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
  COUNT(title) AS num_articles
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
  AND LENGTH(title) > 10
GROUP BY
  source
ORDER BY num_articles DESC
LIMIT 10
"""
df = bq.Query(query).execute().result().to_dataframe()
df

Unnamed: 0,source,num_articles
0,blogspot,41386
1,github,36525
2,techcrunch,30891
3,youtube,30848
4,nytimes,28787
5,medium,18422
6,google,18235
7,wordpress,17667
8,arstechnica,13749
9,wired,12841


Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning.

In [9]:
query="""
SELECT source, LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title FROM
  (SELECT
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
    title
  FROM
    `bigquery-public-data.hacker_news.stories`
  WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
    AND LENGTH(title) > 10
  )
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
"""

df = bq.Query(query + " LIMIT 10").execute().result().to_dataframe()
df.head()

Unnamed: 0,source,title
0,github,django outbox
1,github,webscrapper using node.js deferred cheerio in less than 100 lines
2,techcrunch,flashnotes picks up another $3.6m
3,github,a git user s guide to svn because at least 10 of us have that problem
4,github,show hn cmake module to take care of git submodules


For ML training, we will need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset).  

A simple, repeatable way to do this is to use the hash of a well-distributed column in our data (See https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning).

In [10]:
traindf = bq.Query(query + " AND MOD(ABS(FARM_FINGERPRINT(title)),4) > 0").execute().result().to_dataframe()
evaldf  = bq.Query(query + " AND MOD(ABS(FARM_FINGERPRINT(title)),4) = 0").execute().result().to_dataframe()

Below we can see that roughly 75% of the data is used for training, and 25% for evaluation. 

We can also see that within each dataset, the classes are roughly balanced.

In [11]:
print(traindf.shape)
print(evaldf.shape)

(72162, 2)
(24041, 2)


In [14]:
traindf['source'].value_counts(normalize = True)

github        0.380325
techcrunch    0.320543
nytimes       0.299133
Name: source, dtype: float64

In [15]:
evaldf['source'].value_counts(normalize = True)

github        0.377688
techcrunch    0.322782
nytimes       0.299530
Name: source, dtype: float64

Finally we will save our data, which is currently in-memory, to disk.

In [16]:
import os, shutil
DATADIR='data/txtcls'
shutil.rmtree(DATADIR, ignore_errors=True)
os.makedirs(DATADIR)
traindf.to_csv( os.path.join(DATADIR,'train.tsv'), header=False, index=False, encoding='utf-8', sep='\t')
evaldf.to_csv( os.path.join(DATADIR,'eval.tsv'), header=False, index=False, encoding='utf-8', sep='\t')

In [17]:
!head -3 data/txtcls/train.tsv

github	this guy just found out how to bypass adblocker
github	show hn  dodo   command line task management for developers
github	show hn  webservicemock   mock out external calls for local development


In [18]:
!wc -l data/txtcls/*.tsv

  24041 data/txtcls/eval.tsv
  72162 data/txtcls/train.tsv
  96203 total


### TensorFlow/Keras Code

Please explore the code in this <a href="txtclsmodel/trainer">directory</a>: `model.py` contains the TensorFlow model and `task.py` parses command line arguments and launches off the training job.

There are some TODOs in the `model.py`, **make sure to complete the TODOs before proceeding!**

In [19]:
!grep -rnwi ./txtclsmodel/trainer/*.py -e 'todo'

./txtclsmodel/trainer/model.py:89:    x = # TODO (hint: use tokenizer)
./txtclsmodel/trainer/model.py:94:    x = # TODO (hint: there is a useful function in tf.keras.preprocessing...)
./txtclsmodel/trainer/model.py:182:    estimator = # TODO: convert keras model to tf.estimator.Estimator
./txtclsmodel/trainer/model.py:260:    estimator = # TODO: create estimator


### Run Locally
Let's make sure the code compiles by running locally for a fraction of an epoch

In [21]:
%%bash
## Make sure we have the latest version of Google Cloud Storage package
pip install --upgrade google-cloud-storage
rm -rf txtcls_trained
gcloud ml-engine local train \
   --module-name=trainer.task \
   --package-path=${PWD}/txtclsmodel/trainer \
   -- \
   --output_dir=${PWD}/txtcls_trained \
   --train_data_path=${PWD}/data/txtcls/train.tsv \
   --eval_data_path=${PWD}/data/txtcls/eval.tsv \
   --num_epochs=0.1

Requirement already up-to-date: google-cloud-storage in /usr/local/envs/py3env/lib/python3.5/site-packages (1.14.0)


  from ._conv import register_converters as _register_converters
INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {}, 'job': {'args': ['--output_dir=/content/datalab/gcp-ml-02-advanced-ml-with-tf-on-gcp/04-sequence-models-tensorflow-gcp/labs/txtcls_trained', '--train_data_path=/content/datalab/gcp-ml-02-advanced-ml-with-tf-on-gcp/04-sequence-models-tensorflow-gcp/labs/data/txtcls/train.tsv', '--eval_data_path=/content/datalab/gcp-ml-02-advanced-ml-with-tf-on-gcp/04-sequence-models-tensorflow-gcp/labs/data/txtcls/eval.tsv', '--num_epochs=0.1'], 'job_name': 'trainer.task'}, 'environment': 'cloud', 'task': {}}
INFO:tensorflow:Using the Keras model provided.
INFO:tensorflow:Using config: {'_num_worker_replicas': 1, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fd4a3da5cc0>, '_train_distribute': None, '_task_id': 0, '_global_id_in_cluster': 0, '_save_checkpoints_steps': 500, '_save_checkpoints_secs': None, '_save_summary_steps

### Train on the Cloud

Let's first copy our training data to the cloud:

In [22]:
%%bash
gsutil cp data/txtcls/*.tsv gs://${BUCKET}/txtcls/

Copying file://data/txtcls/eval.tsv [Content-Type=text/tab-separated-values]...
/ [0 files][    0.0 B/  1.4 MiB]                                                / [1 files][  1.4 MiB/  1.4 MiB]                                                Copying file://data/txtcls/train.tsv [Content-Type=text/tab-separated-values]...
/ [1 files][  1.4 MiB/  5.4 MiB]                                                / [2 files][  5.4 MiB/  5.4 MiB]                                                -
Operation completed over 2 objects/5.4 MiB.                                      


In [23]:
%%bash
OUTDIR=gs://${BUCKET}/txtcls/trained_fromscratch
JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
 --region=$REGION \
 --module-name=trainer.task \
 --package-path=${PWD}/txtclsmodel/trainer \
 --job-dir=$OUTDIR \
 --scale-tier=BASIC_GPU \
 --runtime-version=$TFVERSION \
 -- \
 --output_dir=$OUTDIR \
 --train_data_path=gs://${BUCKET}/txtcls/train.tsv \
 --eval_data_path=gs://${BUCKET}/txtcls/eval.tsv \
 --num_epochs=5

jobId: txtcls_190314_081641
state: QUEUED


CommandException: 1 files/objects could not be removed.
Job [txtcls_190314_081641] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ml-engine jobs describe txtcls_190314_081641

or continue streaming the logs with the command

  $ gcloud ml-engine jobs stream-logs txtcls_190314_081641


### Monitor training with TensorBoard
If tensorboard appears blank try refreshing after 10 minutes

In [24]:
from google.datalab.ml import TensorBoard
TensorBoard().start('gs://{}/txtcls/trained_fromscratch'.format(BUCKET))

4663

In [None]:
for pid in TensorBoard.list()['pid']:
  TensorBoard().stop(pid)
  print('Stopped TensorBoard with pid {}'.format(pid))

### Results
What accuracy did you get?

Logs:

```
Saving dict for global step 2819: acc = 0.81215173, global_step = 2819, loss = 0.45492265
``` 

### Deploy trained model 

Once your training completes you will see your exported models in the output directory specified in Google Cloud Storage. 

You should see one model for each training checkpoint (default is every 1000 steps).

In [25]:
%%bash
gsutil ls gs://${BUCKET}/txtcls/trained_fromscratch/export/exporter/

gs://qwiklabs-gcp-40be833ab06c22a0/txtcls/trained_fromscratch/export/exporter/
gs://qwiklabs-gcp-40be833ab06c22a0/txtcls/trained_fromscratch/export/exporter/1552551604/
gs://qwiklabs-gcp-40be833ab06c22a0/txtcls/trained_fromscratch/export/exporter/1552551627/
gs://qwiklabs-gcp-40be833ab06c22a0/txtcls/trained_fromscratch/export/exporter/1552551648/
gs://qwiklabs-gcp-40be833ab06c22a0/txtcls/trained_fromscratch/export/exporter/1552551670/
gs://qwiklabs-gcp-40be833ab06c22a0/txtcls/trained_fromscratch/export/exporter/1552551692/
gs://qwiklabs-gcp-40be833ab06c22a0/txtcls/trained_fromscratch/export/exporter/1552551713/
gs://qwiklabs-gcp-40be833ab06c22a0/txtcls/trained_fromscratch/export/exporter/1552551731/


We will take the last export and deploy it as a REST API using Google Cloud Machine Learning Engine 

In [26]:
%%bash
MODEL_NAME="txtcls"
MODEL_VERSION="v1_fromscratch"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/txtcls/trained_fromscratch/export/exporter/ | tail -1)
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME} --quiet
#gcloud ml-engine models delete ${MODEL_NAME}
gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version=$TFVERSION

Created ml engine model [projects/qwiklabs-gcp-40be833ab06c22a0/models/txtcls].
Creating version (this might take a few minutes)......
..............................................................................................................................................................................................................................................................................................................................................done.


### Get Predictions

Here are some actual hacker news headlines gathered from July 2018. These titles were not part of the training or evaluation datasets.

In [27]:
techcrunch=[
  'Uber shuts down self-driving trucks unit',
  'Grover raises €37M Series A to offer latest tech products as a subscription',
  'Tech companies can now bid on the Pentagon’s $10B cloud contract'
]
nytimes=[
  '‘Lopping,’ ‘Tips’ and the ‘Z-List’: Bias Lawsuit Explores Harvard’s Admissions',
  'A $3B Plan to Turn Hoover Dam into a Giant Battery',
  'A MeToo Reckoning in China’s Workplace Amid Wave of Accusations'
]
github=[
  'Show HN: Moon – 3kb JavaScript UI compiler',
  'Show HN: Hello, a CLI tool for managing social media',
  'Firefox Nightly added support for time-travel debugging'
]

Our serving input function expects the already tokenized representations of the headlines, so we do that pre-processing in the code before calling the REST API.

Note: Ideally we would do these transformation in the tensorflow graph directly instead of relying on separate client pre-processing code (see: [training-serving skew](https://developers.google.com/machine-learning/guides/rules-of-ml/#training_serving_skew)), howevever the pre-processing functions we're using are python functions so cannot be embedded in a tensorflow graph. 

See the <a href="../text_classification_native.ipynb">text_classification_native</a> notebook for a solution to this.

In [28]:
import pickle
from tensorflow.python.keras.preprocessing import sequence
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

requests = techcrunch+nytimes+github

# Tokenize and pad sentences using same mapping used in the deployed model
tokenizer = pickle.load( open( "txtclsmodel/tokenizer.pickled", "rb" ) )

requests_tokenized = tokenizer.texts_to_sequences(requests)
requests_tokenized = sequence.pad_sequences(requests_tokenized,maxlen=50)

# JSON format the requests
request_data = {'instances':requests_tokenized.tolist()}

# Authenticate and call CMLE prediction API 
credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')

parent = 'projects/%s/models/%s' % (PROJECT, 'txtcls') #version is not specified so uses default
response = api.projects().predict(body=request_data, name=parent).execute()

# Format and print response
for i in range(len(requests)):
  print('\n{}'.format(requests[i]))
  print(' github    : {}'.format(response['predictions'][i]['dense_1'][0]))
  print(' nytimes   : {}'.format(response['predictions'][i]['dense_1'][1]))
  print(' techcrunch: {}'.format(response['predictions'][i]['dense_1'][2]))


Uber shuts down self-driving trucks unit
 github    : 0.0014016543282195926
 nytimes   : 0.15926823019981384
 techcrunch: 0.8393300771713257

Grover raises €37M Series A to offer latest tech products as a subscription
 github    : 5.670904556609457e-06
 nytimes   : 0.17489786446094513
 techcrunch: 0.8250964879989624

Tech companies can now bid on the Pentagon’s $10B cloud contract
 github    : 0.0009365324513055384
 nytimes   : 0.3038620948791504
 techcrunch: 0.6952013969421387

‘Lopping,’ ‘Tips’ and the ‘Z-List’: Bias Lawsuit Explores Harvard’s Admissions
 github    : 0.005086249206215143
 nytimes   : 0.920703113079071
 techcrunch: 0.07421065121889114

A $3B Plan to Turn Hoover Dam into a Giant Battery
 github    : 0.008534395135939121
 nytimes   : 0.984660804271698
 techcrunch: 0.006804757751524448

A MeToo Reckoning in China’s Workplace Amid Wave of Accusations
 github    : 0.0025259980466216803
 nytimes   : 0.9899836778640747
 techcrunch: 0.00749035831540823

Show HN: Moon – 3kb J

How many of your predictions were correct?

### Rerun with Pre-trained Embedding

In the previous model we trained our word embedding from scratch. Often times we get better performance and/or converge faster by leveraging a pre-trained embedding. This is a similar concept to transfer learning during image classification.

We will use the popular GloVe embedding which is trained on Wikipedia as well as various news sources like the New York Times.

You can read more about Glove at the project homepage: https://nlp.stanford.edu/projects/glove/

You can download the embedding files directly from the stanford.edu site, but we've rehosted it in a GCS bucket for faster download speed.

In [29]:
!gsutil cp gs://cloud-training-demos/courses/machine_learning/deepdive/09_sequence/text_classification/glove.6B.200d.txt gs://$BUCKET/txtcls/



Updates are available for some Cloud SDK components.  To install them,
please run:
  $ gcloud components update

Copying gs://cloud-training-demos/courses/machine_learning/deepdive/09_sequence/text_classification/glove.6B.200d.txt [Content-Type=text/plain]...
\ [1 files][661.3 MiB/661.3 MiB]      0.0 B/s                                   
Operation completed over 1 objects/661.3 MiB.                                    


Once the embedding is downloaded re-run your cloud training job with the added command line argument: 

` --embedding_path=gs://${BUCKET}/txtcls/glove.6B.200d.txt`

Be sure to change your OUTDIR so it doesn't overwrite the previous model.

While the final accuracy may not change significantly, you should notice the model is able to converge to it much more quickly because it no longer has to learn an embedding from scratch.

In [30]:
%%bash
OUTDIR=gs://${BUCKET}/txtcls/trained_withembeddings
JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
 --region=$REGION \
 --module-name=trainer.task \
 --package-path=${PWD}/txtclsmodel/trainer \
 --job-dir=$OUTDIR \
 --scale-tier=BASIC_GPU \
 --runtime-version=$TFVERSION \
 -- \
 --output_dir=$OUTDIR \
 --train_data_path=gs://${BUCKET}/txtcls/train.tsv \
 --eval_data_path=gs://${BUCKET}/txtcls/eval.tsv \
 --embedding_path=gs://${BUCKET}/txtcls/glove.6B.200d.txt \
 --num_epochs=5

jobId: txtcls_190314_083755
state: QUEUED


CommandException: 1 files/objects could not be removed.
Job [txtcls_190314_083755] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ml-engine jobs describe txtcls_190314_083755

or continue streaming the logs with the command

  $ gcloud ml-engine jobs stream-logs txtcls_190314_083755


Finals step:

``` 
Saving dict for global step 2819: acc = 0.8063949, global_step = 2819, loss = 0.50821364
``` 


#### References
- This implementation is based on code from: https://github.com/google/eng-edu/tree/master/ml/guides/text_classification.
- See the full text classification tutorial at: https://developers.google.com/machine-learning/guides/text-classification/

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License