<h1> Text Classification using TensorFlow/Keras on AI Platform </h1>

This notebook illustrates:
<ol>
<li> Creating datasets for AI Platform using BigQuery
<li> Creating a text classification model using the Estimator API with a Keras model
<li> Training on Cloud AI Platform
<li> Rerun with pre-trained embedding
</ol>

In [1]:
# Ensure the right version of Tensorflow is installed.
!pip freeze | grep tensorflow==2.1

In [2]:
# change these to try this notebook out
ACCOUNT = 'sandcorp2014@gmail.com'
SAC = 'jupyter-notebook-sac-f'
SAC_KEY_DESTINATION = '/media/mujahid7292/Data/Gcloud_Tem_SAC'
BUCKET = 'ml-practice-260405'
PROJECT = 'ml-practice-260405'
REGION = 'us-central1'

In [3]:
import os
os.environ['ACCOUNT'] = ACCOUNT
os.environ['SAC'] = SAC
os.environ['SAC_KEY_DESTINATION'] = SAC_KEY_DESTINATION# LogIn To Google Cloud
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '2.1'

# LogIn To Google Cloud

In [5]:
%%bash
gcloud auth login $ACCOUNT --force

Opening in existing browser session.


Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?code_challenge=_q3BTAY957bYnIfTsSQtvFWZVfNfPISdbJwMtk99qCU&prompt=select_account&code_challenge_method=S256&access_type=offline&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&response_type=code&client_id=32555940559.apps.googleusercontent.com&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth


[1011/094541.459934:ERROR:nacl_helper_linux.cc(308)] NaCl helper process running without a sandbox!
Most likely you need to configure your SUID sandbox correctly

You are now logged in as [sandcorp2014@gmail.com].
Your current project is [ml-practice-260405].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


In [6]:
if 'COLAB_GPU' in os.environ:  # this is always set on Colab, the value is 0 or 1 depending on whether a GPU is attached
  from google.colab import auth
  auth.authenticate_user()
  # download "sidecar files" since on Colab, this notebook will be on Drive
  !rm -rf txtclsmodel
  !git clone --depth 1 https://github.com/GoogleCloudPlatform/training-data-analyst
  !mv  training-data-analyst/courses/machine_learning/deepdive/09_sequence/txtclsmodel/ .
  !rm -rf training-data-analyst
  # downgrade TensorFlow to the version this notebook has been tested with
  !pip install --upgrade tensorflow==$TFVERSION

# Set Google Application Credentials

In [7]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='{}/{}.json'.format(SAC_KEY_DESTINATION,SAC)

Check Whether Google Application Credential Was Set Successfully Outside Virtual Environment

In [8]:
%%bash
set | grep GOOGLE_APPLICATION_CREDENTIALS 

GOOGLE_APPLICATION_CREDENTIALS=/media/mujahid7292/Data/Gcloud_Tem_SAC/jupyter-notebook-sac-f.json


In [2]:
import tensorflow as tf
print(tf.__version__)

2.2.0


We will look at the titles of articles and figure out whether the article came from the New York Times, TechCrunch or GitHub. 

We will use [hacker news](https://news.ycombinator.com/) as our data source. It is an aggregator that displays tech related headlines from various  sources.

### Creating Dataset from BigQuery 

Hacker news headlines are available as a BigQuery public dataset. The [dataset](https://bigquery.cloud.google.com/table/bigquery-public-data:hacker_news.stories?tab=details) contains all headlines from the sites inception in October 2006 until October 2015. 

Here is a sample of the dataset:

In [10]:
%load_ext google.cloud.bigquery

In [11]:
%%bigquery --project $PROJECT
SELECT
  url, title, score
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  LENGTH(title) > 10
  AND score > 10
  AND LENGTH(url) > 0
LIMIT 10

Unnamed: 0,url,title,score
0,http://www.dumpert.nl/mediabase/6560049/3eb18e...,"Calling the NSA: ""I accidentally deleted an e-...",258
1,http://blog.liip.ch/archive/2013/10/28/hhvm-an...,Amazing performance with HHVM and PHP with a S...,11
2,http://www.gamedev.net/page/resources/_/techni...,A Journey Through the CPU Pipeline,11
3,http://jfarcand.wordpress.com/2011/02/25/atmos...,"Atmosphere Framework 0.7 released: GWT, Wicket...",11
4,http://tech.gilt.com/post/90578399884/immutabl...,Immutable Infrastructure with Docker and EC2 [...,11
5,http://thechangelog.com/post/501053444/episode...,Changelog 0.2.0 - node.js w/Felix Geisendorfer,11
6,http://openangelforum.com/2010/09/09/second-bo...,Second Open Angel Forum in Boston Oct 13th--fr...,11
7,http://bredele.github.io/async,A collection of JavaScript asynchronous patterns,11
8,http://www.smashingmagazine.com/2007/08/25/20-...,20 Free and Fresh Icon Sets,11
9,http://www.cio.com/article/147801/Study_Finds_...,"Study: Only 1 in 5 Workers is ""Engaged"" in The...",11


Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with <i>nytimes</i>

In [12]:
%%bigquery --project $PROJECT
SELECT
  ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
  COUNT(title) AS num_articles
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
  AND LENGTH(title) > 10
GROUP BY
  source
ORDER BY num_articles DESC
LIMIT 10

Unnamed: 0,source,num_articles
0,blogspot,41386
1,github,36525
2,techcrunch,30891
3,youtube,30848
4,nytimes,28787
5,medium,18422
6,google,18235
7,wordpress,17667
8,arstechnica,13749
9,wired,12841


Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for AI Platform.

In [13]:
from google.cloud import bigquery
bq = bigquery.Client(project=PROJECT)

query="""
SELECT source, LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title FROM
  (SELECT
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
    title
  FROM
    `bigquery-public-data.hacker_news.stories`
  WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
    AND LENGTH(title) > 10
  )
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
"""

df = bq.query(query + " LIMIT 5").to_dataframe()
df.head()

Unnamed: 0,source,title
0,github,erlang eunit and travis without rebar
1,github,node-pngdefry
2,github,let s study for google
3,github,mcabber-festival im speech notifications
4,github,an open source pok mon game on ios with locati...


For ML training, we will need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset).  

A simple, repeatable way to do this is to use the hash of a well-distributed column in our data (See https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning).

In [14]:
traindf = bq.query(query + " AND ABS(MOD(FARM_FINGERPRINT(title), 4)) > 0").to_dataframe()
evaldf  = bq.query(query + " AND ABS(MOD(FARM_FINGERPRINT(title), 4)) = 0").to_dataframe()

Below we can see that roughly 75% of the data is used for training, and 25% for evaluation. 

We can also see that within each dataset, the classes are roughly balanced.

In [15]:
traindf['source'].value_counts()

github        27445
techcrunch    23131
nytimes       21586
Name: source, dtype: int64

In [16]:
evaldf['source'].value_counts()

github        9080
techcrunch    7760
nytimes       7201
Name: source, dtype: int64

Finally we will save our data, which is currently in-memory, to disk.

In [17]:
import os, shutil
DATADIR='data/txtcls'
shutil.rmtree(DATADIR, ignore_errors=True)
os.makedirs(DATADIR)
traindf.to_csv( os.path.join(DATADIR,'train.tsv'), header=False, index=False, encoding='utf-8', sep='\t')
evaldf.to_csv( os.path.join(DATADIR,'eval.tsv'), header=False, index=False, encoding='utf-8', sep='\t')

In [18]:
!head -3 data/txtcls/train.tsv

github	this guy just found out how to bypass adblocker
github	show hn  dodo   command line task management for developers
github	show hn  webservicemock   mock out external calls for local development


In [19]:
!wc -l data/txtcls/*.tsv

  24041 data/txtcls/eval.tsv
  72162 data/txtcls/train.tsv
  96203 total


### TensorFlow/Keras Code

Please explore the code in this <a href="txtclsmodel/trainer">directory</a>: `model.py` contains the TensorFlow model and `task.py` parses command line arguments and launches off the training job.

In particular look for the following:

1. [tf.keras.preprocessing.text.Tokenizer.fit_on_texts()](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#fit_on_texts) to generate a mapping from our word vocabulary to integers
2. [tf.keras.preprocessing.text.Tokenizer.texts_to_sequences()](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#texts_to_sequences) to encode our sentences into a sequence of their respective word-integers
3. [tf.keras.preprocessing.sequence.pad_sequences()](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) to pad all sequences to be the same length

The embedding layer in the keras model takes care of one-hot encoding these integers and learning a dense emedding represetation from them. 

Finally we pass the embedded text representation through a CNN model pictured below

<img src=txtcls_model.png  width=25%>

In [4]:
import tensorflow as tf
import pandas as pd
import numpy as np
import pickle
print(tf.__version__)

2.2.0


In [5]:
%%writefile txtclsmodel/trainer/__init__.py
# Empty

Overwriting txtclsmodel/trainer/__init__.py


In [6]:
%%writefile txtclsmodel/trainer/task.py
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
from . import model

if __name__ == '__main__':
    # parse command line arguments
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--output_dir',
        help='GCS location to write checkpoints and export models',
        required=True
    )
    parser.add_argument(
        '--train_data_path',
        help='can be a local path or a GCS url (gs://...)',
        required=True
    )
    parser.add_argument(
        '--eval_data_path',
        help='can be a local path or a GCS url (gs://...)',
        required=True
    )
    parser.add_argument(
        '--embedding_path',
        help='OPTIONAL: can be a local path or a GCS url (gs://...). \
              Download from: https://nlp.stanford.edu/projects/glove/',
    )
    parser.add_argument(
        '--num_epochs',
        help='number of times to go through the data, default=10',
        default=10,
        type=float
    )
    parser.add_argument(
        '--batch_size',
        help='number of records to read during each training step, default=128',
        default=128,
        type=int
    )
    parser.add_argument(
        '--learning_rate',
        help='learning rate for gradient descent, default=.001',
        default=.001,
        type=float
    )
    parser.add_argument(
        '--native',
        action='store_true',
        help='use native in-graph pre-processing functions',
    )

    args, _ = parser.parse_known_args()
    hparams = args.__dict__
    output_dir = hparams.pop('output_dir')
    
    # initiate training
    model.train_and_evaluate(output_dir, hparams)

Overwriting txtclsmodel/trainer/task.py


In [7]:
%%writefile txtclsmodel/trainer/model.py
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function


import tensorflow as tf
import pandas as pd
import numpy as np
import pickle
import re

from tensorflow.python.keras.preprocessing import sequence
from tensorflow.python.keras.preprocessing import text
from tensorflow.python.keras import models
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras.layers import Dropout
from tensorflow.python.keras.layers import Embedding
from tensorflow.python.keras.layers import Conv1D
from tensorflow.python.keras.layers import MaxPool1D
from tensorflow.python.keras.layers import GlobalAveragePooling1D

from google.cloud import storage


tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

CLASSES = {'github': 0, 'nytimes': 1, 'techcrunch': 2}  # label-to-int mapping
TOP_K = 20000  # Limit on the number vocabulary size used for tokenization
MAX_SEQUENCE_LENGTH = 50  # Sentences will be truncated/padded to this length


def download_from_gcs(source, destination):
    """
    Helper function to download data from Google Cloud Storage
      # Arguments:
          source: string, the GCS URL to download from (e.g. 'gs://bucket/file.csv')
          destination: string, the filename to save as on local disk. MUST be filename
          ONLY, doesn't support folders. (e.g. 'file.csv', NOT 'folder/file.csv')
      # Returns: nothing, downloads file to local disk
    """
    search = re.search('gs://(.*?)/(.*)', source)
    bucket_name = search.group(1)
    blob_name = search.group(2)
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)
    bucket.blob(blob_name).download_to_filename(destination)
    

def load_hacker_news_data(train_data_path, eval_data_path):
    """
    Parses raw tsv containing hacker news headlines and 
    returns (sentence, integer label) pairs.
      # Arguments:
          train_data_path: string, path to tsv containing training data.
            can be a local path or a GCS url (gs://...)
          eval_data_path: string, path to tsv containing eval data.
            can be a local path or a GCS url (gs://...)
      # Returns:
          ((train_sentences, train_labels), (test_sentences, test_labels)):  sentences
            are lists of strings, labels are numpy integer arrays
    """
    if train_data_path.startswith('gs://'):
        download_from_gcs(train_data_path, destination='train.csv')
        train_data_path='train.csv'
    if eval_data_path.startswith('gs://'):
        download_from_gcs(eval_data_path, destination='eval.csv')
        eval_data_path='eval.csv'
        
    # Parse CSV using pandas
    column_names=('label','text')
    
    df_train = pd.read_csv(
        train_data_path,
        names=column_names,
        sep='\t'
    )
    df_eval = pd.read_csv(
        eval_data_path,
        names=column_names,
        sep='\t'
    )
    return (
        (list(df_train['text']), np.array(df_train['label'].map(CLASSES))),
        (list(df_eval['text']), np.array(df_eval['label'].map(CLASSES)))
    )


def input_fn(texts, labels, tokenizer, batch_size, mode):
    """
    Create tf.estimator compatible input function
      # Arguments:
          texts: [strings], list of sentences
          labels: numpy int vector, integer labels for sentences
          tokenizer: tf.python.keras.preprocessing.text.Tokenizer
            used to convert sentences to integers
          batch_size: int, number of records to use for each train batch
          mode: tf.estimator.ModeKeys.TRAIN or tf.estimator.ModeKeys.EVAL
      # Returns:
          tf.estimator.inputs.numpy_input_fn, produces feature and label
            tensors one batch at a time
    """
    
    # Transform text to sequence of integer.
    x = tokenizer.texts_to_sequnces(texts)
    
    # Fix sequence length to max value. Sequences shorter 
    # than the length are padded in the beginning and sequences
    # longer are truncated at the beginning.
    x = sequence.pad_sequences(
        x,
        maxlen=MAX_SEQUENCE_LENGTH
    )
    
    # default settings for training
    num_epochs = None
    shuffle = True

    # override if this is eval
    if mode == tf.estimator.ModeKeys.EVAL:
        num_epochs = 1
        shuffle = False
        
    return tf.compat.v1.estimator.inputs.numpy_input_fn(
        x,
        y=labels,
        batch_size=batch_size,
        num_epochs=num_epochs,
        shuffle=shuffle,
        queue_capacity=50000
    )


def keras_estimator(model_dir, config, learning_rate, filters=64,
                   dropout_rate=0.2, embedding_dim=200,
                   kernel_size=3, pool_size=3, embedding_path=None,
                   word_index=None):
    """
    Builds a CNN model using keras and converts to tf.estimator.Estimator
      # Arguments
          model_dir: string, file path where training files will be written
          config: tf.estimator.RunConfig, specifies properties of tf Estimator
          filters: int, output dimension of the layers.
          kernel_size: int, length of the convolution window.
          embedding_dim: int, dimension of the embedding vectors.
          dropout_rate: float, percentage of input to drop at Dropout layers.
          pool_size: int, factor by which to downscale input at MaxPooling layer.
          embedding_path: string , file location of pre-trained embedding (if used)
            defaults to None which will cause the model to train embedding from scratch
          word_index: dictionary, mapping of vocabulary to integers. used only if
            pre-trained embedding is provided

        # Returns
            A tf.estimator.Estimator
    """
    
    # Create model instances.
    model = models.Sequential()
    num_features = min(len(word_index) + 1, TOP_K)
    
    # Add embedding layer. If pre-trained embedding is used add 
    # weights to the embeddings layer and set trainable to input
    # is_embedding_trainable flag.
    if embedding_path != None:
        embedding_matrix = get_embedding_matrix(
            word_index, embedding_path, embedding_dim
        )
        
        is_embedding_trainable = True  # set to False to freeze embedding weights
        
        model.add(Embedding(
            input_dim=num_features,
            output_dim=embedding_dim,
            input_length=MAX_SEQUENCE_LENGTH,
            weights=[embedding_matrix],
            trainable=is_embedding_trainable
        ))
    else:
        model.add(Embedding(
            input_dim=num_features,
            output_dim=embedding_dim,
            input_length=MAX_SEQUENCE_LENGTH,
        ))
        
    model.add(Dropout(
        rate=dropout_rate
    ))
    
    model.add(Conv1D(
        filters=filters,
        kernel_size=kernel_size,
        activation='relu',
        bias_initializer='random_uniform',
        padding='same'
    ))
    model.add(MaxPool1D(
        pool_size=pool_size
    ))
    
    model.add(Conv1D(
        filters=filters * 2,
        kernel_size=kernel_size,
        activation='relu',
        bias_initializer='random_uniform',
        padding='same'
    ))
    model.add(GlobalAveragePooling1D())
    
    model.add(Dropout(
        rate=dropout_rate
    ))
    
    model.add(Dense(
        len(CLASSES),
        activation='softmax'
    ))
    
    # Compile model with learning parameters.
    optimizer = tf.keras.optimizers.Adam(
        learning_rate=learning_rate
    )
    model.compile(
        optimizer=optimizer,
        loss='sparse_categorical_crossentropy',
        metrics=['acc']
    )
    
    estimator = tf.keras.estimator.model_to_estimator(
        keras_model=model,
        model_dir=model_dir,
        config=config
    )
    
    return estimator


def serving_input_fn():
    """
    Defines the features to be passed to the model during inference
      Expects already tokenized and padded representation of sentences
      # Arguments: none
      # Returns: tf.estimator.export.ServingInputReceiver
    """
    feature_placeholder = tf.compat.v1.placeholder(tf.int16, [None, MAX_SEQUENCE_LENGTH])
    features = feature_placeholder  # pass as-is
    return tf.estimator.export.TensorServingInputReceiver(features, feature_placeholder)

def get_embedding_matrix(word_index, embedding_path, embedding_dim):
    """
    Takes embedding for generic voabulary and extracts the embeddings
      matching the current vocabulary
      The pre-trained embedding file is obtained from https://nlp.stanford.edu/projects/glove/
      # Arguments:
          word_index: dict, {key =word in vocabulary: value= integer mapped to that word}
          embedding_path: string, location of the pre-trained embedding file on disk
          embedding_dim: int, dimension of the embedding space
      # Returns: numpy matrix of shape (vocabulary, embedding_dim) that contains the embedded
          representation of each word in the vocabulary.
    """
    # Read the pre-trained embedding file and get word to word 
    # vector mappings.
    embedding_matrix_all = {}
    
    # Download if embedding file is in GCS
    if embedding_path.startswith('gs://'):
        download_from_gcs(
            embedding_path,
            destination='embedding.csv'
        )
        embedding_path = 'embedding.csv'
        
    with open(embedding_path) as f:
        for line in f:
            # Every line contains word followed by the vector value.
            values = line.split()
            word = values[0] # First item is word.
            coefs = np.asarray(
                values[1:], # Rest of the items are vector.
                dtype='float32'
            )
            embedding_matrix_all[word] =coefs
    
    # Prepare embedding matrix with just the words in our 
    # word_index dictionary.
    num_words = min(len(word_index) + 1, TOP_K)
    embedding_matrix = np.zeros(
        (num_words, embedding_dim)
    )
    
    for word, i in word_index.items():
        if i >= TOP_K:
            continue
        embedding_vector = embedding_matrix_all.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector
    
    return embedding_matrix


def train_and_evaluate(output_dir, hparams):
    """
    Main orchestrator. Responsible for calling all other functions in model.py
      # Arguments:
          output_dir: string, file path where training files will be written
          hparams: dict, command line parameters passed from task.py
      # Returns: nothing, kicks off training and evaluation
    """
    # Ensure filewriter cache is clear for TensorBoard events file.
    tf.compat.v1.summary.FileWriterCache.clear()
    
    # Load Data
    ((train_texts, train_labels), (test_texts, test_labels)) = load_hacker_news_data(
        hparams['train_data_path'], hparams['eval_data_path']
    )
    
    # Create vocabulary from training corpus.
    tokenizer = text.Tokenizer(num_words=TOP_K)
    tokenizer.fit_on_texts(train_texts)
    
    # Save token dictionary to use during prediction time
    pickle.dump(tokenizer, open('tokenizer.pickled', 'wb'))
    
    # Create estimator
    run_config = tf.estimator.RunConfig(
        save_checkpoints_steps=500
    )
    estimator = keras_estimator(
        model_dir=output_dir,
        config=run_config,
        learning_rate=hparams['learning_rate'],
        embedding_path=hparams['embedding_path'],
        word_index=tokenizer.word_index
    )
    
    # Create TrainSpec
    train_steps = hparams['num_epochs'] * len(train_texts) / hparams['batch_size']
    train_spec = tf.estimator.TrainSpec(
        input_fn=input_fn(
            train_texts,
            train_labels,
            tokenizer,
            hparams['batch_size'],
            mode=tf.estimator.ModeKeys.TRAIN
        ),
        max_steps=train_steps
    )
    
    # Create EvalSpec
    exporter = tf.estimator.LatestExporter(
        name='exporter',
        serving_input_receiver_fn=serving_input_fn
    )
    eval_spec = tf.estimator.EvalSpec(
        input_fn=input_fn(
            test_texts,
            test_labels,
            tokenizer,
            hparams['batch_size'],
            mode=tf.estimator.ModeKeys.EVAL
        ),
        steps=None,
        exporters=exporter,
        start_delay_secs=10,
        throttle_secs=10
    )
    
    # Start training
    tf.estimator.train_and_evaluate(
        estimator=estimator,
        train_spec=train_spec,
        eval_spec=eval_spec
    )

Overwriting txtclsmodel/trainer/model.py


### Run Locally (optional step)
Let's make sure the code compiles by running locally for a fraction of an epoch.
This may not work if you don't have all the packages installed locally for gcloud (such as in Colab).
This is an optional step; move on to training on the cloud.

In [8]:
%%bash
pip install google-cloud-storage
rm -rf txtcls_trained
gcloud ai-platform local train \
   --module-name=trainer.task \
   --package-path=./txtclsmodel/trainer \
   -- \
   --output_dir=./txtcls_trained \
   --train_data_path=./train.csv \
   --eval_data_path=./eval.csv \
   --num_epochs=5



INFO:tensorflow:TF_CONFIG environment variable: {'job': {'job_name': 'trainer.task', 'args': ['--output_dir=./txtcls_trained', '--train_data_path=./train.csv', '--eval_data_path=./eval.csv', '--num_epochs=5']}, 'task': {}, 'cluster': {}, 'environment': 'cloud'}
2020-10-29 11:07:24.755336: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-10-29 11:07:24.777699: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2899885000 Hz
2020-10-29 11:07:24.778099: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557f9eae0850 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-29 11:07:24.778147: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-29 11:07:24.778358: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool w

CalledProcessError: Command 'b'pip install google-cloud-storage\nrm -rf txtcls_trained\ngcloud ai-platform local train \\\n   --module-name=trainer.task \\\n   --package-path=./txtclsmodel/trainer \\\n   -- \\\n   --output_dir=./txtcls_trained \\\n   --train_data_path=./train.csv \\\n   --eval_data_path=./eval.csv \\\n   --num_epochs=5\n'' returned non-zero exit status 1.