In [0]:
!pip install tensorflow==1.12



## Import  libraries

In [0]:
# import modules
import tensorflow as tf
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
import warnings

warnings.simplefilter('ignore')

print("Tensorflow Version:", tf.__version__)

Tensorflow Version: 1.12.0


## Auth the coud

In [0]:
# auth the could
from google.colab import auth
auth.authenticate_user()

## Get training data

This notebook is try to use DNN to the predition with stack overflow question dataset with tags. 

One thing to notice is that we just train the model in server side, but in AI-platform training service.

In [0]:
# Download data with Stack overflow from google storage
!gsutil cp 'gs://cloudml-demo-lcm/SO_ml_tags_avocado_188k_v2.csv' ./


Copying gs://cloudml-demo-lcm/SO_ml_tags_avocado_188k_v2.csv...
- [1 files][276.7 MiB/276.7 MiB]                                                
Operation completed over 1 objects/276.7 MiB.                                    


In [0]:
# show what we have in current server.
import os

print(os.listdir('.'))

['.config', 'preprocess.py', 'model_prediction.py', 'tutorial_pred.egg-info', 'setup.py', 'dist', 'adc.json', 'processor.pkl', 'SO_ml_tags_avocado_188k_v2.csv', '__pycache__', 'keras_model.h5', 'sample_data']


## Get data ready

In [0]:
# then we could use pandas to read data and remove the empty rows
df = pd.read_csv("SO_ml_tags_avocado_188k_v2.csv", names=['tags', 'original_tags', 'text'], header=0)  # without header
df = df.drop(columns=['original_tags'])
df = df.dropna()

df = shuffle(df, random_state=1234)
df.head()

Unnamed: 0,tags,text
63842,pandas,avocado does not read values from excel cells ...
63486,pandas,optimizing iteration over avocado dataframe i ...
105751,matplotlib,erase previously drawn content from a pyplot d...
70621,pandas,sort/alphabetzing columns in dataframes (avoca...
161865,tensorflow,error in importing niftynet with avocado 1.9 i...


## Explore data

In [0]:
# how many unique lables that we have
print("how many:", len(df))
print("unique label numbers:", len(np.unique(df['tags'])))
print("unique label:", np.unique(df['tags']))

how many: 188199
unique label numbers: 33
unique label: ['keras' 'keras,scikitlearn' 'keras,tensorflow' 'matplotlib'
 'matplotlib,keras' 'matplotlib,keras,scikitlearn' 'matplotlib,pandas'
 'matplotlib,pandas,scikitlearn' 'matplotlib,scikitlearn'
 'matplotlib,tensorflow' 'matplotlib,tensorflow,keras' 'pandas'
 'pandas,keras' 'pandas,matplotlib' 'pandas,matplotlib,keras'
 'pandas,matplotlib,scikitlearn' 'pandas,scikitlearn'
 'pandas,scikitlearn,keras' 'pandas,tensorflow' 'pandas,tensorflow,keras'
 'pandas,tensorflow,scikitlearn' 'scikitlearn' 'scikitlearn,keras'
 'scikitlearn,pandas' 'scikitlearn,tensorflow'
 'scikitlearn,tensorflow,keras' 'tensorflow' 'tensorflow,keras'
 'tensorflow,keras,scikitlearn' 'tensorflow,matplotlib'
 'tensorflow,matplotlib,keras' 'tensorflow,scikitlearn'
 'tensorflow,scikitlearn,keras']


In [0]:
# but we do find that some tags are subset of high level code, as there are the same in each domain
tags_split = [tags.split(',')[0] for tags in df['tags'].values]
print(tags_split[:10])
print("unique labels number:", len(np.unique(tags_split)))
print("unique labels :", np.unique(tags_split))


['pandas', 'pandas', 'matplotlib', 'pandas', 'tensorflow', 'matplotlib', 'pandas', 'pandas', 'pandas', 'pandas']
unique labels number: 5
unique labels : ['keras' 'matplotlib' 'pandas' 'scikitlearn' 'tensorflow']


## Transform label data

So that we just need to make a 5 classes model would be fine. In fact, we couldn't just put the whole data into the model as there are strings, what the model accepted is number! So we need to process string into numbers...

Before we do that, we also need to conver the string label into numbers, but with sklearn is really easy.

In [0]:
encoder = LabelEncoder()
tags_encoded = encoder.fit_transform(tags_split)

num_tags = len(encoder.classes_)

print(df['text'].values[0])
print(encoder.classes_)
print("Class name: `{}` with label: `{}`".format(df['tags'].values[0], tags_encoded[0]))

avocado does not read values from excel cells with such simple formula as =10+10 i've got a few excel sheets with columns with different data types. some of them consist of formulas as well. avocado does not read values from excel cells with such simple formula as =10+10 or =250+30+40  code like this   truck_work = avocado.read_excel(hauls_monthly_data, sheetname=truck)   returns dataframe where column filled from those excel cells consists data, which type is float and value is nan. but i'm waiting for float with value 10 and 320  the only way i've worked out yet to solve this issue is by manually saving each time an excel-file before processing data from it. which is not much pythonic way of dealing with problems.  if i'm using such code as   wb = load_workbook(filename = hauls_monthly_data) sheet_names = wb.get_sheet_names() name = sheet_names[1] sheet_ranges = wb[name] truck_work = avocado.dataframe(sheet_ranges.values)   then it returns   8          =16+109+108        =6+40+29    

In [0]:
# As we already shuffle our data, so we could just get the training size
train_size = int(len(df) * .8)
print('Train size: ', train_size)
print("test size:", len(df) - train_size)

Train size:  150559
test size: 37640


In [0]:
# get train and validation label
train_tags = tags_encoded[:train_size]
test_tags = tags_encoded[train_size:]


## Processor class

we have to process the text into numbers, but here we should do one thing that shoudld be noticed is that when we do prediction, we should follow with same process, so the best way is to create a class to do that. One more thing, as currently we do the training in notebook, but when do predition

we would do the in python code, so we could write the logic into a python file.

In [0]:

%%writefile preprocess.py

from tensorflow.keras.preprocessing import text

class TextProcessor:
  def __init__(self, vocab_size):
    self._vocab_size = vocab_size
    self._tokenizer = None

  def create_tokens(self, text_list):
    # this is to conver text into vector, but how to convert text into vector? with Tokenizer
    # will try to only keep most frequent `num_words` words with lowercase based on
    # binary, or word count, or tf-id. Currently with binarizer.
    # could be found: https://keras.io/api/preprocessing/text/
    tokenizer = text.Tokenizer(num_words=self._vocab_size)
    # then we could update internal vocabulary: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#fit_on_texts
    tokenizer.fit_on_texts(text_list)
    self._tokenizer = tokenizer

  def transform_text(self, text_list):
    # then we could convert the text into numpy matrix.
    text_matrix = self._tokenizer.texts_to_matrix(text_list)
    return text_matrix

Overwriting preprocess.py


In [0]:
# process whole text data, this does takes some time as we process whole data
from preprocess import TextProcessor

VOCAB_SIZE = 500

train_data = df['text'].values[:train_size]
test_data = df['text'].values[train_size:]

processor = TextProcessor(VOCAB_SIZE)

processor.create_tokens(train_data)

train_data = processor.transform_text(train_data)
test_data = processor.transform_text(test_data)

In [0]:
# get some sample data
print("How many dimensions: ", len(train_data[0]))
print("****")
print(train_data[0])

How many dimensions:  500
****
[0. 0. 1. 1. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0. 1. 0. 1. 0.
 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0.
 1. 0. 0. 1. 0. 1. 1. 0. 0. 1. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0.
 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1.
 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1.
 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0

In [0]:
# we could save the processed object into disk for later usecase
import pickle

with open('./processor.pkl', 'wb') as f:
  pickle.dump(processor, f)


## Define model and do traning logic

In [0]:
# then we could start our model build logic
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization

def create_model(vocab_size, num_tags):
  model = Sequential()
  model.add(Dense(128, input_shape=(vocab_size, ), activation='relu'))
  model.add(BatchNormalization())
  model.add(Dropout(.5))
  model.add(Dense(64, activation='relu'))
  model.add(BatchNormalization())
  model.add(Dropout(.5))
  model.add(Dense(num_tags, activation='softmax'))
  
  model.compile(loss='sparse_categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
  return model

In [0]:
model = create_model(VOCAB_SIZE, num_tags)

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_6 (Dense)              (None, 128)               64128     
_________________________________________________________________
batch_normalization_4 (Batch (None, 128)               512       
_________________________________________________________________
dropout_4 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 64)                8256      
_________________________________________________________________
batch_normalization_5 (Batch (None, 64)                256       
_________________________________________________________________
dropout_5 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_8 (Dense)              (None, 5)                 325       
Total para

In [0]:
# Now that we could start our training logic
his = model.fit(train_data, train_tags, epochs=3, batch_size=128)
eval_res = model.evaluate(test_data, test_tags)
print('Evaluation loss: {}, accuracy:{}'.format(eval_res[0], eval_res[1]))

Epoch 1/3
Epoch 2/3
Epoch 3/3
Evaluation loss: 0.26055393286294815, accuracy:0.8972635494155154


In [0]:
# After we have trained our model, we could save the trained model into disk
model.save("keras_model.h5")

In [0]:
os.remove('model_prediction.py')

## Write prediction logic class

In [0]:
# Here we should write another class to load the processor object and trained model object
%%writefile model_prediction.py
import pickle
import os
import numpy as np

class ModelPredictor:
  def __init__(self, model, processor):
    self._model = model
    self._processor = processor

  def predict(self, instances, **kwargs):
    processed_data = self._processor.transform_text(instances)
    predictions = self._model.predict(processed_data)
    return predictions.tolist()    # we have to make the array data into a list, as array is not json serializable

  @classmethod
  def from_path(cls, model_dir):
    import tensorflow as tf
    model = tf.keras.models.load_model(os.path.join(model_dir, 'keras_model.h5'))
    
    with open(os.path.join(model_dir, 'processor.pkl'), 'rb') as f:
      processor = pickle.load(f)
    
    return cls(model, processor)

Overwriting model_prediction.py


In [0]:
test_requests = [
  "How to preprocess strings in Keras models Lambda layer? I have the problem that the value passed on to the Lambda layer (at compile time) is a placeholder generated by keras (without values). When the model is compiled, the .eval () method throws the error: You must feed a value for placeholder tensor 'input_1' with dtype string and shape [?, 1] def text_preprocess(x): strings = tf.keras.backend.eval(x) vectors = [] for string in strings: vector = string_to_one_hot(string.decode('utf-8')) vectors.append(vector) vectorTensor = tf.constant(np.array(vectors),dtype=tf.float32) return vectorTensor input_text = Input(shape=(1,), dtype=tf.string) embedding = Lambda(text_preprocess)(input_text) dense = Dense(256, activation='relu')(embedding) outputs = Dense(2, activation='softmax')(dense) model = Model(inputs=[input_text], outputs=outputs) model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy']) model.summary() model.save('test.h5') If I pass a string array into the input layer statically, I can compile the model, but I get the same error if I want to convert the model to tflite. #I replaced this line: input_text = Input(shape=(1,), dtype=tf.string) #by this lines: test = tf.constant(['Hello', 'World']) input_text = Input(shape=(1,), dtype=tf.string, tensor=test) #but calling this ... converter = TFLiteConverter.from_keras_model_file('string_test.h5') tfmodel = converter.convert() #... still leads to this error: InvalidArgumentError: You must feed a value for placeholder tensor 'input_3' with dtype string and shape [2] [[{{node input_3}}]] ",
  "Change the bar item name in Pandas I have a test excel file like: df = pd.DataFrame({'name':list('abcdefg'), 'age':[10,20,5,23,58,4,6]}) print (df) name  age 0    a   10 1    b   20 2    c    5 3    d   23 4    e   58 5    f    4 6    g    6 I use Pandas and matplotlib to read and plot it: import pandas as pd import numpy as np import matplotlib.pyplot as plt import os excel_file = 'test.xlsx' df = pd.read_excel(excel_file, sheet_name=0) df.plot(kind='bar') plt.show() the result shows: enter image description here it use index number as item name, how can I change it to the name, which stored in column name?"
]


## Make sample prediction in local server

In [0]:
from model_prediction import ModelPredictor

classifier = ModelPredictor.from_path('.')
pred = classifier.predict(test_requests)

for i in range(len(test_requests)):
  pred_labels = encoder.classes_[np.argmax(pred[i])]
  print("Sample: {} as {}... with prediction label: {}".format(i, test_requests[i][:20], pred_labels))

Sample: 0 as How to preprocess st... with prediction label: keras
Sample: 1 as Change the bar item ... with prediction label: pandas


## Deploy our trained model into AI platform

As we have already trained our model, then we could server our model into the cloud server. Currently with Google cloud, we would use container as service, then most commonly way is to wrap our source code into a package.

In [0]:
%%writefile setup.py

from setuptools import setup

setup(name='tutorial_pred', 
      version='0.1', 
      include_package_data=True, 
      scripts=['preprocess.py', 'model_prediction.py'])

Overwriting setup.py


In [0]:
! pip install google-cloud-storage



## Upload files into bucket

In [0]:
from google.cloud import storage

# then we should our trained model and processor object into our bucket for later use case
bucket_name = "first_bucket_lugq"

os.environ['GCLOUD_PROJECT'] = 'cloudtutorial-278306'

from google.cloud import storage
client = storage.Client()

bucket = client.bucket(bucket_name)

def upload_file(source_file, des_file):
  blob = bucket.blob(des_file)
  try:
    blob.upload_from_filename(source_file)
    print("File :{} has been uploaded".format(source_file))
  except Exception as e:
    raise Exception("Upload file with error:", e)

source_file_list = ['keras_model.h5', 'processor.pkl']
for file in source_file_list:
  upload_file(file, os.path.join('model', file))

File :keras_model.h5 has been uploaded
File :processor.pkl has been uploaded


In [0]:
# just to ensure the file has been uploaded
list(bucket.list_blobs())

[<Blob: first_bucket_lugq, keras_model.h5, 1590554313328791>,
 <Blob: first_bucket_lugq, models/, 1590549398786260>,
 <Blob: first_bucket_lugq, models/keras_model.h5, 1590549490950859>,
 <Blob: first_bucket_lugq, package/tutorial_pred-0.1.tar.gz, 1590547407658021>,
 <Blob: first_bucket_lugq, processor.pkl, 1590554314224308>,
 <Blob: first_bucket_lugq, test.zip, 1590405307598367>,
 <Blob: first_bucket_lugq, tmp, 1590404269440879>]

In [0]:
# wrap the project into a zip file
!python setup.py sdist

running sdist
running egg_info
writing tutorial_pred.egg-info/PKG-INFO
writing dependency_links to tutorial_pred.egg-info/dependency_links.txt
writing top-level names to tutorial_pred.egg-info/top_level.txt
reading manifest file 'tutorial_pred.egg-info/SOURCES.txt'
writing manifest file 'tutorial_pred.egg-info/SOURCES.txt'

running check


creating tutorial_pred-0.1
creating tutorial_pred-0.1/tutorial_pred.egg-info
copying files to tutorial_pred-0.1...
copying model_prediction.py -> tutorial_pred-0.1
copying preprocess.py -> tutorial_pred-0.1
copying setup.py -> tutorial_pred-0.1
copying tutorial_pred.egg-info/PKG-INFO -> tutorial_pred-0.1/tutorial_pred.egg-info
copying tutorial_pred.egg-info/SOURCES.txt -> tutorial_pred-0.1/tutorial_pred.egg-info
copying tutorial_pred.egg-info/dependency_links.txt -> tutorial_pred-0.1/tutorial_pred.egg-info
copying tutorial_pred.egg-info/top_level.txt -> tutorial_pred-0.1/tutorial_pred.egg-info
Writing tutorial_pred-0.1/setup.cfg
Creating tar archive


In [0]:
# upload the target source file into bucket
package_file_name = os.listdir('dist')[0]
print(package_file_name)
upload_file(os.path.join('dist', package_file_name), os.path.join('package', package_file_name))

tutorial_pred-0.1.tar.gz
File :dist/tutorial_pred-0.1.tar.gz has been uploaded


In [0]:
# config current script with our project
!gcloud config set project cloudtutorial-278306

Updated property [core/project].


## Create model version

Why we need with model version in AI-Platform? As we should deploy our model into the server when we have already trained our model which as a serialized obejct, a `version` is a instance represented in cloud, after we have deployed our model, then we could even write our own prediction class to do the prediction as we have already done before. If you are curious about the other term when deploy the model could get info [here](https://cloud.google.com/ai-platform/training/docs/projects-models-versions-jobs).

In [0]:
MODEL_VERSION = 'v12'
MODEL_NAME = "keras_model_tutorial"
BUCKET_NAME = "first_bucket_lugq"

In [0]:
# then create the model on the platform
!gcloud ai-platform models create $MODEL_NAME 


Learn more about regional endpoints and see a list of available regions: https://cloud.google.com/ai-platform/prediction/docs/regional-endpoints
[1;31mERROR:[0m (gcloud.ai-platform.models.create) Resource in project [cloudtutorial-278306] is the subject of a conflict: Field: model.name Error: A model with the same name already exists.
- '@type': type.googleapis.com/google.rpc.BadRequest
  fieldViolations:
  - description: A model with the same name already exists.
    field: model.name


In [0]:
!gcloud beta ai-platform versions create $MODEL_VERSION \
--model $MODEL_NAME \
--runtime-version 1.12 \
--python-version 3.5 \
--origin gs://$BUCKET_NAME/model/ \
--package-uris gs://$BUCKET_NAME/package/tutorial_pred-0.1.tar.gz \
--prediction-class model_prediction.ModelPredictor



## Make predictons based on trained model

As we have already deployed our model into the server, then we could do some tests to send the request data to the server to confirm that we already put our model into server.

In [0]:
# we could use google python client to do that with easier way
! pip install --upgrade google-api-python-client

Collecting google-api-python-client
[?25l  Downloading https://files.pythonhosted.org/packages/91/dc/1207147686a770a867c918fac580e73fd5c8dcac1ec918ebd868b3f62c8b/google_api_python_client-1.8.4-py3-none-any.whl (58kB)
[K     |█████▋                          | 10kB 16.5MB/s eta 0:00:01[K     |███████████▏                    | 20kB 3.3MB/s eta 0:00:01[K     |████████████████▊               | 30kB 4.3MB/s eta 0:00:01[K     |██████████████████████▎         | 40kB 4.5MB/s eta 0:00:01[K     |███████████████████████████▉    | 51kB 3.8MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 2.8MB/s 
Installing collected packages: google-api-python-client
  Found existing installation: google-api-python-client 1.7.12
    Uninstalling google-api-python-client-1.7.12:
      Successfully uninstalled google-api-python-client-1.7.12
Successfully installed google-api-python-client-1.8.4


## Get prediction online

After we have already deployed our model into server, we could get prediction with HTTP post request. There are many ways that we could use to get prediction like `curl` or `python request`. 

The logic here is trying to serialize the raw data within JSON data for remote server, then remote server get raw data and do preprocessing logic, and load trained model from bucket in memory, last we could get the prediction with our custom prediction class, returned object is also a JSON object, if we face any error, this JSON is just with `error` key, otherwise we could get `predictions` key, this is based on custom implement logic, currently I just return a probabilities. For detail info could be found [predict](https://cloud.google.com/ai-platform/prediction/docs/reference/rest/v1/projects/predict#tensorflow)

In [0]:
# Here just make a function to get the prediction result
def get_pred_class(predictions):
  for i, pred in enumerate(predictions):
    pred_class_name = encoder.classes_[np.argmax(pred)]
    yield pred_class_name

In [0]:
# we could use google api to get the prediction result.
import googleapiclient.discovery

project_id = "cloudtutorial-278306"

# find the service and build the object
service = googleapiclient.discovery.build('ml', 'v1')
name = "projects/{}/models/{}/versions/{}".format(project_id, MODEL_NAME, MODEL_VERSION)

# get prediction repsponse
response = service.projects().predict(name=name, body={'instances': request_data}).execute()

if 'error' in response:
  raise RuntimeError(response['error'])
else:
  # by default will return the prediction probabilities
  pred = get_pred_class(response['predictions'])
  for i, p_class in enumerate(pred):
    print("Sample text: {}.. With prediction class: `{}`".format(request_data[i][:20], p_class))

Sample text: How to preprocess st.. With prediction class: `keras`
Sample text: Change the bar item .. With prediction class: `pandas`


## Final words

This script is mainly for model prediction logic here, there are many other functionalities that should be covered, code more and learn more.