## Using a pre-trained embedding from the TensorFlow Hub to categorise books

Final modelling scenario whereby I am utilzing a pre-trained embedding available from the TensorFlow Hub and code created by AIEngineering [online] available at https://www.youtube.com/watch?v=dkpS2g4K08s. This code csreate an end to end NLP pipeline starting from cleaning text data, setting NLP pipeline, model selection and model evaluation while handling handling imbalanced a dataset.

## 1. Import required packages

In [2]:
# AIEngineering
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import os
import datetime
import tensorflow_hub as hub
import numpy as np   
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from sklearn.metrics import classification_report

import matplotlib.pyplot as plt
import urllib.request, json

## 2. Get cleaned data from CW1

In [3]:
df=pd.read_csv(r"C:\Users\jmd05\DSM-020\4. CW1\Data\all1.csv", index_col=0)
df.shape

(656, 7)

In [4]:
df.head()

Unnamed: 0,Title,Synopsis,Subject,ISBN,Synopsis1,Synopsis2,Synopsis2_len
0,my life in red and white,for the very first time world renowned and rev...,sports leisure,9781474618267,"['first', 'time', 'world', 'renowned', 'revolu...","['first', 'time', 'world', 'renowned', 'revolu...",149
1,the accidental footballer,pat nevin never wanted to be a professional fo...,sports leisure,9781913183370,"['pat', 'nevin', 'never', 'wanted', 'professio...","['pat', 'nevin', 'never', 'want', 'professiona...",98
2,sooley,one man seventeen year old samuel sooleyman co...,sports leisure,9781529368000,"['one', 'man', 'seventeen', 'year', 'old', 'sa...","['one', 'man', 'seventeen', 'year', 'old', 'sa...",105
3,mortimer whitehouse gone fishing life death an...,two comedy greats talk life friendship and the...,sports leisure,9781788702942,"['two', 'comedy', 'greats', 'talk', 'life', 'f...","['two', 'comedy', 'greats', 'talk', 'life', 'f...",109
4,the accidental footballer signed edition,signed edition a standard edition is available...,sports leisure,9781800960114,"['signed', 'edition', 'standard', 'edition', '...","['sign', 'edition', 'standard', 'edition', 'av...",103


In [5]:
df.dtypes
# See that the target 'Subject' is of 'object' format

Title            object
Synopsis         object
Subject          object
ISBN              int64
Synopsis1        object
Synopsis2        object
Synopsis2_len     int64
dtype: object

In [6]:
df['Subject'].value_counts(dropna=False)
# Slight imbalance

romantic fiction               92
history                        91
sports leisure                 89
food drink                     88
entertainment                  79
spirituality beliefs           75
science technology medicine    72
business finance law           70
Name: Subject, dtype: int64

In [7]:
# Look at adding rebalancing weights
class_weights=list(class_weight.compute_class_weight('balanced', np.unique(df['Subject']), df['Subject']))
class_weights.sort()
class_weights

 'romantic fiction' 'science technology medicine' 'spirituality beliefs'
 'sports leisure'], y=0      sports leisure
1      sports leisure
2      sports leisure
3      sports leisure
4      sports leisure
            ...      
760     entertainment
761     entertainment
762     entertainment
764     entertainment
765     entertainment
Name: Subject, Length: 656, dtype: object as keyword args. From version 0.25 passing these as positional arguments will result in an error


[0.8913043478260869,
 0.9010989010989011,
 0.9213483146067416,
 0.9318181818181818,
 1.0379746835443038,
 1.0933333333333333,
 1.1388888888888888,
 1.1714285714285715]

In [8]:
weights={}
for index, weight in enumerate(class_weights) :
  weights[index]=weight
weights

{0: 0.8913043478260869,
 1: 0.9010989010989011,
 2: 0.9213483146067416,
 3: 0.9318181818181818,
 4: 1.0379746835443038,
 5: 1.0933333333333333,
 6: 1.1388888888888888,
 7: 1.1714285714285715}

In [9]:
X_train, X_test = train_test_split(df, test_size=0.3, random_state=42)

In [10]:
#df.shape      # 656
#X_train.shape # 459
#X_test.shape  # 197

In [11]:
dataset_train = tf.data.Dataset.from_tensor_slices((X_train['Synopsis'].values, X_train['Subject'].values))
dataset_test = tf.data.Dataset.from_tensor_slices((X_test['Synopsis'].values, X_test['Subject'].values))

In [14]:
for text, target in dataset_train.take(1):
  print ('Synopsis: {}, Subject: {}'.format(text, target))

Synopsis: b'for curious readers young and old a rich and colorful history of religion from humanity s earliest days to our own contentious times in an era of hardening religious attitudes and explosive religious violence this book offers a welcome antidote richard holloway retells the entire history of religion from the dawn of religious belief to the twenty first century with deepest respect and a keen commitment to accuracy writing for those with faith and those without and especially for young readers he encourages curiosity and tolerance accentuates nuance and mystery and calmly restores a sense of the value of faith ranging far beyond the major world religions of judaism islam christianity buddhism and hinduism holloway also examines where religious belief comes from the search for meaning throughout history today s fascinations with scientology and creationism religiously motivated violence hostilities between religious people and secularists and more holloway proves an empathic 

In [15]:
table = tf.lookup.StaticHashTable(
    initializer=tf.lookup.KeyValueTensorInitializer(
        keys=tf.constant(['romantic fiction','history','sports leisure','food drink','entertainment','spirituality beliefs',
                          'science technology medicine','business finance law']), values=tf.constant([0,1,2,3,4,5,6,7]),
    ),
    default_value=tf.constant(-1),
    name="target_encoding"
)

@tf.function
def target(x):
  return table.lookup(x)

In [17]:
table


<tensorflow.python.ops.lookup_ops.StaticHashTable at 0x1252bf70>

In [19]:
def show_batch(dataset, size=1):
  for batch, label in dataset.take(size):
      print(batch.numpy())
      print(target(label).numpy())

In [20]:
show_batch(dataset_test)

b'the old world dying on its feet a new one struggling to be born dublin 1918 in a country doubly ravaged by war and disease nurse julia power works at an understaffed hospital in the city centre where expectant mothers who have come down with an unfamiliar flu are quarantined together into julia s regimented world step two outsiders doctor kathleen lynn on the run from the police and a young volunteer helper bridie sweeney in the darkness and intensity of this tiny ward over the course of three days these women change each other s lives in unexpected ways they lose patients to this baffling pandemic but they also shepherd new life into a fearful world with tireless tenderness and humanity carers and mothers alike somehow do their impossible work in the pull of the stars emma donoghue tells an unforgettable and deeply moving story of love and loss from the bestselling author of the wonder and room'
0


In [21]:
def fetch(text, labels):
  return text, tf.one_hot(target(labels),8)

In [22]:
train_data_f=dataset_train.map(fetch)
test_data_f=dataset_test.map(fetch)

In [24]:
next(iter(test_data_f))

(<tf.Tensor: shape=(), dtype=string, numpy=b'the old world dying on its feet a new one struggling to be born dublin 1918 in a country doubly ravaged by war and disease nurse julia power works at an understaffed hospital in the city centre where expectant mothers who have come down with an unfamiliar flu are quarantined together into julia s regimented world step two outsiders doctor kathleen lynn on the run from the police and a young volunteer helper bridie sweeney in the darkness and intensity of this tiny ward over the course of three days these women change each other s lives in unexpected ways they lose patients to this baffling pandemic but they also shepherd new life into a fearful world with tireless tenderness and humanity carers and mothers alike somehow do their impossible work in the pull of the stars emma donoghue tells an unforgettable and deeply moving story of love and loss from the bestselling author of the wonder and room'>,
 <tf.Tensor: shape=(8,), dtype=float32, num

In [18]:
train_data, train_labels = next(iter(train_data_f.batch(5)))
train_data, train_labels

(<tf.Tensor: shape=(5,), dtype=string, numpy=
 array([b'for curious readers young and old a rich and colorful history of religion from humanity s earliest days to our own contentious times in an era of hardening religious attitudes and explosive religious violence this book offers a welcome antidote richard holloway retells the entire history of religion from the dawn of religious belief to the twenty first century with deepest respect and a keen commitment to accuracy writing for those with faith and those without and especially for young readers he encourages curiosity and tolerance accentuates nuance and mystery and calmly restores a sense of the value of faith ranging far beyond the major world religions of judaism islam christianity buddhism and hinduism holloway also examines where religious belief comes from the search for meaning throughout history today s fascinations with scientology and creationism religiously motivated violence hostilities between religious people and secul

In [19]:
embedding = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1"
hub_layer = hub.KerasLayer(embedding, output_shape=[128], input_shape=[], 
                           dtype=tf.string, trainable=True)
#hub_layer(train_data[:1])

In [20]:
model = tf.keras.Sequential()
model.add(hub_layer)
#for units in [128,64,32]:
for units in [32]:
  model.add(tf.keras.layers.Dense(units, activation='relu'))
  model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(8, activation='softmax'))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 128)               124642688 
_________________________________________________________________
dense (Dense)                (None, 32)                4128      
_________________________________________________________________
dropout (Dropout)            (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 8)                 264       
Total params: 124,647,080
Trainable params: 124,647,080
Non-trainable params: 0
_________________________________________________________________


In [21]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [22]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [23]:
train_data_f=train_data_f.shuffle(70000).batch(100)
test_data_f=test_data_f.batch(100)

In [24]:
history = model.fit(train_data_f,
                    epochs=3,
                    validation_data=test_data_f,
                    verbose=1,
                    class_weight=weights,
                    callbacks=[tensorboard_callback])

Epoch 1/3




Epoch 2/3
Epoch 3/3


In [25]:
len(list(dataset_test))

197

In [26]:
results = model.evaluate(dataset_test.map(fetch).batch(197), verbose=2)

print(results)

1/1 - 0s - loss: 1.9130 - accuracy: 0.3553
[1.9130250215530396, 0.3553299605846405]


In [27]:
test_data, test_labels = next(iter(dataset_test.map(fetch).batch(459)))

In [28]:
y_pred=model.predict(test_data)

In [29]:
from sklearn.metrics import classification_report

In [30]:
print(classification_report(test_labels.numpy().argmax(axis=1), y_pred.argmax(axis=1)))

              precision    recall  f1-score   support

           0       0.46      0.88      0.60        24
           1       0.27      0.80      0.41        25
           2       0.93      0.35      0.51        37
           3       1.00      0.14      0.24        22
           4       0.00      0.00      0.00        32
           5       0.67      0.17      0.27        24
           6       0.14      0.50      0.22        14
           7       0.50      0.11      0.17        19

    accuracy                           0.36       197
   macro avg       0.50      0.37      0.30       197
weighted avg       0.52      0.36      0.31       197



  _warn_prf(average, modifier, msg_start, len(result))


In [31]:
from sklearn.metrics import confusion_matrix
confusion_matrix(test_labels.numpy().argmax(axis=1), y_pred.argmax(axis=1))

array([[21,  3,  0,  0,  0,  0,  0,  0],
       [ 2, 20,  1,  0,  0,  0,  1,  1],
       [ 0, 18, 13,  0,  0,  1,  5,  0],
       [ 1,  0,  0,  3,  0,  0, 18,  0],
       [12, 10,  0,  0,  0,  1,  9,  0],
       [ 4,  7,  0,  0,  0,  4,  8,  1],
       [ 3,  4,  0,  0,  0,  0,  7,  0],
       [ 3, 11,  0,  0,  0,  0,  3,  2]], dtype=int64)

In [33]:
conf_matrix = tf.math.confusion_matrix(test_labels.numpy().argmax(axis=1), y_pred.argmax(axis=1))
conf_matrix

<tf.Tensor: shape=(8, 8), dtype=int32, numpy=
array([[21,  3,  0,  0,  0,  0,  0,  0],
       [ 2, 20,  1,  0,  0,  0,  1,  1],
       [ 0, 18, 13,  0,  0,  1,  5,  0],
       [ 1,  0,  0,  3,  0,  0, 18,  0],
       [12, 10,  0,  0,  0,  1,  9,  0],
       [ 4,  7,  0,  0,  0,  4,  8,  1],
       [ 3,  4,  0,  0,  0,  0,  7,  0],
       [ 3, 11,  0,  0,  0,  0,  3,  2]])>

In [104]:
#!pip install tensorboard



In [106]:
#!pip show tensorboard

Name: tensorboard
Version: 2.6.0
Summary: TensorBoard lets you watch Tensors Flow
Home-page: https://github.com/tensorflow/tensorboard
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: c:\users\jmd05\anaconda3\lib\site-packages
Requires: absl-py, google-auth-oauthlib, numpy, markdown, setuptools, google-auth, protobuf, tensorboard-plugin-wit, tensorboard-data-server, grpcio, werkzeug, wheel, requests
Required-by: tensorflow


In [110]:
#python C:\Users\jmd05\Anaconda3\Lib\site-packages\tensorboard\main.py --logdir=r'C:\Users\jmd05\Anaconda3\Lib\site-packages\tensorboard\logs\fit'

SyntaxError: invalid syntax (<ipython-input-110-5f05a87d6c4e>, line 1)

In [39]:
#!kill 7700

'kill' is not recognized as an internal or external command,
operable program or batch file.


In [40]:
%reload_ext tensorboard
%tensorboard --logdir logs

Reusing TensorBoard on port 6006 (pid 9936), started 0:04:27 ago. (Use '!kill 9936' to kill it.)