## Using a pre-trained embedding from the TensorFlow Hub to categorise books

Final modelling scenario whereby I am utilzing a pre-trained embedding available from the TensorFlow Hub and code created by AIEngineering [online] available at https://www.youtube.com/watch?v=dkpS2g4K08s. This code csreate an end to end NLP pipeline starting from cleaning text data, setting NLP pipeline, model selection and model evaluation while handling handling imbalanced a dataset.

## 1. Import required packages

In [13]:
# Data manipulation
import pandas as pd
from copy import copy
import re
import string
from functools import reduce

# Numeric manipulation
import numpy as np
import math

# Charts
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import matplotlib.pylab as plt1
# Installed wordcloud in a terminal in Jupyter with this powershell command:
# PS C:\Users\jmd05\Documents>cd C:\Users\jmd05\anaconda3
# PS C:\Users\jmd05\anaconda3\> conda install -c conda-forge wordcloud=1.6.0 

# Web scraping & APIs
headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"}
from bs4 import BeautifulSoup
import requests
from isbnlib import meta, desc, info, is_isbn13, classify
from isbnlib.registry import bibformatters
import time

# NLP
import nltk
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords, webtext
from nltk.probability import FreqDist
from wordcloud import WordCloud 
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from scipy.spatial import distance_matrix
from nltk.stem import WordNetLemmatizer
# Map any punctuation characters to white space
translator=str.maketrans(string.punctuation, ' '*len(string.punctuation)) 

# Other
import warnings

# AIEngineering
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import os
import datetime
import tensorflow_hub as hub
import numpy as np   
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight

## 2. Get cleaned data from CW1

In [7]:
df=pd.read_csv(r"C:\Users\jmd05\DSM-020\4. CW1\Data\all1.csv", index_col=0)
df.shape

(656, 7)

In [8]:
df.head()

Unnamed: 0,Title,Synopsis,Subject,ISBN,Synopsis1,Synopsis2,Synopsis2_len
0,my life in red and white,for the very first time world renowned and rev...,sports leisure,9781474618267,"['first', 'time', 'world', 'renowned', 'revolu...","['first', 'time', 'world', 'renowned', 'revolu...",149
1,the accidental footballer,pat nevin never wanted to be a professional fo...,sports leisure,9781913183370,"['pat', 'nevin', 'never', 'wanted', 'professio...","['pat', 'nevin', 'never', 'want', 'professiona...",98
2,sooley,one man seventeen year old samuel sooleyman co...,sports leisure,9781529368000,"['one', 'man', 'seventeen', 'year', 'old', 'sa...","['one', 'man', 'seventeen', 'year', 'old', 'sa...",105
3,mortimer whitehouse gone fishing life death an...,two comedy greats talk life friendship and the...,sports leisure,9781788702942,"['two', 'comedy', 'greats', 'talk', 'life', 'f...","['two', 'comedy', 'greats', 'talk', 'life', 'f...",109
4,the accidental footballer signed edition,signed edition a standard edition is available...,sports leisure,9781800960114,"['signed', 'edition', 'standard', 'edition', '...","['sign', 'edition', 'standard', 'edition', 'av...",103


In [10]:
df.dtypes
# See that the target 'Subject' is of 'object' format

Title            object
Synopsis         object
Subject          object
ISBN              int64
Synopsis1        object
Synopsis2        object
Synopsis2_len     int64
dtype: object

In [20]:
df['Subject'].value_counts(dropna=False)
# Slight imbalance

romantic fiction               92
history                        91
sports leisure                 89
food drink                     88
entertainment                  79
spirituality beliefs           75
science technology medicine    72
business finance law           70
Name: Subject, dtype: int64

In [28]:
# Look at adding rebalancing weights
class_weights=list(class_weight.compute_class_weight('balanced', np.unique(df['Subject']), df['Subject']))
class_weights.sort()
class_weights

[0.8913043478260869,
 0.9010989010989011,
 0.9213483146067416,
 0.9318181818181818,
 1.0379746835443038,
 1.0933333333333333,
 1.1388888888888888,
 1.1714285714285715]

In [29]:
weights={}
for index, weight in enumerate(class_weights) :
  weights[index]=weight
weights

{0: 0.8913043478260869,
 1: 0.9010989010989011,
 2: 0.9213483146067416,
 3: 0.9318181818181818,
 4: 1.0379746835443038,
 5: 1.0933333333333333,
 6: 1.1388888888888888,
 7: 1.1714285714285715}

In [32]:
X_train, X_test = train_test_split(df, test_size=0.3, random_state=42)

In [33]:
dataset_train = tf.data.Dataset.from_tensor_slices((X_train['Synopsis'].values, X_train['Subject'].values))
dataset_test = tf.data.Dataset.from_tensor_slices((X_test['Synopsis'].values, X_test['Subject'].values))

In [35]:
for text, target in dataset_train.take(5):
  print ('Synopsis: {}, Subject: {}'.format(text, target))

Synopsis: b'for curious readers young and old a rich and colorful history of religion from humanity s earliest days to our own contentious times in an era of hardening religious attitudes and explosive religious violence this book offers a welcome antidote richard holloway retells the entire history of religion from the dawn of religious belief to the twenty first century with deepest respect and a keen commitment to accuracy writing for those with faith and those without and especially for young readers he encourages curiosity and tolerance accentuates nuance and mystery and calmly restores a sense of the value of faith ranging far beyond the major world religions of judaism islam christianity buddhism and hinduism holloway also examines where religious belief comes from the search for meaning throughout history today s fascinations with scientology and creationism religiously motivated violence hostilities between religious people and secularists and more holloway proves an empathic 

In [36]:
table = tf.lookup.StaticHashTable(
    initializer=tf.lookup.KeyValueTensorInitializer(
        keys=tf.constant(['romantic fiction','history','sports leisure','food drink','entertainment','spirituality beliefs',
                          'science technology medicine','business finance law']), values=tf.constant([0,1,2,3,4,5,6,7]),
    ),
    default_value=tf.constant(-1),
    name="target_encoding"
)

@tf.function
def target(x):
  return table.lookup(x)

In [38]:
def show_batch(dataset, size=5):
  for batch, label in dataset.take(size):
      print(batch.numpy())
      print(target(label).numpy())

In [39]:
show_batch(dataset_test)

b'the old world dying on its feet a new one struggling to be born dublin 1918 in a country doubly ravaged by war and disease nurse julia power works at an understaffed hospital in the city centre where expectant mothers who have come down with an unfamiliar flu are quarantined together into julia s regimented world step two outsiders doctor kathleen lynn on the run from the police and a young volunteer helper bridie sweeney in the darkness and intensity of this tiny ward over the course of three days these women change each other s lives in unexpected ways they lose patients to this baffling pandemic but they also shepherd new life into a fearful world with tireless tenderness and humanity carers and mothers alike somehow do their impossible work in the pull of the stars emma donoghue tells an unforgettable and deeply moving story of love and loss from the bestselling author of the wonder and room'
0
b'the mesmerising new york times bestseller each year eight beautiful girls are chosen

In [40]:
def fetch(text, labels):
  return text, tf.one_hot(target(labels),8)

In [41]:
train_data_f=dataset_train.map(fetch)
test_data_f=dataset_test.map(fetch)

In [42]:
next(iter(train_data_f))

(<tf.Tensor: shape=(), dtype=string, numpy=b'for curious readers young and old a rich and colorful history of religion from humanity s earliest days to our own contentious times in an era of hardening religious attitudes and explosive religious violence this book offers a welcome antidote richard holloway retells the entire history of religion from the dawn of religious belief to the twenty first century with deepest respect and a keen commitment to accuracy writing for those with faith and those without and especially for young readers he encourages curiosity and tolerance accentuates nuance and mystery and calmly restores a sense of the value of faith ranging far beyond the major world religions of judaism islam christianity buddhism and hinduism holloway also examines where religious belief comes from the search for meaning throughout history today s fascinations with scientology and creationism religiously motivated violence hostilities between religious people and secularists and 

In [43]:
train_data, train_labels = next(iter(train_data_f.batch(5)))
train_data, train_labels

(<tf.Tensor: shape=(5,), dtype=string, numpy=
 array([b'for curious readers young and old a rich and colorful history of religion from humanity s earliest days to our own contentious times in an era of hardening religious attitudes and explosive religious violence this book offers a welcome antidote richard holloway retells the entire history of religion from the dawn of religious belief to the twenty first century with deepest respect and a keen commitment to accuracy writing for those with faith and those without and especially for young readers he encourages curiosity and tolerance accentuates nuance and mystery and calmly restores a sense of the value of faith ranging far beyond the major world religions of judaism islam christianity buddhism and hinduism holloway also examines where religious belief comes from the search for meaning throughout history today s fascinations with scientology and creationism religiously motivated violence hostilities between religious people and secul

In [44]:
embedding = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1"
hub_layer = hub.KerasLayer(embedding, output_shape=[128], input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_data[:1])

<tf.Tensor: shape=(1, 128), dtype=float32, numpy=
array([[ 1.64101779e+00,  1.09652907e-01,  1.00001007e-01,
        -2.85438001e-01,  6.83002546e-02,  5.10165952e-02,
         2.34609097e-01,  3.72527400e-03, -1.07460707e-01,
         3.07002336e-01,  3.48766237e-01, -2.98062950e-01,
        -5.96389949e-01, -1.90483078e-01, -3.67415577e-01,
        -6.65464848e-02, -5.73120952e-01,  1.07594058e-01,
        -1.51422203e-01,  9.72040236e-01,  1.83290526e-01,
        -9.65867490e-02, -2.24603061e-02, -2.23811105e-01,
         1.43597379e-01, -5.74837267e-01,  4.49375093e-01,
         1.04780287e-01, -5.34029715e-02, -1.64000601e-01,
        -1.28825128e-01,  2.62997672e-02,  1.78510562e-01,
        -4.91137467e-02,  2.73741871e-01, -2.00332642e-01,
        -1.90475762e-01, -2.87243515e-01, -5.72264344e-02,
         3.14799875e-01, -3.67510706e-01, -2.53344864e-01,
        -1.78415611e-01,  3.23827803e-01,  1.40781835e-01,
        -1.49370342e-01, -1.19922146e-01, -4.04885337e-02,
      

In [45]:
model = tf.keras.Sequential()
model.add(hub_layer)
for units in [128, 128, 64 , 32]:
  model.add(tf.keras.layers.Dense(units, activation='relu'))
  model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(8, activation='softmax'))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 128)               124642688 
_________________________________________________________________
dense (Dense)                (None, 128)               16512     
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               16512     
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 64)                8256      
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)                0

In [46]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [47]:
train_data_f=train_data_f.shuffle(70000).batch(512)
test_data_f=test_data_f.batch(512)

In [48]:
history = model.fit(train_data_f,
                    epochs=4,
                    validation_data=test_data_f,
                    verbose=1,
                    class_weight=weights)

Epoch 1/4




ResourceExhaustedError:  OOM when allocating tensor with shape[973771,128] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
	 [[node Adam/Adam/update/mul_1 (defined at <ipython-input-48-6a6eda8a7550>:1) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_1408]

Function call stack:
train_function


In [None]:
len(list(dataset_test))

11491

In [None]:
results = model.evaluate(dataset_test.map(fetch).batch(11491), verbose=2)

print(results)

1/1 - 0s - loss: 1.1696 - accuracy: 0.8739
[1.1696072816848755, 0.8739013075828552]


In [None]:
test_data, test_labels = next(iter(dataset_test.map(fetch).batch(45963)))

In [None]:
y_pred=model.predict(test_data)

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(test_labels.numpy().argmax(axis=1), y_pred.argmax(axis=1)))

              precision    recall  f1-score   support

           0       0.94      0.89      0.91      4295
           1       0.85      0.86      0.86      2583
           2       0.93      0.92      0.92      2015
           3       0.87      0.82      0.85      1461
           4       0.81      0.84      0.82       611
           5       0.55      0.84      0.66       526

    accuracy                           0.87     11491
   macro avg       0.82      0.86      0.84     11491
weighted avg       0.88      0.87      0.88     11491



Classification Report with no class weights assigned
           


```
              precision    recall  f1-score   support

           0       0.92      0.91      0.92      4295
           1       0.84      0.88      0.86      2583
           2       0.90      0.94      0.92      2015
           3       0.86      0.84      0.85      1461
           4       0.86      0.78      0.81       611
           5       0.69      0.62      0.65       526

    accuracy                           0.88     11491
   macro avg       0.85      0.83      0.84     11491
weighted avg       0.88      0.88      0.88     11491
```




In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(test_labels.numpy().argmax(axis=1), y_pred.argmax(axis=1))

array([[3815,  167,   76,   32,   80,  125],
       [ 125, 2222,   13,  109,   10,  104],
       [  37,   21, 1848,   23,   21,   65],
       [  37,  162,   30, 1204,    5,   23],
       [  18,   18,   11,    4,  511,   49],
       [  29,   21,   17,   10,    7,  442]])

Confusion matrix without weights assigned

```
array([[3910,  165,   80,   34,   50,   56],
       [ 128, 2274,   20,  128,    3,   30],
       [  36,   27, 1893,   18,   16,   25],
       [  41,  149,   29, 1227,    5,   10],
       [  41,   35,   30,    6,  474,   25],
       [  72,   50,   60,   14,    5,  325]])
```

