In [32]:
import tensorflow as tf
from tensorflow import keras
import tensorflow_hub as hub
import tensorflow_text as text
import pandas as pd

from sklearn.model_selection import train_test_split

#### *Reading the dataset*

In [33]:
df = pd.read_csv("data/train-set-cat1-processed.csv")
df.head()

Unnamed: 0,text,label
0,ionospheric correction space radar data mike h...,4
1,scissors mode trapped dipolar gas mitsuru tohy...,4
2,evidence existence many pure ground state pm j...,4
3,migration induced epidemic dynamic fluxbased m...,4
4,degree freedom cognitive radio channel natasha...,3


#### *Retrieving encoder and preprocessor of Bert model from Tensorflow Hub*

We are going to retrieve the encoder and the preprocessor of Bert. The model we are going to use is composed of 12 encoding layers and it has been trained on the Wikipedia and BooksCorpus. Inputs have been stripped of every accent marker and the text has been lower-cased before tokenization.

In [49]:
bert_preprocess = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
bert_encoder = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"

#### *Splitting into train set and test set*

Now, we are going to divide the dataset into train and test set. Firstly, we will divide the _text_ column from the _label_ column and later, we will split into train and test set. The train set will be the 80% of the entire dataset, while the remaining part will be the test set.

In [46]:
# Divide text and label columns
num_classes = len(df['label'].value_counts())
X = df['text']
y = tf.keras.utils.to_categorical(df["label"].values, num_classes=num_classes)

In [47]:
# Splitting into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(43236,) (10809,) (43236, 8) (10809, 8)


#### *Architecture definition*

Now, we are going to define the model. The model will be composed of:
- An **input layer**: Used for bringing each text line into the model.
- A **preprocessing layer**: Text input must be tokenized and organized in tensors before introducing it to the model. Tensorflow Hub provides us with a preprocessor for each available model. The outputs of this layer are *input_words_id*, *input_mask* and *input_type_ids*.
- An **encoding layer**: This layer captures contextual information from the input text. It consists of multiple stacked transformer encoder layers; in our specific case, there are 12 layers, but there could have been more or less, depending on the Bert model that we are using.
- A **dropout layer**: Used for reducing overfitting and improve generalization.
- A **dense layer**: Used for mapping the outputs. It has a *softmax* activation function, in order to provide us the probabilities for each output.

In [50]:
text_input = tf.keras.layers.Input(shape=(), dtype = tf.string, name = "text")
preprocessing_layer = hub.KerasLayer(bert_preprocess, name = "preprocessing")
encoder_inputs = preprocessing_layer(text_input)
encoder = hub.KerasLayer(bert_encoder, name = "bert_encoder")
outputs = encoder(encoder_inputs)
net = outputs["pooled_output"]
net = tf.keras.layers.Dropout(0.1)(net)
net = tf.keras.layers.Dense(8, activation = "softmax", name = "classifier")(net)
model = tf.keras.Model(text_input, net)

ValueError: Trying to load a model of incompatible/unknown type. 'C:\Users\acer\AppData\Local\Temp\tfhub_modules\602d30248ff7929470db09f7385fc895e9ceb4c0' contains neither 'saved_model.pb' nor 'saved_model.pbtxt'.

In [18]:
keras.utils.plot_model(model)

NameError: name 'model' is not defined

In [35]:
#Train and test
model.compile(loss = "sparse_categorical_crossentropy", optimizer = "rmsprop", metrics = ["acc"])

In [39]:
history = model.fit(X_train, y_train, epochs = 5)

Epoch 1/5
  61/1931 [..............................] - ETA: 3:44:56 - loss: 1.5139 - acc: 0.4395

KeyboardInterrupt: 

0