# Text classification using Keras with Neptune tracking
Notebook inspired from https://keras.io/examples/nlp/text_classification_from_scratch/

## Setup

In [1]:
import os
import tensorflow as tf
import utils

(Neptune) Import Neptune and initialize a project

In [2]:
os.environ["NEPTUNE_PROJECT"] = "showcase/project-text-classification"

In [3]:
import neptune.new as neptune

project = neptune.init_project()

https://app.neptune.ai/showcase/project-text-classification/
Remember to stop your project once you’ve finished logging your metadata (https://docs.neptune.ai/api/project#stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


## Data preparation
We are using the IMDB sentiment analysis data available at https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz. For the purposes of this demo, we've uploaded this data to S3 at https://neptune-examples.s3.us-east-2.amazonaws.com/data/text-classification/aclImdb_v1.tar.gz and will be downloading it from there.

### (Neptune) Track datasets using Neptune
Since this dataset will be used among all the runs in the project, we track it at the project level

In [4]:
project["keras/data/files"].track_files(
    "s3://neptune-examples/data/text-classification/aclImdb_v1.tar.gz"
)
project.sync()

### (Neptune) Download files from S3 using Neptune

In [5]:
print("Downloading data...")
project["keras/data/files"].download("..")

Downloading data...


### Prepare data

In [6]:
utils.extract_files(source="../aclImdb_v1.tar.gz", destination="..")
utils.prep_data(imdb_folder="../aclImdb", dest_path="../data")

Extracting data...
../aclImdb renamed to ../data


(Neptune) Upload dataset sample to Neptune project

In [7]:
import random

base_namespace = "keras/data/sample/"

project[base_namespace]["train/pos"].upload(
    f"../data/train/pos/{random.choice(os.listdir('../data/train/pos'))}"
)
project[base_namespace]["train/neg"].upload(
    f"../data/train/neg/{random.choice(os.listdir('../data/train/neg'))}"
)
project[base_namespace]["test/pos"].upload(
    f"../data/test/pos/{random.choice(os.listdir('../data/test/pos'))}"
)
project[base_namespace]["test/neg"].upload(
    f"../data/test/neg/{random.choice(os.listdir('../data/test/neg'))}"
)

### Generate training, validation, and test datasets

In [8]:
data_params = {
    "batch_size": 32,
    "validation_split": 0.2,
    "max_features": 20000,
    "embedding_dim": 128,
    "sequence_length": 500,
    "seed": 42,
}

(Neptune) Log data metadata to Neptune

In [9]:
run = neptune.init_run(name="Keras text classification", tags=["keras"])

https://app.neptune.ai/showcase/project-text-classification/e/TXTCLF-358
Remember to stop your run once you’ve finished logging your metadata (https://docs.neptune.ai/api/run#stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


In [10]:
run["data/params"] = data_params

In [11]:
from tensorflow.keras.layers import TextVectorization
import string
import re

In [12]:
raw_train_ds, raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "../data/train",
    batch_size=data_params["batch_size"],
    validation_split=data_params["validation_split"],
    subset="both",
    seed=data_params["seed"],
)

raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "../data/test", batch_size=data_params["batch_size"]
)

print(f"Number of batches in raw_train_ds: {raw_train_ds.cardinality()}")
print(f"Number of batches in raw_val_ds: {raw_val_ds.cardinality()}")
print(f"Number of batches in raw_test_ds: {raw_test_ds.cardinality()}")

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.
Number of batches in raw_train_ds: 625
Number of batches in raw_val_ds: 157
Number of batches in raw_test_ds: 782


### Clean data

In [13]:
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(stripped_html, f"[{re.escape(string.punctuation)}]", "")


vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=data_params["max_features"],
    output_mode="int",
    output_sequence_length=data_params["sequence_length"],
)

text_ds = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


### Vectorize data

In [14]:
def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label


# Vectorize the data.
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

# Do async prefetching / buffering of the data for best performance on GPU.
train_ds = train_ds.cache().prefetch(buffer_size=10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)

## Modelling

(Neptune) Create a new model and model version

In [15]:
from neptune.new.exceptions import NeptuneModelKeyAlreadyExistsError

project_key = project["sys/id"].fetch()

try:
    model = neptune.init_model(name="keras", key="KER")
    model.stop()
except NeptuneModelKeyAlreadyExistsError:
    # If it already exists, we don't have to do anything.
    pass

model_version = neptune.init_model_version(model=f"{project_key}-KER", name="keras")

https://app.neptune.ai/showcase/project-text-classification/m/TXTCLF-KER/v/TXTCLF-KER-7
Remember to stop your model_version once you’ve finished logging your metadata (https://docs.neptune.ai/api/model_version#stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


### Build a model

In [16]:
model_params = {
    "dropout": 0.5,
    "strides": 3,
    "activation": "relu",
    "kernel_size": 7,
    "loss": "binary_crossentropy",
    "optimizer": "adam",
    "metrics": ["accuracy"],
}

In [17]:
model_version["params"] = model_params

  return self.assign(value, wait)


In [18]:
keras_model = utils.build_model(model_params, data_params)

### Train the model

(Neptune) Initialize the Neptune callback

In [19]:
from neptune.new.integrations.tensorflow_keras import NeptuneCallback

neptune_callback = NeptuneCallback(run=run, log_model_diagram=True, log_on_batch=True)

In [20]:
training_params = {
    "epochs": 3,
}

In [21]:
# Fit the model using the train and test datasets.
keras_model.fit(
    train_ds, validation_data=val_ds, epochs=training_params["epochs"], callbacks=neptune_callback
)
# Training parameters are logged automatically to Neptune

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x1f031afc130>

### Evaluate the model

In [22]:
_, curr_model_acc = keras_model.evaluate(test_ds, callbacks=neptune_callback)



## (Neptune) Associate run with model and vice-versa

In [23]:
run_meta = {
    "id": run["sys/id"].fetch(),
    "name": run["sys/name"].fetch(),
    "url": run.get_url(),
}

print(run_meta)

{'id': 'TXTCLF-358', 'name': 'Keras text classification', 'url': 'https://app.neptune.ai/showcase/project-text-classification/e/TXTCLF-358'}


In [24]:
model_version["run"] = run_meta

In [25]:
model_version_meta = {
    "id": model_version["sys/id"].fetch(),
    "name": model_version["sys/name"].fetch(),
    "url": model_version.get_url(),
}

print(model_version_meta)

{'id': 'TXTCLF-KER-7', 'name': 'keras', 'url': 'https://app.neptune.ai/showcase/project-text-classification/m/TXTCLF-KER/v/TXTCLF-KER-7'}


In [26]:
run["training/model/meta"] = model_version_meta

## (Neptune) Upload serialized model and model weights to Neptune

In [27]:
model_version["serialized_model"] = keras_model.to_json()

In [28]:
keras_model.save_weights("model_weights.h5")
model_version["model_weights"].upload("model_weights.h5")

(Neptune) Wait for all operations to sync with Neptune servers

In [29]:
model_version.sync()

## (Neptune) Promote best model to production

### (Neptune) Fetch current production model

In [30]:
with neptune.init_model(with_id=f"{project_key}-KER") as model:
    model_versions_df = model.fetch_model_versions_table().to_pandas()
model_versions_df

https://app.neptune.ai/showcase/project-text-classification/m/TXTCLF-KER
Remember to stop your model once you’ve finished logging your metadata (https://docs.neptune.ai/api/model#stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.
Shutting down background jobs, please wait a moment...
Done!
All 0 operations synced, thanks for waiting!
Explore the metadata in the Neptune app:
https://app.neptune.ai/showcase/project-text-classification/m/TXTCLF-KER/metadata


Unnamed: 0,sys/creation_time,sys/id,sys/model_id,sys/modification_time,sys/monitoring_time,sys/name,sys/owner,sys/ping_time,sys/running_time,sys/size,...,params/dropout,params/kernel_size,params/loss,params/metrics,params/optimizer,params/strides,run/id,run/name,run/url,serialized_model
0,2023-01-06 12:18:54.909000+00:00,TXTCLF-KER-7,TXTCLF-KER,2023-01-06 12:20:28.760000+00:00,89,keras,siddhant.sadangi,2023-01-06 12:20:28.760000+00:00,93.846,11255447.0,...,0.5,7,binary_crossentropy,['accuracy'],adam,3,TXTCLF-358,Keras text classification,https://app.neptune.ai/showcase/project-text-c...,"{""class_name"": ""Functional"", ""config"": {""name""..."
1,2023-01-05 18:25:49.330000+00:00,TXTCLF-KER-6,TXTCLF-KER,2023-01-05 18:27:25.769000+00:00,82,keras,siddhant.sadangi,2023-01-05 18:27:25.769000+00:00,96.435,11255451.0,...,0.5,7,binary_crossentropy,['accuracy'],adam,3,TXTCLF-356,Keras text classification,https://app.neptune.ai/showcase/project-text-c...,"{""class_name"": ""Functional"", ""config"": {""name""..."
2,2023-01-05 18:09:11.474000+00:00,TXTCLF-KER-5,TXTCLF-KER,2023-01-05 18:10:57.264000+00:00,93,keras,siddhant.sadangi,2023-01-05 18:10:57.264000+00:00,105.786,11255451.0,...,0.5,7,binary_crossentropy,['accuracy'],adam,3,TXTCLF-354,Keras text classification,https://app.neptune.ai/showcase/project-text-c...,"{""class_name"": ""Functional"", ""config"": {""name""..."
3,2023-01-05 18:00:54.375000+00:00,TXTCLF-KER-4,TXTCLF-KER,2023-01-05 18:03:31.043000+00:00,140,keras,siddhant.sadangi,2023-01-05 18:03:31.043000+00:00,156.663,11255731.0,...,0.5,7,binary_crossentropy,['accuracy'],adam,3,TXTCLF-352,Keras text classification,https://app.neptune.ai/showcase/project-text-c...,"{""class_name"": ""Functional"", ""config"": {""name""..."
4,2023-01-05 17:57:35.457000+00:00,TXTCLF-KER-3,TXTCLF-KER,2023-01-05 18:27:16.564000+00:00,137,Untitled,siddhant.sadangi,2023-01-05 18:27:16.564000+00:00,164.422,11255456.0,...,0.5,7,binary_crossentropy,['accuracy'],adam,3,TXTCLF-350,Keras text classification,https://app.neptune.ai/showcase/project-text-c...,"{""class_name"": ""Functional"", ""config"": {""name""..."
5,2023-01-05 17:35:27.859000+00:00,TXTCLF-KER-2,TXTCLF-KER,2023-01-05 18:00:09.752000+00:00,123,Untitled,siddhant.sadangi,2023-01-05 18:16:47.766000+00:00,2479.781,11255454.0,...,0.5,7,binary_crossentropy,['accuracy'],adam,3,TXTCLF-348,Keras text classification,https://app.neptune.ai/showcase/project-text-c...,"{""class_name"": ""Functional"", ""config"": {""name""..."
6,2023-01-05 17:26:18.799000+00:00,TXTCLF-KER-1,TXTCLF-KER,2023-01-05 17:39:34.330000+00:00,49,Untitled,siddhant.sadangi,2023-01-05 18:16:47.943000+00:00,2383.721,4529.0,...,0.5,7,binary_crossentropy,['accuracy'],adam,3,,,,"{""class_name"": ""Functional"", ""config"": {""name""..."


In [31]:
production_models = model_versions_df[model_versions_df["sys/stage"] == "production"]["sys/id"]
assert (
    len(production_models) == 1
), f"Multiple model versions found in production: {production_models.values}"

In [32]:
prod_model_id = production_models.values[0]
print(f"Current model in production: {prod_model_id}")

Current model in production: TXTCLF-KER-3


In [33]:
npt_prod_model = neptune.init_model_version(with_id=prod_model_id)
npt_prod_model_params = npt_prod_model["params"].fetch()
prod_model = tf.keras.models.model_from_json(
    npt_prod_model["serialized_model"].fetch(), custom_objects=None
)

npt_prod_model["model_weights"].download()
prod_model.load_weights("model_weights.h5")

https://app.neptune.ai/showcase/project-text-classification/m/TXTCLF-KER/v/TXTCLF-KER-3
Remember to stop your model_version once you’ve finished logging your metadata (https://docs.neptune.ai/api/model_version#stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


### (Neptune) Evaluate current model on lastest test data

In [34]:
# using the model's original loss and optimizer, but the current metric
prod_model.compile(
    loss=npt_prod_model_params["loss"],
    optimizer=npt_prod_model_params["optimizer"],
    metrics=model_params["metrics"],
)

_, prod_model_acc = prod_model.evaluate(test_ds)



### (Neptune) If challenger model outperforms production model, promote it to production

In [36]:
print(f"Production model accuracy: {prod_model_acc}\nChallenger model accuracy: {curr_model_acc}")

if curr_model_acc > prod_model_acc:
    print("Promoting challenger to production")
    npt_prod_model.change_stage("archived")
    model_version.change_stage("production")
else:
    print("Archiving challenger model")
    model_version.change_stage("archived")

npt_prod_model.stop()

Production model accuracy: 0.8633599877357483
Challenger model accuracy: 0.8563600182533264
Archiving challenger model
Shutting down background jobs, please wait a moment...
Done!
All 0 operations synced, thanks for waiting!
Explore the metadata in the Neptune app:
https://app.neptune.ai/showcase/project-text-classification/m/TXTCLF-KER/v/TXTCLF-KER-3/metadata


## (Neptune) Stop tracking

In [37]:
model_version.stop()
run.stop()
project.stop()

Shutting down background jobs, please wait a moment...
Done!
All 0 operations synced, thanks for waiting!
Explore the metadata in the Neptune app:
https://app.neptune.ai/showcase/project-text-classification/m/TXTCLF-KER/v/TXTCLF-KER-7/metadata
Shutting down background jobs, please wait a moment...
Done!
Waiting for the remaining 6 operations to synchronize with Neptune. Do not kill this process.
All 6 operations synced, thanks for waiting!
Explore the metadata in the Neptune app:
https://app.neptune.ai/showcase/project-text-classification/e/TXTCLF-358
Shutting down background jobs, please wait a moment...
Done!
All 0 operations synced, thanks for waiting!
Explore the metadata in the Neptune app:
https://app.neptune.ai/showcase/project-text-classification/metadata
