# Keyword Detection on Websites

### Assignment
Your task is to create an algorithm, that takes an HTML page as input and infers if the page contains the information about cancer tumor board or not.

What is a tumor board? Tumor Board is a consilium of doctors (usually from diferent disciplines) discussing cancer cases in their departments. If you want to know more you can read this article.

As a final output from this task you should provide a submission.csv file for the test data set with two columns: document ID and a prediction, and a Jupyter notebook with code and documentation giving answers to the following questions:

How did you decide to handle this amount of data?
How did you decide to do feature engineering?
How did you decide which models to try (if you decide to train any models)?
How did you perform validation of your model?
What metrics did you measure?
How do you expect your model to perform on test data (in terms of your metrics)?
How fast will your algorithm perform and how could you improve its performance if you would have more time?
How do you think you would be able to improve your algorithm if you would have more data?
What potential issues do you see with your algorithm?

In [1]:
import pandas as pd

In [3]:
train_csv = pd.read_csv(filepath_or_buffer="train.csv")
print("Training set shape", train_csv.shape)
train_csv.head()

Training set shape (100, 3)


Unnamed: 0,url,doc_id,label
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3


In [4]:
test_csv = pd.read_csv(filepath_or_buffer="test.csv")
print("Test set shape", test_csv.shape)
test_csv.head()

Test set shape (48, 2)


Unnamed: 0,url,doc_id
0,http://chirurgie-goettingen.de/medizinische-ve...,0
1,http://evkb.de/kliniken-zentren/chirurgie/allg...,2
2,http://krebszentrum.kreiskliniken-reutlingen.d...,7
3,http://marienhospital-buer.de/mhb-av-chirurgie...,15
4,http://marienhospital-buer.de/mhb-av-chirurgie...,16


In [5]:
tumor_keywords = pd.read_csv(filepath_or_buffer="keyword2tumor_type.csv")
print("Tumor keywords set shape", tumor_keywords.shape)
tumor_keywords.head()

Tumor keywords set shape (126, 2)


Unnamed: 0,keyword,tumor_type
0,senologische,Brust
1,brustzentrum,Brust
2,breast,Brust
3,thorax,Brust
4,thorakale,Brust


We have 100 documents in the training set, and 48 in the test set. We have 32 documents that mention no tumor board (label = 1), 59 documents where a tumor board is mentioned, but we are not certain if it is the main focus of the page (label = 2), and 9 documents for which we are certain that they are dedicated to tumor boards.

In [6]:
train_csv.groupby(by="label").size()

label
1    32
2    59
3     9
dtype: int64

### Load Data

In [7]:
def read_html(doc_id: int) -> str:
    with open(file=f"htmls/{doc_id}.html",
              mode="r",
              encoding="latin1") as f:
        html = f.read()
    return html


train_csv["html"] = train_csv["doc_id"].apply(read_html)

In [8]:
train_csv.sample(n=5, random_state=42)

Unnamed: 0,url,doc_id,label,html
83,http://www.sbk-vs.de/de/medizin/leistungen-und...,125,1,"\n\n<!DOCTYPE HTML>\n<html dir=""ltr"" lang=""de_..."
53,http://www.klinikum-esslingen.de/kliniken-und-...,85,2,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or..."
70,http://www.malteser-kliniken-rhein-ruhr.de/med...,107,2,"<!DOCTYPE html>\n<html lang=""de"">\n<head>\n\n<..."
45,http://www.klilu.de/medizin__pflege/kliniken_u...,73,2,"<!DOCTYPE html>\n<html lang=""de""><head>\n\t<me..."
44,http://www.kk-bochum.de/de/kliniken_zentren_be...,72,1,"<!DOCTYPE html PUBLIC ""-//W3C//DTD HTML 4.01 T..."


In [9]:
import warnings

from bs4 import BeautifulSoup

warnings.filterwarnings(action="ignore")


def extract_html_text(html):
    bs = BeautifulSoup(markup=html, features="lxml")
    for script in bs(name=["script", "style"]):
        script.decompose()
    return bs.get_text(separator=" ")


train_csv["html_text"] = train_csv["html"].apply(extract_html_text)

In [10]:
train_csv.sample(n=5, random_state=42)

Unnamed: 0,url,doc_id,label,html,html_text
83,http://www.sbk-vs.de/de/medizin/leistungen-und...,125,1,"\n\n<!DOCTYPE HTML>\n<html dir=""ltr"" lang=""de_...",\n \n \n \n \n Prostata-Karzinom-Zentrum - Sch...
53,http://www.klinikum-esslingen.de/kliniken-und-...,85,2,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...
70,http://www.malteser-kliniken-rhein-ruhr.de/med...,107,2,"<!DOCTYPE html>\n<html lang=""de"">\n<head>\n\n<...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...
45,http://www.klilu.de/medizin__pflege/kliniken_u...,73,2,"<!DOCTYPE html>\n<html lang=""de""><head>\n\t<me...",\n \n \n \n Darmzentrum Rheinpfalz » Zentren A...
44,http://www.kk-bochum.de/de/kliniken_zentren_be...,72,1,"<!DOCTYPE html PUBLIC ""-//W3C//DTD HTML 4.01 T...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...


So far we are making some progress, but we immediately observe an issue, and that is the large number of new line symbols \n at the beginning of each document. Ideally, we would want to provide clear text, with no special characters and in a proper, human-readable format.

In [12]:
from gensim.parsing import preprocessing


def preprocess_html_text(html_text: str) -> str:
    preprocessed_text = preprocessing.strip_non_alphanum(s=html_text)
    preprocessed_text = preprocessing.strip_multiple_whitespaces(s=preprocessed_text)
    preprocessed_text = preprocessing.strip_punctuation(s=preprocessed_text)
    preprocessed_text = preprocessing.strip_numeric(s=preprocessed_text)

    preprocessed_text = preprocessing.stem_text(text=preprocessed_text)
    preprocessed_text = preprocessing.remove_stopwords(s=preprocessed_text)
    return preprocessed_text


train_csv["preprocessed_html_text"] = train_csv["html_text"].apply(preprocess_html_text)

In [13]:
train_csv.sample(n=5, random_state=42)

Unnamed: 0,url,doc_id,label,html,html_text,preprocessed_html_text
83,http://www.sbk-vs.de/de/medizin/leistungen-und...,125,1,"\n\n<!DOCTYPE HTML>\n<html dir=""ltr"" lang=""de_...",\n \n \n \n \n Prostata-Karzinom-Zentrum - Sch...,prostata karzinom zentrum schwarzwald baar kli...
53,http://www.klinikum-esslingen.de/kliniken-und-...,85,2,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...,unser profil gefã ã und thoraxchirurgi kliniku...
70,http://www.malteser-kliniken-rhein-ruhr.de/med...,107,2,"<!DOCTYPE html>\n<html lang=""de"">\n<head>\n\n<...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...,maltes kliniken rhein ruhr darmzentrum duisbur...
45,http://www.klilu.de/medizin__pflege/kliniken_u...,73,2,"<!DOCTYPE html>\n<html lang=""de""><head>\n\t<me...",\n \n \n \n Darmzentrum Rheinpfalz » Zentren A...,darmzentrum rheinpfalz zentren z kliniken und ...
44,http://www.kk-bochum.de/de/kliniken_zentren_be...,72,1,"<!DOCTYPE html PUBLIC ""-//W3C//DTD HTML 4.01 T...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...,mund kiefer und plastisch gesichtschirurgi en ...


### Exploratory Data Analysis

In [14]:
import plotly.express as px
import plotly.offline as pyo

# set notebook mode to work in offline
pyo.init_notebook_mode(connected=True)

In [15]:
px.histogram(x=train_csv["preprocessed_html_text"].apply(len), title="Distribution of Text Length (Character Count)")

There is one document with 170-179K characters. Others are with < 50K character count in total.

In [16]:
px.histogram(x=train_csv["preprocessed_html_text"].apply(lambda text: text.split(" ")).apply(len),
             title="Distribution of Text Length (Word Count)")

There is one document with 27-28K words. Other documents all have < 6K words in total.

In [17]:
px.histogram(x=train_csv["preprocessed_html_text"].apply(lambda text: set(text.split(" "))).apply(len),
             title="Unique Words Count")


There is one document with 6500-7000 unique words. All others consist of < 2000 unique words.

### Modeling

To solve our task, which falls under the umbrella of natural language processing, we will use a model called the siamese network. Siamese networks are able to address the class imbalance and small data set sizes. They are mostly used in few shots learning tasks, like signature verification systems, face recognition, object detection, etc.

In [18]:
import random
import numpy as np
import tensorflow as tf

# set the random seeds
np.random.seed(42)
tf.random.set_seed(seed=42)

#### Data Generator

In [19]:
class Pair(tf.keras.utils.Sequence):
    def __init__(self, dataframe: pd.DataFrame, labels: pd.Series, n_batch: int, batch_size: int):
        self.dataframe = dataframe
        self.labels = labels
        self.n_batch = n_batch
        self.batch_size = batch_size
        self.all_classes = set(self.labels)
        self.anchor_groups = {}
        for target_class in self.all_classes:
            self.anchor_groups[target_class] = {
                "positive": self.dataframe[self.labels == target_class],
                "negative": self.dataframe[self.labels != target_class]
            }

    def __len__(self):
        return self.n_batch

    def __getitem__(self, item):
        pairs = []

        for i in range(int(self.batch_size / 2)):
            anchor_class = random.randint(1, 3)
            anchor_group = self.anchor_groups[anchor_class]["positive"]
            not_anchor_group = self.anchor_groups[anchor_class]["negative"]

            anchor = anchor_group.sample(n=1).iloc[0]
            positive = anchor_group.sample(n=1).iloc[0]
            negative = not_anchor_group.sample(n=1).iloc[0]

            pairs.append([anchor, positive, 1])
            pairs.append([anchor, negative, 0])

        random.shuffle(x=pairs)
        pairs = np.array(pairs)

        data_pairs = pairs[:, :2]
        targets = pairs[:, 2]

        return data_pairs, tf.convert_to_tensor(targets, dtype=np.float32)

    def get_support_set(self, sample_size: int = 1):
        support_set = {}
        for target_class in self.all_classes:
            support_set[target_class] = self.anchor_groups[target_class]["positive"].sample(n=sample_size)
        return support_set

#### Model Definition
Here, we define our model, as a siamese network. The model is a sequence of layers, starting with a TextVectorization layer. This layer accepts natural language (text) as input, and maps it to an integer sequence. At initialization time, we should provide a vocabulary of words for it to be able to map the words at prediction time.

Following the text vectorization layer, we implement three Dense layers, with two Dropout layers in between. Lastly, we apply a L2 normalization layer to penalize large weights.

In [20]:
class SiameseNetwork(tf.keras.Model):
    def __init__(self, corpora: pd.Series):
        super(SiameseNetwork, self).__init__()
        self.vectorizer_layer: tf.keras.layers.TextVectorization = tf.keras.layers.TextVectorization(
            max_tokens=2000,
            output_mode="int",
            output_sequence_length=512
        )
        self.vectorizer_layer.adapt(corpora.values)
        self.encoder = tf.keras.Sequential(layers=[
            self.vectorizer_layer,
            tf.keras.layers.Dense(units=256, activation=tf.keras.activations.relu),
            tf.keras.layers.Dropout(rate=0.3),
            tf.keras.layers.Dense(units=128, activation=tf.keras.activations.relu),
            tf.keras.layers.Dropout(rate=0.3),
            tf.keras.layers.Dense(units=64, activation=tf.keras.activations.relu),
            tf.keras.layers.Lambda(function=lambda x: tf.math.l2_normalize(x, axis=1))
        ])
        self.encoding_distance = tf.keras.layers.Dot(axes=1)

    def __call__(self, inputs, *args, **kwargs):
        anchors, supports = inputs[:, 0], inputs[:, 1]
        anchors_encoded = self.encoder(anchors)
        supports_encoded = self.encoder(supports)
        return self.encoding_distance((anchors_encoded, supports_encoded))

    def predict_with_support_set(self, entry, support_set: dict):
        scores = {}
        for instance_class, texts in support_set.items():
            class_scores = ([self(np.array([entry, text]).reshape((-1, 2))) for text in texts])
            scores[instance_class] = tf.math.reduce_mean(class_scores)
        return max(scores, key=scores.get)

In [22]:
model = SiameseNetwork(corpora=train_csv["preprocessed_html_text"])

In [23]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics='binary_accuracy')

At this point, we have our model, our data, and the data generator. We are ready to commence training.

But, before we do that, let's split the data in train_csv into training and validation sets.

In [24]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(train_csv["preprocessed_html_text"], train_csv["label"],
                                                      test_size=0.2,
                                                      random_state=42, stratify=train_csv["label"])

In [25]:
# training params
BATCH_SIZE = 64
N_BATCH = 100
# we instantiate training and validation data / pair generators
TRAIN_PAIR_GENERATOR = Pair(dataframe=X_train, labels=y_train, n_batch=N_BATCH, batch_size=BATCH_SIZE)
VALID_PAIR_GENERATOR = Pair(dataframe=X_valid, labels=y_valid, n_batch=N_BATCH, batch_size=BATCH_SIZE)


Finally, we put in an early stopping callback method that will stop the training prematurely if the validation loss does not decrease for 3 straight epochs.

In [26]:
early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=3)

In [27]:
history = model.fit(
    x=TRAIN_PAIR_GENERATOR,
    validation_data=VALID_PAIR_GENERATOR,
    epochs=10,
    callbacks=[early_stopping_callback],
    verbose=1
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


### Model Evaluation

Once we finish with the model training we can start evaluating the produced model. All training information is stored in the history object that is returned by the model.fit() method. In the plots below, we plot the model's training and validation accuracy and loss over the number of epochs.

In [28]:
import plotly.graph_objects as go

In [29]:
figure = go.Figure()

figure.add_scatter(y=history.history["binary_accuracy"], name="Training Accuracy")
figure.add_scatter(y=history.history["val_binary_accuracy"], name="Validation Accuracy")

figure.update_layout(dict1={
    "title": "Model Accuracy During Training",
    "xaxis_title": "Epoch",
    "yaxis_title": "Accuracy"
}, overwrite=True)

figure.show()

In [30]:
figure = go.Figure()

figure.add_scatter(y=history.history["loss"], name="Training Loss")
figure.add_scatter(y=history.history["val_loss"], name="Validation Loss")

figure.update_layout(dict1={
    "title": "Model Loss During Training",
    "xaxis_title": "Epoch",
    "yaxis_title": "Loss"
}, overwrite=True)

figure.show()

Let's try to make predictions on the validation set. The validation metrics are not indicative of the model's general performance on unseen data, since they have been used during the training process, therefore they are a bit optimistic. In general, we would expect the metrics to be lower in the production setting (though, not much lower - hopefully).

In [31]:
y_pred = X_valid.apply(lambda text: model.predict_with_support_set(
    entry=text,
    support_set=TRAIN_PAIR_GENERATOR.get_support_set(7)
))

In [32]:
# build a classification report
from sklearn.metrics import classification_report

report = classification_report(y_true=y_valid, y_pred=y_pred, zero_division=0)
print(report)

              precision    recall  f1-score   support

           1       0.40      0.33      0.36         6
           2       0.67      0.83      0.74        12
           3       0.00      0.00      0.00         2

    accuracy                           0.60        20
   macro avg       0.36      0.39      0.37        20
weighted avg       0.52      0.60      0.55        20



### Prediction

In [33]:
test_csv["html"] = test_csv["doc_id"].apply(read_html)

In [34]:
test_csv["html_text"] = test_csv["html"].apply(extract_html_text)

In [35]:
test_csv["preprocessed_html_text"] = test_csv["html_text"].apply(preprocess_html_text)

In [36]:
test_csv.sample(n=5, random_state=42)

Unnamed: 0,url,doc_id,html,html_text,preprocessed_html_text
27,http://www.josephstift-dresden.de/pressemittei...,71,"<?xml version=""1.0"" encoding=""utf-8""?>\n<rss v...",\n \n Krankenhaus St. Joseph-Stift Dresden (PM...,krankenhau st joseph stift dresden pm http www...
40,http://www.pius-hospital.de/kliniken/gynaekolo...,123,"<!DOCTYPE html>\n<html lang=""de"" dir=""ltr"" pre...",\n \n \n \n \n \n \n \n \n \n \n Patienteninfo...,patienteninformationen klinik fã¼r frauenheilk...
26,http://www.interdisziplinaere-endoskopie.mri.t...,70,"<!DOCTYPE html>\n<html lang=""de"">\n\t<!--[if I...",\n \n \n \n \n \n \n Herzlich Willkommen â...,herzlich willkommen â interdisziplinã endoskop...
43,http://www.uk-augsburg.de/krebsbehandlung/diag...,134,"<!DOCTYPE html> \n<html lang=""de""> \n\t<head> ...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n Z...,zweitmeinung â warum ein weiter einschã tzung ...
24,http://www.hjk-muenster.de/unsere-kompetenzen/...,68,"<!DOCTYPE html><html lang=""de"" class=""no-js""><...",Gelenkersatz Skip to main content hjk Die Ei...,gelenkersatz skip main content hjk die einrich...


In [37]:
# do inference
test_csv["predictions"] = test_csv["preprocessed_html_text"].apply(lambda text: model.predict_with_support_set(
    entry=text,
    support_set=TRAIN_PAIR_GENERATOR.get_support_set(sample_size=7)
))

In [38]:
test_csv.sample(n=5, random_state=42)

Unnamed: 0,url,doc_id,html,html_text,preprocessed_html_text,predictions
27,http://www.josephstift-dresden.de/pressemittei...,71,"<?xml version=""1.0"" encoding=""utf-8""?>\n<rss v...",\n \n Krankenhaus St. Joseph-Stift Dresden (PM...,krankenhau st joseph stift dresden pm http www...,2
40,http://www.pius-hospital.de/kliniken/gynaekolo...,123,"<!DOCTYPE html>\n<html lang=""de"" dir=""ltr"" pre...",\n \n \n \n \n \n \n \n \n \n \n Patienteninfo...,patienteninformationen klinik fã¼r frauenheilk...,1
26,http://www.interdisziplinaere-endoskopie.mri.t...,70,"<!DOCTYPE html>\n<html lang=""de"">\n\t<!--[if I...",\n \n \n \n \n \n \n Herzlich Willkommen â...,herzlich willkommen â interdisziplinã endoskop...,2
43,http://www.uk-augsburg.de/krebsbehandlung/diag...,134,"<!DOCTYPE html> \n<html lang=""de""> \n\t<head> ...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n Z...,zweitmeinung â warum ein weiter einschã tzung ...,2
24,http://www.hjk-muenster.de/unsere-kompetenzen/...,68,"<!DOCTYPE html><html lang=""de"" class=""no-js""><...",Gelenkersatz Skip to main content hjk Die Ei...,gelenkersatz skip main content hjk die einrich...,2


Below, we explore the predictions on the test set.

In [39]:
test_csv["predictions"].value_counts()

2    34
1    14
Name: predictions, dtype: int64

In [40]:
test_csv[["doc_id", "predictions"]]

Unnamed: 0,doc_id,predictions
0,0,2
1,2,2
2,7,2
3,15,2
4,16,2
5,24,2
6,31,1
7,32,1
8,36,1
9,38,2


### Answers to Questions
- How did you decide to handle this amount of data?

We have used data generators that dynamically load the data samples from disk. It would have been possible to load the entire data set into memory, given that it is relatively small.

- How did you decide to do feature engineering?

We haven't used any feature engineering techniques per se, though we have spent some effort on data pre-processing, with steps like removing punctuation, multiple whitespaces, non-alphanumerical characters, etc.

- How did you decide which models to try (if you decide to train any models)?

We've decided to use the Siamese Network model because it is very popular for this particular task (natural language processing, small data set, class imbalance). The choice of layers is also very common in the field: we intertwine dropout layers with dense layers, which have a decreasing number of units. Lastly, we apply L2 regularization to penalize any large weights.

- How did you perform validation of your model?

Validation is automatically handled by the Tensorflow library, we just pass in a validation set. The validation set was obtained by splitting the provided data into the train (80%) and validation (20%) sets.

- What metrics did you measure?

During training we measure binary accuracy. In the evaluation phase, we measure per-class precision, recall, and f1 scores on the validation set.

- How do you expect your model to perform on test data (in terms of your metrics)?

We expect somewhat similar performance to the validation set, around 0.5-6 f1 score on label = 1, around 0.8 f1 score on label = 2, and we hope, f1 > 0 on label = 3.

- How fast will your algorithm perform and how could you improve its performance if you would have more time?

Each epoch takes around 30s to execute. We can improve that if we were to run the model on GPUs.

- How do you think you would be able to improve your algorithm if you would have more data?

Build a more complex model
Try different loss metrics
Use pre-trained models

- What potential issues do you see with your algorithm?

It is very prone to overfitting, though this is almost certainly because of the small data set.
We have zero precision and recall on the label = 3 which is concerning and should be addressed somehow.