# Data Project from PeakData - Guided Solution

## Table of Contents

* [Assignment](#a)
* [Introduction](#i)
* [Loading Data](#ld)
* [Exploratory Data Analysis](#da)
* [Modeling](#m)
    * [Data Generators](#dg)
    * [Model Definition](#md)
* [Model Evaluation](#me)
* [Prediction](#p)
* [Answers to Questions](#atq)

## Assignment <a class="anchor" id="a"></a>

Your task is to create an algorithm, that takes an HTML page as input and infers if the page contains the information about cancer tumor board or not.

_What is a tumor board?_ Tumor Board is a consilium of doctors (usually from diferent disciplines) discussing cancer cases in their departments. If you want to know more you can read this [article](https://www.cancer.net/blog/2017-07/what-tumor-board-expert-qa).


As a final output from this task you should provide a `submission.csv` file for the test data set with two columns: document ID and a prediction, and a Jupyter notebook with code and documentation giving answers to the following questions:

- How did you decide to handle this amount of data?
- How did you decide to do feature engineering?
- How did you decide which models to try (if you decide to train any models)?
- How did you perform validation of your model?
- What metrics did you measure?
- How do you expect your model to perform on test data (in terms of your metrics)?
- How fast will your algorithm perform and how could you improve its performance if you would have more time?
- How do you think you would be able to improve your algorithm if you would have more data?
- What potential issues do you see with your algorithm?

## Introduction <a class="anchor" id="i"></a>

As a first step in solving this problem, we will load the provided CSV files using the Pandas library. Pandas is very useful library for manipulating with small to medium size data sets, as it has implemented methods for reading many data formats, such as: Excel, CSV / TSV, JSON, HTML, Parquet, etc. Intermediate knowledge of Pandas will often get you a long way in your day to day job as a data scientist, so it is worthwhile endeavour exploring its possibilities.

Pandas will load the CSV files into a structure called dataframe. In the cells below we read the provided files. The training CSV file contains 100 rows, with three columns: `URL`, `doc_id`, and a `label`. The test CSV file has 48 rows with two columns: `URL` and `doc_id`. The goal is to train a machine learning model that can predict a `label` for the documents provided in the test CSV based on the data that is available in the training CSV.

In [1]:
import pandas as pd

In [2]:
train_csv = pd.read_csv(filepath_or_buffer="train.csv")
# .shape prints a tuple of #rows, #columns
print("Training set shape", train_csv.shape)
# .head() prints the first n (=5) rows of the dataframe
train_csv.head()

Training set shape (100, 3)


Unnamed: 0,url,doc_id,label
0,http://elbe-elster-klinikum.de/fachbereiche/ch...,1,1
1,http://klinikum-bayreuth.de/einrichtungen/zent...,3,3
2,http://klinikum-braunschweig.de/info.php/?id_o...,4,1
3,http://klinikum-braunschweig.de/info.php/?id_o...,5,1
4,http://klinikum-braunschweig.de/zuweiser/tumor...,6,3


In [3]:
test_csv = pd.read_csv(filepath_or_buffer="test.csv")
print("Test set shape", test_csv.shape)
test_csv.head()

Test set shape (48, 2)


Unnamed: 0,url,doc_id
0,http://chirurgie-goettingen.de/medizinische-ve...,0
1,http://evkb.de/kliniken-zentren/chirurgie/allg...,2
2,http://krebszentrum.kreiskliniken-reutlingen.d...,7
3,http://marienhospital-buer.de/mhb-av-chirurgie...,15
4,http://marienhospital-buer.de/mhb-av-chirurgie...,16


In [4]:
tumor_keywords = pd.read_csv(filepath_or_buffer="keyword2tumor_type.csv")
print("Tumor keywords set shape", tumor_keywords.shape)
tumor_keywords.head()

Tumor keywords set shape (126, 2)


Unnamed: 0,keyword,tumor_type
0,senologische,Brust
1,brustzentrum,Brust
2,breast,Brust
3,thorax,Brust
4,thorakale,Brust


We have 100 documents in the training set, and 48 in the test set. We have 32 documents that mention no tumorboard (label = 1), 59 documents where tumorboard is mentioned, but we are not certain if it is the main focus of the page (label = 2), and 9 documents for which we are certain that they are dedicated to tumorboards.

In [5]:
# the .size() method counts the number of rows for each group
train_csv.groupby(by="label").size()

label
1    32
2    59
3     9
dtype: int64

## Loading Data <a class="anchor" id="ld"></a>

Once we got familiar with the provided data, we can start to write methods to read the actual HTML files. As a first step we are going to implement a `read_html(doc_id)` method that will load the corresponding HTML file from the `htmls` directory. The code in this notebook assumes that the files are in the same working directory as the notebook itself. If that is not the case, please modify where needed. We will use the [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) method in Pandas. For each row in the calling dataframe, the method will pass the corresponding value as an argument to the method, execute it and return a result.

In this case, we are calling `apply` to the `doc_id` column (thus, for each row we pass in the `doc_id` as an argument) and are reading in the corresponding HTML file from the directory. We return the HTML contents and store it in an `html` column.

In [6]:
def read_html(doc_id: int) -> str:
    """
    Reads the HTML file at the specified path.
    Since the language of the documents is German,
    we need to specify the 'latin1' encoding, rather
    than the more common 'utf-8'. For more info about
    the encoding, see: https://en.wikipedia.org/wiki/ISO/IEC_8859-1
    """
    with open(file=f"htmls/{doc_id}.html",
              mode="r",
              encoding="latin1") as f:
        html = f.read()
    return html


# this will store the actual HTML text in the 'html' column
train_csv["html"] = train_csv["doc_id"].apply(read_html)

In [7]:
# print a sample to get familiar with the data at this point
# the random_state argument is needed to provide deterministic output
train_csv.sample(n=5, random_state=42)

Unnamed: 0,url,doc_id,label,html
83,http://www.sbk-vs.de/de/medizin/leistungen-und...,125,1,"\n\n<!DOCTYPE HTML>\n<html dir=""ltr"" lang=""de_..."
53,http://www.klinikum-esslingen.de/kliniken-und-...,85,2,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or..."
70,http://www.malteser-kliniken-rhein-ruhr.de/med...,107,2,"<!DOCTYPE html>\n<html lang=""de"">\n<head>\n\n<..."
45,http://www.klilu.de/medizin__pflege/kliniken_u...,73,2,"<!DOCTYPE html>\n<html lang=""de""><head>\n\t<me..."
44,http://www.kk-bochum.de/de/kliniken_zentren_be...,72,1,"<!DOCTYPE html PUBLIC ""-//W3C//DTD HTML 4.01 T..."


As mentioned in the **Tips** section, we can use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) package to parse the HTML content for the data that we need. To be able to use the package properly, and as we intend, in addition to installing `BeautifulSoup` with the `pip install beautifulsoup4` command, you will need to install a proper HTML parser, [lxml](https://lxml.de/elementsoup.html) with the `pip install lxml` command. The `lxml` library is OS-dependant, so we recommend reading the Installation page to get most accurate instructions.

In [9]:
import warnings

from bs4 import BeautifulSoup

warnings.filterwarnings(action="ignore")


def extract_html_text(html: str) -> str:
    """
    Extracts the text from the provided HTML.
    Using the 'lxml' parser has excellent encoding detection
    and provides better results for HTMLs which do not
    declare their encoding.
    """
    bs = BeautifulSoup(markup=html, features="lxml")
    for script in bs(name=["script", "style"]):
        # remove all <script> and <style> tags from the HTML
        script.decompose()
    return bs.get_text(separator=" ")


# extract text elements from the HTML
train_csv["html_text"] = train_csv["html"].apply(extract_html_text)

In [10]:
# printing a sample to observe the data
train_csv.sample(n=5, random_state=42)

Unnamed: 0,url,doc_id,label,html,html_text
83,http://www.sbk-vs.de/de/medizin/leistungen-und...,125,1,"\n\n<!DOCTYPE HTML>\n<html dir=""ltr"" lang=""de_...",\n \n \n \n \n Prostata-Karzinom-Zentrum - Sch...
53,http://www.klinikum-esslingen.de/kliniken-und-...,85,2,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...
70,http://www.malteser-kliniken-rhein-ruhr.de/med...,107,2,"<!DOCTYPE html>\n<html lang=""de"">\n<head>\n\n<...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...
45,http://www.klilu.de/medizin__pflege/kliniken_u...,73,2,"<!DOCTYPE html>\n<html lang=""de""><head>\n\t<me...",\n \n \n \n Darmzentrum Rheinpfalz » Zentren A...
44,http://www.kk-bochum.de/de/kliniken_zentren_be...,72,1,"<!DOCTYPE html PUBLIC ""-//W3C//DTD HTML 4.01 T...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...


So far we are making some progress, but we immediately observe an issue, and that is the large number of new line symbols `\n` at the beginning of each document. Ideally, we would want to provide clear text, with no special characters and in proper, human-readable format. To achieve that, we will try to utilise some of the methods in the [gensim](https://github.com/rare-technologies/gensim) library.

There are a bunch of pre-processing methods in the `gensim.parsing` modules that suit to our use case. We will use some of them. Feel free to explore other methods as well.

In [11]:
from gensim.parsing import preprocessing


def preprocess_html_text(html_text: str) -> str:
    """
    The preprocessing consists of the following six steps:

    1. Strips all non-alphanumerical characters.
    2. Strips all multiple whitespaces.
    3. Strips all punctuation.
    4. Strips all numerical characters.
    5. Converts to lowercase and then stems the text.
    6. Removes all stop-words.
    """
    preprocessed_text = preprocessing.strip_non_alphanum(s=html_text)
    preprocessed_text = preprocessing.strip_multiple_whitespaces(s=preprocessed_text)
    preprocessed_text = preprocessing.strip_punctuation(s=preprocessed_text)
    preprocessed_text = preprocessing.strip_numeric(s=preprocessed_text)

    preprocessed_text = preprocessing.stem_text(text=preprocessed_text)
    preprocessed_text = preprocessing.remove_stopwords(s=preprocessed_text)
    return preprocessed_text


train_csv["preprocessed_html_text"] = train_csv["html_text"].apply(preprocess_html_text)

In [12]:
# it is always recommended to print any intermediate format of a data set,
# to make sure that things are progressing as expected
train_csv.sample(n=5, random_state=42)

Unnamed: 0,url,doc_id,label,html,html_text,preprocessed_html_text
83,http://www.sbk-vs.de/de/medizin/leistungen-und...,125,1,"\n\n<!DOCTYPE HTML>\n<html dir=""ltr"" lang=""de_...",\n \n \n \n \n Prostata-Karzinom-Zentrum - Sch...,prostata karzinom zentrum schwarzwald baar kli...
53,http://www.klinikum-esslingen.de/kliniken-und-...,85,2,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...,unser profil gefã ã und thoraxchirurgi kliniku...
70,http://www.malteser-kliniken-rhein-ruhr.de/med...,107,2,"<!DOCTYPE html>\n<html lang=""de"">\n<head>\n\n<...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...,maltes kliniken rhein ruhr darmzentrum duisbur...
45,http://www.klilu.de/medizin__pflege/kliniken_u...,73,2,"<!DOCTYPE html>\n<html lang=""de""><head>\n\t<me...",\n \n \n \n Darmzentrum Rheinpfalz » Zentren A...,darmzentrum rheinpfalz zentren z kliniken und ...
44,http://www.kk-bochum.de/de/kliniken_zentren_be...,72,1,"<!DOCTYPE html PUBLIC ""-//W3C//DTD HTML 4.01 T...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \...,mund kiefer und plastisch gesichtschirurgi en ...


## Exploratory Data Analysis <a class="anchor" id="da"></a>

At this point we are ready to apply some machine learning algorithms that are able to build a model that can predict the label of the document based on its text. But before we do that, it would be useful to provide some EDA (Exploratory Data Analysis) plots on the texts. We will use the `plotly` library for that.

In [13]:
import plotly.express as px
import plotly.offline as pyo

# set notebook mode to work in offline
# should enable viewing of plotly plots in offline mode
pyo.init_notebook_mode(connected=True)

In [14]:
px.histogram(x=train_csv["preprocessed_html_text"].apply(len), title="Distribution of Text Length (Character Count)")

There is one document with 170-179K characters. Others are with < 50K character count in total.

In [15]:
px.histogram(x=train_csv["preprocessed_html_text"].apply(lambda text: text.split(" ")).apply(len),
             title="Distribution of Text Length (Word Count)")

There is one document with 27-28K words. Other documents all have < 6K words in total.

In [16]:
px.histogram(x=train_csv["preprocessed_html_text"].apply(lambda text: set(text.split(" "))).apply(len),
             title="Unique Words Count")

There is one document with 6500-7000 unique words. All others consist of < 2000 unique words.

## Modeling <a class="anchor" id="m"></a>

We will use [Tensorflow](https://www.tensorflow.org/overview) to build a neural network that will be able to consume the texts we have pre-processed and output a label for them. Tensorflow is widely used in the data science community for solving tasks that deal with non-tabular data, such as natural language processing, computer vision, audio processing, etc. It has great support and is highly optimised for creating production-ready, state-of-the-art neural network models.

To solve our task, which falls under the umbrella of natural language processing, we will use a model called siamese network. Siamese networks are able to address class imbalance and small data set sizes. They are mostly used in **few shots learning** tasks, like signature verification systems, face recognition, object detection, etc.

They fit our task well. We have relatively small data set (< 100 samples), and we have class imbalance (only 9 training instances with `label = 3` compared to 59 instances with `label = 2` and 32 with `label = 1`).

Let's get started building our neural network. First, we import some needed libraries, like Tensorflow, numpy, and Python's random package.

In [17]:
import random
import numpy as np
import tensorflow as tf

# it is always useful to set the random seeds
# wherever possible, for reproducibility of results
np.random.seed(42)
tf.random.set_seed(seed=42)

### Data Generators <a class="anchor" id="dg"></a>

While it is not crucial in this task, we would like to show how to properly use Tensorflow (and Keras) by implementing a data generator class.

Since most of the applications of Tensorflow are in the deep learning domain, and require data sets of great sizes, the developers of the platform have come up with a solution with which they dynamically load the data from disk. In the following cell we implement a data generator class called `Pair`. We implement an initialization method, a `__len__` and a `__getitem__` method, that get called during model training. We also add a custom `get_support_set` method which we will utilise later.

Data generators are very useful, and sometimes necessary, when working on deep learning applications. This [tutorial](https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly) provides a nice overview of the concept around them.

In [18]:
class Pair(tf.keras.utils.Sequence):
    def __init__(self, dataframe: pd.DataFrame, labels: pd.Series, n_batch: int, batch_size: int):
        """Initialization"""
        self.dataframe = dataframe
        self.labels = labels
        self.n_batch = n_batch
        self.batch_size = batch_size
        self.all_classes = set(self.labels)
        self.anchor_groups = {}
        for target_class in self.all_classes:
            self.anchor_groups[target_class] = {
                "positive": self.dataframe[self.labels == target_class],
                "negative": self.dataframe[self.labels != target_class]
            }

    def __len__(self):
        return self.n_batch

    def __getitem__(self, item):
        pairs = []

        for i in range(int(self.batch_size / 2)):
            anchor_class = random.randint(1, 3)
            anchor_group = self.anchor_groups[anchor_class]["positive"]
            not_anchor_group = self.anchor_groups[anchor_class]["negative"]

            anchor = anchor_group.sample(n=1).iloc[0]
            positive = anchor_group.sample(n=1).iloc[0]
            negative = not_anchor_group.sample(n=1).iloc[0]

            pairs.append([anchor, positive, 1])
            pairs.append([anchor, negative, 0])

        random.shuffle(x=pairs)
        pairs = np.array(pairs)

        data_pairs = pairs[:, :2]
        targets = pairs[:, 2]

        return data_pairs, tf.convert_to_tensor(targets, dtype=np.float32)

    def get_support_set(self, sample_size: int = 1):
        """Returns sample sets of certain size of each target class"""
        support_set = {}
        for target_class in self.all_classes:
            support_set[target_class] = self.anchor_groups[target_class]["positive"].sample(n=sample_size)
        return support_set

### Model Definition <a class="anchor" id="md"></a>

Here, we define our model, a siamese network. The model is a sequence of layers, starting with a [TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer. This layer accepts natural language (text) as input, and maps it to an integer sequence. At initialization time, we should provide a vocabulary of words for it to be able to map the words at prediction time.

Following the text vectorization layer, we implement three [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layers, with two [Dropout](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout) layers in between. Lastly, we apply a [L2 normalization](https://www.tensorflow.org/api_docs/python/tf/math/l2_normalize) layer to penalise large weights.

In our implementation of a siamese network, we override the `__call__` method of the `tf.keras.Model` class. This is needed because of the nature of the model.

Siamese networks take as input triplets: anchor (baseline) input, sample from the same class as the anchor - positive, and a sample from a different class than the anchor - negative. It then does two passes the anchor twice through the network: once in combination with the positive sample, and second time with the negative sample. Lastly, it compares the difference in outputs from the two passes. We expect the error / loss of the model to be low for the "positive pass" and higher for the "negative pass", since we want samples from the same class to be as similar to each other as possible, and as different from other classes as possible.

This [tutorial](https://towardsdatascience.com/a-friendly-introduction-to-siamese-networks-85ab17522942) provides brief overview of siamese networks.

In [19]:
class SiameseNetwork(tf.keras.Model):
    def __init__(self, corpora: pd.Series):
        super(SiameseNetwork, self).__init__()
        self.vectorizer_layer: tf.keras.layers.TextVectorization = tf.keras.layers.TextVectorization(
            max_tokens=2000,  # empirically chosen as best, higher number overfits (see the unique words count plot)
            output_mode="int",
            output_sequence_length=512
        )
        self.vectorizer_layer.adapt(corpora.values)
        self.encoder = tf.keras.Sequential(layers=[
            self.vectorizer_layer,
            tf.keras.layers.Dense(units=256, activation=tf.keras.activations.relu),
            tf.keras.layers.Dropout(rate=0.3),
            tf.keras.layers.Dense(units=128, activation=tf.keras.activations.relu),
            tf.keras.layers.Dropout(rate=0.3),
            tf.keras.layers.Dense(units=64, activation=tf.keras.activations.relu),
            tf.keras.layers.Lambda(function=lambda x: tf.math.l2_normalize(x, axis=1))
        ])
        self.encoding_distance = tf.keras.layers.Dot(axes=1)

    def __call__(self, inputs, *args, **kwargs):
        anchors, supports = inputs[:, 0], inputs[:, 1]
        anchors_encoded = self.encoder(anchors)
        supports_encoded = self.encoder(supports)
        return self.encoding_distance((anchors_encoded, supports_encoded))

    def predict_with_support_set(self, entry, support_set: dict):
        """
        Custom method that wraps around the __call__ method.
        It is used to pass the entry (input text) multiple times
        through the model to average out the losses and provide more
        stable estimate.
        """
        scores = {}
        for instance_class, texts in support_set.items():
            class_scores = ([self(np.array([entry, text]).reshape((-1, 2))) for text in texts])
            scores[instance_class] = tf.math.reduce_mean(class_scores)
        return max(scores, key=scores.get)

Below we instantiate a model object and compile it.

In [20]:
model = SiameseNetwork(corpora=train_csv["preprocessed_html_text"])

In [21]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics='binary_accuracy')

At this point we have our model, our data, and data generator. We are ready to commense training.

But, before we do that, let's split the data in `train_csv` into training and validation sets. We'll use 20% of the documents as validation, and the remainder as training. Sklearn's `train_test_split` method is very convenient for doing that. Furthermore, notice that we stratify the split by the label. This is important, because it prevents the case where the split is done only on a single class, or the splits have unrepresentative class distribution.

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(train_csv["preprocessed_html_text"], train_csv["label"],
                                                      test_size=0.2,
                                                      random_state=42, stratify=train_csv["label"])

In [23]:
# training params
BATCH_SIZE = 64
N_BATCH = 100
# we instantiate training and validation data / pair generators
TRAIN_PAIR_GENERATOR = Pair(dataframe=X_train, labels=y_train, n_batch=N_BATCH, batch_size=BATCH_SIZE)
VALID_PAIR_GENERATOR = Pair(dataframe=X_valid, labels=y_valid, n_batch=N_BATCH, batch_size=BATCH_SIZE)

Finally, we put in an early stopping callback method that will stop the training prematurely if the validation loss does not decrease for 3 straight epochs.

In [24]:
early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=3)

In [25]:
history = model.fit(
    x=TRAIN_PAIR_GENERATOR,
    validation_data=VALID_PAIR_GENERATOR,
    epochs=10,
    callbacks=[early_stopping_callback],
    verbose=1
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


## Model Evaluation <a class="anchor" id="me"></a>

Once we finish with the model training we can start evaluating the produced model. All training information is stored in the `history` object that is returned by the `model.fit()` method. In the plots below, we plot the model's training and validation accuracy and loss over the number of epochs.

As expected, the training accuracy increases with the number of epochs, while the validation accuracy starts decreasing - which is a sign of overfitting.
Same thing can be observed with the loss plot - training loss decreases a bit and then stays flat, while the validation loss fluctuates in the beginning and then stabilizes.

In [26]:
import plotly.graph_objects as go

In [27]:
figure = go.Figure()

figure.add_scatter(y=history.history["binary_accuracy"], name="Training Accuracy")
figure.add_scatter(y=history.history["val_binary_accuracy"], name="Validation Accuracy")

figure.update_layout(dict1={
    "title": "Model Accuracy During Training",
    "xaxis_title": "Epoch",
    "yaxis_title": "Accuracy"
}, overwrite=True)

figure.show()

In [28]:
figure = go.Figure()

figure.add_scatter(y=history.history["loss"], name="Training Loss")
figure.add_scatter(y=history.history["val_loss"], name="Validation Loss")

figure.update_layout(dict1={
    "title": "Model Loss During Training",
    "xaxis_title": "Epoch",
    "yaxis_title": "Loss"
}, overwrite=True)

figure.show()

Let's try make predictions on the validation set. The validation metrics are not indicative of the model's general performance on unseen data, since they have been used during the training process, therefore they are a bit optimistic. In general, we would expect the metrics to be lower in production setting (though, not much lower - hopefully).

Here we use the `predict_with_support_set` method in our siamese network class with size 7 (empirically chosen as best). The method will pass the entry (input text) multiple times (7) through the model with different positive and negative samples are average out its errors, the goal being more stable predictions.

In [28]:
y_pred = X_valid.apply(lambda text: model.predict_with_support_set(
    entry=text,
    support_set=TRAIN_PAIR_GENERATOR.get_support_set(7)
))

A useful method for evaluating a classfication model is the `classification_report` method in the `sklearn.metrics` package. It prints a matrix of numbers, where each row is a target class and each column is a metric. The last column provides the corresponding target class count (there are 6 documents with class label 1 in the validation set), and the bottom row provides metric averages.

In [29]:
# build a classification report
from sklearn.metrics import classification_report

report = classification_report(y_true=y_valid, y_pred=y_pred, zero_division=0)
print(report)

              precision    recall  f1-score   support

           1       0.56      0.83      0.67         6
           2       0.82      0.75      0.78        12
           3       0.00      0.00      0.00         2

    accuracy                           0.70        20
   macro avg       0.46      0.53      0.48        20
weighted avg       0.66      0.70      0.67        20



## Prediction <a class="anchor" id="p"></a>

We apply the same set of pre-processing steps as we did for the training data.

In [30]:
# this stores the actual HTML text in the 'html' column
test_csv["html"] = test_csv["doc_id"].apply(read_html)

In [31]:
# extracts text elements from the HTML
test_csv["html_text"] = test_csv["html"].apply(extract_html_text)

In [32]:
# cleans the extracted text
test_csv["preprocessed_html_text"] = test_csv["html_text"].apply(preprocess_html_text)

In [33]:
test_csv.sample(n=5, random_state=42)

Unnamed: 0,url,doc_id,html,html_text,preprocessed_html_text
27,http://www.josephstift-dresden.de/pressemittei...,71,"<?xml version=""1.0"" encoding=""utf-8""?>\n<rss v...",\n \n Krankenhaus St. Joseph-Stift Dresden (PM...,krankenhau st joseph stift dresden pm http www...
40,http://www.pius-hospital.de/kliniken/gynaekolo...,123,"<!DOCTYPE html>\n<html lang=""de"" dir=""ltr"" pre...",\n \n \n \n \n \n \n \n \n \n \n Patienteninfo...,patienteninformationen klinik fã¼r frauenheilk...
26,http://www.interdisziplinaere-endoskopie.mri.t...,70,"<!DOCTYPE html>\n<html lang=""de"">\n\t<!--[if I...",\n \n \n \n \n \n \n Herzlich Willkommen â...,herzlich willkommen â interdisziplinã endoskop...
43,http://www.uk-augsburg.de/krebsbehandlung/diag...,134,"<!DOCTYPE html> \n<html lang=""de""> \n\t<head> ...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n Z...,zweitmeinung â warum ein weiter einschã tzung ...
24,http://www.hjk-muenster.de/unsere-kompetenzen/...,68,"<!DOCTYPE html><html lang=""de"" class=""no-js""><...",Gelenkersatz Skip to main content hjk Die Ei...,gelenkersatz skip main content hjk die einrich...


To perform the prediction, we can call the `apply` method on the `preprocessed_html_text` column in the `test_csv` dataframe. In it, we use the `predict_with_support_set` method in the `SiameseNetwork` class to pass in each test pre-processed HTML text. The predictions are then stored in a `predictions` column.

In [34]:
test_csv["predictions"] = test_csv["preprocessed_html_text"].apply(lambda text: model.predict_with_support_set(
    entry=text,
    support_set=TRAIN_PAIR_GENERATOR.get_support_set(sample_size=7)
))

In [35]:
test_csv.sample(n=5, random_state=42)

Unnamed: 0,url,doc_id,html,html_text,preprocessed_html_text,predictions
27,http://www.josephstift-dresden.de/pressemittei...,71,"<?xml version=""1.0"" encoding=""utf-8""?>\n<rss v...",\n \n Krankenhaus St. Joseph-Stift Dresden (PM...,krankenhau st joseph stift dresden pm http www...,1
40,http://www.pius-hospital.de/kliniken/gynaekolo...,123,"<!DOCTYPE html>\n<html lang=""de"" dir=""ltr"" pre...",\n \n \n \n \n \n \n \n \n \n \n Patienteninfo...,patienteninformationen klinik fã¼r frauenheilk...,1
26,http://www.interdisziplinaere-endoskopie.mri.t...,70,"<!DOCTYPE html>\n<html lang=""de"">\n\t<!--[if I...",\n \n \n \n \n \n \n Herzlich Willkommen â...,herzlich willkommen â interdisziplinã endoskop...,1
43,http://www.uk-augsburg.de/krebsbehandlung/diag...,134,"<!DOCTYPE html> \n<html lang=""de""> \n\t<head> ...",\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n Z...,zweitmeinung â warum ein weiter einschã tzung ...,2
24,http://www.hjk-muenster.de/unsere-kompetenzen/...,68,"<!DOCTYPE html><html lang=""de"" class=""no-js""><...",Gelenkersatz Skip to main content hjk Die Ei...,gelenkersatz skip main content hjk die einrich...,2


Below, we explore the predictions on the test set.

In [36]:
test_csv["predictions"].value_counts()

2    26
1    22
Name: predictions, dtype: int64

In [37]:
test_csv[["doc_id", "predictions"]]

Unnamed: 0,doc_id,predictions
0,0,2
1,2,2
2,7,2
3,15,2
4,16,2
5,24,2
6,31,2
7,32,1
8,36,1
9,38,2


## Answers to Questions <a class="anchor" id="atq"></a>

In this section we provide answers to the questions that were posed in the beginning of the assignment.

**How did you decide to handle this amount of data?**

> We have used data generators that dynamically load the data samples from disk. It would have been possible to load the entire data set into memory, given that it is relatively small.

**How did you decide to do feature engineering?**

> We haven't used any feature engineering techniques per se, though we have spent some effort on data pre-processing, with steps like: removing punctuation, multiple whitespaces, non-alphanumerical characters, etc.

**How did you decide to handle this amount of data?**

> We have used data generators that dynamically load the data samples from disk. It would have been possible to load the entire data set into memory, given that it is relatively small.

**How did you decide to do feature engineering?**

> We haven't used any feature engineering techniques per se, though we have spent some effort on data pre-processing, with steps like: removing punctuation, multiple whitespaces, non-alphanumerical characters, etc.

**How did you decide which models to try (if you decide to train any models)?**

> We've decided to use Siamese Network model because it is very popular for the particular task (natural language processing, small data set, class imbalance). The choice of layers is also very common in the field: we intertwine dropout layers with dense layers, which have decreasing number of units. Lastly we apply L2 regularization to penalise any large weights.

**How did you perform validation of your model?**

> Validation is automatically handled by the Tensorflow library, we just pass in a validation set. The validation set was obtained by splitting the provided data into train (80%) and validation (20%) set.

**What metrics did you measure?**

> During training we measure binary accuracy. In the evaluation phase we measure per-class precision, recall and f1 scores on the validation set.

**How do you expect your model to perform on test data (in terms of your metrics)?**

> We expect somewhat similar performance to the validation set, around 0.5-6 f1 score on label = 1, around 0.8 f1 score on label = 2, and we hope, f1 > 0 on label = 3.

**How fast will your algorithm perform and how could you improve its performance if you would have more time?**

> Each epoch takes around 30s to execute. We can improve that if we were to run the model on GPUs.

**How do you think you would be able to improve your algorithm if you would have more data?**
> * Build more complex model
> * Try different loss metrics
> * Use pre-trained models

**What potential issues do you see with your algorithm?**
> * It is very prone to overfitting, though this is almost certainly because of the small data set.
> * We have zero precision and recall on the label = 3 which is concerning and should be addressed somehow.