token-sequence-classification

Use labels as tokens to classify a sequence of text.

For each label, a new token is added to the tokenizer and word embeddings. Each input sequence has each label prepended to the beginning. The output embeddings are put through a linear layer that creates a single logit. All label probas are put through a sigmoid (multi-label) or through softmax (single-label) to determine which label(s) applies(apply) to this example. See the image below for a visualization.

Pros

It seems to get better results: wandb report.

Cons

If there are many labels, the sequence length will get long, slowing the training down.
If there are more labels than 512, then it won't work at all because the input will only be label tokens and no text.
It requires custom code.

Areas to explore

Currently, the label tokens are randomly initialized. It would be interesting to see if it would be beneficial to initially set them as:

the same as the CLS token embedding
the average of all token embeddings
the average of the label's text meaning in token embeddings

To explain the last point, let's use the image above as a reference. For the label anger, the text anger would be passed to the tokenizer:

ids = tokenizer("anger", add_special_tokens=False).input_ids
# [4963]

In this case, it results in one token, but it might be multiple depending on the label. The next step is to average the token embeddings.

avg_value = model.embeddings.word_embeddings.weight.data[torch.tensor(ids)].mean(dim=0)

This average value now becomes the new embedding value for the label token [anger]

label_token_id = tokenizer("[anger]", add_special_tokens=False).input_ids[0]

model.embeddings.word_embeddings.weight.data[label_token_id] = avg_value

The thought is that the label's meaning might be a helpful starting place for turning it into a classification token.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
assets		assets
configs		configs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
azure-pipelines.yml		azure-pipelines.yml
data.py		data.py
hyperparameter_search.py		hyperparameter_search.py
model.py		model.py
requirements.txt		requirements.txt
run_tokenseq_classification.py		run_tokenseq_classification.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

token-sequence-classification

Pros

Cons

Areas to explore

About

Uh oh!

Releases

Packages

Languages

License

nbroad1881/token-sequence-classification

Folders and files

Latest commit

History

Repository files navigation

token-sequence-classification

Pros

Cons

Areas to explore

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages