<a href="https://colab.research.google.com/github/onlyabhilash/Transformers_with_NLP/blob/main/ByT5/Multilexnorm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center"><b>ÚFAL at MultiLexNorm 2021:</b></h1>
<h2 align="center"><b>Improving Multilingual Lexical Normalization by Fine-tuning ByT5</b></h2>
<br>

<p align="center">
  <b>David Samuel & Milan Straka</b>
</p>

<p align="center">
  <i>
    Charles University<br>
    Faculty of Mathematics and Physics<br>
    Institute of Formal and Applied Linguistics
  </i>
</p>
<br>

<p align="center">
  <a href="https://aclanthology.org/2021.wnut-1.54/"><b>Paper</b></a><br>
  <a href="https://github.com/ufal/multilexnorm2021"><b>GitHub repository</b></a>
</p>


<p align="center">
  <img src="https://github.com/ufal/multilexnorm2021/raw/master/img/overall.png" alt="Overall architecture." width="720"/>  
</p>

# Introduction

This is the official demo notebook for the winning entry to the [*W-NUT 2021: Multilingual Lexical Normalization (MultiLexNorm)* shared task](https://noisy-text.github.io/2021/multi-lexnorm.html), which
evaluates lexical-normalization systems on 12 social media datasets in 11
languages.

Our system is based on [ByT5](https://arxiv.org/abs/2105.13626), which we first pre-train on synthetic data and then fine-tune on authentic normalization data. Our system achieves the best performance by a wide margin in intrinsic evaluation, and also the best performance in extrinsic evaluation through dependency parsing. In addition to this notebook, we also release the fine-tuned models on [HuggingFace TODO](TODO) and the source codes on [GitHub](https://github.com/ufal/multilexnorm2021).


# Interactive demo

## 1. Initialize

First, we have to clone the repository, install all required modules, import them and also implement a light-weight AbstractData class that handles new sentences.

In [None]:
!git clone https://github.com/ufal/multilexnorm2021
%cd multilexnorm2021

!pip3 install torchmetrics==0.4.1
!pip3 install transformers==4.8.2
!pip3 install pytorch_lightning==1.3.8

from google.colab import output
from utility.twokenize import tokenizeRawTweetText
import torch
import warnings
import math
warnings.filterwarnings('ignore')

import sys
sys.path.append("/content/multilexnorm2021")

from config.params import Params
from data.dataset.inference import InferenceDataset
from data.inference_data import CollateFunctor
from data.abstract_data import AbstractData
from torch.utils.data import DataLoader
from model.model import Model
from config.params import Params
from utility.output_assembler import OutputAssembler
from pytorch_lightning.utilities.apply_func import move_data_to_device


class Data(AbstractData):
    def __init__(self, inputs, args):
        super().__init__(args)
        self.dataset = InferenceDataset(inputs)
        collate_fn = CollateFunctor(self.tokenizer)

        self.dataloader = DataLoader(
            self.dataset, batch_size=self.batch_size, shuffle=False, num_workers=self.threads, collate_fn=collate_fn
        )


args = Params().load(["--config", "config/inference.yaml"])
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

output.clear()

## 2. Write a sentence to be normalized

Please select the language of the sentence youo want to normalize and, most importantly, provide the sentence itself. Don't forget to hit the "run" button afterwards! :)

In [None]:
language = "en"  #@param ['da', 'de', 'en', 'es', 'hr', 'iden', 'it', 'nl', 'sl', 'sr', 'tr', 'trde']
input = "yo hv to let ppl decide wat dey wanna do" #@param {type:"string"}

args.dataset.language = language
path = f"ufal/byt5-small-multilexnorm2021-{language}"
args.model.pretrained_lm = args.dataset.tokenizer = path

## 3. Run!

Automatically downloads the suitable fine-tuned model from Hugging Face and runs the inference. We release the fine-tuned models together with the source code. 

In [None]:
tokens = [tokenizeRawTweetText(input)]

data = Data(tokens, args)
assembler = OutputAssembler(".", args, data.dataset)
model = Model(args, data).to(device)
model.eval()

for i, batch in enumerate(data.dataloader):
    batch = move_data_to_device(batch, device)
    output = model.generate(batch)
    assembler.step(output)

assembler.flush()

with open("outputs.txt") as f:
    words = [w.split('\t')[1].strip() for w in f.readlines() if w != '\n']

output.clear()
print(f"ORIGINAL:   {input}")
print(f"NORMALIZED: {' '.join(words)}")

Downloading:   0%|          | 0.00/2.50k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.59k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/706 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

ORIGINAL:   yo hv to let ppl decide wat dey wanna do
NORMALIZED: you have to let people decide what they want to do


## Addendum: some more options

### A. Get more predictions for each word

You can set the number of beams in the config file to get different number of options

In [None]:
with open("raw_outputs.txt") as f:
    for line in f.readlines():
        original, *options = line.split('\t')
        options = zip(options[::2], options[1::2])

        print(original)
        for i, (word, logprob) in enumerate(options):
            print(f"  {i}: {word} -- with likelihood ~ {math.exp(float(logprob)):.2f}")
        print()

yo
  0: you -- with likelihood ~ 1.00
  1: you going to -- with likelihood ~ 0.10
  2: you will -- with likelihood ~ 0.05
  3: you have -- with likelihood ~ 0.05
  4: you gonna -- with likelihood ~ 0.05
  5: you're -- with likelihood ~ 0.05
  6: youlugually -- with likelihood ~ 0.04
  7: youluvely -- with likelihood ~ 0.04
  8: youluguay -- with likelihood ~ 0.04
  9: you would -- with likelihood ~ 0.03
  10: you will have -- with likelihood ~ 0.03
  11: you will you -- with likelihood ~ 0.03
  12: you would have -- with likelihood ~ 0.03
  13: you wasn't -- with likelihood ~ 0.03
  14: youull -- with likelihood ~ 0.02
  15: youluve -- with likelihood ~ 0.02

hv
  0: have -- with likelihood ~ 1.00
  1: government -- with likelihood ~ 0.05
  2: gevernment -- with likelihood ~ 0.04
  3: have announce -- with likelihood ~ 0.02
  4: have avenue -- with likelihood ~ 0.02
  5: have average -- with likelihood ~ 0.02
  6: have've -- with likelihood ~ 0.02
  7: have fuck -- with likelihood ~ 0.

### B. Process multiple sentences at once

In [None]:
sentences = [
    "fyrst sntnce",
    "scond one .",
    "and yet another one of them sentencesss"
]

tokens = [tokenizeRawTweetText(sentence) for sentence in sentences]

data = Data(tokens, args)
assembler = OutputAssembler(".", args, data.dataset)
model = Model(args, data).to(device)
model.eval()

for i, batch in enumerate(data.dataloader):
    batch = move_data_to_device(batch, device)
    output = model.generate(batch)
    assembler.step(output)

assembler.flush()

with open("outputs.txt") as f:
    for line in f.readlines():
        print(line.strip())

fyrst	first
sntnce	sntnce

scond	second
one	one
.	.

and	and
yet	yet
another	another
one	one
of	of
them	them
sentencesss	sentences

