
<h2 align=center>Automatic Document Generation with Transformers for Layout Analysis Enhancement</h3>
<h3 align=center>PyCon 2022</h2>
<h3 align=center>Florence, June 3rd 2022</h2>

<br>


```python
__AUTHOR__ = {'lp': ("Lorenzo Pisaneschi",
                     "lorenzo.pisaneschi1@gmail.com",
                     "https://github.com/pisalore/master_thesis")}

__TOPICS__ = ['Natural Language Processing', 'Text Mining', 'Layout Analysis']

__KEYWORDS__ = ['Machine Learning', 'AI', 'NLP', 'Transformers', 'GROBID']
```


## Who I am


<br/>
<div>
  <ul>
      <li>MSc Degree in Computer Engineering <strong>@University of Study of Florence</strong></li>
      <li>Research interests:
        <ul>
          <li>Natural Language Processing</li>
          <li>Machine Learning</li>
          <li>Information Extraction</li>
        </ul>
      </li>
      <li>Backend Developer <strong>@Nephila</strong></li>
      <li>First time speaker</li>
   </ul>
</div>


## Overview

- Document Layout Analysis

- Dataset

- Semi-automatic annotation pipeline

- Automatic document generation

- Method Evaluation and Conclusions

- Q&A


## Document Layout Analysis

<div>
    <div style="float: right">
        <img src="images/Image0.png" width="380">
    </div>
    <br/>
    <br/>
    <br/>
    <br/>
     <div>
        <li>Identify the most significant parts</li>
        <br/>
        <li>Do text mining: extract syntactic and semantic information</li>
        <br/>
        <li>Exploiting state-of-the-art NLP techniques for further inference</li>
    </div>
</div>


## Document Layout Analysis
### Problems

- Needs for a **huge amount of data**

- Difficult task: we must deal with many **classes**

- High costs: data collection, annotation, pre-processing

- Work with **unstructured data**: text



## Document Layout Analysis

### Ideas

- **Use less data**

- **Automatically annotate objects**

- **Restructure what is unstructured**


## Proposed method

<div>
    <div style="float: right">
        <img src="images/general.png" width="380">
    </div>
    <br/>
    <br/>
    <br/>
     <div>
        <li style="height:150px;"><strong>Information retrieval, Named Entity Recognition, Text Mining</strong> from unstructured data (PDF)
        <li style="height:100px;">Speed up <strong>label processing</strong>
        <li style="height:50px;">Generate <strong> synthetic data</strong> exploiting <strong>Transformers</strong>
    </div>
</div>

## Dataset

<div>
    <div style="float: left">
        <img src="images/image1.png" width="380">
    </div>
    <br/>
    <br/>
    <br/>
    <br/>
     <div>
        <li>Scientific papers in PDF format from ICDAR 2019 (Internation Conference od Document Analysis and Recognition)</li>
        <br/>
        <li>Specific layout style (IEEE scientific articles)</li>
        <br/>
        <li>375 PDF for 2088 pages</li>
    </div>
</div>



## Semi-automatic annotation pipeline

<img style="display: block;
  margin-left: auto;
  margin-right: auto;
  width: 480px;" src="images/pipeline.drawio.png">

## GROBID (GeneRation Of BIbliographic Data)


<div>
    <div style="float: left">
        <img src="images/input.png" width="450">
    </div>
    <br/>
    <br/>
    <br/>
    <br/>
     <div>
        <li>We need to convert <strong>unstructured</strong> data into parsable <strong>structured</strong> data</li>
        <br/>
        <li>GROBID makes us available information impossible to retrieve from PDF directly</li>
    </div>
</div>



## Structured data

In [7]:
import pathlib
import requests

for pdf_path in pathlib.Path("data/pdfs").rglob("*.pdf"):
    xml_path = pathlib.Path("data/xml")
    xml_path = xml_path.joinpath(
        pdf_path.relative_to("data/pdfs").parents[0], pdf_path.stem + ".xml"
    )
    xml_path.parents[0].mkdir(parents=True, exist_ok=True)
    print("Processing {}...".format(pdf_path))
    # Open PDF
    pdf = open(pdf_path, "rb")
    # Request Grobid xml
    xml_response_content = requests.post(
        url="http://localhost:8070/api/processFulltextDocument",
        files={"input": pdf.read()},
        data={
            "teiCoordinates": ["persName", "figure", "ref", "biblStruct", "formula", "s", "head",
                               ]
        },
    )
    # Write XML file
    xml_file = open(xml_path, "a")
    xml_file.write(xml_response_content.text)
    xml_file.close()


Processing data/pdfs/ICDAR19/1bEnBHAZurlCHlfcg65fed/622U0zHbbvLmKLHzTHo6gZ.pdf...


### GROBID examples

#### Formula

```xml
<formula xml:id="formula_0" coords="3,119.49,353.98,173.62,25.87">
	L(x, y) = log e xy n i = 0 e xi
	<label>1</label>
</formula>
```

#### Table

```xml
<figure type="table" xml:id="tab_2" coords="5,57.42,193.23,496.01,81.97">
	<head>TABLE III :</head>
	<label>III</label>
	<figDesc>Mean IU (%) on the test set of the different architectures trained on the three manuscripts of the DIVA-HisDB dataset. For pre-training, only the encoder...
	</figDesc>
	<table coords="5,65.74,219.88,479.38,55.32">...
	</table>
</figure>
```

## Parsing

<div>
    <div style="float: right">
        <img src="images/parsing.png" width="450">
    </div>
    <br/>
    <br/>
     <div>
     Two main tools:
     <br>
     <ul>
        <li><strong>PDFMIner</strong>
        <li style="height:50px;"><strong>beautifulSoup4</strong>
     </ul>
      <li style="height:100px;">The former iterates over <strong>PDF</strong> text and images
      <li style="height:100px;">The latter works with relative structured <strong>XML/TEI</strong> file generated by GROBID
      <li style="height:100px;">A <code>parser</code> does the work
</div>



### What the code does?

In [10]:
% run master_thesis / main --annotations-path data / png --pdfs-path data / pdfs --xml-path data / xml --pickle-filename docs_instances.pickle --debug True

Parsing data/pdfs/ICDAR19/1bEnBHAZurlCHlfcg65fed/622U0zHbbvLmKLHzTHo6gZ.pdf
Converting 1bEnBHAZurlCHlfcg65fed from pdf to PNG...
PDF file successfully converted.


- It aligns information retrieved by **PDF** and **XML** using the ```parse_doc``` function

- It save pages as **PNG** for method validation

- It saves annotations in `doc_instances.pickle`



## Parse doc

We iterate over **PDF** and related **XML** at the same time

In [None]:
from master_thesis.file_parser.tei import TEIFile
from master_thesis.utilities.parser_utils import (
    are_similar,
    do_overlap,
    element_contains_authors,
    check_keyword,
    calc_coords_from_pdfminer,
    check_subtitles,
    adjust_overlapping_coordinates,
)


def parse_doc(pdf_path, xml_path, annotations_path, debug)


## TEI wrapper class

This class is used by `parse_doc` for extract specific content as **properties**


In [None]:
class TEIFile(object):
    def __init__(self, filename):
        self.filename = filename
        self.lxml_soup = self.read_tei(filename)
        self.xml_soup = self.read_tei(filename, markup="xml")

    def read_tei(self, tei_file, markup="lxml"):
        with open(tei_file, "r", encoding="utf8") as tei:
            return BeautifulSoup(tei, features="xml")

    def title(self)

    def abstract(self)

    def tables(self)

    def authors(self)

    ...


## Annotations

<div>
    <div style="float: left">
        <img src="images/convert_annotations.png" width="450">
    </div>
    <br/>
    <br/>
     <div>
     <br>
      <li style="height:100px;">Get and <strong>visualize annotations</strong> for "debugging"
      <br/>
      <li style="height:100px;">Convert them in <strong>PASCAL VOC format</strong>
      <br/>
      <li style="height:100px;"><strong>Correct</strong> them for better results
</div>



### How our annotations look like?

In [None]:
from master_thesis.utilities.parser_utils import load_doc_instances

annotations = load_doc_instances("docs_instances.pickle")
paper_name = '622U0zHbbvLmKLHzTHo6gZ'
print(annotations[paper_name])

```python
{
    "title": {
        "content": "Selective Super-Resolution for Scene Text Images\n",
        "coords": [162.1726, 92.92924728000003, 448.70301648, 106.59828728000002],
    },
    "subtitles": {
        1: [
            {
                "title_content": "I. INTRODUCTION",
                "coords": (139.45, 445.41, 215.64999999999998, 454.61),
            },
            ...,
        ],
        2: [
            {
                "title_content": "III. SUPER-RESOLUTION CONVOLUTIONAL NEURAL NETWORKS",
                "coords": (67.29, 220.14, 287.63, 229.33999999999997),
            },
            ...,
        ],
    },
    # Other object categories
}
```

### Correction using labelImg

<div>
    <div style="float: left">
        <img src="images/labelImg.PNG" width="390">
    </div>
    <div>
        <br/>
    <br/>
        <li style="height:100px;">Automatic annotations are a little noisy
      <br/>
      <li style="height:100px;">Since we want to start with <strong>few data</strong>, we manually correct annotations
      <br/>
      <li style="height:100px;"><strong>labelImg</strong> is a tool that perfectly suites our needs
      <br/>
      <li style="height:100px;">Finally we convert PASCAL VOC annotations to <strong>COCO</strong> for LayoutTransformer
    </div>
</div>



### Annotation examples


<div>
    <div style="float: right">
        <img src="images/622U0zHbbvLmKLHzTHo6gZ_0.png" width="390">
    </div>
    <div>
        <img src="images/622U0zHbbvLmKLHzTHo6gZ_4.png" width="390">
    </div>
</div>

### PASCAL VOC annotation example

One XML file for **every single page** of each document in initial dataset

```XML
<annotation>
    <folder>1bEnBHAZurlCHlfcg65fed</folder>
    <filename>622U0zHbbvLmKLHzTHo6gZ_0.png</filename>
    <path>
        /home/lorenzo/Documenti/Workshops/pycon2022/data/png/ICDAR19/1bEnBHAZurlCHlfcg65fed/622U0zHbbvLmKLHzTHo6gZ_0.png
    </path>
    <size>
        <width>612</width>
        <height>792</height>
    </size>
    <object>
        <name>title</name>
        <bndbox>
            <xmin>162.1726</xmin>
            <xmax>448.70301648</xmax>
            <ymin>92.92924728000003</ymin>
            <ymax>106.59828728000002</ymax>
        </bndbox>
    </object>
...
</annotation>
```

### COCO annotation example

A single file containing all train/validation data to be used with Transformers

```json
{
    "images": [{
            "file_name": "1qxE3a6GBT9ILRIQA6WzoL/1llA34vNzpDFcSV4Y1Omk2_2.png",
            "height": 792,
            "id": 1001842,
            "width": 612
        },...],
    "annotations": [{
            "segmentation": [[68.0, 74.0, 289.0, 74.0, 289.0, 180.0, 68.0, 180.0]],
            "area": 23426.0,
            "is_crowd": 0,
            "image_id": 1001842,
            "bbox": [68.0, 74.0, 221.0, 106.0],
            "category_id": 6,
            "id": 2027900
        },...],
    "categories": [{
            "supercategory": "",
            "id": 1,
            "name": "text"
        },...
            {
            "supercategory": "",
            "id": 6,
            "name": "figure"
        },
        ...]
}
```

## Transformers as generative models


<div>
    <div style="float: right">
        <img width="500" src="images/transformres2.png">
    </div>
    <div>
    <br>
        <li><strong>Transformers</strong> revolutionaized the way we do NLP</li>
    <br>
     <li>Their <strong>encoder-decoder</strong> architecture <strong>differently weights</strong> sequence inpust data significance</li>
    <br>
        <li>Could we exploit Transformers to understand <strong>instances relationship in layouts</strong>?</li>
    </div>
</div>




## LayoutTransformer

<div>
     <li>Learns <strong>layout schema</strong> and generates novel ones
     <li><strong>Scene layout</strong> is a, unordered set of graphical primitive
</div>
$$\mathcal{G} = (s_{<bos>}; s_i, x_i, y_i, w_i, h_i, s_{<eos>}, ..., s_n, x_n, y_n, w_n, h_n; s_{<eos>})$$
<br>
<img style="display: block;
  margin-left: auto;
  margin-right: auto;
  width: 680px;"
  src="images/layout_transformer_arch.drawio.png">



## LayoutTransformer training

Training is straightforward

```shell
python main.py
 --train_json /path/to/annotations/train.json
 --val_json /path/to/annotations/val.json
 --exp <exp_name>
 ```

<br>
We can also see real-time process progress thanks to **WANDB** tool: https://wandb.ai/site




## Training details
<div>
    <li> <strong>Embeddings input dimension</strong> d = 512
    <li> <strong>Layers</strong> = 6
    <li> <strong>Heads</strong> = 8
    <li> <strong>Learning Rate</strong> = $4.5e-06$
    <li> <strong>Adam optimizer</strong>
    <li> <strong>Epochs</strong> = 25
    <li> <strong>Batch size</strong> = 64
    <li> <strong>Teacher forcing</strong>
    <li> <strong>Label smoothing</strong>
</div>
$$ \textbf{Loss}: E_{\theta \sim Disc.}[D_{KL}(SoftMax(\theta ^L) || p(\theta ')] + \lambda E_{\theta \sim Cont.}[||\theta - \theta '||_1]$$



## Inference: layouts generation

From **LayoutTransformer training** we obtain a **model** which can be used to sample an arbitrary large number of **synthetic new layouts**.

## How we sample synthetic layouts?

In [None]:
def inference(model_state_path, data_json_path, n_gen_layouts, debug):
    dataset = JSONLayout(data_json_path)
    model_conf = GPTConfig(
        dataset.vocab_size, dataset.max_length, n_layer=6, n_head=8, n_embd=512
    )
    generative_model = GPT(model_conf)
    generative_model.load_state_dict(torch.load(model_state_path))
    device = torch.cuda.current_device() if torch.cuda.is_available() else "cpu"
    generative_model = torch.nn.DataParallel(generative_model).to(device)
    loader = DataLoader(
        dataset,
        shuffle=True,
        pin_memory=True,
        batch_size=len(dataset.data),
        num_workers=0,
    )
    for it in range(n_gen_layouts):
        for _, (x, y) in enumerate(loader):
            x_cond = x[:1].to(device)
            layouts = (
                sample(
                    generative_model,
                    x_cond[:, :6],
                    steps=dataset.max_length,
                    temperature=1.0,
                    sample=False,
                    top_k=None,
                )
                .detach()
                .cpu()
                .numpy()
            )
            for layout in layouts:
                layout_dir = Path(exp_dir.joinpath(f"layout_{it}"))
                layout_dir.mkdir(mode=0o777, parents=False, exist_ok=True)
                filename = f"{PurePosixPath(layout_dir).__str__()}/{it}"
                dataset.save_annotations(layout, f"{filename}.json", it)


## How sample function works?

In [None]:
@torch.no_grad()
def sample(model, x, steps, temperature=1.0, sample=False, top_k=None):
    block_size = (
        model.module.get_block_size()
        if hasattr(model, "module")
        else model.getcond_block_size()
    )
    model.eval()
    for k in range(steps):
        x_cond = (
            x if x.size(1) <= block_size else x[:, -block_size:]
        )  # crop context if needed
        logits, _ = model(x_cond)
        # pluck the logits at the final step and scale by temperature
        logits = logits[:, -1, :] / temperature
        # optionally crop probabilities to only the top k options
        if top_k is not None:
            logits = top_k_logits(logits, top_k)
        # apply softmax to convert to probabilities
        probs = F.softmax(logits, dim=-1)
        # sample from the distribution or take the most likely
        if sample:
            ix = torch.multinomial(probs, num_samples=1)
        else:
            _, ix = torch.topk(probs, k=1, dim=-1)
        # append to the sequence and continue
        x = torch.cat((x, ix), dim=1)

    return x

## Generated layouts

- We obtain a **JSON annotation file** (in COCO format) for **each generated layout**
- We synthesized **10k layouts** in **Letter** format (612z792)

<div>
    <div style="float: left">
        <img src="images/10.png" width="380">
    </div>
    <div>
        <img src="images/20.png" width="360">
    </div>
</div>


## Postprocessing

Since we used **few data**, we obtained **noisy annotations**.

We fix them iterating through them to:

- **Merge** overlapping objects belonging to same category
- **Separate** overlapping objects belonging to different categories



## Example post-processed layouts I

 <div>
    <div style="float: left">
        <img src="images/10.png" width="400">
    </div>
    <div>
        <img src="images/10_corrected3.png" width="380">
    </div>
</div>



## Example post-processed layouts II

 <div>
    <div style="float: left">
        <img src="images/20.png" width="400">
    </div>
    <div>
        <img src="images/20_corrected3.png" width="380">
    </div>
</div>

## Document generation

We have layouts, we miss contents.

- Generate text from original exploiting **NLTK**
- Retrieve images, tables and equation from the web
- Integrate formula in synthetic PDF with **LaTeX**



## Text generation

- It is easy with **Natural Language Processing Toolkit** module.
- We used the simpler model possible: **3-grams**

In [None]:
# Tokenize the text.
corpus = [word_tokenize(s) for s in all_text]
# Preprocess the tokenized text for 3-grams language modelling
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, corpus)
model = MLE(n)
model.fit(train_data, padded_sents)

print(f"Generate {category} stuff...")
for i in range(text_num):
    num_words = all_text_lengths[randrange(0, len(all_text_lengths))]
    sentence = generate_sent(model, num_words, random_seed=randint(1, 150))
    if sentence and not any(x in sentence for x in ["(cid:", "<s>", "</ s>"]):
        generated_instances.append(sentence)
# Distinct elements
return list(set(generated_instances))

## Generate sentence

- We generate text for all our **text categories** (title, subtitle, text, abstract, keywords, authors, references, caption)
- Each model is fitted with **specific text data** taken from original papers during parsing



In [None]:
def generate_sent(model, num_words, random_seed=42):
    content = []
    for token in model.generate(num_words=num_words, random_seed=random_seed):
        if token == "<s>":
            continue
        if token == "</s>":
            break
        content.append(token)
    return detokenize(content)

## Doclab: put everything together



In [None]:
def generate_synthetic_documents():
    for idx, json_path in enumerate(lgt_dir.rglob("*.json.json")):            # Instantiate pdf object (fpdf)
        pdf = FPDF(unit="pt")
        pdf.add_page()
        # Add all necessary fonts for each type of text (titles, subtitles, abstract, authors...) to let them
        # available during documents writing
        for _, font in FONTS.items():
            pdf.add_font(font["fontname"], "", font["tff"], uni=True)
        filename = json_path.stem.split(".json")[0]
        # output file path
        out_filepath = f"{gen_pdfs}/{filename}.pdf"
        if not os.path.exists(out_filepath):
            generate_document(pdf, json_path, filename, out_filepath)