<a href="https://colab.research.google.com/github/mavillot/FUNSD-Information-Extraction/blob/main/Text_Clasification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classifier


In this Notebook we'll be using a FasterRCNN model to detect different block texts. We'll use IceVision, an Agnostic Object Detection Framework.

The first thing we need to do is to install the libraries.

## Libraries

In [1]:
%%capture
pip install transformers

In [2]:
import cv2
import json
import os
import re
import pandas as pd
from pathlib import Path
import glob
import torch
from transformers import AutoTokenizer,BertForSequenceClassification,AdamW, BertTokenizer,AutoModelForSequenceClassification,DistilBertForSequenceClassification, Trainer, TrainingArguments, DistilBertTokenizerFast
from sklearn.model_selection import train_test_split

In [3]:
import warnings
warnings.filterwarnings('ignore')

## Dataset

We download the dataset:

In [4]:
%%capture
!wget https://guillaumejaume.github.io/FUNSD/dataset.zip -O dataset.zip
!unzip dataset.zip

We create a Dataset Class in order to easily manage the dataset.

In [5]:
class Dataset():
    def __init__(self, path_anot):
        self.path_anot= path_anot

    def __iter__(self):
      with os.scandir(self.path_anot) as files:
        for file in files:
          yield file.name
    def __len__(self):
      i=0
      for file in self:
        i+=1
      return i
    def list_text_label(self, o):
      anot=json.loads(open(self.path_anot+'/'+o).read())
      txt=[]
      lbl=[]
      for block in list(anot.values())[0]:
        txt.append(block['text'])
        lbl.append(block['label'])
      return (txt,lbl)
    def prep(self):
      dic={'question':0, 'answer':1, 'header':2, 'other':3}
      text=[]
      labels=[]
      for file in self:
        txt,lbl=self.list_text_label(file)
        text=text+txt
        labels=labels+lbl
      return (text,[dic[x] for x in labels])

In [6]:
dataset_train=Dataset('dataset/training_data/annotations')
dataset_test=Dataset('dataset/testing_data/annotations')

With this class we can see the number of files in the directory of annotations or images

In [7]:
len(dataset_train)

149

## Train, validation and test set

Of each image/annotation we extract all the differents text blocks with its labels. We can easily do this calling the prep function.

In [8]:
text_train, labels_train=dataset_train.prep()
text_test, labels_test=dataset_test.prep()

In [25]:
text_train[:10]

['Attention:',
 'MARDEN- KANE. INC. 666 FIFTH AVE. NEW YORK, N.Y. 10103 (212) 582-6600',
 'TRAVEL INFORMATION SHEET',
 'This information will be used by Marden- Kane in making travel arrangements for your round- trip to Los Angeles, California.',
 'Finalist Name:',
 'Home Address:',
 'Home Telephone:',
 'Business Telephone:',
 'Name of Guests',
 'Relationship to Finalist:']

In [9]:
len(text_train)

7411

In [10]:
labels_train[:10]

[0, 2, 2, 3, 0, 0, 0, 0, 0, 0]

We split the train set in train and validation set

In [11]:
train_texts, val_texts, train_labels, val_labels = train_test_split(text_train, labels_train, test_size=.2)

## Dataset for training

In order to train the model, we need to create a new class for our dataset. With this new dataset class we can handle the encodings of each block text.

In [12]:
class FUNDSDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = torch.tensor(labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

## Metrics
We define the metrics that we will be using during the training

In [13]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

## Model

We aren't going to train a model from the begining, we will use a pretrained one from huggingFace.

### bert-base-uncased

This is a pretrained model. We need to download the tokenizer and the model. 

The tokenizer transforms the text into a vector of numbers.

In [14]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

We use the tokenizer to obtain the encodings:

In [15]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(text_test, truncation=True, padding=True)

With the encodings we can build the new dataset:

In [16]:
train_dataset = FUNDSDataset(train_encodings, train_labels)
val_dataset = FUNDSDataset(val_encodings, val_labels)
test_dataset = FUNDSDataset(test_encodings, labels_test)

## Training

In [17]:
args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy = "epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    num_train_epochs=7,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
)

In [18]:
trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [19]:
trainer.train() 

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Runtime,Samples Per Second
1,No log,0.703418,0.761969,0.481762,0.776688,0.492763,15.5088,95.623
2,No log,0.634845,0.784221,0.553703,0.802811,0.541542,15.4981,95.689
3,0.714800,0.600079,0.795684,0.619926,0.703187,0.597101,15.5026,95.661
4,0.714800,0.592964,0.807148,0.644702,0.754005,0.611683,15.4917,95.728
5,0.714800,0.598956,0.80445,0.654857,0.731629,0.626024,15.4826,95.785
6,0.470400,0.593406,0.811868,0.675058,0.729844,0.649939,15.4477,96.001
7,0.470400,0.595397,0.810519,0.672203,0.724208,0.647302,15.4995,95.68


TrainOutput(global_step=1302, training_loss=0.5460872679445234, metrics={'train_runtime': 1872.9535, 'train_samples_per_second': 0.695, 'total_flos': 3761771813025408.0, 'epoch': 7.0, 'init_mem_cpu_alloc_delta': 335172, 'init_mem_gpu_alloc_delta': 439078400, 'init_mem_cpu_peaked_delta': 18306, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 1533975, 'train_mem_gpu_alloc_delta': 1811347456, 'train_mem_cpu_peaked_delta': 94409736, 'train_mem_gpu_peaked_delta': 3483717632})

In [20]:
compute_metrics(trainer.predict(test_dataset))

{'accuracy': 0.8074614065180102,
 'f1': 0.6867658773301732,
 'precision': 0.7084665950719728,
 'recall': 0.6719656588563511}

In [21]:
trainer.save_model('/content/drive/MyDrive/Modelo/trainer_bert')