<a href="https://colab.research.google.com/github/mavillot/FUNSD-Information-Extraction/blob/main/Clasification/Text_Clasification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classifier


In this Notebook we'll be using the HuggingFace library in order to create the text classification models.


## Libraries

In [None]:
%%capture
pip install transformers

In [None]:
import cv2
import json
import os
import re
import pandas as pd
from pathlib import Path
import glob
import torch
from transformers import AutoTokenizer,BertForSequenceClassification,AdamW, BertTokenizer,AutoModelForSequenceClassification,DistilBertForSequenceClassification, Trainer, TrainingArguments, DistilBertTokenizerFast
from sklearn.model_selection import train_test_split

In [None]:
import warnings
warnings.filterwarnings('ignore')

## Dataset

We download the dataset:

In [None]:
%%capture
!wget https://guillaumejaume.github.io/FUNSD/dataset.zip -O dataset.zip
!unzip dataset.zip

We create a Dataset Class in order to easily manage the dataset.

In [None]:
class Dataset():
    def __init__(self, path_anot):
        self.path_anot= path_anot

    def __iter__(self):
      with os.scandir(self.path_anot) as files:
        for file in files:
          yield file.name
    def __len__(self):
      i=0
      for file in self:
        i+=1
      return i
    def list_text_label(self, o):
      anot=json.loads(open(self.path_anot+'/'+o).read())
      txt=[]
      lbl=[]
      for block in list(anot.values())[0]:
        txt.append(block['text'])
        lbl.append(block['label'])
      return (txt,lbl)
    def prep(self):
      dic={'question':0, 'answer':1, 'header':2, 'other':3}
      text=[]
      labels=[]
      for file in self:
        txt,lbl=self.list_text_label(file)
        text=text+txt
        labels=labels+lbl
      return (text,[dic[x] for x in labels])

In [None]:
dataset_train=Dataset('dataset/training_data/annotations')
dataset_test=Dataset('dataset/testing_data/annotations')

With this class we can see the number of files in the directory of annotations or images

In [None]:
len(dataset_train)

149

## Train, validation and test set

Of each image/annotation we extract all the differents text blocks with its labels. We can easily do this calling the prep function.

In [None]:
text_train, labels_train=dataset_train.prep()
text_test, labels_test=dataset_test.prep()

In [None]:
text_train[:10]

['Attention:',
 'MARDEN- KANE. INC. 666 FIFTH AVE. NEW YORK, N.Y. 10103 (212) 582-6600',
 'TRAVEL INFORMATION SHEET',
 'This information will be used by Marden- Kane in making travel arrangements for your round- trip to Los Angeles, California.',
 'Finalist Name:',
 'Home Address:',
 'Home Telephone:',
 'Business Telephone:',
 'Name of Guests',
 'Relationship to Finalist:']

In [None]:
len(text_train)

7411

In [None]:
labels_train[:10]

[0, 2, 2, 3, 0, 0, 0, 0, 0, 0]

We split the train set in train and validation set

In [None]:
train_texts, val_texts, train_labels, val_labels = train_test_split(text_train, labels_train, test_size=.2)

## Dataset for training

In order to train the model, we need to create a new class for our dataset. With this new dataset class we can handle the encodings of each block text.

In [None]:
class FUNDSDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = torch.tensor(labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

## Metrics
We define the metrics that we will be using during the training

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

## Model

We aren't going to train a model from the begining, we will use a pretrained one from huggingFace.

### bert-base-uncased

This is a pretrained model. We need to download the tokenizer and the model. 

The tokenizer transforms the text into a vector of numbers.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

We use the tokenizer to obtain the encodings:

In [None]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(text_test, truncation=True, padding=True)

In [None]:
train_encodings

{'input_ids': [[101, 20328, 1043, 1012, 6285, 5582, 2015, 1010, 4632, 2102, 1012, 27084, 2100, 1010, 1012, 1996, 3840, 2194, 1010, 13057, 1010, 3516, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 7615, 2006, 3269, 1024, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 100, 13371, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

With the encodings we can build the new dataset:

In [None]:
train_dataset = FUNDSDataset(train_encodings, train_labels)
val_dataset = FUNDSDataset(val_encodings, val_labels)
test_dataset = FUNDSDataset(test_encodings, labels_test)

## Training

In [None]:
args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy = "epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    num_train_epochs=7,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
)

NameError: ignored

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train() 

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Runtime,Samples Per Second
1,No log,1.154968,0.443695,0.153667,0.110924,0.25,21.0068,70.596
2,No log,1.154552,0.443695,0.153667,0.110924,0.25,21.0067,70.596
3,1.144100,1.154501,0.443695,0.153667,0.110924,0.25,20.9799,70.687
4,1.144100,1.154651,0.443695,0.153667,0.110924,0.25,20.9861,70.666
5,1.144100,1.154555,0.443695,0.153667,0.110924,0.25,20.9892,70.655
6,1.152700,1.154575,0.443695,0.153667,0.110924,0.25,20.9534,70.776
7,1.152700,1.154556,0.443695,0.153667,0.110924,0.25,20.9651,70.737


TrainOutput(global_step=1302, training_loss=1.151612336002004, metrics={'train_runtime': 2020.8367, 'train_samples_per_second': 0.644, 'total_flos': 3761771813025408.0, 'epoch': 7.0, 'init_mem_cpu_alloc_delta': 2024980, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 93595544, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 1512553, 'train_mem_gpu_alloc_delta': 1354791424, 'train_mem_cpu_peaked_delta': 94405737, 'train_mem_gpu_peaked_delta': 3483230720})

In [None]:
while True:pass

In [None]:
compute_metrics(trainer.predict(test_dataset))

{'accuracy': 0.8074614065180102,
 'f1': 0.6867658773301732,
 'precision': 0.7084665950719728,
 'recall': 0.6719656588563511}

In [None]:
trainer.save_model('/content/drive/MyDrive/Modelo/trainer_bert')