# Trabalho 2

### Sobre o trabalho
- O trabalho tem os seguintes objetivos:
    - Estudar a tarefa de **POS Tagging** (Part-Of-Speech (POS) Tagging) para a língua Portuguesa
    - Classificar a classe gramatical de palavras em Português
    - Implementar e avaliar a precisão de um modelo de POS tagging
### Ferramentas, dados e bibliotecas utilizadas
- O corpus de treino utilizado foi o corpus recomendado na descrição do trabalho
- O corpus de avaliação utilizado foi o corpus recomendado na descrição do trabalho
- A principal biblioteca utilizada para manipular os dados foi apenas a biblioteca recomendada: 

Trabalho Prático II

Nosso objetivo é estudar a tarefa de POS Tagging para a língua Portuguesa. Para isso utilizaremos o corpus Mac-Morpho, produzido pelo grupo NILC da ICMC USP. O Mac-Morpho oferece dados para treinamento, validação e teste de modelos preditivos, capazes de classificar a classe gramatical de palavras em Português. Para acessar o conjunto de classes, acesse http://nilc.icmc.usp.br/macmorpho/macmorpho-manual.pdf.


O corpus está disponível em http://nilc.icmc.usp.br/macmorpho/macmorpho-v3.tgz. Você deverá implementar e avaliar a precisão de um modelo de POS tagging, a sua escolha. Voce pode utilizar pacotes que facilitem a implementação, como gensim, e transformers. Você deverá entregar documentação embutida em um notebook, detalhando a tarefa, a metodologia e os resultados. Sua análise deverá discutir quais as classes gramaticais para as quais a precisão é maior/menor. Não utilizaremos LLMs para essa tarefa, mas a sugestão é utilizar Transformers disponiveis e pre-treinados, em especial o BERT.

Exemplo de dados de treino:
```
Jersei_N atinge_V média_N de_PREP Cr$_CUR 1,4_NUM milhão_N na_PREP+ART venda_N da_PREP+ART Pinhal_NPROP em_PREP São_NPROP Paulo_NPROP ._PU Programe_V sua_PROADJ viagem_N à_PREP+ART Exposição_NPROP Nacional_NPROP do_NPROP Zebu_NPROP ,_PU que_PRO-KS começa_V dia_N 25_N ._PU
```

In [None]:
from transformers import BertTokenizerFast
from sklearn.preprocessing import LabelEncoder
from keras.preprocessing.sequence import pad_sequences
from tqdm import tqdm

# Initialize the BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-cased')

# Initialize the label encoder
label_encoder = LabelEncoder()

# Assume sentences is a list of sentences and pos_tags is a list of POS tag sequences
# Each sentence and POS tag sequence should be a list of words/POS tags

def preprocess_and_tokenize(data, max_length=512):
    sentences = []
    pos_tags = []

    # Split data into sentences and POS tags
    for line in tqdm(data, desc="Processing data"):
        words = []
        tags = []
        for word in line.split():
            split_word = word.split('_')
            words.append(split_word[0])
            tags.append(split_word[1])
        sentences.append(words)
        pos_tags.append(tags)

    # Tokenize the sentences
    tokenized_inputs = tokenizer(sentences, truncation=True, padding=True, is_split_into_words=True, max_length=max_length)

    # Handle the POS tags
    new_pos_tags = []
    for sent_tags in pos_tags:
        new_tags = []
        for tag in sent_tags:
            new_tags.append(tag)
        new_pos_tags.append(new_tags)

    # Encode the POS tags
    encoded_pos_tags = [label_encoder.fit_transform(tags) for tags in new_pos_tags]

    # Pad the POS tags
    padded_pos_tags = pad_sequences(encoded_pos_tags, maxlen=max_length, padding='post')

    # Create attention masks
    attention_masks = [[float(i != 0.0) for i in seq] for seq in tokenized_inputs["input_ids"]]

    return tokenized_inputs["input_ids"], padded_pos_tags, attention_masks

In [None]:
with open('data/macmorpho-train.txt', 'r') as f:
    data = f.readlines()
    input_ids, tags, masks = preprocess_and_tokenize(data)

with open('data/macmorpho-dev.txt', 'r') as f:
    data = f.readlines()
    val_input_ids, val_tags, val_masks = preprocess_and_tokenize(data)

with open('data/macmorpho-test.txt', 'r') as f:
    data = f.readlines()
    test_input_ids, test_tags, test_masks = preprocess_and_tokenize(data)

In [15]:
from transformers import TFBertForTokenClassification, TFTrainingArguments, TFTrainer
import tensorflow as tf

# Set max_length
max_length = 512

# Convert training data into TF Dataset
train_dataset = tf.data.Dataset.from_tensor_slices((
    {"input_ids": input_ids, "attention_mask": masks},
    tags
))

# Convert validation data into TF Dataset
eval_dataset = tf.data.Dataset.from_tensor_slices((
    {"input_ids": val_input_ids, "attention_mask": val_masks},
    val_tags
))


# Initialize the BERT model
model = TFBertForTokenClassification.from_pretrained(
            'bert-base-multilingual-cased',
            num_labels=len(label_encoder.classes_),
            max_length=max_length
)

# Define the training arguments
training_args = TFTrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = TFTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Train the model
trainer.train()

All PyTorch model weights were used when initializing TFBertForTokenClassification.

Some weights or buffers of the TF 2.0 model TFBertForTokenClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2023-12-09 17:59:16.154653: W tensorflow/core/framework/dataset.cc:959] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.


ValueError: in user code:

    File "/home/misaelrezende/miniconda3/envs/general/lib/python3.11/site-packages/transformers/trainer_tf.py", line 710, in distributed_training_steps  *
        self.args.strategy.run(self.apply_gradients, inputs)
    File "/home/misaelrezende/miniconda3/envs/general/lib/python3.11/site-packages/transformers/trainer_tf.py", line 653, in apply_gradients  *
        gradients = self.training_step(features, labels, nb_instances_in_global_batch)
    File "/home/misaelrezende/miniconda3/envs/general/lib/python3.11/site-packages/transformers/trainer_tf.py", line 636, in training_step  *
        per_example_loss, _ = self.run_model(features, labels, True)
    File "/home/misaelrezende/miniconda3/envs/general/lib/python3.11/site-packages/transformers/trainer_tf.py", line 755, in run_model  *
        outputs = self.model(features, labels=labels, training=training)[:2]
    File "/home/misaelrezende/miniconda3/envs/general/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler  **
        raise e.with_traceback(filtered_tb) from None
    File "/tmp/__autograph_generated_file3omv_k83.py", line 40, in tf__run_call_with_unpacked_inputs
        raise
    File "/tmp/__autograph_generated_fileofuah1x2.py", line 19, in tf__call
        loss = ag__.if_exp(ag__.ld(labels) is None, lambda: None, lambda: ag__.converted_call(ag__.ld(self).hf_compute_loss, (), dict(labels=ag__.ld(labels), logits=ag__.ld(logits)), fscope), 'labels is None')
    File "/tmp/__autograph_generated_fileofuah1x2.py", line 19, in <lambda>
        loss = ag__.if_exp(ag__.ld(labels) is None, lambda: None, lambda: ag__.converted_call(ag__.ld(self).hf_compute_loss, (), dict(labels=ag__.ld(labels), logits=ag__.ld(logits)), fscope), 'labels is None')
    File "/tmp/__autograph_generated_fileb5p5yn6o.py", line 92, in tf__hf_compute_loss
        ag__.if_stmt(ag__.ld(self).config.tf_legacy_loss, if_body_3, else_body_3, get_state_3, set_state_3, ('do_return', 'retval_', 'labels'), 2)
    File "/tmp/__autograph_generated_fileb5p5yn6o.py", line 76, in else_body_3
        unmasked_loss = ag__.converted_call(ag__.ld(loss_fn), (ag__.converted_call(ag__.ld(tf).nn.relu, (ag__.ld(labels),), None, fscope), ag__.ld(logits)), None, fscope)

    ValueError: Exception encountered when calling layer 'tf_bert_for_token_classification_5' (type TFBertForTokenClassification).
    
    in user code:
    
        File "/home/misaelrezende/miniconda3/envs/general/lib/python3.11/site-packages/transformers/modeling_tf_utils.py", line 1748, in run_call_with_unpacked_inputs  *
            return func(self, **unpacked_inputs)
        File "/home/misaelrezende/miniconda3/envs/general/lib/python3.11/site-packages/transformers/models/bert/modeling_tf_bert.py", line 1773, in call  *
            loss = None if labels is None else self.hf_compute_loss(labels=labels, logits=logits)
        File "/home/misaelrezende/miniconda3/envs/general/lib/python3.11/site-packages/transformers/modeling_tf_utils.py", line 271, in hf_compute_loss  *
            unmasked_loss = loss_fn(tf.nn.relu(labels), logits)
        File "/home/misaelrezende/miniconda3/envs/general/lib/python3.11/site-packages/keras/src/losses.py", line 143, in __call__  **
            losses = call_fn(y_true, y_pred)
        File "/home/misaelrezende/miniconda3/envs/general/lib/python3.11/site-packages/keras/src/losses.py", line 270, in call  **
            return ag_fn(y_true, y_pred, **self._fn_kwargs)
        File "/home/misaelrezende/miniconda3/envs/general/lib/python3.11/site-packages/keras/src/losses.py", line 2454, in sparse_categorical_crossentropy
            return backend.sparse_categorical_crossentropy(
        File "/home/misaelrezende/miniconda3/envs/general/lib/python3.11/site-packages/keras/src/backend.py", line 5775, in sparse_categorical_crossentropy
            res = tf.nn.sparse_softmax_cross_entropy_with_logits(
    
        ValueError: `labels.shape` must equal `logits.shape` except for the last dimension. Received: labels.shape=(16, 512) and logits.shape=(16, 267, 10)
    
    
    Call arguments received by layer 'tf_bert_for_token_classification_5' (type TFBertForTokenClassification):
      • input_ids={'input_ids': 'tf.Tensor(shape=(16, 267), dtype=int32)', 'attention_mask': 'tf.Tensor(shape=(16, 267), dtype=float32)'}
      • attention_mask=None
      • token_type_ids=None
      • position_ids=None
      • head_mask=None
      • inputs_embeds=None
      • output_attentions=None
      • output_hidden_states=None
      • return_dict=None
      • labels=tf.Tensor(shape=(16, 512), dtype=int32)
      • training=True
