<a href="https://colab.research.google.com/github/joaoBernardinoo/formas-research/blob/main/atividade_01_formas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ✨ Etiquetagem de Classes Gramaticais em Palavras Baseadas em Regras ✨

Este projeto implementa um **etiquetador morfológico baseado em regras** para o idioma português, inspirado no trabalho clássico de **Brill (1992)**. Utilizando o renomado **corpus Bosque**, o objetivo é por em prática o conhecimento adquirído durante minha iniciação científica, como também testar hipótese de que, mesmo sendo o português uma língua mais verbosa que o inglês, podemos aplicar etiquetas gramaticais corretas utilizando os três últimos caracteres dos tokens etiquetados em um corpus padrão ouro **( Bosque )**.

🔍 **Destaques do Projeto**:
- Utiliza técnicas de **Processamento de Linguagem Natural (PLN)**.
- Baseado em regras linguísticas para análise morfológica.
- Testa a eficiência de sufixos na **etiquetagem gramatical** em português.

📊 **Corpus Utilizado**:
- **Bosque** (um dos maiores e mais completos corpora da língua portuguesa).

---

### 📚 Referências

BRILL, E. *A Simple Rule-Based Part of Speech Tagger*. Proceedings of the Third Conference on Applied Natural Language Processing. **ANLC ’92**. USA: Association for Computational Linguistics, 1992. Disponível em: [https://doi.org/10.3115/974499.974526](https://doi.org/10.3115/974499.974526)



In [9]:
import pickle
import nltk
import pandas as pd
import numpy as np
import copy
import itertools as it
from sklearn.model_selection import train_test_split

In [10]:
!pip install conllu
!wget http://marlovss.work.gd:8080/tomorrow/aula2/bosque.conllu

Collecting conllu
  Downloading conllu-6.0.0-py3-none-any.whl.metadata (21 kB)
Downloading conllu-6.0.0-py3-none-any.whl (16 kB)
Installing collected packages: conllu
Successfully installed conllu-6.0.0
--2024-11-01 14:37:56--  http://marlovss.work.gd:8080/tomorrow/aula2/bosque.conllu
Resolving marlovss.work.gd (marlovss.work.gd)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘marlovss.work.gd’


In [11]:
import conllu
import itertools as it

class AttributeDict(dict):
    __getattr__ = dict.__getitem__
    __setattr__ = dict.__setitem__
    __delattr__ = dict.__delitem__

class CoNLLU:
   def __init__(self, files):
      self.words = []
      self.sentences = []
      for f in files:
         parsed = conllu.parse(open(f).read())
         sents = [[AttributeDict(form = token['form'], lemma=token['lemma'],pos=token['upos'],feats=token['feats']) for token in tokenlist if token['upos']!='_'] for tokenlist in parsed]
         self.sentences.extend(sents)
         self.words.extend([word for sent in sents for word in sent])
      self.pos_tags = set([word.pos for word in self.words])
      self.feats_dict ={pos:set(it.chain.from_iterable([list(word.feats.keys()) for word in self.words if word.pos==pos and word.feats!= None])) for pos in self.pos_tags}


In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
bosque = CoNLLU(files=["bosque.conllu"])

In [None]:
# aqui train_data, patch_data e test_data são o corpus "bosque" particionado por suas sentenćas, não palavras
# deve-se verificar se todas as partićões abrangem todas as "universal pos tags"

train_data, temp_data = train_test_split(bosque.sentences, test_size=0.1, random_state=42) # 90% train, 10% temp
patch_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42) # Split the 10% into 5% patch and 5% test

print(f"Training data size: {len(train_data)}")
print(f"Patch data size: {len(patch_data)}")
print(f"Test data size: {len(test_data)}")

In [None]:
print(train_data[0])

In [None]:
train_words = sorted([word for sentence in train_data for word in sentence], key=lambda x: x.form)

In [None]:
from nltk.probability import FreqDist
suffixes = set([word.form.lower()[-3:] for word in train_words])

In [None]:
# O artigo utiliza 3 ultimos caracteres do token do corpus anotado na lingua inglesa,
# hipótese ( precisa testar ): a língua portuguesa é mais verbosa, seria 3 caracteres o suficiente
# para, por exemplo, contemplar todas as conjugaćões verbais??

try:
    with open('/content/drive/MyDrive/Colab Notebooks/suf_to_tag.pkl', 'rb') as f:
        suf_to_tag = pickle.load(f)
except FileNotFoundError:
    print("Arquivo nao encontrado, extraindo os sufixos...")
    suf_to_tag = {suf: FreqDist([word.pos for word in train_words if word.form.lower()[-3:] == suf]).max() for suf in suffixes}

    with open('/content/drive/MyDrive/Colab Notebooks/suf_to_tag.pkl', 'wb') as f:
        pickle.dump(suf_to_tag, f)

In [None]:
rules = {
    'ADJ': [],
    'ADP': [],
    'ADV': [],
    'AUX': [],
    'CCONJ': [],
    'DET': [],
    'INTJ': [],
    'NOUN': [],
    'NUM': [],
    'PART': [],
    'PRON': [],
    'PROPN': [],
    'PUNCT': [],
    'SCONJ': [],
    'SYM': [],
    'VERB': [],
    'X': []
}

df = pd.DataFrame(list(rules.items()), columns=['pos_tag', 'token'])
df['token'] = df['token'].apply(set)


In [None]:
train_sents = [[word.form for word in sent] for sent in train_data]
patch_sents = [[word.form for word in sent] for sent in patch_data]
patch_gold = [[(word.form,word.pos) for word in sent] for sent in patch_data]
test_gold  = [[(word.form.lower(),word.pos) for word in sent] for sent in test_data]

In [None]:
# primeiro vamos etiquetar o patch
# quantificando as vezes ao invés de etiquetar tag b, etiquetou tag a
# < tagA, tagB, number >

def lexic_tag(tokens):
  tagged = []
  for token in tokens:
    if token.lower()[-3:] in suffixes:
       tagged.append([token,suf_to_tag[token.lower()[-3:]]])
    else:
       tagged.append([token,"_"])
  return tagged

# has to add < tagA, tagB, number > to a list, when a word is mistagged with a tagA
# when it should be tagB

def lexic_tag_error(predicted,gold):
  mistagged = []
  for j in range(len(gold)):
    for i in range(len(gold[j])):
      tagA = predicted[j][i][1]
      tagB = gold[j][i][1]
      if tagA != tagB:
        # caso o elemento < tagA,tagB, number > exista, incremente number
        # caso contrário adicione < tagA, tagB,  1 > na lista
        found = False
        for k in range(len(mistagged)):
            if mistagged[k][0] == tagA and mistagged[k][1] == tagB:
                mistagged[k] = (tagA, tagB, mistagged[k][2] + 1)
                found = True
                break
        if not found:
            mistagged.append((tagA, tagB, 1))

  return mistagged

In [None]:
patch_pred = [lexic_tag(sent) for sent in patch_sents]

In [None]:
print(patch_pred[0][1])
print(patch_gold[0][1])
print(f"Aqui houve divergencia entre os dois, devemos adicionar \n< {patch_pred[0][1][1]},{patch_gold[0][1][1]}, +1> na lista de triplas")

In [None]:
triples = lexic_tag_error(patch_pred,patch_gold)

In [None]:
sorted_triples = sorted(triples, key=lambda x: x[2], reverse=True)
sorted_triples[:5]

In [None]:
total_errors = sum([triple[2] for triple in triples])
print(total_errors)

### Utilizaremos os templates abaixo para gerar os patches:
(Brill, 1992)
Change tag a to tag b when:
1. The preceding (following) word is tagged z.
2. The word two before (after) is tagged z.
3. One of the two preceding (following) words is tagged
Z.
4. One of the three preceding (following) words is
tagged z.
5. The preceding word is tagged z and the following
word is tagged w.
6. The preceding (following) word is tagged z and the
word two before (after) is tagged w.
7. The current word is (is not) capitalized.
8. The previous word is (is not) capitalized.

In [None]:
def preprocess_train_words(train_words):
     """ Dicionário para checagem rápida das palavras do corpus de treino """
     word_to_tags = {}
     for word_data in train_words:
       form = word_data['form']
       pos_tag = word_data['pos']
       word_to_tags.setdefault(form, set()).add(pos_tag)
     return word_to_tags

word_to_tags_lookup = preprocess_train_words(train_words)

In [None]:
patches = []

In [None]:
conditions = [
    "NEXT-TAG",
    "PREV-TAG",
    "NEXT-2-TAG",
    "PREV-2-TAG",
    "NEXT-1-OR-2-TAG",
    "PREV-1-OR-2-TAG",
    "NEXT-1-OR-2-OR-3-TAG",
    "PREV-1-OR-2-OR-3-TAG",
    "PREV-TAG-NEXT-TAG",
    "PREV-TAG-NEXT-2-TAG",
    "PREV-2-TAG-NEXT-TAG",
    "IS-CAPITALIZED",
    "IS-NOT-CAPITALIZED",
    "PREV-IS-CAPITALIZED",
    "PREV-IS-NOT-CAPITALIZED",
]

In [None]:
class PatchTemplate():
  def __init__(self, tagA, tagB, cond, tagC = "_"):
    self.current = tagA
    self.patch = tagB
    self.cond = cond
    self.next = tagC

  def __str__(self):
    return f"{self.current} {self.next} {self.cond} {self.patch}"
  #   A patch which
  # changes the tagging of a word from a to b only applies
  # if the word has been tagged b somewhere in the training
  # corpus.
  def __repr__(self):
    return f"{self.current} {self.next} {self.cond} {self.patch}"

  def canTag(self, token):
    """Checa se o patch pode ser aplicando usando o dicionario de consulta"""
    return self.next in word_to_tags_lookup.get(token, set())

  def apply(self,predicted):
    predicted_copy = copy.deepcopy(predicted)
    print(predicted_copy)
    for token in predicted_copy:
      if self.canTag(token[0]):
          token[1] = self.patch

          patched_error = lexic_tag_error([predicted_copy], patch_gold)
          patched_error_sum = sum(err[2] for err in patched_error)

          print("Original Error:", total_errors)
          print("Patched Error:", patched_error_sum)

          if patched_error_sum < total_errors:
              predicted[:] = predicted_copy
              patches.append(self)
      break

    return

In [None]:
conditions = [
    "NEXT-TAG",
    "PREV-TAG",
    "NEXT-2-TAG",
    "PREV-2-TAG",
    "NEXT-1-OR-2-TAG",
    "PREV-1-OR-2-TAG",
    "NEXT-1-OR-2-OR-3-TAG",
    "PREV-1-OR-2-OR-3-TAG",
    "PREV-TAG-NEXT-TAG",
    "PREV-TAG-NEXT-2-TAG",
    "PREV-2-TAG-NEXT-TAG",
    "IS-CAPITALIZED",
    "IS-NOT-CAPITALIZED",
    "PREV-IS-CAPITALIZED",
    "PREV-IS-NOT-CAPITALIZED",
]

In [None]:
def generate_templates(tagA, pos_tags,conditions):
    templates = []
    for tagB,tagC,conditions in it.product(pos_tags,pos_tags,conditions):

        if tagB != tagC:
            templates.append(PatchTemplate(tagA, tagB, conditions, tagC))
        # 1. The preceding (following) word is tagged z.
        # 2. The word two before (after) is tagged z.
        # 3. One of the two preceding (following) words is tagged Z.
        # 4. One of the three preceding (following) words is tagged z.
        # 5. The preceding word is tagged z and the following word is tagged w.
        # 6. The preceding (following) word is tagged z and the word two before (after) is tagged w.
        # 7. The current word is (is not) capitalized.
        # 8. The previous word is (is not) capitalized.
    return templates

pos_tags = {'ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X'}
# conditions = ["NEXT-TAG","PREV-TAG"]
tagA = "NUON"
templates = generate_templates(tagA,pos_tags, conditions)

# cada objeto da classe patch template tem 4 attributos ex:
#  VERB PREP NEXT-TAG DET
# The first patch states that if a word is tagged VERB
# and the following word is tagged DET, then switch the
# tag from VERB to PREP.




In [None]:
templates[0]

In [None]:
len(templates)

In [None]:
print(patch_pred[0][1])
print(patch_gold[0][1])

In [None]:
tagA = "NOUN"
for tag in pos_tags:
  if tag != tagA:
    templates = generate_templates(tagA,pos_tags, conditions)
    for template in templates:
      template.apply(patch_pred[0])

In [None]:
print(patch_pred[0][1])
print(patch_gold[0][1])

In [None]:
def tag(tokens):
  tagged = []
  for token in tokens:
    if token.lower()[-3:] in suffixes:
       tagged.append((token,suf_to_tag[token.lower()[-3:]]))
    else:
       tagged.append((token,"_"))
  return tagged

In [None]:
def accuracy(predicted,gold):

   acertos = len([predicted[i][j][1] for i in range(len(gold)) for j in range(len(gold[i])) if predicted[i][j][1]==gold[i][j][1]])
   totais = sum([len(sent) for sent in gold])
   return acertos/totais

def abrangencia(predicted,gold):
  tagged_tokens = 0

  for sent in predicted:
    for _, predicted_tag in sent:
      if predicted_tag != "_":
        tagged_tokens += 1
  total_tokens = 0

  for sent in gold:
    for _, gold_tag in sent:
      if gold_tag != "_":
        total_tokens += 1
  return tagged_tokens / total_tokens

def F(predicted,gold):
  return 2 * (abrangencia(predicted,gold) * accuracy(predicted,gold)) / (abrangencia(predicted,gold) + accuracy(predicted,gold))

In [None]:
!wget http://marlovss.work.gd:8080/tomorrow/aula2/test.conllu

In [None]:
test = CoNLLU(files=["test.conllu"])
test_sents = [[word.form for word in sent] for sent in test.sentences]
gold = [[(word.form.lower(),word.pos) for word in sent] for sent in test.sentences]
predicted = [tag(sent) for sent in test_sents]

In [None]:
def validate(train,test):
  gold = [[(word.form.lower(),word.pos) for word in sent] for sent in test]
  predicted = [tag(sent) for sent in train]
  return {
        'accuracy': accuracy(predicted,gold),
        'coverage': abrangencia(predicted,gold),
        "F" : F(predicted,gold)
}