# Análisis meta lingüisticos

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ivanvladimir/analisis_linguistico/blob/main/Analisis%20metalinguisticos.ipynb)
[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/ivanvladimir/analisis_linguistico/blob/main/Analisis%20metalinguisticos.ipynb)

Este es el código para ejemplificar análisis computacional lingüístico: metalingüísticos.

### Instrucciones

Ejecutar las celdas en el orden que se encuentran.

### Licencia de la notebook

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/80x15.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.

### Información general

> **Author(s)**: <a href="https://twitter.com/ivanvladimir">@ivanvladimir</a> </br>
> **Last updated**: 15/06/2025

# ❶  Preparar librerias 

In [None]:
# Instalar librerias
!pip install transformers accelerate

In [None]:
# Cargar librerias
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, AutoModelForTokenClassification
import torch
import numpy as np

# ❷ Cargar modelo 

Potencialmente probar con

* [facebook/roberta-hate-speech-dynabench-r4-target](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target)
* [pysentimiento/robertuito-sentiment-analysis](https://huggingface.co/pysentimiento/robertuito-sentiment-analysis)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("pysentimiento/robertuito-sentiment-analysis")
model = AutoModelForSequenceClassification.from_pretrained("pysentimiento/robertuito-sentiment-analysis")

# ❸ Usar el modelo

In [None]:
MSG="Yo creo que el día de hoy habrá mucho sol y será un día excelente"

with torch.no_grad():
    tokens = tokenizer(MSG, return_tensors="pt")
    logits = model(**tokens).logits

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

# ❹ Explorando la salida

In [None]:
tokens

In [None]:
print(f"Token IDs: {tokens['input_ids'].tolist()[0]}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(tokens['input_ids'][0])}")
print(f"Decoded: {tokenizer.decode(tokens['input_ids'][0])}")

In [None]:
logits

In [None]:
import torch.nn.functional as F

probs = F.softmax(logits, dim=1)
print(probs)

In [None]:
print({v: f"{float(probs[0][k])*100:3.2f}%" for k,v in model.config.id2label.items()})

# ❺ Modelos para segmento de documentos

In [None]:
tokenizer = AutoTokenizer.from_pretrained("SIRIS-Lab/citation-parser-ENTITY")
model = AutoModelForTokenClassification.from_pretrained("SIRIS-Lab/citation-parser-ENTITY")

# Get label names
id2label = model.config.id2label
label2id = model.config.label2id

text="Robles, C. , Carrillo, M. and Meza, I. : Detección de emociones en texto en español utilizando transformers. Abstraction & Application. Vol. 47. pp. 87-97. 2024"

with torch.no_grad():
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=2)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [id2label[pred.item()] for pred in predictions[0]]

# Extract entities
entities = []
current_entity = []
current_label = None

for token, label in zip(tokens, labels):
    if token in ['[CLS]', '[SEP]', '[PAD]']:
        continue
        
    if label.startswith('B-'):  # Beginning of entity
        if current_entity:  # Save previous entity
            entities.append({
                'text': tokenizer.convert_tokens_to_string(current_entity),
                'label': current_label
            })
        current_entity = [token]
        current_label = label[2:]  # Remove 'B-' prefix
        
    elif label.startswith('I-') and current_label == label[2:]:  # Inside entity
        current_entity.append(token)
        
    else:  # Outside entity or different entity
        if current_entity:
            entities.append({
                'text': tokenizer.convert_tokens_to_string(current_entity),
                'label': current_label
            })
        current_entity = []
        current_label = None

# Don't forget the last entity
if current_entity:
    entities.append({
        'text': tokenizer.convert_tokens_to_string(current_entity),
        'label': current_label
    })

for entity in entities:
    print(f"  {entity['text']} -> {entity['label']}")

In [None]:
from transformers import pipeline

# Load the model
citation_parser = pipeline("ner",
                           model="nicolauduran45/patstat-citation-parser",
                           tokenizer="nicolauduran45/patstat-citation-parser",
                           aggregation_strategy="simple")

citation_text="Alizadeh, P. , Garcia, J. , Meza, I. and Taleb, S. : Reinforcement Learning for Expert Finding from Web Search Results. Advances in Knowledge Discovery and Management. pp. 113-128. 2024."
# Parse the citation
entities = citation_parser(citation_text)

for entity in entities:
    print(f"  {entity['word']} -> {entity['entity_group']} (confidence: {entity['score']:.3f})")