# Task
Mount Google Drive, load the dataset "task1" from the "english" directory, and filter it to include only English language data.

## Mount google drive

### Subtask:
Mount Google Drive to access files stored there.


**Reasoning**:
Mount Google Drive to access files stored there.



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Load data

### Subtask:
Load the relevant data file from the specified directory within Google Drive.


**Reasoning**:
Load the data from the specified path into a pandas DataFrame.



In [2]:
import pandas as pd

file_path = '/content/drive/MyDrive/Mestrado/Modelos de Linguagem/Trabalho LLM/Tarefa Subjetividade Clef2025-1/task1/data/english/train_en.tsv'
df_sub = pd.read_csv(file_path,sep="\t")

In [3]:
df_sub.head()

Unnamed: 0,sentence_id,sentence,label,solved_conflict
0,b9e1635a-72aa-467f-86d6-f56ef09f62c3,Gone are the days when they led the world in r...,SUBJ,True
1,f99b5143-70d2-494a-a2f5-c68f10d09d0a,The trend is expected to reverse as soon as ne...,OBJ,False
2,4076639c-aa56-4202-ae0f-9d9217f8da68,But there is the specious point again.,OBJ,False
3,b057c366-698e-419d-a284-9b16d835c64e,He added he wouldn’t be surprised to see a new...,OBJ,False
4,a5a9645e-7850-41ba-90a2-5def725cd5b8,"Not less government, you see; the same amount ...",SUBJ,False


In [4]:
df_sub.label.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
OBJ,532
SUBJ,298


# Task
Implement a simple language model and a classifier on top of it to classify between subjective and objective based on the `sentence` and `label` columns of the dataframe `df_sub` loaded from the file "/content/drive/MyDrive/Mestrado/Modelos de Linguagem/Trabalho LLM/Tarefa Deteccao semeval 2024-8/SubtaskA-20251024T182540Z-1-001/SubtaskA/subtaskA_train_monolingual.jsonl".

## Prepare the data

### Subtask:
Split the `df_sub` DataFrame into training and testing sets, and preprocess the text data (e.g., tokenization, potentially removing stop words or punctuation).


**Reasoning**:
Split the dataframe into training and testing sets and then preprocess the text data using TF-IDF vectorization.



In [5]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X = df_sub['sentence']
y = df_sub['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tfidf_vectorizer = TfidfVectorizer()

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

## Build the model

### Subtask:
Create a simple language model (e.g., using TF-IDF vectorization or a simple embedding layer) and a classifier (e.g., a logistic regression or a simple neural network) on top of it.


**Reasoning**:
Instantiate a Logistic Regression model and train it on the TF-IDF transformed training data.



In [6]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

## Evaluate the model

### Subtask:
Evaluate the performance of the trained model on the testing data using appropriate metrics (e.g., accuracy, precision, recall, F1-score).


**Reasoning**:
Use the trained model to predict on the test data and then generate and print the classification report.



In [7]:
y_pred = model.predict(X_test_tfidf)

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         OBJ       0.70      0.93      0.80       104
        SUBJ       0.75      0.34      0.47        62

    accuracy                           0.71       166
   macro avg       0.73      0.64      0.63       166
weighted avg       0.72      0.71      0.68       166



## Summary:

### Data Analysis Key Findings

*   The dataset was split into training and testing sets with a test size of 20%.
*   Text data was preprocessed using TF-IDF vectorization.
*   A Logistic Regression model was trained on the TF-IDF transformed training data.
*   The model achieved an overall accuracy of 0.71 on the test set.
*   For the "OBJ" class, the model showed good recall (0.93) but lower precision (0.70), resulting in an F1-score of 0.80.
*   For the "SUBJ" class, the model had higher precision (0.75) but lower recall (0.34), leading to an F1-score of 0.47.

### Insights or Next Steps

*   The current model performs better at identifying objective sentences than subjective ones. Further work could focus on improving the model's ability to detect subjective language.
*   Experimenting with different vectorization methods (e.g., word embeddings) or more complex classifiers could potentially improve the performance, especially for the subjective class.


In [8]:
file_path = '/content/drive/MyDrive/Mestrado/Modelos de Linguagem/Trabalho LLM/Tarefa Subjetividade Clef2025-1/task1/data/english/dev_en.tsv'
dev_sub = pd.read_csv(file_path,sep="\t")

In [9]:
dev_sub

Unnamed: 0,sentence_id,sentence,label,solved_conflict
0,ab677701-ae20-42b4-89f2-ddf1eb71b8b7,(It’s also true that Alien Nation was “a maste...,SUBJ,False
1,56164d11-a7f5-4ac5-8dde-681e6f3436e1,It’s all justified in the name of racial “equi...,SUBJ,False
2,676d7dfd-f9d9-42fd-9b00-2165f4576cd7,These issues include punishing Sanctuary Citie...,OBJ,False
3,f3c8718e-d553-4730-89d5-5077c96de10a,"Only 20 percent of voters support it, while a ...",OBJ,False
4,10cb731a-0d62-4e85-b101-a51b48a20219,"In contrast, in 2017 fewer than half of Republ...",OBJ,False
...,...,...,...,...
457,62cff7b0-e459-4cbe-b534-6a814a50302e,"The poverty memoir is an old, popular, well-es...",OBJ,
458,dfd1bc2d-5419-46fd-84a1-ac4d2ee2138d,It doesn’t tinker with the underlying social s...,SUBJ,
459,e398e3aa-2af1-45d4-9672-d709d50b6aed,Asylum claimants must be kept in Mexico while ...,OBJ,
460,7db63c04-f58e-4b77-93d4-d1e22fb7a376,Another example was the Governor’s move last w...,OBJ,


In [10]:
X_dev = dev_sub['sentence']
y_dev = dev_sub['label']

X_dev_tfidf = tfidf_vectorizer.transform(X_dev)

y_dev_pred = model.predict(X_dev_tfidf)

print(classification_report(y_dev, y_dev_pred))

              precision    recall  f1-score   support

         OBJ       0.49      0.97      0.65       222
        SUBJ       0.70      0.06      0.11       240

    accuracy                           0.50       462
   macro avg       0.59      0.52      0.38       462
weighted avg       0.60      0.50      0.37       462



In [11]:
!pip install transformers peft datasets accelerate scikit-learn torch




In [12]:
import torch

# Isso retornará 'True' se uma GPU CUDA estiver disponível
print(f"CUDA está disponível? {torch.cuda.is_available()}")

# Se estiver disponível, mostrará o nome da GPU
if torch.cuda.is_available():
    print(f"Nome da GPU: {torch.cuda.get_device_name(0)}")

CUDA está disponível? True
Nome da GPU: Tesla T4


In [13]:
import torch
import pandas as pd
import numpy as np
from torch.utils.data import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
from peft import LoraConfig, get_peft_model
from sklearn.metrics import classification_report

# ============================================
# 0️⃣ LOAD DATA (Adicionado para ser executável)
# ============================================
# Substitua pelos seus caminhos de arquivo reais
train_file_path = '/content/drive/MyDrive/Mestrado/Modelos de Linguagem/Trabalho LLM/Tarefa Subjetividade Clef2025-1/task1/data/english/train_en.tsv'
dev_file_path = '/content/drive/MyDrive/Mestrado/Modelos de Linguagem/Trabalho LLM/Tarefa Subjetividade Clef2025-1/task1/data/english/dev_en.tsv'

df_sub = pd.read_csv(train_file_path, sep="\t")
dev_sub = pd.read_csv(dev_file_path, sep="\t")

# ============================================
# 1️⃣ PREPROCESS DATA
# ============================================

# Example: labels are "OBJ" and "SUBJ"
label_map = {"OBJ": 0, "SUBJ": 1}

# Map to numeric labels
df_sub["label"] = df_sub["label"].map(label_map)
dev_sub["label"] = dev_sub["label"].map(label_map)

print("Label distribution (train):")
print(df_sub["label"].value_counts())
print("\nLabel distribution (dev):")
print(dev_sub["label"].value_counts())

# ============================================
# 2️⃣ DEFINE DATASET CLASS
# ============================================

class TextDataset(Dataset):
    def __init__(self, df, tokenizer, max_len=128):
        self.texts = df["sentence"].tolist()
        self.labels = df["label"].tolist()
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        enc = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt",
        )
        item = {key: val.squeeze() for key, val in enc.items()}
        item["labels"] = torch.tensor(label, dtype=torch.long)
        return item

# ============================================
# 3️⃣ LOAD PRETRAINED MODEL + TOKENIZER
# ============================================

model_name = "bert-base-uncased"  # can replace with another model if desired

tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,  # "OBJ" and "SUBJ"
    id2label={0: "OBJ", 1: "SUBJ"}, # Adicionado para clareza
    label2id={"OBJ": 0, "SUBJ": 1}
)

# ============================================
# 4️⃣ ADD LORA ADAPTER
# ============================================

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query", "value"], # Correto para BERT
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_CLS",
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

# ============================================
# 5️⃣ CREATE DATASETS
# ============================================

train_dataset = TextDataset(df_sub, tokenizer)
dev_dataset = TextDataset(dev_sub, tokenizer)

print(f"Train dataset size: {len(train_dataset)}")
print(f"Dev dataset size: {len(dev_dataset)}")

# ============================================
# 6️⃣ DEFINE METRICS
# ============================================

def compute_metrics(pred):
    preds = np.argmax(pred.predictions, axis=1)
    labels = pred.label_ids
    accuracy = (preds == labels).mean()
    return {"accuracy": accuracy}

# ============================================
# 7️⃣ TRAINING ARGUMENTS
# ============================================

training_args = TrainingArguments(
    output_dir="./lora_results",
    # CORREÇÃO AQUI: "eval_strategy" foi renomeado para "evaluation_strategy"
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=2e-4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="none",
)

# ============================================
# 8️⃣ TRAINER SETUP
# ============================================

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    compute_metrics=compute_metrics,
)

# ============================================
# 9️⃣ TRAIN MODEL
# ============================================

trainer.train()

# ============================================
# 🔟 EVALUATE MODEL
# ============================================

preds_output = trainer.predict(dev_dataset)

# CORREÇÃO/MELHORIA AQUI:
# Usar `preds_output.label_ids` é mais robusto do que `dev_sub["label"].values`
# pois garante que os rótulos estejam na mesma ordem das previsões.
y_true = preds_output.label_ids
y_pred = np.argmax(preds_output.predictions, axis=1)

inv_label_map = {0: "OBJ", 1: "SUBJ"}
y_true_labels = [inv_label_map[i] for i in y_true]
y_pred_labels = [inv_label_map[i] for i in y_pred]

print("\n--- Classification Report ---")
print(classification_report(y_true_labels, y_pred_labels))

Label distribution (train):
label
0    532
1    298
Name: count, dtype: int64

Label distribution (dev):
label
1    240
0    222
Name: count, dtype: int64


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 296,450 || all params: 109,780,228 || trainable%: 0.2700
Train dataset size: 830
Dev dataset size: 462


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6277,0.760278,0.528139
2,0.5052,0.622973,0.699134
3,0.4607,0.591581,0.720779



--- Classification Report ---
              precision    recall  f1-score   support

         OBJ       0.66      0.86      0.75       222
        SUBJ       0.82      0.60      0.69       240

    accuracy                           0.72       462
   macro avg       0.74      0.73      0.72       462
weighted avg       0.74      0.72      0.72       462



In [15]:
from google.colab import drive
import os

# --- 1. Monte seu Google Drive ---
print("Montando Google Drive...")
drive.mount('/content/drive')

# --- 2. Defina um caminho PERMANENTE ---
# Um novo diretório para o seu modelo de subjetividade
output_dir = "/content/drive/MyDrive/Mestrado/Modelos de Linguagem/Trabalho LLM/Tarefa Subjetividade Clef2025-1/subj_obj_lora_final"

# Crie o diretório se ele não existir
os.makedirs(output_dir, exist_ok=True)
print(f"Diretório de salvamento: {output_dir}")

# ============================================
# 11. SALVAR O MODELO FINAL
# ============================================
# (Assumindo que 'model' e 'tokenizer' são as variáveis
# do seu script de treino que acabaram de ser usadas)

print(f"Salvando adaptadores LoRA em {output_dir}...")

# 1. Salve os pesos do adaptador PEFT (LoRA)
model.save_pretrained(output_dir)

# 2. Salve o tokenizer
tokenizer.save_pretrained(output_dir)

print(f"Modelo e tokenizer salvos permanentemente em seu Google Drive!")

# Verifique os arquivos salvos no seu Drive
!ls -lh {output_dir}

Montando Google Drive...
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Diretório de salvamento: /content/drive/MyDrive/Mestrado/Modelos de Linguagem/Trabalho LLM/Tarefa Subjetividade Clef2025-1/subj_obj_lora_final
Salvando adaptadores LoRA em /content/drive/MyDrive/Mestrado/Modelos de Linguagem/Trabalho LLM/Tarefa Subjetividade Clef2025-1/subj_obj_lora_final...
Modelo e tokenizer salvos permanentemente em seu Google Drive!
ls: cannot access '/content/drive/MyDrive/Mestrado/Modelos': No such file or directory
ls: cannot access 'de': No such file or directory
ls: cannot access 'Linguagem/Trabalho': No such file or directory
ls: cannot access 'LLM/Tarefa': No such file or directory
ls: cannot access 'Subjetividade': No such file or directory
ls: cannot access 'Clef2025-1/subj_obj_lora_final': No such file or directory


In [16]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel
from google.colab import drive

# --- 1. Monte seu Google Drive ---
print("Montando Google Drive...")
drive.mount('/content/drive')

# ============================================
# 12. CARREGAR O MODELO SALVO DO DRIVE
# ============================================

# --- 2. Defina os caminhos ---
base_model_name = "bert-base-uncased"
# O caminho exato onde você salvou os adaptadores
adapter_dir = "/content/drive/MyDrive/Mestrado/Modelos de Linguagem/Trabalho LLM/Tarefa Subjetividade Clef2025-1/subj_obj_lora_final"

print(f"Carregando o modelo base: {base_model_name}...")

# --- 3. Carregue o Modelo Base ---
# ATENÇÃO: Use os rótulos corretos para ESTE modelo
base_model = AutoModelForSequenceClassification.from_pretrained(
    base_model_name,
    num_labels=2,
    id2label={0: "OBJ", 1: "SUBJ"},
    label2id={"OBJ": 0, "SUBJ": 1}
)

print(f"Carregando e aplicando adaptadores LoRA de: {adapter_dir}...")

# --- 4. Aplique os Adaptadores LoRA ---
# Isso mescla os pesos salvos no modelo base
model = PeftModel.from_pretrained(base_model, adapter_dir)

# --- 5. Carregue o Tokenizer ---
tokenizer = AutoTokenizer.from_pretrained(adapter_dir)

print("Modelo e tokenizer carregados do Google Drive!")

# --- 6. Prepare o modelo para inferência ---
model.eval()
if torch.cuda.is_available():
    model.to("cuda")
    print("Modelo movido para a GPU.")

Montando Google Drive...
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Carregando o modelo base: bert-base-uncased...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Carregando e aplicando adaptadores LoRA de: /content/drive/MyDrive/Mestrado/Modelos de Linguagem/Trabalho LLM/Tarefa Subjetividade Clef2025-1/subj_obj_lora_final...
Modelo e tokenizer carregados do Google Drive!
Modelo movido para a GPU.


Usando o modelo de subj no dataframe de detec


In [21]:
!pip install nltk -q

import torch
import pandas as pd
import numpy as np
import nltk
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    pipeline
)
from peft import PeftModel
from google.colab import drive
from tqdm.auto import tqdm # Para barras de progresso

# ============================================
# 1. MONTAR O DRIVE E CARREGAR O MODELO DE SUBJETIVIDADE
# (Este é o modelo SUBJ/OBJ que acabamos de treinar)
# ============================================

print("Montando Google Drive...")
drive.mount('/content/drive', force_remount=True)

print("Carregando o modelo de Subjetividade (SUBJ/OBJ)...")
base_model_name = "bert-base-uncased"
adapter_dir = "/content/drive/MyDrive/Mestrado/Modelos de Linguagem/Trabalho LLM/Tarefa Subjetividade Clef2025-1/subj_obj_lora_final"

# Carrega o modelo base
base_model = AutoModelForSequenceClassification.from_pretrained(
    base_model_name,
    num_labels=2,
    id2label={0: "OBJ", 1: "SUBJ"},
    label2id={"OBJ": 0, "SUBJ": 1}
)

# Aplica os adaptadores LoRA
model_subj_obj = PeftModel.from_pretrained(base_model, adapter_dir)

# Carrega o tokenizer
tokenizer_subj_obj = AutoTokenizer.from_pretrained(adapter_dir)

# Prepara o modelo para inferência e move para a GPU
model_subj_obj.eval()
if torch.cuda.is_available():
    model_subj_obj.to("cuda")
    print("Modelo SUBJ/OBJ movido para a GPU.")

# ============================================
# 2. CARREGAR O DATAFRAME DE DETECÇÃO (Human vs. AI)
# ============================================

# !!! ATENÇÃO !!!
# Substitua este caminho pelo caminho CORRETO do seu DataFrame de treino
# de detecção (o que tem 120k linhas).
# Estou usando um placeholder.
try:
    # TENTATIVA DE CARREGAR SEU ARQUIVO GRANDE
    # (Por favor, corrija este caminho)
    train_file_path = "/content/drive/MyDrive/Mestrado/Modelos de Linguagem/Trabalho LLM/Tarefa Deteccao semeval 2024-8/SubtaskA-20251024T182540Z-1-001/SubtaskA/subtaskA_dev_monolingual.jsonl"
    df_detection = pd.read_json(train_file_path, lines=True)
    print(f"Carregado DataFrame de detecção com {len(df_detection)} linhas.")

except FileNotFoundError:
    raise
    # print("Arquivo de detecção não encontrado. Usando um PEQUENO EXEMPLO.")
    # # Usando um exemplo baseado no seu 'head()'
    # data = {
    #     'text': [
    #         "Forza Motorsport is a popular racing game that... It's very realistic.",
    #         "Buying Virtual Console games for your Nintendo... This guide explains how.",
    #         "Windows NT 4.0 was a popular operating system... Many businesses used it.",
    #         "How to Make Perfume\n\nPerfume is a great way to smell good. It can be expensive. You can make it yourself."
    #     ],
    #     'model': ['chatGPT', 'chatGPT', 'chatGPT', 'human'],
    #     'source': ['wikihow', 'wikihow', 'wikihow', 'wikihow'],
    #     'id': [0, 1, 2, 3]
    # }
    # df_detection = pd.DataFrame(data)


# ============================================
# 3. DIVIDIR OS TEXTOS EM FRASES (SENTENÇAS)
# ============================================

print("Baixando o tokenizador de sentenças do NLTK ('punkt')...")
nltk.download('punkt_tab', quiet=True)

print("Dividindo textos em sentenças... (Isso pode levar um momento)")

# 1. Aplica o NLTK 'sent_tokenize' para criar uma lista de frases por linha
df_detection['sentences_list'] = df_detection['text'].apply(lambda x: nltk.sent_tokenize(str(x)))

# 2. "Explode" o DataFrame: cada frase vira uma nova linha
df_exploded = df_detection.explode('sentences_list')

# 3. Limpa o DataFrame final (LÓGICA CORRIGIDA)
#    PRIMEIRO, renomeia a nova coluna
df_exploded = df_exploded.rename(columns={'sentences_list': 'sentence'})
#    SEGUNDO, remove a coluna de texto original (que não precisamos mais)
df_exploded = df_exploded.drop(columns=['text'])
#    TERCEIRO, remove linhas onde a sentença possa ser nula
df_exploded = df_exploded.dropna(subset=['sentence'])
#    QUARTO, reseta o índice
df_exploded = df_exploded.reset_index(drop=True)

total_sentences = len(df_exploded)
print(f"Dataset original de {len(df_detection)} textos foi dividido em {total_sentences} sentenças.")

# ============================================
# 4. CLASSIFICAR AS FRASES (SUBJ vs OBJ)
# ============================================

print("Inicializando o pipeline de classificação...")

# Cria um pipeline para inferência em lote
subj_pipe = pipeline(
    "text-classification",
    model=model_subj_obj,
    tokenizer=tokenizer_subj_obj,
    device=0 if torch.cuda.is_available() else -1,
    batch_size=32
)

# Pega a lista de todas as sentenças para classificar
sentences_list = df_exploded['sentence'].tolist()

print(f"Classificando {total_sentences} sentenças (SUBJ vs OBJ)...")

results = []
#
# 💡 A CORREÇÃO ESTÁ AQUI 💡
# Adicionamos 'truncation=True' para cortar sentenças > 512 tokens.
#
for result in tqdm(subj_pipe(sentences_list, truncation=True), total=total_sentences):
    results.append(result)

print("Classificação concluída.")

# ============================================
# 5. ADICIONAR RESULTADOS AO DATAFRAME
# ============================================

# Os resultados vêm como [{'label': 'SUBJ', 'score': 0.9}, ...]
# Vamos separá-los em novas colunas
df_exploded['subj_label'] = [r['label'] for r in results]
df_exploded['subj_score'] = [r['score'] for r in results]

print("\n--- Análise Concluída ---")
print("Visualização do DataFrame final com classificação de subjetividade:")
print(df_exploded.head(10))

print("\nContagem de classificações por modelo de origem (IA vs. Humano):")
print(df_exploded.groupby('model')['subj_label'].value_counts(normalize=True))

Montando Google Drive...
Mounted at /content/drive
Carregando o modelo de Subjetividade (SUBJ/OBJ)...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Modelo SUBJ/OBJ movido para a GPU.
Carregado DataFrame de detecção com 5000 linhas.
Baixando o tokenizador de sentenças do NLTK ('punkt')...
Dividindo textos em sentenças... (Isso pode levar um momento)


[nltk_data] Error loading punkt-tab: Package 'punkt-tab' not found in
[nltk_data]     index
Device set to use cuda:0


Dataset original de 5000 textos foi dividido em 84282 sentenças.
Inicializando o pipeline de classificação...
Classificando 84282 sentenças (SUBJ vs OBJ)...


  0%|          | 0/84282 [00:00<?, ?it/s]

Classificação concluída.

--- Análise Concluída ---
Visualização do DataFrame final com classificação de subjetividade:
   label   model   source  id  \
0      1  bloomz  wikihow   0   
1      1  bloomz  wikihow   0   
2      1  bloomz  wikihow   0   
3      1  bloomz  wikihow   0   
4      1  bloomz  wikihow   0   
5      1  bloomz  wikihow   0   
6      1  bloomz  wikihow   0   
7      1  bloomz  wikihow   0   
8      1  bloomz  wikihow   1   
9      1  bloomz  wikihow   1   

                                            sentence subj_label  subj_score  
0           Giving gifts should always be enjoyable.        OBJ    0.534707  
1  However, it may become stressful when trying t...        OBJ    0.735707  
2  This wikiHow will help you figure out exactly ...       SUBJ    0.504216  
3  If you're having trouble deciding between two ...       SUBJ    0.535345  
4  Make sure it's appropriate - some people don't...       SUBJ    0.607313  
5  Don't forget to include any special requests 

In [22]:
# ============================================
# 6. SALVAR O DATAFRAME FINAL EM CSV
# ============================================

# Defina um caminho permanente no seu Google Drive para o arquivo CSV
# (Assumindo que seu drive já está montado da 'Seção 1')

output_csv_path = "/content/drive/MyDrive/Mestrado/Modelos de Linguagem/Trabalho LLM/Tarefa Subjetividade Clef2025-1/detection_dataset_with_subj_scores.csv"

print(f"Salvando DataFrame explodido em {output_csv_path}...")

# Salva o DataFrame em um arquivo CSV
# index=False evita salvar o índice do pandas como uma coluna extra
df_exploded.to_csv(output_csv_path, index=False)

print("Arquivo CSV salvo com sucesso no Google Drive!")

# (Opcional) Verifique o início do arquivo salvo
!ls -lh {output_csv_path}
!head {output_csv_path}

Salvando DataFrame explodido em /content/drive/MyDrive/Mestrado/Modelos de Linguagem/Trabalho LLM/Tarefa Subjetividade Clef2025-1/detection_dataset_with_subj_scores.csv...
Arquivo CSV salvo com sucesso no Google Drive!
ls: cannot access '/content/drive/MyDrive/Mestrado/Modelos': No such file or directory
ls: cannot access 'de': No such file or directory
ls: cannot access 'Linguagem/Trabalho': No such file or directory
ls: cannot access 'LLM/Tarefa': No such file or directory
ls: cannot access 'Subjetividade': No such file or directory
ls: cannot access 'Clef2025-1/detection_dataset_with_subj_scores.csv': No such file or directory
head: cannot open '/content/drive/MyDrive/Mestrado/Modelos' for reading: No such file or directory
head: cannot open 'de' for reading: No such file or directory
head: cannot open 'Linguagem/Trabalho' for reading: No such file or directory
head: cannot open 'LLM/Tarefa' for reading: No such file or directory
head: cannot open 'Subjetividade' for reading: No suc

In [24]:
from IPython.display import display

# ============================================
# 7. ANALISAR FRASES MAIS SUBJETIVAS
# ============================================
# (Execute esta célula após a Seção 5)

print(f"Analisando {len(df_exploded)} sentenças...")

# 1. Filtra o DataFrame para incluir apenas as linhas 'SUBJ'
df_only_subj = df_exploded[df_exploded['subj_label'] == 'SUBJ'].copy()

# 2. Ordena essas linhas pelo 'subj_score' (confiança) em ordem decrescente
df_top_subj = df_only_subj.sort_values(by='subj_score', ascending=False)

print("\n--- Top 20 Frases com Maior Confiança de Subjetividade ---")

# 3. Mostra as 20 primeiras linhas
# Usar display() em vez de print() para a formatação de tabela do Colab
display(df_top_subj.head(20))



Analisando 84282 sentenças...

--- Top 20 Frases com Maior Confiança de Subjetividade ---


Unnamed: 0,label,model,source,id,sentence,subj_label,subj_score
39328,0,human,wikipedia,1651,In a speech at Harvard's Kennedy School of Gov...,SUBJ,0.923892
8489,0,human,wikihow,549,"""All governments naturally tend towards tyrann...",SUBJ,0.922756
60256,0,human,reddit,2730,Think of blood type as a sort of evolutionary ...,SUBJ,0.917686
21887,0,human,wikihow,822,"""Wives, place yourselves under your husbands' ...",SUBJ,0.917444
65286,0,human,reddit,2999,So the Roma end up marginalised and end up per...,SUBJ,0.917142
62759,0,human,reddit,2856,It just generates value in the hands of a few ...,SUBJ,0.916582
62783,0,human,reddit,2857,Eventually we're going to have a society where...,SUBJ,0.916321
60882,0,human,reddit,2762,The basic concept of a stock market is a fanta...,SUBJ,0.915961
65246,0,human,reddit,2997,Misinformed or outright untruthful political p...,SUBJ,0.915773
35592,0,human,wikipedia,1541,The racial makeup of Tupman was 149 (92.5%) Wh...,SUBJ,0.915247


In [34]:
df_top_subj.iloc[0].sentence

'In a speech at Harvard\'s Kennedy School of Government, Lars Løkke Rasmussen, the centre-right Danish prime minister from the conservative-liberal Venstre party, addressed the American misconception that the Nordic model is a form of socialism, which is conflated with any form of planned economy, stating: "I know that some people in the US associate the Nordic model with some sort of socialism.'