# Baseline Bag of words

Para reproducir el modelo suministrado por los organizadores de la competición en Codalab se han de realizar los siguientes pasos.

In [2]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
import nltk

from tqdm import tqdm

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\javid\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Descargar los últimos datos
Después de realizar la configuración de paquetes pertinente, se procede con la descarga de los conjuntos de datos directamente desde el servidor, obteniendo las versiones más actualizadas de los conjuntos de datos.

In [9]:
import wget
wget.download("https://pln.inf.um.es/corpora/politices/2023/politicES_phase_1_traindev_public.csv","../data/politicES_phase_1_traindev_public.csv")
wget.download("https://pln.inf.um.es/corpora/politices/2023/politicES_phase_1_testdev_codalab.csv","../data/politicES_phase_1_testdev_codalab.csv")

100% [............................................................................] 926389 / 926389

'../data/politicES_phase_1_testdev_codalab.csv'

## Carga de datos
A continuación se cargan los datos en memoria para poder trabajar con ellos.

In [14]:
df_train = pd.read_csv("../data/politicES_phase_1_traindev_public.csv")
df_test = pd.read_csv("../data/politicES_phase_1_testdev_codalab.csv")

columns = ['label', 'gender', 'profession', 'ideology_binary', 'ideology_multiclass']

for df in [df_train, df_test]:
    for column in columns:
        df[column] = df[column].astype('category')

## Entrenamiento del modelo baseline
Para ello, se lleva a cabo primeramente una fase de transformación de los datos, combinando los documentos para cada clúster y entrenando un modelo de regeresión logística para cada característica.

In [20]:
dataframes = {
    'train': df_train,
    'test': df_test
}

for key, df in dataframes.items():
    group = df.groupby(by = columns, dropna=False, observed=True, sort=False)
    df_clusters=group[columns].agg(func=['count'], as_index=False, observed=True).index.to_frame(index=False)
    merged_fields = []
    pbar=tqdm(df_clusters.iterrows(), total = df_clusters.shape[0], desc="merging clusters")
    for index,row in pbar:
        df_cluster = df[(df['label'] == row['label'])]
        merged_fields.append({**row, **{field:' [SEP] '.join(df_cluster[field].fillna('')) for field in ['tweet']}})
    dataframes[key] = pd.DataFrame(merged_fields)

merging clusters: 100%|████████████████████████████████████████████████████████████| 360/360 [00:00<00:00, 1800.98it/s]
merging clusters: 100%|██████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 1658.13it/s]


In [22]:
stop_words = nltk.corpus.stopwords.words('spanish')

vectorizer = TfidfVectorizer(
    analyzer='word',
    min_df = .1,
    max_features = 50_000,
    lowercase=True,
    stop_words=stop_words
)

X_train = vectorizer.fit_transform(dataframes['train']['tweet'])

X_test = vectorizer.transform(dataframes['test']['tweet'])

baselines = {}

for label in ['gender', 'profession', 'ideology_binary', 'ideology_multiclass']:
    baselines[label] = LogisticRegression()
    
    baselines[label].fit(X_train, dataframes['train'][label])

In [23]:
for label in ['gender', 'profession', 'ideology_binary', 'ideology_multiclass']:
    y_pred = baselines[label].predict(X_test)
    print(label)
    print(classification_report(dataframes['test'][label], y_pred, zero_division=0, digits=6))

gender
              precision    recall  f1-score   support

      female   0.800000  0.117647  0.205128        34
        male   0.647059  0.982143  0.780142        56

    accuracy                       0.655556        90
   macro avg   0.723529  0.549895  0.492635        90
weighted avg   0.704837  0.655556  0.562914        90

profession
              precision    recall  f1-score   support

   celebrity   0.000000  0.000000  0.000000         4
  journalist   0.757576  0.980392  0.854701        51
  politician   0.958333  0.657143  0.779661        35

    accuracy                       0.811111        90
   macro avg   0.571970  0.545845  0.544787        90
weighted avg   0.801978  0.811111  0.787532        90

ideology_binary
              precision    recall  f1-score   support

        left   0.754098  0.884615  0.814159        52
       right   0.793103  0.605263  0.686567        38

    accuracy                       0.766667        90
   macro avg   0.773601  0.744939  0.750

In [27]:
# Cálculo del resultado total (Esto es para la entrega en codalab)
f1_scores = {}
for label in ['gender', 'profession', 'ideology_binary', 'ideology_multiclass']:
    y_pred = baselines[label].predict(X_test)
    f1_scores[label] = f1_score(dataframes['test'][label], y_pred, average='macro')
f1_scores = list(f1_scores.values())
print("El valor final de la puntuación f1 es {f1}". format(f1=np.mean(f1_scores)))

El valor final de la puntuación f1 es 0.5473568187423464
