# Weak Supervision Pipeline - News from Folha de SP

This notebook is an assignment for the Datacentric AI (IMD3011) course at Instituto Metrópole Digital (IMD), taught by Professor Elias Jacob. The goal is to develop a weak supervision pipeline as part of the first unit's assessment.
 
Here, we will perform news classification using articles extracted from Folha de S.Paulo, leveraging a dataset available on Kaggle. The workflow will include data exploration, preprocessing, and modeling steps, with a focus on weak supervision techniques for automatic labeling and classification of news articles.

In [None]:

#Download the dataset from Kaggle
!curl -L -o ./news-of-the-site-folhauol.zip\
  https://www.kaggle.com/api/v1/datasets/download/marlesson/news-of-the-site-folhauol

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  187M  100  187M    0     0   154M      0  0:00:01  0:00:01 --:--:--  212M


In [1]:
import snorkel
import pandas as pd
from snorkel.labeling import (
    LFAnalysis,
    PandasLFApplier,
    filter_unlabeled_dataframe,
    labeling_function,
)
from snorkel.labeling.model.label_model import LabelModel
from snorkel.utils import probs_to_preds

snorkel.__version__

'0.9.9'

EDA

In [2]:
df = pd.read_csv("./data/articles.csv", low_memory=False)

df.head()

Unnamed: 0,title,text,date,category,subcategory,link
0,"Lula diz que está 'lascado', mas que ainda tem...",Com a possibilidade de uma condenação impedir ...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
1,"'Decidi ser escrava das mulheres que sofrem', ...","Para Oumou Sangaré, cantora e ativista malines...",2017-09-10,ilustrada,,http://www1.folha.uol.com.br/ilustrada/2017/10...
2,Três reportagens da Folha ganham Prêmio Petrob...,Três reportagens da Folha foram vencedoras do ...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
3,Filme 'Star Wars: Os Últimos Jedi' ganha trail...,A Disney divulgou na noite desta segunda-feira...,2017-09-10,ilustrada,,http://www1.folha.uol.com.br/ilustrada/2017/10...
4,CBSS inicia acordos com fintechs e quer 30% do...,"O CBSS, banco da holding Elopar dos sócios Bra...",2017-09-10,mercado,,http://www1.folha.uol.com.br/mercado/2017/10/1...


In [3]:
print(df.shape)

(167053, 6)


In [4]:
df.category.value_counts(normalize=True)

category
poder                           0.131826
colunas                         0.129432
mercado                         0.125529
esporte                         0.118106
mundo                           0.102542
cotidiano                       0.101567
ilustrada                       0.097843
opiniao                         0.027087
paineldoleitor                  0.024010
saopaulo                        0.023675
tec                             0.013529
tv                              0.012822
educacao                        0.012679
turismo                         0.011392
ilustrissima                    0.008446
ciencia                         0.007991
equilibrioesaude                0.007854
sobretudo                       0.006327
bbc                             0.005866
folhinha                        0.005244
empreendedorsocial              0.005034
comida                          0.004957
asmais                          0.003280
ambiente                        0.002939
seminar

In [5]:
filtered_categories = ["poder", "mercado", "esporte", "mundo"]
df = df[df['category'].isin(filtered_categories)]

print(df.shape)

(79852, 6)


In [6]:
print(df.columns)

# Generate index column to use as a unique identifier
df = df.drop(columns=["date", "category", "subcategory", "link"]).reset_index()

Index(['title', 'text', 'date', 'category', 'subcategory', 'link'], dtype='object')


In [7]:
for row in df.sample(50, random_state=271828).itertuples():
    print(f"{row.title} - {row.text}")
    print("\n")

Lojistas apostam em pontos não tradicionais buscando novos clientes - Pontos de venda incomuns, fora dos shoppings ou tradicionais comércios de rua, têm sido a aposta de empresários para renovar a clientela e cortar custos em ano de economia fraca.  Denise Schneider, 35, decidiu abrir um consultório odontológico dentro de uma escola de ensino fundamental e médio de Alvorada (RS).  Ela conta que já era proprietária de uma unidade da franquia da Ortoplan em um ponto na rua, então procurou uma alternativa.  "Não queria fazer mais do mesmo. Então encontrei um bairro com uma escola com 800 alunos", diz Schneider, que investirá em campanha de marketing e de conscientização sobre saúde bucal para conquistar não só os jovens mas também seus pais.  Para reduzir os riscos de baixa demanda, Schneider escolheu uma escola com espaço que permite à clínica ter uma saída para a rua e atender a clientela externa.  O sócio-fundador e presidente da Ortoplan, Faisal Ismail, diz que a rede aposta em pontos

In [8]:
# Convert all characters in the 'text' column to lowercase
# This ensures uniformity in text data, which is useful for text processing tasks
df["text"] = df["text"].str.lower()

df

Unnamed: 0,index,title,text
0,0,"Lula diz que está 'lascado', mas que ainda tem...",com a possibilidade de uma condenação impedir ...
1,2,Três reportagens da Folha ganham Prêmio Petrob...,três reportagens da folha foram vencedoras do ...
2,4,CBSS inicia acordos com fintechs e quer 30% do...,"o cbss, banco da holding elopar dos sócios bra..."
3,5,"Em encontro, Bono pergunta a Macri sobre argen...","o vocalista da banda irlandesa u2, bono, fez u..."
4,6,"Posso sair do Brasil quando e como quiser, diz...",o italiano cesare battisti disse nesta segunda...
...,...,...,...
79847,167046,"Apoiado pelos irmãos Gomes, petista toma posse...",engenheiro agrônomo e servidor licenciado do i...
79848,167048,"Em cenário de crise, tucano Beto Richa assume ...",o tucano beto richa tinha tudo para começar se...
79849,167049,Filho supera senador Renan Calheiros e assume ...,o economista renan filho (pmdb) assume nesta q...
79850,167050,"Hoje na TV: Tottenham x Chelsea, Campeonato In...",destaques da programação desta quinta-feira (1...


In [9]:
from utils.text import (
    remove_accented_characters,
    remove_excessive_spaces,
    remove_repeated_letters,
    remove_repeated_non_word_characters,
)

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

# Create a text cleaning pipeline using scikit-learn's Pipeline
# Each step in the pipeline applies a specific text cleaning function

pipeline_clean_text = Pipeline(
    [
        # Step 1: Remove accented characters from the text
        ("remove_accented_characters", FunctionTransformer(remove_accented_characters)),
        # Step 2: Remove excessive spaces from the text
        ("remove_excessive_spaces", FunctionTransformer(remove_excessive_spaces)),
        # Step 3: Remove repeated letters from the text
        ("remove_repeated_letters", FunctionTransformer(remove_repeated_letters)),
        # Step 4: Remove repeated non-word characters (e.g., punctuation) from the text
        (
            "remove_repeated_non_word_characters",
            FunctionTransformer(remove_repeated_non_word_characters),
        ),
    ]
)

In [11]:
# Print the shape of the DataFrame before dropping rows with missing 'text' values
# This shows the number of rows and columns in the DataFrame initially
print(df.shape)

# Drop rows where the 'text' column has NaN (missing) values
# The 'inplace=True' parameter modifies the DataFrame in place without returning a new DataFrame
df = df.dropna(subset=["text"])

print(df.shape)

(79852, 3)
(79852, 3)


In [12]:
# Apply the text cleaning pipeline to the 'text' column of the DataFrame
# 'pipeline_clean_text.transform' applies all the cleaning steps defined in the pipeline
# The 'apply' method applies the transformation to each element in the 'text' column
df["text"] = df["text"].apply(pipeline_clean_text.transform)

df

Unnamed: 0,index,title,text
0,0,"Lula diz que está 'lascado', mas que ainda tem...",com a possibilidade de uma condenacao impedir ...
1,2,Três reportagens da Folha ganham Prêmio Petrob...,tres reportagens da folha foram vencedoras do ...
2,4,CBSS inicia acordos com fintechs e quer 30% do...,"o cbss, banco da holding elopar dos socios bra..."
3,5,"Em encontro, Bono pergunta a Macri sobre argen...","o vocalista da banda irlandesa u2, bono, fez u..."
4,6,"Posso sair do Brasil quando e como quiser, diz...",o italiano cesare battisti disse nesta segunda...
...,...,...,...
79847,167046,"Apoiado pelos irmãos Gomes, petista toma posse...",engenheiro agronomo e servidor licenciado do i...
79848,167048,"Em cenário de crise, tucano Beto Richa assume ...",o tucano beto richa tinha tudo para comecar se...
79849,167049,Filho supera senador Renan Calheiros e assume ...,o economista renan filho (pmdb) assume nesta q...
79850,167050,"Hoje na TV: Tottenham x Chelsea, Campeonato In...",destaques da programacao desta quinta-feira (1...


In [13]:
# Drop duplicate rows based on the 'text' column
# This ensures that each text entry in the DataFrame is unique
# The 'inplace=True' parameter modifies the DataFrame in place without returning a new DataFrame
df = df.drop_duplicates(subset=["text"])

print(df.shape)

(79791, 3)


In [14]:
# Calculate descriptive statistics for the length of text in the 'text' column
# The 'str.len()' method computes the length of each string in the 'text' column
# The 'describe()' method provides summary statistics for these lengths
# The list of percentiles [0.05, 0.1, 0.25, 0.5, 0.75, 0.8, 0.9, 0.95, 0.98, 0.99, 0.999] specifies additional percentiles to include in the summary
df["text"].str.len().describe([0.05, 0.1, 0.25, 0.5, 0.75, 0.8, 0.9, 0.95, 0.98, 0.99, 0.999])

count    79791.000000
mean      2673.606898
std       1820.254852
min         28.000000
5%         808.000000
10%       1058.000000
25%       1551.000000
50%       2341.000000
75%       3323.000000
80%       3608.000000
90%       4531.000000
95%       5625.000000
98%       7311.600000
99%       8784.000000
99.9%    17385.680000
max      60816.000000
Name: text, dtype: float64

In [15]:
# Calculate the 2th and 98th percentiles for the length of text in the 'text' column
# The 'quantile' method returns the specified percentiles as a Series
# The 'values' attribute converts the Series to a NumPy array for easier indexing
quantiles = df["text"].str.len().quantile([0.02, 0.98]).values

# Filter the DataFrame to keep only rows where the text length is greater than the 10th percentile
df = df[df["text"].str.len() > quantiles[0]]

# Further filter the DataFrame to keep only rows where the text length is less than or equal to the 99th percentile
df = df[df["text"].str.len() <= quantiles[1]]

print(quantiles)
print(df.shape)

[ 526.  7311.6]
(76597, 3)


In [16]:
# Import the math module for mathematical functions
import math

# Calculate the total number of reviews in the dataset
n = len(df)  # Total number of reviews

# Define the Z-value for a 95% confidence level
# This value corresponds to the number of standard deviations from the mean in a normal distribution
z = 1.96  # Z-value for 95% confidence level

# Define the expected proportion of positive reviews
# We assume 50% (0.5) as the worst-case scenario to maximize the sample size
p = 0.5  # Expected proportion of positive reviews

# Define the margin of error we are willing to accept
# This value represents the maximum acceptable difference between the sample estimate and the true population value
e = 0.05  # Margin of error

# Calculate the required sample size using the formula for sample size estimation
# The formula is derived from the standard error of the proportion
# We use math.ceil to round up to the nearest whole number, ensuring the sample size is sufficient
sample_size = math.ceil((z**2 * p * (1 - p)) / e**2)

# Print the calculated sample size
print(f"Sample size: {sample_size}")

Sample size: 385


In [17]:
# We'll round that to 400 samples for each set.

#  Randomize the dataset to ensure that the samples are shuffled
# 'frac=1.0' means we are shuffling the entire DataFrame
# 'random_state=271828' ensures reproducibility of the shuffling
df = df.sample(frac=1.0, random_state=271828)

# Define the desired sample size for the development and test sets
# We want 400 samples for each set
desired_sample_size = 400

# Split the DataFrame into development/test set and training set
# The first 800 samples (4 * 400) are used for the development and test sets
df_dev_test = df[: desired_sample_size * 4]

# The remaining samples are used for the training set
df_train = df[desired_sample_size * 4 :]

# Print the size of the original dataset
print("Original dataset size: ", len(df))

# Print the size of the training set
print(f"Train set size: {len(df_train)}")

# Print the size of the combined development and test set
print(f"Dev/Test set size: {len(df_dev_test)}")

Original dataset size:  76597
Train set size: 74997
Dev/Test set size: 1600


In [20]:
df_train.to_parquet("./data/train.parquet", index=False)
df_dev_test.to_parquet("./data/dev_test.parquet", index=False)

## Labeled Data

In [18]:
df_dev_test_labeled = pd.read_csv("data/labeled/labeled.csv", low_memory=False)

In [19]:
from sklearn.model_selection import train_test_split

# Split the labeled development/test set into separate development and test sets
# 'test_size=0.5' means we split the data equally into two halves
# 'random_state=271828' ensures reproducibility of the split
# 'stratify=df_dev_test_labeled.label' ensures that the split maintains the same proportion of labels in both sets
df_test, df_dev = train_test_split(
    df_dev_test_labeled,
    test_size=0.5,
    random_state=271828,
    stratify=df_dev_test_labeled.label,
)

# Print the size of the test set
print(f"Test set size: {len(df_test)}")

# Print the size of the development set
print(f"Development set size: {len(df_dev)}")

Test set size: 800
Development set size: 800


In [20]:
df_test.label.value_counts(normalize=True)

label
poder       0.40375
esportes    0.22625
mundo       0.18625
mercado     0.18375
Name: proportion, dtype: float64

In [21]:
df_dev.label.value_counts(normalize=True)

label
poder       0.40250
esportes    0.22625
mundo       0.18625
mercado     0.18500
Name: proportion, dtype: float64

In [22]:
import pandas as pd

def remove_duplicates_and_overlap(df_train: pd.DataFrame, df_dev: pd.DataFrame, df_test: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Remove linhas duplicadas e garante que não haja sobreposição de índices entre os conjuntos
    de treino, desenvolvimento e teste.

    Args:
        df_train (pd.DataFrame): DataFrame de treino.
        df_dev (pd.DataFrame): DataFrame de desenvolvimento.
        df_test (pd.DataFrame): DataFrame de teste.

    Returns:
        tuple: Tupla contendo os DataFrames limpos (train, dev, test).
    """
    # Exibir o tamanho original dos datasets
    print("Original shapes:", df_train.shape, df_dev.shape, df_test.shape)

    # Remover duplicatas com base em 'index' e 'text'
    df_train.drop_duplicates(subset=["index", "text"], inplace=True)
    df_dev.drop_duplicates(subset=["index", "text"], inplace=True)
    df_test.drop_duplicates(subset=["index", "text"], inplace=True)

    print("After dropping duplicates:", df_train.shape, df_dev.shape, df_test.shape)

    # Obter sets de índices para verificar sobreposição
    index_train = set(df_train["index"])
    index_dev = set(df_dev["index"])
    index_test = set(df_test["index"])

    # Filtrar as linhas com sobreposição
    df_train = df_train[~df_train["index"].isin(index_dev.union(index_test))]
    df_dev = df_dev[~df_dev["index"].isin(index_train.union(index_test))]
    df_test = df_test[~df_test["index"].isin(index_train.union(index_dev))]

    print("After removing overlaps:", df_train.shape, df_dev.shape, df_test.shape)

    # Resetar os índices dos DataFrames
    df_train.reset_index(drop=True, inplace=True)
    df_dev.reset_index(drop=True, inplace=True)
    df_test.reset_index(drop=True, inplace=True)

    return df_train, df_dev, df_test

df_train, df_dev, df_test = remove_duplicates_and_overlap(df_train, df_dev, df_test)

Original shapes: (74997, 3) (800, 4) (800, 4)
After dropping duplicates: (74997, 3) (800, 4) (800, 4)
After removing overlaps: (74997, 3) (800, 4) (800, 4)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train.drop_duplicates(subset=["index", "text"], inplace=True)


In [23]:
# df_dev.to_parquet("data/pipeline/dev.parquet", index=False)
# df_train.to_parquet("data/pipeline/train.parquet", index=False)

df_train.shape, df_test.shape, df_dev.shape

((74997, 3), (800, 4), (800, 4))

In [24]:
df_train.head()

Unnamed: 0,index,title,text
0,114285,Rússia propõe aos EUA conversa entre líderes m...,"o secretario de estado dos eua, john kerry, di..."
1,17806,Áudio de Joesley entregue à Procuradoria tem c...,uma pericia contratada pela folha concluiu que...
2,20652,Fim do foro privilegiado para crime comum avan...,o projeto que acaba com o foro privilegiado pa...
3,53274,Janot pede arquivamento de um dos inquéritos c...,o procurador-geral da republica rodrigo janot ...
4,70638,Governo Temer sonda ex-ministro Pedro Parente ...,o presidente interino michel temer busca um no...


In [25]:
df_dev.head()

Unnamed: 0,index,title,text,label
0,107107,Obama compara republicanos à mal-humorada Grum...,"o presidente dos estados unidos, barack obama,...",poder
1,70209,17 Estados e Distrito Federal têm desemprego s...,mais da metade das 27 unidades federativas bra...,mercado
2,139740,Alckmin defende ação penal antes de pedido de ...,"o governador de sao paulo, geraldo alckmin (ps...",poder
3,10210,Senador colombiano quer que Maduro seja réu em...,"para o senador colombiano ivan duque, a eleica...",mundo
4,116468,Pedido para vestir verde e amarelo no 7 de Set...,o site oficial do pt convocou na quinta-feira ...,poder


## Labeling Functions

In [26]:
# Iterate over a random sample of 30 rows from the training DataFrame 'df_train'
# The 'random_state' parameter ensures reproducibility of the random sampling
for row in df_train.sample(30, random_state=271828).itertuples():
    # Print the text of each sampled row
    print(row.text + "\n")

os precos do petroleo recuam mais de 2% nesta terca-feira (13), apos previsoes pessimistas sobre o crescimento da demanda. a aie (agencia internacional de energia) divulgou que uma forte desaceleracao na demanda mundial, juntamente com o acumulo de estoques e o aumento da oferta, significam que o mercado tera excedente de petroleo durante pelo menos os primeiros seis meses de 2017. a nova projecao contrasta com a ultima estimativa da agencia ha um mes sobre a oferta e a demanda. na ocasiao, a aie informou que o mercado estaria em amplo equilibrio dali em diante e que os estoques iriam cair rapidamente. os ultimos comentarios da aie vem apos uma perspectiva surpreendentemente baixista da opep (organizacao dos paises exportadores de petroleo) na segunda-feira (12), cuja previsao e de um maior excedente de petroleo em 2017. o petroleo brent, negociado em londres, recuava ha pouco 2,03%, a us$ 47,34 o barril. o petroleo wti, dos estados unidos, caia 2,68%, a us$ 45,25. juntamente com os te

In [27]:
import re
from snorkel.labeling import labeling_function

# Constantes de rótulo
ABSTAIN = -1
PODER = 0
MUNDO = 1
MERCADO = 2
ESPORTES = 3


In [28]:
regex_poder_a = re.compile(r"\b(congresso|senado|camara|deputado|ministro|palacio|planalto|stf|lava jato|caixa dois|dilma|lula|fhc|temer)\b", re.IGNORECASE)

regex_mundo_a = re.compile(r"\b(terremoto|onu|embaixador|toquio|nova york|governo estrangeiro|internacional|paises|obama|trump|hillary|eua)\b", re.IGNORECASE)

regex_mercado_a = re.compile(r"\b(petroleo|mercado|acoes|bolsa|dolar|pib|inflacao|juros|investimento|exporta|economia)\b", re.IGNORECASE)

regex_esportes_a = re.compile(r"\b(futebol|jogo|gol|campeonato|olimpiada|esporte|partida|selecao)\b", re.IGNORECASE)


In [29]:
@labeling_function()
def lf_regex_poder_a(x):
    return PODER if regex_poder_a.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_mundo_a(x):
    return MUNDO if regex_mundo_a.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_mercado_a(x):
    return MERCADO if regex_mercado_a.search(x.text) else ABSTAIN

@labeling_function()
def lf_regex_esportes_a(x):
    return ESPORTES if regex_esportes_a.search(x.text) else ABSTAIN


In [30]:
df_train.iloc[0].text, lf_regex_poder_a(df_train.iloc[0])

('o secretario de estado dos eua, john kerry, disse que o governo barack obama esta analisando uma oferta da russia para os dois paises terem encontros entre dirigentes militares para analisar a situacao na siria. os eua tem-se mostrado preocupados sobre uma possivel presenca militar kerry afirmou que a proposta do encontro foi feita em um telefonema pelo ministro das relacoes exteriores da russia, sergei lavrov, e que a casa branca e o pentagono estao avaliando-a. o secretario de estado disse apoiar a ideia do encontro, ressaltando que os eua querem ter uma imagem clara sobre as intencoes militares da russia no pais. da russia na siria. lavrov propos o encontro entre lideres militares para discutir o que precisamente sera feito para evitar conflitos e riscos potenciais na regiao, disse kerry em entrevista no departamento de defesa ao lado da ministra de relacoes exteriores da africa do sul, maite nkoana-mashabane. "e vital termos essas conversas para evitarmos mal-entendidos e erros d

In [42]:
df_train.iloc[0].text, lf_regex_mundo_a(df_train.iloc[0])

('o secretario de estado dos eua, john kerry, disse que o governo barack obama esta analisando uma oferta da russia para os dois paises terem encontros entre dirigentes militares para analisar a situacao na siria. os eua tem-se mostrado preocupados sobre uma possivel presenca militar kerry afirmou que a proposta do encontro foi feita em um telefonema pelo ministro das relacoes exteriores da russia, sergei lavrov, e que a casa branca e o pentagono estao avaliando-a. o secretario de estado disse apoiar a ideia do encontro, ressaltando que os eua querem ter uma imagem clara sobre as intencoes militares da russia no pais. da russia na siria. lavrov propos o encontro entre lideres militares para discutir o que precisamente sera feito para evitar conflitos e riscos potenciais na regiao, disse kerry em entrevista no departamento de defesa ao lado da ministra de relacoes exteriores da africa do sul, maite nkoana-mashabane. "e vital termos essas conversas para evitarmos mal-entendidos e erros d

In [32]:
lfs = [
    lf_regex_poder_a,
    lf_regex_mundo_a,
    lf_regex_mercado_a,
    lf_regex_esportes_a,
]

# Create an instance of PandasLFApplier with the list of labeling functions (lfs)
# PandasLFApplier applies the labeling functions to a pandas DataFrame
applier = PandasLFApplier(lfs=lfs)

# Apply the labeling functions to the training DataFrame (df_train)
# This generates a label matrix (L_train) where each row corresponds to a data point
# and each column corresponds to the output of a labeling function
L_train = applier.apply(df=df_train)

# It took around 10 seconds to run this cell on my machine.


100%|██████████| 74997/74997 [00:36<00:00, 2063.93it/s]


In [33]:
L_train

array([[ 0,  1, -1, -1],
       [ 0, -1,  2, -1],
       [ 0, -1, -1, -1],
       ...,
       [ 0, -1, -1, -1],
       [-1, -1,  2, -1],
       [-1,  1, -1,  3]], shape=(74997, 4))

In [34]:
L_train.shape  # The label matrix (L_train) has 102,326 data points (rows) and 11 labeling functions (columns).

(74997, 4)

In [35]:
# Calculate the coverage of each labeling function
# Coverage is the proportion of data points that a labeling function labels (i.e., does not abstain)
# (L_train != ABSTAIN) creates a boolean matrix where True indicates a non-abstain label
# .mean(axis=0) calculates the mean (coverage) for each labeling function across all data points
coverage = (L_train != ABSTAIN).mean(axis=0)

# Print the coverage of each labeling function
# This helps us understand how often each labeling function is providing a label

for i in range(L_train.shape[1]):
    print(f"LF {lfs[i].name} coverage: {coverage[i] * 100:.2f} %")

LF lf_regex_poder_a coverage: 45.66 %
LF lf_regex_mundo_a coverage: 30.62 %
LF lf_regex_mercado_a coverage: 33.56 %
LF lf_regex_esportes_a coverage: 24.53 %


In [36]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
lf_regex_poder_a,0,[0],0.456645,0.252237,0.252237
lf_regex_mundo_a,1,[1],0.306239,0.229289,0.229289
lf_regex_mercado_a,2,[2],0.335627,0.24957,0.24957
lf_regex_esportes_a,3,[3],0.24533,0.103044,0.103044


In [37]:
LFAnalysis(L=L_train, lfs=lfs).label_coverage()

np.float64(0.8825819699454645)

In [38]:
LFAnalysis(L=L_train, lfs=lfs).label_conflict()

np.float64(0.37288158192994386)

In [39]:
L_dev = applier.apply(df=df_dev)

100%|██████████| 800/800 [00:00<00:00, 1633.68it/s]


In [40]:
# Perform labeling function analysis using LFAnalysis
# LFAnalysis provides metrics such as coverage, accuracy, and conflict for each labeling function
# L_dev is the label matrix for the development set
# lfs is the list of labeling functions
# df_dev.label.values are the true labels for the development set

df_dev["label"] = df_dev.label.map(
    {
        "poder": PODER,
        "mundo": MUNDO,
        "mercado": MERCADO,
        "esportes": ESPORTES,
    }
)

lf_summary = LFAnalysis(L=L_dev, lfs=lfs).lf_summary(Y=df_dev.label.values)

# Display the summary of the labeling functions' performance
# The summary includes metrics like coverage, accuracy, and conflict
lf_summary

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
lf_regex_poder_a,0,[0],0.455,0.24375,0.24375,276,88,0.758242
lf_regex_mundo_a,1,[1],0.34375,0.26125,0.26125,96,179,0.349091
lf_regex_mercado_a,2,[2],0.33,0.23,0.23,124,140,0.469697
lf_regex_esportes_a,3,[3],0.23875,0.11625,0.11625,148,43,0.774869


In [47]:
import utils.plot

In [49]:
# Compute the overlap matrix for the labeling functions and labeled data
results = utils.plot.plot_coverage_overlap(
    labeling_functions=lfs,
    label_matrix=L_train,
    colorscale="Reds",
    show_values=True,
    sort_by_coverage=True,
)

In [50]:
results["coverage"]

Unnamed: 0,Labeling Function,Coverage
0,lf_regex_poder_a,0.456645
1,lf_regex_mercado_a,0.335627
2,lf_regex_mundo_a,0.306239
3,lf_regex_esportes_a,0.24533


In [51]:
results["overlap_matrix"]

Unnamed: 0,lf_regex_poder_a,lf_regex_mercado_a,lf_regex_mundo_a,lf_regex_esportes_a
lf_regex_poder_a,0.456645,0.169967,0.122498,0.039468
lf_regex_mercado_a,0.169967,0.335627,0.124045,0.034975
lf_regex_mundo_a,0.122498,0.124045,0.306239,0.063416
lf_regex_esportes_a,0.039468,0.034975,0.063416,0.24533


In [53]:
import utils.lf

# Assuming L_train is the label matrix for the training set and lfs is the list of labeling functions
final_conflict_matrix = utils.lf.compute_conflict_matrices(label_matrix=L_train, labeling_functions=lfs)

# Display the final conflict matrix DataFrame
final_conflict_matrix

Unnamed: 0,conflict,normalized_conflict
lf_regex_poder_a,0.252237,1.0
lf_regex_mercado_a,0.24957,1.0
lf_regex_mundo_a,0.229289,1.0
lf_regex_esportes_a,0.103044,1.0


In [55]:
# Compute the conflict matrix for the labeling functions and labeled data
# lfs is the list of labeling functions
# L_train is the label matrix for the training set
conflict_matrix_df = utils.lf.compute_pairwise_conflict_matrix(labeling_functions=lfs, label_matrix=L_train)

# Display the conflict matrix DataFrame
conflict_matrix_df

Unnamed: 0,lf_regex_poder_a,lf_regex_mundo_a,lf_regex_mercado_a,lf_regex_esportes_a
lf_regex_poder_a,0.0,0.122498,0.169967,0.039468
lf_regex_mundo_a,0.122498,0.0,0.124045,0.063416
lf_regex_mercado_a,0.169967,0.124045,0.0,0.034975
lf_regex_esportes_a,0.039468,0.063416,0.034975,0.0


In [59]:
# Visualize the conflict matrix
utils.plot.plot_conflict_matrix(conflict_matrix_df)[0]

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [61]:
# Visualize the conflict matrix
utils.plot.plot_conflict_matrix(conflict_matrix_df)[1]

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [62]:
conflict_matrix_df

Unnamed: 0,lf_regex_poder_a,lf_regex_mundo_a,lf_regex_mercado_a,lf_regex_esportes_a
lf_regex_poder_a,0.0,0.122498,0.169967,0.039468
lf_regex_mundo_a,0.122498,0.0,0.124045,0.063416
lf_regex_mercado_a,0.169967,0.124045,0.0,0.034975
lf_regex_esportes_a,0.039468,0.063416,0.034975,0.0


## Training Our Own Models Using the Development Set

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TF-IDF vectorizer on the training data. Remember: our training data has no labels, but we can still use it to learn the vocabulary.

tfidf_vec_ssl = TfidfVectorizer(
    ngram_range=(1, 2),
    strip_accents="unicode",
    lowercase=True,
    max_features=500,
    min_df=3,
)

tfidf_vec_ssl.fit(df_train.text)

# Transform the text data in the development set into TF-IDF features
# This ensures that the vectorizer uses the vocabulary learned from the training data.
X_dev = tfidf_vec_ssl.transform(df_dev.text)

# Convert the sparse matrix to a dense array
X_dev = X_dev.toarray()

# Extract the labels from the development set
y_dev = df_dev.label.values

# Print the shape of the TF-IDF feature matrix
X_dev.shape

(800, 500)

In [65]:
import utils.classification

In [67]:
# Call the function to train and evaluate multiple classification models
# X_dev is the feature matrix for the development set
# y_dev is the label vector for the development set
# The function returns a DataFrame with the performance metrics of each model and a list of classification reports
df_results, classification_reports, calibration_plots = utils.classification.train_and_evaluate_classification_models(X_dev, y_dev)

# df_results is a DataFrame containing the performance metrics for each model
# It includes columns for the model name, F1 score, balanced accuracy, accuracy, Matthews correlation coefficient, elapsed time, confusion matrix, and classification report
# This DataFrame can be used to compare the performance of different models

Model: Calibrated-LSVC - F1: 0.8538 - Balanced Accuracy: 0.8526 - Accuracy: 0.8538 - Matthews Correlation Coefficient: 0.7962 - Elapsed time: 3.90s
              precision    recall  f1-score   support

           0       0.84      0.85      0.85       322
           1       0.78      0.81      0.80       149
           2       0.87      0.84      0.86       148
           3       0.93      0.91      0.92       181

    accuracy                           0.85       800
   macro avg       0.85      0.85      0.85       800
weighted avg       0.85      0.85      0.85       800

[[273  26  16   7]
 [ 22 121   3   3]
 [ 17   3 125   3]
 [ 12   5   0 164]]
******************** 

Model: Logistic Regression - F1: 0.8538 - Balanced Accuracy: 0.8726 - Accuracy: 0.8538 - Matthews Correlation Coefficient: 0.8036 - Elapsed time: 8.77s
              precision    recall  f1-score   support

           0       0.92      0.78      0.84       322
           1       0.74      0.91      0.82       149
  

In [71]:
# Display the results DataFrame
df_results

Unnamed: 0,Model,F1,Balanced Accuracy,Accuracy,Matthews Correlation Coefficient,Elapsed Time,Confusion Matrix,Classification Report
0,Calibrated-LSVC,0.85375,0.852645,0.85375,0.796242,3.901134,[[273 26 16 7]\n [ 22 121 3 3]\n [ 17 ...,precision recall f1-score ...
1,Logistic Regression,0.85375,0.872567,0.85375,0.803565,8.771666,[[252 35 31 4]\n [ 9 136 2 2]\n [ 6 ...,precision recall f1-score ...
2,Random Forest,0.85375,0.843243,0.85375,0.795661,4.071562,[[283 20 12 7]\n [ 21 118 4 6]\n [ 21 ...,precision recall f1-score ...
3,XGBoost,0.83125,0.822615,0.83125,0.764299,27.896212,[[272 20 20 10]\n [ 25 114 4 6]\n [ 24 ...,precision recall f1-score ...
4,SGD,0.86,0.853811,0.86,0.804773,0.547556,[[281 23 14 4]\n [ 20 119 7 3]\n [ 15 ...,precision recall f1-score ...
5,Naive Bayes,0.8375,0.811469,0.8375,0.773386,1.277412,[[295 15 12 0]\n [ 46 99 1 3]\n [ 27 ...,precision recall f1-score ...
6,K-Nearest Neighbors,0.7925,0.771069,0.7925,0.709113,0.40211,[[279 21 18 4]\n [ 42 103 3 1]\n [ 33 ...,precision recall f1-score ...
7,Decision Tree,0.6725,0.649079,0.6725,0.542214,1.396502,[[237 34 34 17]\n [ 33 77 28 11]\n [ 42 ...,precision recall f1-score ...
8,Extra Trees,0.87125,0.864439,0.87125,0.820245,3.888063,[[284 19 13 6]\n [ 24 118 4 3]\n [ 14 ...,precision recall f1-score ...


In [72]:
df_results.sort_values(by="Matthews Correlation Coefficient", ascending=False)

Unnamed: 0,Model,F1,Balanced Accuracy,Accuracy,Matthews Correlation Coefficient,Elapsed Time,Confusion Matrix,Classification Report
8,Extra Trees,0.87125,0.864439,0.87125,0.820245,3.888063,[[284 19 13 6]\n [ 24 118 4 3]\n [ 14 ...,precision recall f1-score ...
4,SGD,0.86,0.853811,0.86,0.804773,0.547556,[[281 23 14 4]\n [ 20 119 7 3]\n [ 15 ...,precision recall f1-score ...
1,Logistic Regression,0.85375,0.872567,0.85375,0.803565,8.771666,[[252 35 31 4]\n [ 9 136 2 2]\n [ 6 ...,precision recall f1-score ...
0,Calibrated-LSVC,0.85375,0.852645,0.85375,0.796242,3.901134,[[273 26 16 7]\n [ 22 121 3 3]\n [ 17 ...,precision recall f1-score ...
2,Random Forest,0.85375,0.843243,0.85375,0.795661,4.071562,[[283 20 12 7]\n [ 21 118 4 6]\n [ 21 ...,precision recall f1-score ...
5,Naive Bayes,0.8375,0.811469,0.8375,0.773386,1.277412,[[295 15 12 0]\n [ 46 99 1 3]\n [ 27 ...,precision recall f1-score ...
3,XGBoost,0.83125,0.822615,0.83125,0.764299,27.896212,[[272 20 20 10]\n [ 25 114 4 6]\n [ 24 ...,precision recall f1-score ...
6,K-Nearest Neighbors,0.7925,0.771069,0.7925,0.709113,0.40211,[[279 21 18 4]\n [ 42 103 3 1]\n [ 33 ...,precision recall f1-score ...
7,Decision Tree,0.6725,0.649079,0.6725,0.542214,1.396502,[[237 34 34 17]\n [ 33 77 28 11]\n [ 42 ...,precision recall f1-score ...


In [73]:
calibration_plots

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [74]:
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.ensemble import ExtraTreesClassifier
import numpy as np


# Initialize a SelfTrainingClassifier with a ExtraTreesClassifier
model_ssl = SelfTrainingClassifier(ExtraTreesClassifier(random_state=271828, n_jobs=-1, class_weight="balanced"))

# Train features and labels includes the labeled and pseudo-labeled data
X_unlabeled = tfidf_vec_ssl.transform(df_train.text).toarray()
y_unlabeled = np.array([-1] * len(df_train))

X_labeled = tfidf_vec_ssl.transform(df_dev.text).toarray()
y_labeled = df_dev.label.values

# Fit the SelfTrainingClassifier on the labeled and unlabeled data
model_ssl.fit(X=np.vstack([X_labeled, X_unlabeled]), y=np.concatenate([y_labeled, y_unlabeled]))

In [75]:
from snorkel.preprocess import preprocessor


# Define a preprocessor function that adds a predicted label and score to a given example using a stacked classifier
# The @preprocessor decorator indicates that this function is a Snorkel preprocessor
# memoize=True caches the results to avoid redundant computations
@preprocessor(memoize=True)
def semi_superv_classifier(x):
    """
    Preprocessa um exemplo adicionando um rótulo e score predito por um classificador treinado com aprendizado semi-supervisionado.

    Args:
        x: Objeto com atributo 'text' contendo o texto.

    Returns:
        x: Objeto com os atributos 'label_pred_ssl' (rótulo predito) e 'score_ssl' (probabilidade associada).
    """
    # Vetoriza o texto
    vectorized_text = tfidf_vec_ssl.transform([x.text])

    # Obtém as probabilidades de cada classe
    pred_proba = model_ssl.predict_proba(vectorized_text)

    # Índice da classe com maior probabilidade
    pred_idx = np.argmax(pred_proba, axis=1)[0]

    # Atribui diretamente o valor da classe (0 a 3)
    x.label_pred_ssl = pred_idx  # Vai ser 0, 1, 2 ou 3

    # Score (probabilidade) da classe predita
    x.score_ssl = pred_proba[0][pred_idx]

    return x

In [76]:
@labeling_function(pre=[semi_superv_classifier])
def lf_ssl(x):
    """
    Função de rotulagem que retorna o rótulo previsto pelo classificador semi-supervisionado,
    desde que o score da previsão seja suficientemente alto.

    Args:
        x: Objeto com 'score_ssl' e 'label_pred_ssl', definidos pelo preprocessor.

    Returns:
        int: Rótulo previsto (0 a 3), ou ABSTAIN se o score for baixo.
    """
    if x.score_ssl >= 0.95:
        return x.label_pred_ssl  # Vai retornar 0, 1, 2 ou 3
    else:
        return ABSTAIN

In [78]:
# Redeclare the list of labeling functions to include the new one
new_lfs = lfs + [lf_ssl]

new_lfs

[LabelingFunction lf_regex_poder_a, Preprocessors: [],
 LabelingFunction lf_regex_mundo_a, Preprocessors: [],
 LabelingFunction lf_regex_mercado_a, Preprocessors: [],
 LabelingFunction lf_regex_esportes_a, Preprocessors: [],
 LabelingFunction lf_ssl, Preprocessors: [LambdaMapper semi_superv_classifier, Pre: []]]

In [79]:
applier = PandasLFApplier(lfs=new_lfs)
L_train = applier.apply(df=df_train)
# This will take around 60 minutes to run on 28 cores 

100%|██████████| 74997/74997 [55:11<00:00, 22.65it/s]  


In [80]:
LFAnalysis(L_train, new_lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
lf_regex_poder_a,0,[0],0.456645,0.455752,0.25401
lf_regex_mundo_a,1,[1],0.306239,0.302279,0.301959
lf_regex_mercado_a,2,[2],0.335627,0.321506,0.301825
lf_regex_esportes_a,3,[3],0.24533,0.233596,0.111884
lf_ssl,4,"[0, 1, 2, 3]",0.92705,0.8221,0.478646


In [81]:
LFAnalysis(L=L_train, lfs=new_lfs).label_coverage()
# We've boosted our coverage to over 98%.

np.float64(0.9875328346467191)

In [82]:
L_dev = applier.apply(df=df_dev)

100%|██████████| 800/800 [00:35<00:00, 22.38it/s]


In [None]:
LFAnalysis(L=L_dev, lfs=new_lfs).lf_summary(Y=df_dev.label.values)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
lf_regex_poder_a,0,[0],0.455,0.455,0.25875,276,88,0.758242
lf_regex_mundo_a,1,[1],0.34375,0.34375,0.28375,96,179,0.349091
lf_regex_mercado_a,2,[2],0.33,0.33,0.25125,124,140,0.469697
lf_regex_esportes_a,3,[3],0.23875,0.23875,0.12375,148,43,0.774869
lf_ssl,4,"[0, 1, 2, 3]",1.0,0.895,0.445,800,0,1.0


In [94]:
# Saving the L_train and L_dev matrices for future use
import numpy as np

df_L_train = df_train.copy()
df_L_dev = df_dev.copy()

df_L_train = pd.concat([df_L_train, pd.DataFrame(L_train, columns=[lf.name for lf in new_lfs])], axis=1)
df_L_dev = pd.concat([df_L_dev, pd.DataFrame(L_dev, columns=[lf.name for lf in new_lfs])], axis=1)

df_L_train.to_parquet("data/labeled/df_L_train.parquet", index=False)
df_L_dev.to_parquet("data/labeled/df_L_dev.parquet", index=False)

df_L_train.head()

Unnamed: 0,index,title,text,lf_regex_poder_a,lf_regex_mundo_a,lf_regex_mercado_a,lf_regex_esportes_a,lf_ssl
0,114285,Rússia propõe aos EUA conversa entre líderes m...,"o secretario de estado dos eua, john kerry, di...",0,1,-1,-1,0
1,17806,Áudio de Joesley entregue à Procuradoria tem c...,uma pericia contratada pela folha concluiu que...,0,-1,2,-1,0
2,20652,Fim do foro privilegiado para crime comum avan...,o projeto que acaba com o foro privilegiado pa...,0,-1,-1,-1,0
3,53274,Janot pede arquivamento de um dos inquéritos c...,o procurador-geral da republica rodrigo janot ...,0,-1,-1,-1,0
4,70638,Governo Temer sonda ex-ministro Pedro Parente ...,o presidente interino michel temer busca um no...,0,-1,2,-1,0


In [None]:
# Import the LabelModel class from Snorkel's labeling model module
from snorkel.labeling.model import LabelModel

# Calculate the class balance in the development set
# np.bincount(y_dev) counts the number of occurrences of each class in y_dev
# Dividing by len(y_dev) normalizes the counts to get the proportion of each class
class_balance = np.bincount(y_dev) / len(y_dev)

# Initialize a LabelModel with a cardinality of 4 (multi-class classification) and verbose output
# cardinality=4 indicates that there are four classes (e.g., poder, mundo, mercado, and esportes)
# verbose=True enables detailed logging during the training process
label_model = LabelModel(cardinality=4, verbose=True)

# Fit the LabelModel using the label matrix L_train
# L_train is a numpy array where each column corresponds to the output of a labeling function
# n_epochs=500 specifies the number of training epochs
# log_freq=50 specifies the frequency (in epochs) of logging the training progress
# seed=271828 ensures reproducibility of the results
# class_balance=class_balance provides the class balance information to the model. If not provided, it will assume uniform class distribution (i.e., equal class weights)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=50, seed=271828, class_balance=class_balance)

INFO:root:Computing O...


INFO:root:Estimating \mu...
  0%|          | 0/500 [00:00<?, ?epoch/s]INFO:root:[0 epochs]: TRAIN:[loss=0.591]
  6%|▌         | 30/500 [00:02<00:23, 20.18epoch/s]INFO:root:[50 epochs]: TRAIN:[loss=0.012]
 15%|█▌        | 75/500 [00:02<00:06, 62.90epoch/s]INFO:root:[100 epochs]: TRAIN:[loss=0.005]
 22%|██▏       | 108/500 [00:02<00:03, 101.96epoch/s]INFO:root:[150 epochs]: TRAIN:[loss=0.003]
 32%|███▏      | 158/500 [00:02<00:02, 170.96epoch/s]INFO:root:[200 epochs]: TRAIN:[loss=0.002]
 40%|████      | 202/500 [00:03<00:01, 223.71epoch/s]INFO:root:[250 epochs]: TRAIN:[loss=0.001]
 51%|█████     | 253/500 [00:03<00:00, 286.27epoch/s]INFO:root:[300 epochs]: TRAIN:[loss=0.001]
 69%|██████▉   | 347/500 [00:03<00:00, 302.54epoch/s]INFO:root:[350 epochs]: TRAIN:[loss=0.001]
 80%|███████▉  | 399/500 [00:03<00:00, 352.32epoch/s]INFO:root:[400 epochs]: TRAIN:[loss=0.001]
INFO:root:[450 epochs]: TRAIN:[loss=0.000]
100%|██████████| 500/500 [00:03<00:00, 134.51epoch/s]
INFO:root:Finished Training


In [98]:
from sklearn.metrics import classification_report
import numpy as np

# Usa o modelo de rótulos treinado pelo Snorkel para prever os rótulos do conjunto de desenvolvimento
# L_dev é a matriz de rótulos (outputs das labeling functions)
snorkel_label_model_pred = label_model.predict(L=L_dev)

# Imprime o relatório de classificação usando os rótulos verdadeiros e os previstos
print(f"Classification report for label model:\n{classification_report(df_dev.label.values, snorkel_label_model_pred)}")

# Conta quantos exemplos foram abstidos (-1)
print(f"Number of abstains: {np.count_nonzero(snorkel_label_model_pred == -1)}")

Classification report for label model:
              precision    recall  f1-score   support

           0       0.86      0.91      0.88       322
           1       0.79      0.69      0.74       149
           2       0.96      0.98      0.97       148
           3       0.92      0.91      0.91       181

    accuracy                           0.88       800
   macro avg       0.88      0.87      0.88       800
weighted avg       0.88      0.88      0.88       800

Number of abstains: 0


In [99]:
# Iterate over the labeling functions and their corresponding weights in the label model
# This code is useful for understanding the contribution of each labeling function to the label model
# It can be used to identify labeling functions that are particularly informative or noisy
for name, weight in zip([lf.name for lf in lfs], label_model.get_weights()):
    # Print the name of the labeling function and its weight as a percentage
    # The weight indicates the importance of the labeling function in the label model
    print(f"{name}: {weight * 100:.2f}%")

lf_regex_poder_a: 70.54%
lf_regex_mundo_a: 51.20%
lf_regex_mercado_a: 46.81%
lf_regex_esportes_a: 72.82%


In [101]:
# Use the label model to predict label probabilities for the training set
probs_train_snorkel = label_model.predict_proba(L=L_train)

labels_train_snorkel = label_model.predict(L=L_train)

In [105]:
pd.Series(labels_train_snorkel).value_counts(normalize=True)

0    0.657946
3    0.211075
2    0.067216
1    0.063763
Name: proportion, dtype: float64

In [106]:
df_dev["label"].value_counts(normalize=True)

label
0    0.40250
3    0.22625
1    0.18625
2    0.18500
Name: proportion, dtype: float64

In [107]:
probs_train_snorkel

array([[8.11146311e-01, 1.27048184e-01, 5.75946286e-02, 4.21087651e-03],
       [8.00668700e-01, 3.67187817e-02, 1.62611973e-01, 5.44869579e-07],
       [8.70671616e-01, 4.14872548e-02, 6.83821503e-02, 1.94589790e-02],
       ...,
       [8.70671616e-01, 4.14872548e-02, 6.83821503e-02, 1.94589790e-02],
       [5.87411861e-01, 1.41600571e-01, 2.70985627e-01, 1.93984621e-06],
       [1.66672433e-05, 7.28881235e-01, 9.92862754e-02, 1.71815823e-01]],
      shape=(74997, 4))

In [108]:
from snorkel.labeling import filter_unlabeled_dataframe

# Filter out unlabeled examples from the training dataframe using Snorkel's filter_unlabeled_dataframe function
# X=df_train: The input dataframe containing the training examples
# y=probs_train_snorkel: The label probabilities for the training examples
# L=L_train: The label matrix where each element indicates the label assigned by a labeling function
df_train_weakly_labeled_snorkel, probs_train_weakly_labeled_snorkel = filter_unlabeled_dataframe(X=df_train, y=probs_train_snorkel, L=L_train)

# df_train_weakly_labeled_snorkel now contains only the examples with at least one non-abstain label
# probs_train_weakly_labeled_snorkel contains the label probabilities for the filtered examples

In [109]:
# Check if the number of weakly labeled data points is equal to the original number of data points
# This helps in understanding how many data points were filtered out due to all labeling functions abstaining
if len(df_train_weakly_labeled_snorkel) == len(df_train):
    # If the lengths are equal, it means all data points are weakly labeled
    print("All data points are weakly labeled!")
else:
    # If the lengths are not equal, print the number of data points after filtering
    # Also, print the number of data points that were removed
    num_removed = len(df_train) - len(df_train_weakly_labeled_snorkel)
    print(f"Number of data points before filtering: {len(df_train):,}")
    print(f"Number of data points after filtering: {len(df_train_weakly_labeled_snorkel):,}")
    print(f"Number of data points removed: {num_removed:,} ({num_removed / len(df_train) * 100:.2f}%)")

Number of data points before filtering: 74,997
Number of data points after filtering: 74,062
Number of data points removed: 935 (1.25%)


In [111]:
# Save for further use during next classes

df_train_weakly_labeled = df_train_weakly_labeled_snorkel.copy()
df_train_weakly_labeled["label_snorkel"] = probs_train_weakly_labeled_snorkel.argmax(axis=1)
df_train_weakly_labeled.to_parquet("data/labeled/df_train_weakly_labeled.parquet", index=False)

In [112]:
df_train_weakly_labeled["label_snorkel"].value_counts(normalize=True)

label_snorkel
0    0.653628
3    0.213740
2    0.068065
1    0.064568
Name: proportion, dtype: float64

In [113]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer with specific parameters
# ngram_range=(1, 2): Consider both unigrams and bigrams
# strip_accents='unicode': Remove accents from characters
# lowercase=True: Convert all characters to lowercase
# max_features=3000: Use only the top 3000 features based on term frequency
# max_df=0.85: Ignore terms that appear in more than 85% of the documents
# min_df=3: Ignore terms that appear in fewer than 3 documents
vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),
    strip_accents="unicode",
    lowercase=True,
    max_features=3000,
    max_df=0.85,
    min_df=3,
)

# Fit the vectorizer on the training text data
# This learns the vocabulary and idf (inverse document frequency) from the training data
vectorizer.fit(df_train.text.values)

# Transform the training text data into TF-IDF feature vectors
# The result is a sparse matrix which is then converted to a dense array
X_train = vectorizer.transform(df_train.text.values).toarray()

# Transform the development text data into TF-IDF feature vectors
# The same vocabulary and idf learned from the training data are used
X_dev = vectorizer.transform(df_dev.text.values).toarray()

# Extract the true labels for the development set
# These labels will be used for evaluating the model's performance
y_dev_true = df_dev.label.values

In [114]:
# Remember: our labels are an array containing the probability of belonging to each class
probs_train_snorkel

array([[8.11146311e-01, 1.27048184e-01, 5.75946286e-02, 4.21087651e-03],
       [8.00668700e-01, 3.67187817e-02, 1.62611973e-01, 5.44869579e-07],
       [8.70671616e-01, 4.14872548e-02, 6.83821503e-02, 1.94589790e-02],
       ...,
       [8.70671616e-01, 4.14872548e-02, 6.83821503e-02, 1.94589790e-02],
       [5.87411861e-01, 1.41600571e-01, 2.70985627e-01, 1.93984621e-06],
       [1.66672433e-05, 7.28881235e-01, 9.92862754e-02, 1.71815823e-01]],
      shape=(74997, 4))

In [115]:
# Assign the predicted label probabilities to y_train_prob_labels
# probs_train contains the label probabilities predicted by the label model for the training set
# y_train_prob_labels will be used as the target variable for training a downstream model
y_train_prob_labels = probs_train_snorkel

# Print the first 5 rows of y_train_prob_labels
# This helps in inspecting the predicted label probabilities and understanding their structure
y_train_prob_labels[:5]

array([[8.11146311e-01, 1.27048184e-01, 5.75946286e-02, 4.21087651e-03],
       [8.00668700e-01, 3.67187817e-02, 1.62611973e-01, 5.44869579e-07],
       [8.70671616e-01, 4.14872548e-02, 6.83821503e-02, 1.94589790e-02],
       [8.70671616e-01, 4.14872548e-02, 6.83821503e-02, 1.94589790e-02],
       [8.00668700e-01, 3.67187817e-02, 1.62611973e-01, 5.44869579e-07]])

In [116]:
import utils.classification

model_nn = utils.classification.train_neural_network(X_train, y_train_prob_labels, X_dev, y_dev_true)

Epoch 1/20, Training Loss: 0.2648 / Validation Loss: 0.7651 / Validation MCC: 0.5001
Epoch 2/20, Training Loss: 0.1442 / Validation Loss: 0.7656 / Validation MCC: 0.5038
Epoch 3/20, Training Loss: 0.1185 / Validation Loss: 0.8295 / Validation MCC: 0.4600
Epoch 4/20, Training Loss: 0.1010 / Validation Loss: 0.8196 / Validation MCC: 0.4909
Epoch 5/20, Training Loss: 0.0841 / Validation Loss: 0.8703 / Validation MCC: 0.4483
Epoch 6/20, Training Loss: 0.0667 / Validation Loss: 0.9201 / Validation MCC: 0.4537
Epoch 7/20, Training Loss: 0.0530 / Validation Loss: 1.0062 / Validation MCC: 0.4283
Epoch 8/20, Training Loss: 0.0439 / Validation Loss: 1.0252 / Validation MCC: 0.4573
Epoch 9/20, Training Loss: 0.0374 / Validation Loss: 1.0806 / Validation MCC: 0.4591
Epoch 10/20, Training Loss: 0.0328 / Validation Loss: 1.1309 / Validation MCC: 0.4650
Epoch 11/20, Training Loss: 0.0291 / Validation Loss: 1.1705 / Validation MCC: 0.4585
Epoch 12/20, Training Loss: 0.0263 / Validation Loss: 1.2009 / 

In [117]:
y_dev_pred = utils.classification.predict_pytorch(model_nn, X_dev)
y_dev_pred[:5]

array([[0.47789988, 0.3874515 , 0.10826869, 0.02637985],
       [0.15870334, 0.17249148, 0.6620362 , 0.00676901],
       [0.8297279 , 0.05063055, 0.0876228 , 0.0320188 ],
       [0.40100658, 0.4624657 , 0.12562963, 0.01089818],
       [0.70254225, 0.10571714, 0.1206079 , 0.0711327 ]], dtype=float32)

In [118]:
utils.classification.print_classification_metrics(y_dev_true, y_dev_pred)

Metric                                   Score
Accuracy Score:                        0.64375
Balanced Accuracy Score:               0.54851
F1 Score (weighted):                   0.60819
Cohen Kappa Score:                     0.46364
Matthews Correlation Coefficient:      0.50383

Classification Report:

              precision    recall  f1-score   support

           0       0.58      0.94      0.72       322
           1       0.24      0.12      0.16       149
           2       1.00      0.34      0.51       148
           3       0.95      0.78      0.86       181

    accuracy                           0.64       800
   macro avg       0.69      0.55      0.56       800
weighted avg       0.68      0.64      0.61       800


Confusion Matrix:

Class 0 has 18 false negatives and 222 false positives.
Class 1 has 131 false negatives and 56 false positives.
Class 2 has 97 false negatives and 0 false positives.
Class 3 has 39 false negatives and 7 false positives.
The total number o

In [119]:
# Identify the indices where the predicted labels differ from the true labels
# np.argmax(y_dev_pred, axis=1) returns the predicted labels by taking the index of the maximum probability for each example
# df_dev.label.values contains the true labels for the development set
# The condition checks where the predicted labels are not equal to the true labels
condition = np.argmax(y_dev_pred, axis=1) != df_dev.label.values

# np.where(condition)[0] returns the indices where the condition is True
# These are the indices where the predictions are incorrect
idxs = np.where(condition)[0]

# Iterate over the indices where the predictions are incorrect
for idx in idxs:
    # Print the predicted label for the current index
    # np.argmax(y_dev_pred, axis=1)[idx] gives the predicted label for the current example
    print(f"Predicted: {np.argmax(y_dev_pred, axis=1)[idx]}, True: {df_dev.label.values[idx]}")

    # Print the text of the current example
    # df_dev.text.values[idx] gives the text for the current example
    print(df_dev.text.values[idx])
    print()

Predicted: 0, True: 2
"what can i do for you?" ("o que posso fazer por voce?"). a frase economica e cortante, poucas vezes antecedida por um "bom dia" ou "como vai?" era frequentemente utilizada por michael geoghean, presidente do hsbc no brasil, quando atendia assessores ao celular nas semanas imediatamente posteriores ao desembarque do banco ingles no pais, no dia 26 de marco de 1997, data em que oficialmente assumiu o falido bamerindus. eu era assessor contratado para atuar na linha de frente de uma comunicacao destinada a tornar rapidamente conhecido e admirado o nome da instituicao que acabara de chegar ao mercado. a atitude de geoghean ao telefone traduzia a pressa e o impeto do gigante global hsbc (na epoca, 3.400 agencias em 78 paises, 105 mil funcionarios e us$ 50 bilhoes de valor de mercado) para atuar no mercado nacional. ja nos primeiros dias de sua gestao geoghean anunciava a meta de avancar rumo a lideranca do setor bancario do brasil. o esforco para rapidamente incluir o

In [120]:
# Import the probs_to_preds function from Snorkel's utils module
# probs_to_preds converts probabilistic labels to categorical labels
from snorkel.utils import probs_to_preds

# Convert the probabilistic labels to categorical labels for the training set
# probs_train contains the probabilistic labels predicted by the label model for the training set
# probs_to_preds(probs=probs_train) converts these probabilities to categorical labels
# The resulting y_train_categorical will contain the most likely class for each example
y_train_categorical = probs_to_preds(probs=probs_train_snorkel)

In [121]:
y_train_categorical[:5]

array([0, 0, 0, 0, 0])

In [122]:
probs_train_snorkel[:5]

array([[8.11146311e-01, 1.27048184e-01, 5.75946286e-02, 4.21087651e-03],
       [8.00668700e-01, 3.67187817e-02, 1.62611973e-01, 5.44869579e-07],
       [8.70671616e-01, 4.14872548e-02, 6.83821503e-02, 1.94589790e-02],
       [8.70671616e-01, 4.14872548e-02, 6.83821503e-02, 1.94589790e-02],
       [8.00668700e-01, 3.67187817e-02, 1.62611973e-01, 5.44869579e-07]])

In [123]:
import utils.classification

model_nn = utils.classification.train_neural_network(X_train, y_train_categorical, X_dev, y_dev_true)

Epoch 1/20, Training Loss: 0.5731 / Validation Loss: 1.2566 / Validation MCC: 0.5556
Epoch 2/20, Training Loss: 0.3535 / Validation Loss: 1.2140 / Validation MCC: 0.5136
Epoch 3/20, Training Loss: 0.2871 / Validation Loss: 1.5047 / Validation MCC: 0.4916
Epoch 4/20, Training Loss: 0.2371 / Validation Loss: 1.5648 / Validation MCC: 0.4876
Epoch 5/20, Training Loss: 0.1978 / Validation Loss: 1.8128 / Validation MCC: 0.4865
Epoch 6/20, Training Loss: 0.1685 / Validation Loss: 1.8801 / Validation MCC: 0.4901
Epoch 7/20, Training Loss: 0.1459 / Validation Loss: 2.1048 / Validation MCC: 0.4873
Epoch 8/20, Training Loss: 0.1282 / Validation Loss: 2.1190 / Validation MCC: 0.4737
Epoch 9/20, Training Loss: 0.1132 / Validation Loss: 2.3734 / Validation MCC: 0.4870
Epoch 10/20, Training Loss: 0.1018 / Validation Loss: 2.3232 / Validation MCC: 0.4808
Epoch 11/20, Training Loss: 0.0910 / Validation Loss: 2.5513 / Validation MCC: 0.4747
Epoch 12/20, Training Loss: 0.0845 / Validation Loss: 2.7050 / 

In [124]:
y_dev_pred = utils.classification.predict_pytorch(model_nn, X_dev)

utils.classification.print_classification_metrics(y_dev_true, y_dev_pred)

Metric                                   Score
Accuracy Score:                        0.65875
Balanced Accuracy Score:               0.55061
F1 Score (weighted):                   0.57529
Cohen Kappa Score:                     0.47434
Matthews Correlation Coefficient:      0.55561

Classification Report:

              precision    recall  f1-score   support

           0       0.55      0.98      0.70       322
           1       0.00      0.00      0.00       149
           2       1.00      0.28      0.43       148
           3       0.93      0.95      0.94       181

    accuracy                           0.66       800
   macro avg       0.62      0.55      0.52       800
weighted avg       0.62      0.66      0.58       800


Confusion Matrix:

Class 0 has 8 false negatives and 261 false positives.
Class 1 has 149 false negatives and 0 false positives.
Class 2 has 107 false negatives and 0 false positives.
Class 3 has 9 false negatives and 12 false positives.
The total number of


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



## Conseguimos um MCC de 0.55 no pipeline