# Snorkel Apply
--- 


### 01. Data Import 

In [1]:
import re
import nltk
from nltk.corpus import stopwords

import pandas as pd
import numpy as np

from unidecode import unidecode

from snorkel.labeling import PandasLFApplier
from snorkel.labeling.model import LabelModel
from snorkel.augmentation import transformation_function
from snorkel.augmentation import ApplyOnePolicy, PandasTFApplier
from snorkel.slicing import slicing_function, PandasSFApplier

import nlpaug.augmenter.word as naw

In [2]:
books_df = pd.read_csv('../data/livros.csv')
books_df.columns = books_df.columns.map(str.lower)
books_df.head(5)

Unnamed: 0,titulo,genero
0,Iniciação ao Estudo da Administração,Administracao
1,Iniciação a Administração geral,Administracao
2,Iniciação a Administração de pessoal,Administracao
3,Administração de Materiais,Administracao
4,Gestão Ambiental na Empresa,Administracao


--- 
### 02. Data Analysis
One way to understand the content of a text is to perform an analysis of the most essential words, that is, those that occur more frequently and have greater meaning for the topic addressed.

Due to the nature of the data set, the titles usually consist of few words and have a limited number of examples, which allows a more subjective analysis to identify which words may be relevant in the classification of the main topic.

In [3]:
# Download Portuguese stopwords from the NLTK library
nltk.download('stopwords')

def remove_special_characters_and_stopwords(text):
    """
    Removes special characters (except Latin usual characters),
    tokenizes the text, and removes Portuguese stopwords.
    
    Args:
        text (str): The input text to process.
        
    Returns:
        list of str: A list of cleaned and tokenized words without stopwords.
    """
    # Remove special characters (except Latin usual characters)
    clean_text = re.sub(r'[^\w\s]', ' ', text)
    
    # Remove accents and diacritics from the text (e.g., converting é to e)
    clean_text = unidecode(clean_text)
    
    # Tokenize the cleaned text into a list of words
    clean_text = clean_text.strip().split()

    # Get the list of Portuguese stopwords
    stopwords_list = set(stopwords.words('portuguese'))
    
    # Remove stopwords from the tokenized words
    clean_text = [word.lower() for word in clean_text if word.lower() not in stopwords_list]

    return clean_text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kevin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
books_df['alt_title'] = books_df['titulo'].map(remove_special_characters_and_stopwords)

In [5]:
list(books_df['genero'].unique())

['Administracao',
 'Artes',
 'Biologia',
 'Geografia',
 'Historia',
 'Literatura',
 'Matematica']

In [6]:
# Select the 'genero' and 'alt_title' columns from the DataFrame
# and explode the 'alt_title' column to create multiple rows for each title
# Then we can see the most relevant words for each genero
(
 books_df[['genero', 'alt_title']]
 .explode('alt_title')
 
 # Group the data by 'genero', reset the index, and calculate the value counts
 .groupby(by='genero', as_index=False)
 .value_counts()
 
 # Sort the values in descending order first by 'genero' and then by 'count'
 .sort_values(by=['genero', 'count'], ascending=False)
 
 # Group the sorted data by 'genero' and select the top 15 elements for each group
 .groupby('genero')
 .head(100)

# Change the word inside the double quotes for selection
).query('genero=="Historia"')


Unnamed: 0,genero,alt_title,count
800,Historia,historia,154
801,Historia,brasil,79
802,Historia,geral,30
803,Historia,arte,22
804,Historia,guerra,19
...,...,...,...
895,Historia,documentos,3
896,Historia,indios,3
897,Historia,indigenas,3
898,Historia,leste,3


#### 02.01. Analysis Conclusions:

We will use the Snorkel package for data augmentation, which uses simple concepts to make a heuristic inference. After removing the stop words and cleaning the words, it is possible to select some keywords that can strongly indicate a genre.

These analyses and selections do not have exact criteria. Therefore, if you are viewing the code, you may find one or another word more relevant than the ones selected.

| Administração | Artes   | Biologia   | Geografia  | História    | Literatura | Matemática  |
|---------------|---------|------------|------------|-------------|------------|-------------|
| administracao | museu   | biologia   | geografia  | historia    | literatura | matematica  |
| organizacoes  | pintura | seres      | geografico | guerra      | texto      | fundamentos |
| organizacao   | arte    | genetica   | sociedade  | revolucao   | portuguesa | geometria   |
| gestao        | teatro  | vida       | regiao     | anos        | leitura    | calculo     |
| empresa       | museum  | biologicas | territorio | civilizacao | gramatica  | analitica   |
|               | gallery | evolucao   |            | antiga      |            | financeira  |

--- 
### 03. Snorkel
The Snorkel Python package is a system that allows the fast generation of training data with weak supervision. It was created to automate the process of creating and managing training data, allowing users to label, build and manage training data programmatically.

In [7]:
LABELS = {'Historia' : 0, 'Administração' : 1, 'Geografia' : 2, 
          'Biologia' : 3, 'Matemática' : 4, 'Artes': 5, 'Literatura': 6}

# books_df['label'] = books_df['genero'].map(LABELS)

#### 03.01. Train and Test split 

In [8]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
x_train, x_test = train_test_split(books_df, test_size=0.25, random_state=42, stratify=books_df['genero'])

# Reset the index of the training and testing sets
x_train.reset_index(drop=True, inplace=True) 
x_test.reset_index(drop=True, inplace=True)

In [9]:
# Join the words in the 'alt_title' column of the training set into a single string
x_train['alt_title'] = x_train['alt_title'].map(lambda x: ' '.join(x))

# Join the words in the 'alt_title' column of the testing set into a single string
x_test['alt_title'] = x_test['alt_title'].map(lambda x: ' '.join(x))

#### 03.02. Labeling Function

The labeling_function function in Snorkel is used to create weak labeling functions that assign labels to data instances. These labeling functions are often written by human experts and are based on rules. These functions are used to generate approximate or weakly supervised labels for the data.


In [10]:
from snorkel.labeling import labeling_function

# Define a labeling function for the "Historia" label
@labeling_function()
def lf_historia(words: str):
    """
    This function takes in a list of words and returns the "Historia" label if any of the keywords for this label are present
    in the input words. If none of the keywords match, it returns -1 to represent "no label".
    
    Parameters:
    words (list): A list of strings representing the words to be checked for keywords.
    
    Returns:
    int: An integer representing the "Historia" label, or -1 if no match is found.
    """
    # Define a list of keywords for the "Historia" label
    historia_keywords = ["guerra", "revol", "anos", "civiliz", "antiga", 
                         "hist", "arqueo", "arquitetura", "cultura", "brasil", "polit",
                         "socie", "socioe", "amer", "geral", "indep", "imperi", "repub", 
                         "reforma", "mediev", "moderna", "contempo", "europeia","idade", 
                         "colonial", "nacion", "africa", "colon", "descobri", "escrav", 
                         "crise", 'ociden', "doc", "antig"]    
    # Check if any of the keywords are present in the input words
    for word in historia_keywords:
        if remove_special_characters_and_stopwords(word)[0] in  words['alt_title']:
            # Return the corresponding label if a match is found
            return LABELS['Historia']
    # Return -1 to represent "no label" if none of the keywords match
    return -1 

@labeling_function()
def lf_administracao(words: str):
    """
    This function takes in a list of words and returns the "Administração" label if any of the keywords for this label are present
    in the input words. If none of the keywords match, it returns -1 to represent "no label".
    
    Parameters:
    words (list): A list of strings representing the words to be checked for keywords.
    
    Returns:
    int: An integer representing the "Administração" label, or -1 if no match is found.
    """
    # Define a list of keywords for the "Administração" label
    
    administracao_keywords = ["adm", "org", "gest", "empr", "neg", "econ", "controle"]
    # Check if any of the keywords are present in the input words
    for word in administracao_keywords:
        if remove_special_characters_and_stopwords(word)[0] in  words['alt_title']:
            # Return the corresponding label if a match is found
            return LABELS['Administracao']
    # Return -1 to represent "no label" if none of the keywords match    return -1
    return -1

# Define a labeling function for the "Geografia" label
@labeling_function()
def lf_geografia(words: str):
    """
    This function takes in a list of words and returns the "Geografia" label if any of the keywords for this label are present
    in the input words. If none of the keywords match, it returns -1 to represent "no label".
    
    Parameters:
    words (list): A list of strings representing the words to be checked for keywords.
    
    Returns:
    int: An integer representing the "Geografia" label, or -1 if no match is found.
    """
    # Define a list of keywords for the "Geografia" label
    geografia_keywords = ["geo", "socio", "antropo", "regiao", "terr", "rural", "urban", 
                          'mund', 'atlas', 'mapa', 'carto', 'clima', 'ambien', 'ecolo', 'hidro',]
    # Check if any of the keywords are present in the input words
    for word in geografia_keywords:
        if remove_special_characters_and_stopwords(word)[0] in  words['alt_title']:
            # Return the corresponding label if a match is found
            return LABELS['Geografia']
    # Return -1 to represent "no label" if none of the keywords match
    return -1

# Define a labeling function for the "Biologia" label
@labeling_function()
def lf_biologia(words: str):
    """
    This function takes in a list of words and returns the "Biologia" label if any of the keywords for this label are present
    in the input words. If none of the keywords match, it returns -1 to represent "no label".
    
    Parameters:
    words (list): A list of strings representing the words to be checked for keywords.
    
    Returns:
    int: An integer representing the "Biologia" label, or -1 if no match is found.
    """
    # Define a list of keywords for the "Biologia" label
    biologia_keywords = ["bio", "seres", "vida", "evol", "genet", "medic", "saude", 
                         "nutri", "fisio", "enferm", "farma", "veteri","odonto", "psico"]
    # Check if any of the keywords are present in the input words
    for word in biologia_keywords:
        if remove_special_characters_and_stopwords(word)[0] in  words['alt_title']:
            # Return the corresponding label if a match is found
            return LABELS['Biologia']
    # Return -1 to represent "no label" if none of the keywords match
    return -1

# Define a labeling function for the "Literatura" label
@labeling_function()
def lf_literatura(words: str):
    """
    This function takes in a list of words and returns the "Literatura" label if any of the keywords for this label are present
    in the input words. If none of the keywords match, it returns -1 to represent "no label".
    
    Parameters:
    words (list): A list of strings representing the words to be checked for keywords.
    
    Returns:
    int: An integer representing the "Literatura" label, or -1 if no match is found.
    """
    # Define a list of keywords for the "Literatura" label
    literatura_keywords = ["litera", "texto", "portuguesa", "leit", "grama", "poesia", "poema"]
    # Check if any of the keywords are present in the input words
    for word in literatura_keywords:
        if remove_special_characters_and_stopwords(word)[0] in  words['alt_title']:
            # Return the corresponding label if a match is found
            return LABELS['Literatura']
    # Return -1 to represent "no label" if none of the keywords match
    return -1

# Define a labeling function for the "Artes" label
@labeling_function()
def lf_artes(words: str):
    """
    This function takes in a list of words and returns the "Artes" label if any of the keywords for this label are present
    in the input words. If none of the keywords match, it returns -1 to represent "no label".
    
    Parameters:
    words (list): A list of strings representing the words to be checked for keywords.
    
    Returns:
    int: An integer representing the "Artes" label, or -1 if no match is found.
    """
    # Define a list of keywords for the "Artes" label
    artes_keywords = ["museu", "cinema", "filme", "museum", "gallery",
                      "art", "pintur", "escult", "music", "teatr", "danc"]
    # Check if any of the keywords are present in the input words
    for word in artes_keywords:
        if remove_special_characters_and_stopwords(word)[0] in words['alt_title']:
            # Return the corresponding label if a match is found
            return LABELS['Artes']
    # Return -1 to represent "no label" if none of the keywords match
    return -1

# Define a labeling function for the "Matemática" label
@labeling_function()
def lf_matematica(words: str):
    """
    This function takes in a list of words and returns the "Matemática" label if any of the keywords for this label are present
    in the input words. If none of the keywords match, it returns -1 to represent "no label".
    
    Parameters:
    words (list): A list of strings representing the words to be checked for keywords.
    
    Returns:
    int: An integer representing the "Matemática" label, or -1 if no match is found.
    """
    # Define a list of keywords for the "Matemática" label
    matematica_keywords = ["matema", "fundam", "calcu", "algebra", "geome", "estatis", 
                           "proba", "trigono", "logica", "fisica", 'analit', 'aplica', 
                           "medio", "grau", 'discre', 'numeri', 'vetor', 'equa', 'difer', 'integral']
    # Check if any of the keywords are present in the input words
    for word in matematica_keywords:
        if remove_special_characters_and_stopwords(word)[0] in words['alt_title']:
            # Return the corresponding label if a match is found
            return LABELS['Matematica']
    # Return -1 to represent "no label" if none of the keywords match
    return -1


In [11]:


# Define the missing variables
LABELS = {'Historia': 0, 'Administracao': 1, 'Geografia': 2, 'Biologia': 3, 
          'Literatura': 4, 'Artes': 5, 'Matematica': 6}

# Apply the labeling functions to the training data
applier = PandasLFApplier(lfs=[lf_historia, lf_administracao, lf_geografia, lf_biologia, 
                               lf_literatura, lf_artes, lf_matematica])
L_train = applier.apply(df=x_train)


100%|██████████| 872/872 [00:18<00:00, 46.26it/s]


In [12]:

# Train the label model
label_model = LabelModel(cardinality=7, verbose=True)
label_model.fit(L_train=L_train, 
                n_epochs=100, 
                lr=0.000005,
                log_freq=20, 
                seed=42, optimizer='adam')

# Predict labels for the training data
x_train['label_inf'] = label_model.predict(L=L_train, tie_break_policy="abstain")

# Apply the labeling functions to the test data
L_train = applier.apply(df=x_test)

# Predict labels for the test data
x_test['label_inf'] = label_model.predict(L=L_train, tie_break_policy="abstain")

INFO:root:Computing O...
INFO:root:Estimating \mu...
  0%|          | 0/100 [00:00<?, ?epoch/s]INFO:root:[0 epochs]: TRAIN:[loss=0.178]
INFO:root:[20 epochs]: TRAIN:[loss=0.178]
INFO:root:[40 epochs]: TRAIN:[loss=0.178]
INFO:root:[60 epochs]: TRAIN:[loss=0.178]
 67%|██████▋   | 67/100 [00:00<00:00, 668.09epoch/s]INFO:root:[80 epochs]: TRAIN:[loss=0.177]
100%|██████████| 100/100 [00:00<00:00, 652.00epoch/s]
INFO:root:Finished Training
100%|██████████| 291/291 [00:05<00:00, 55.78it/s]


In [13]:
x_train['label'] = x_train['genero'].map(LABELS).values
x_test['label'] = x_test['genero'].map(LABELS).values

In [14]:
pd.crosstab(x_test['label'], x_test['label_inf'])

label_inf,-1,0,1,2,3,4,5,6
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,28,78,1,5,1,0,3,0
1,11,9,17,0,1,0,1,0
2,2,18,0,12,0,1,0,0
3,4,5,0,3,22,0,0,0
4,2,6,0,0,0,5,0,0
5,12,3,0,1,0,0,6,0
6,1,2,1,3,0,0,0,27


#### 03.03. Transformation Function

In [15]:
context_aug = naw.ContextualWordEmbsAug(model_path='neuralmind/bert-base-portuguese-cased', aug_min=1)

def synonym_replacer(text, model=context_aug):
    """
    This function takes in a string of text and performs data augmentation using the specified augmentation technique.
    
    Parameters:
    text (str): A string of text to be augmented.
    
    Returns:
    str: The augmented text.
    """
    # Define the augmentation technique
    aug = model
    
    return aug.augment(text)[0]

@transformation_function()
def tf_synonym(df_row): 
    # Assuming that 'alt_title' is the column containing text
    df_row['alt_title'] = synonym_replacer(df_row.alt_title)
    return df_row

In [16]:
tf_policy = ApplyOnePolicy(n_per_original=2, keep_original=True)
tf_applier = PandasTFApplier([tf_synonym], tf_policy)

x_train_aug = tf_applier.apply(x_train)

100%|██████████| 872/872 [05:54<00:00,  2.46it/s]


In [17]:
L_train = applier.apply(df=x_train_aug.drop('label_inf', axis=1))

100%|██████████| 2616/2616 [00:49<00:00, 53.15it/s]


In [18]:

label_model = LabelModel(cardinality=7, verbose=True)
label_model.fit(L_train=L_train, 
                n_epochs=100, 
                lr=0.1,
                log_freq=20, 
                seed=42)


INFO:root:Computing O...
INFO:root:Estimating \mu...
  0%|          | 0/100 [00:00<?, ?epoch/s]INFO:root:[0 epochs]: TRAIN:[loss=0.087]
INFO:root:[20 epochs]: TRAIN:[loss=0.010]
INFO:root:[40 epochs]: TRAIN:[loss=0.001]
INFO:root:[60 epochs]: TRAIN:[loss=0.000]
 68%|██████▊   | 68/100 [00:00<00:00, 676.15epoch/s]INFO:root:[80 epochs]: TRAIN:[loss=0.000]
100%|██████████| 100/100 [00:00<00:00, 620.45epoch/s]
INFO:root:Finished Training


In [19]:
x_train_aug['label_inf'] = label_model.predict(L=L_train, tie_break_policy="abstain")

# Apply the labeling functions to the test data
L_train = applier.apply(df=x_test)

# Predict labels for the test data
x_test['label_inf'] = label_model.predict(L=L_train, tie_break_policy="abstain")

  1%|▏         | 4/291 [00:00<00:07, 38.53it/s]

100%|██████████| 291/291 [00:05<00:00, 52.48it/s]


In [20]:
pd.crosstab(x_train_aug['label'], x_train_aug['label_inf'], margins=True)
pd.crosstab(x_test['label'], x_test['label_inf'])

label_inf,-1,0,1,2,3,4,5,6
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,28,75,1,6,1,0,4,1
1,11,4,21,0,2,0,1,0
2,2,18,0,11,0,1,0,1
3,4,3,0,3,24,0,0,0
4,2,6,0,0,0,5,0,0
5,12,4,0,0,0,0,6,0
6,1,0,1,3,1,1,0,27


In [21]:
@slicing_function()
def sf_historia(x):
    return 1 if "histori" in x.alt_title.lower() else 0

@slicing_function()
def sf_administracao(x):
    return 1 if "administra" in x.alt_title.lower() else 0

@slicing_function()
def sf_geografia(x):
    return 1 if "geogra" in x.alt_title.lower() else 0

@slicing_function()
def sf_biologia(x):
    return 1 if "biolo" in x.alt_title.lower() else 0

@slicing_function()
def sf_literatura(x):
    return 1 if "litera" in x.alt_title.lower() else 0

@slicing_function()
def sf_artes(x):
    return 1 if "arte" in x.alt_title.lower() else 0

@slicing_function()
def sf_matematica(x):
    return 1 if "matem" in x.alt_title.lower() else 0


sf_applier = PandasSFApplier([sf_historia, sf_administracao, sf_geografia, 
                              sf_biologia, sf_literatura, sf_artes, sf_matematica])
slice_labels = sf_applier.apply(x_train_aug)

# slice_labels = pd.DataFrame(slice_labels)

100%|██████████| 2616/2616 [00:00<00:00, 25221.71it/s]


In [22]:
x_train_final = x_train_aug.reset_index(drop=True).join(pd.DataFrame(slice_labels))

In [23]:
x_test_final = x_test.reset_index(drop=True).join(pd.DataFrame(slice_labels))

### 04. Data Export

In [24]:
x_train['label'] = x_train['genero'].map(LABELS)
x_train_aug['label'] = x_train_aug['genero'].map(LABELS)
x_train_final['label'] = x_train_final['genero'].map(LABELS)
x_test_final['label'] = x_test['genero'].map(LABELS)

In [25]:
x_train.drop(['titulo', 'genero'], axis=1, inplace=True)
x_train_aug.drop(['titulo', 'genero'], axis=1, inplace=True)
x_train_final.drop(['titulo', 'genero'], axis=1, inplace=True)
x_test_final.drop(['titulo', 'genero'], axis=1, inplace=True)

In [26]:
x_train.to_csv('../data/processed/train_og.csv', index=False)
x_train_aug.to_csv('../data/processed/train_aug.csv', index=False)
x_train_final.to_csv('../data/processed/train_final.csv', index=False)
x_test_final.to_csv('../data/processed/test.csv', index=False)