# Fixing the themes of Brazilian Laws with Confident Learning techniques



While researching new possibilities to apply machine learning techniques, I've come across the Federal Congress of Brazil's [Open Data Portal](https://dadosabertos.camara.leg.br/swagger/api.html#staticfile). This portal contains a lot of data about the Brazilian Congress, including all the laws that are and have been discussed in the last decades. Among the resources, there is the laws' summary (*_ementa_*) and its theme (*_tema_*), a categorical multi-valued variable that describes the subjects, such as *_health_*, *_education_*, *_economy_*, etc.

However, when trying to develop a simple model to identify whether a class is related to a specific theme or not (binary classification), the poor model performance denunciated to me that the data could have some problems and, after inspection, I noticed that there are several laws wrongly classified.

**The problem is**: The dataset contains 60K+ laws, ranging from 1990 to 2022, and it's impossible to manually inspect all of them.

Clean Lab to rescue!

Confident Learning is an approach to automatically identify noise and label errors in datasets using supervised Machine Learning models. It’s used in this notebook through the [CleanLab Python package](https://github.com/cleanlab/cleanlab).

## Importing Libraries

In [2]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

from cleanlab import Datalab
import json

RANDOM_SEED = 214
np.random.seed(RANDOM_SEED)

  from .autonotebook import tqdm as notebook_tqdm


## Importing data

In [8]:
df_pls_theme = pd.read_parquet('../../data/proposicoes_temas_one_hot_encoding.parquet')
df_pls_theme.head(2)

tema,id,ementa,Administração Pública,"Agricultura, Pecuária, Pesca e Extrativismo","Arte, Cultura e Religião",Cidades e Desenvolvimento Urbano,Comunicações,Defesa e Segurança,Direito Civil e Processual Civil,Direito Penal e Processual Penal,...,Finanças Públicas e Orçamento,Homenagens e Datas Comemorativas,"Indústria, Comércio e Serviços",Meio Ambiente e Desenvolvimento Sustentável,"Política, Partidos e Eleições",Previdência e Assistência Social,Saúde,Trabalho e Emprego,"Viação, Transporte e Mobilidade",Outro
0,14919,"Dispõe sobre a Política Nacional de Salários, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,14920,"Modifica o art. 6º da Lei nº 9.424, de 24 de d...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Counting the number of laws per theme**

In [9]:
df_pls_theme\
    .drop_duplicates('ementa')\
    .drop(columns=['id','ementa'])\
    .sum(axis=0)\
    .sort_values(ascending=False)

tema
Direitos Humanos e Minorias                    8544
Trabalho e Emprego                             7775
Saúde                                          7648
Finanças Públicas e Orçamento                  7168
Administração Pública                          6915
Direito Penal e Processual Penal               5704
Educação                                       5267
Indústria, Comércio e Serviços                 4540
Viação, Transporte e Mobilidade                4459
Defesa e Segurança                             3529
Direito Civil e Processual Civil               3413
Meio Ambiente e Desenvolvimento Sustentável    2988
Previdência e Assistência Social               2818
Homenagens e Datas Comemorativas               2683
Economia                                       2513
Direito e Defesa do Consumidor                 2360
Comunicações                                   2254
Cidades e Desenvolvimento Urbano               1995
Outro                                          1973
Energia

Selecting the theme to be analyzed

In [10]:
BINARY_CLASS = "Homenagens e Datas Comemorativas"
IN_BINARY_CLASS = "in_" + BINARY_CLASS.lower().replace(" ", "_")

df_pls_theme = df_pls_theme.drop_duplicates(subset=["ementa"])
df_pls_theme = df_pls_theme[["ementa", BINARY_CLASS]]
df_pls_theme = df_pls_theme.rename(
    columns={BINARY_CLASS: IN_BINARY_CLASS}
)

df_pls_theme.info()

<class 'pandas.core.frame.DataFrame'>
Index: 61934 entries, 0 to 65976
Data columns (total 2 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   ementa                               61934 non-null  object
 1   in_homenagens_e_datas_comemorativas  61934 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.4+ MB


Percentage of the selected theme

In [12]:
100*df_pls_theme[IN_BINARY_CLASS].value_counts(normalize=True)

in_homenagens_e_datas_comemorativas
0    95.667969
1     4.332031
Name: proportion, dtype: float64

**Showing errors in the data**

As the choosen class is *Tributes and commemorative dates* (a simple theme), it's possible to search errors with simple queries.

The code below searches for false negatives in the dataset, i.e., laws that are not classified as *Tributes and commemorative dates* but should be.

In [42]:
df_possible_errors = (
    df_pls_theme
    .query(f"{IN_BINARY_CLASS} == 0")
    .loc[ df_pls_theme['ementa'].apply(lambda x: 'dia nacional' in x.lower()) ]
)

print(f"Found {len(df_possible_errors)} possible errors")
df_possible_errors

Found 178 possible errors


tema,ementa,in_homenagens_e_datas_comemorativas
1000,Dispõe sobre a instituição do Dia Nacional da ...,0
1216,Institui o dia 20 de julho como Dia Nacional d...,0
1526,Institui o 12 de agosto como Dia Nacional da J...,0
1876,Institui o dia 13 de julho como o Dia Nacional...,0
1951,Institui o Dia Nacional das APAES.,0
...,...,...
54911,Insere métodos de ensino sanitário para crianç...,0
63545,Institui o Dia 10 de outubro como Dia Nacional...,0
64085,Dispõe sobre a criação do Dia Nacional de defe...,0
64293,Cria o Programa Nacional de Prevenção da Depre...,0




It's important to note that I'm not a specialist in law, but I see no reason why *Institui o dia nacional do skate*, *Institui o Dia Nacional do Inventor ...* and *Institui o Dia Nacional do Idoso* should not be related to *Tributes and commemorative dates*.

## Training the Model with the original labels

The first step of this notebook is to train a model with the original labels and check its performance.

The cells below follow the usual steps of a Machine Learning project, from data preparation to model training and evaluation.

**Train test split**

In [5]:
X = df_pls_theme.ementa
y = df_pls_theme[IN_BINARY_CLASS]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, shuffle=True, random_state=RANDOM_SEED
)

**Model Selection**

To make this code simpler, I've omitted the Model Selection part, and I've choose to use a Random Forest Classifier directly.

**Hyperparameter Tuning**

As this is a highlly unbalanced dataset, the accuracy is not a good metric to evaluate the model. The F1 score is a better metric for this case.

Also, its imperative to use the stratify parameter to ensure that the train and test sets have the same proportion of the classes.

In [64]:
clf_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=True, strip_accents='ascii', max_features=5000, max_df=0.9)),
    ('clf', RandomForestClassifier(random_state=RANDOM_SEED, n_jobs=-1))
])

params = {
    'clf__n_estimators': [200, 300],
    'clf__max_depth': [None, 10, 30],
}

In [34]:
grid_search = GridSearchCV(
    clf_pipeline, params, cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_SEED)
    ,scoring='f1', n_jobs=2, verbose=1
)

grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 6 candidates, totalling 18 fits
CPU times: user 1min 6s, sys: 18.5 s, total: 1min 25s
Wall time: 2min 47s


The best model and its score

In [47]:
print(grid_search.best_score_)
grid_search.best_estimator_

0.7992734162850049


## Cleaning the dataset

The CleanLab package uses the probabilities predicted by a model trained with the original labels to identify label errors and noisy data.

It's mentioned in the [CleanLab documentation](https://cleanlab.readthedocs.io/en/latest/) that these probabilites must be created for all records in a out-of-sample dataset, i.e., a dataset that was not used to train the model. The easiest way to do this is to use K-Fold Cross Validation.

In [67]:
clean_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=True, strip_accents='ascii', max_features=6000, max_df=0.95)),
    ('clf', RandomForestClassifier(random_state=RANDOM_SEED, n_jobs=-1, n_estimators=300))
])

In [68]:
y_proba = cross_val_predict(
    clean_pipeline, 
    df_pls_theme['ementa'], 
    df_pls_theme[IN_BINARY_CLASS],
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED), 
    method='predict_proba', 
    verbose=2,
    n_jobs=-1
)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  2.1min finished


The CleanLab package has a Datalab object that encapsulate a Dataframe object and its labels. It's from this object + model probabilities that it will identify label errors and noisy data.

In [72]:
lab = Datalab(
    data=df_pls_theme,
    label_name=IN_BINARY_CLASS,
)

With the probabilities in hands, all that is needed is to pass them to the `find_issues` function.

In [109]:
lab.find_issues(pred_probs=y_proba)

Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided pred_probs ...
Audit complete. 1027 issues found in the dataset.




CleanLab is geared to identify label errors, outliers, near-dupe, and noisy data, but we're only interested in label errors.

The function `get_issue_summary` returns a summary of the issue specified.

In [103]:
lab.get_issue_summary("label")

Unnamed: 0,issue_type,score,num_issues
0,label,0.994962,312


It found 312 label errors in the dataset, let's inspect them.

In [96]:
# Getting the predicted errors
y_clean_labels = lab.get_issues("label")[['predicted_label', 'is_label_issue']]

# adding them to the original dataset
df_ples_theme_clean = df_pls_theme.copy().reset_index(drop=True)
df_ples_theme_clean['predicted_label'] = y_clean_labels['predicted_label']
df_ples_theme_clean['is_label_issue'] = y_clean_labels['is_label_issue']

In [97]:
df_ples_theme_clean.query("is_label_issue")[['ementa', IN_BINARY_CLASS, 'predicted_label']]

tema,ementa,in_homenagens_e_datas_comemorativas,predicted_label
278,"Institui, na República Federativa do Brasil, a...",0,1
286,"Institui o Dia do Evangélico, determinando fer...",0,1
805,Institui o dia 2 de julho como Dia da Libertaç...,0,1
1203,"Denomina ""Aeroporto de Porto Velho / Governado...",0,1
1504,Institui o 12 de agosto como Dia Nacional da J...,0,1
...,...,...,...
60482,Institui o dia nacional do skate,0,1
60499,Cria a premiação “Aluno Nota Dez” e “Escola No...,1,0
60932,"Institui o selo “Quebra-Cabeça”, com a finalid...",1,0
60947,Concede isenção do Imposto sobre Produtos Indu...,1,0


### Automatically fixing the dataset

In [6]:
clean_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=True, strip_accents='ascii', max_features=6000, max_df=0.95)),
    ('clf', RandomForestClassifier(random_state=RANDOM_SEED, n_jobs=-1, n_estimators=300))
])

In [12]:
df_pls_theme

tema,ementa,in_homenagens_e_datas_comemorativas
0,"Dispõe sobre a Política Nacional de Salários, ...",0
1,"Modifica o art. 6º da Lei nº 9.424, de 24 de d...",0
2,Dispõe sobre salário-família e dá outras provi...,0
3,"Modifica a Lei nº 4.117, de 1962, que ""institu...",0
4,Concede isenção do imposto sobre produtos indu...,0
...,...,...
65971,Dispõe sobre a permuta dos agentes de seguranç...,0
65973,"Confere ao município de Laranjal Paulista, loc...",1
65974,Dispõe sobre a obrigatoriedade de plataformas ...,0
65975,"Altera a Lei nº 13.277, de 29 de abril de 2016...",1


In [46]:
metrics = []
N_FIXES = 8
df = df_pls_theme.copy()
df[IN_BINARY_CLASS+'_0'] = df[IN_BINARY_CLASS]

for i in range(N_FIXES):
    in_binary_class_i = IN_BINARY_CLASS+f'_{i}'

    y_proba_i = cross_val_predict(
        clean_pipeline, 
        df['ementa'], 
        df[in_binary_class_i],
        cv=StratifiedKFold(n_splits=4, shuffle=True, random_state=RANDOM_SEED), 
        method='predict_proba', 
        verbose=3,
        n_jobs=-1
    )

    y_i_true = df[in_binary_class_i]
    y_i = np.argmax(y_proba_i, axis=1)

    metrics.append({
        'accuracy': accuracy_score(y_i_true, y_i),
        'precision': precision_score(y_i_true, y_i),
        'recall': recall_score(y_i_true, y_i),
        'f1': f1_score(y_i_true, y_i),
        'confusion_matrix': confusion_matrix(y_i_true, y_i),
        'classification_report': classification_report(y_i_true, y_i),
        'n_fixes': i,
    })


    # Find issues and add to dataframe
    datalab_i = Datalab(
        data=df,
        label_name=in_binary_class_i,
    )
    datalab_i.find_issues(pred_probs=y_proba_i)
    metrics[-1]['n_issues'] = datalab_i.get_issue_summary("label")['num_issues'][0]
    
    
    data_label_issues_i = datalab_i.get_issues("label")

    df[IN_BINARY_CLASS+f'_{i+1}'] = data_label_issues_i['predicted_label'].to_list()
    df['is_label_issue'] = data_label_issues_i['is_label_issue'].to_list()
    df[IN_BINARY_CLASS+f'_{i+1}'] = df[IN_BINARY_CLASS+f'_{i+1}'].mask(~df['is_label_issue'], df[IN_BINARY_CLASS+f'_{i}'])

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:  2.3min finished


Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided pred_probs ...
Audit complete. 1022 issues found in the dataset.


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:  1.8min finished


Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided pred_probs ...
Audit complete. 718 issues found in the dataset.


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:  1.8min finished


Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided pred_probs ...
Audit complete. 727 issues found in the dataset.


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:  1.8min finished


Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided pred_probs ...
Audit complete. 709 issues found in the dataset.


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:  1.6min finished


Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided pred_probs ...
Audit complete. 682 issues found in the dataset.


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:  1.5min finished


Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided pred_probs ...
Audit complete. 684 issues found in the dataset.


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:  1.5min finished


Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided pred_probs ...
Audit complete. 657 issues found in the dataset.


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:  1.5min finished


Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided pred_probs ...
Audit complete. 657 issues found in the dataset.


In [59]:
df_clean_labels = (
    df
    [ ['ementa', IN_BINARY_CLASS, IN_BINARY_CLASS+f'_{N_FIXES}'] ]
    .rename(
        columns={
            IN_BINARY_CLASS+f'_{N_FIXES}': IN_BINARY_CLASS+'_fixed',
            IN_BINARY_CLASS: IN_BINARY_CLASS+'_original',
        }
    )
)

In [58]:
df_metrics_results = pd.DataFrame(metrics)
df_metrics_results

Unnamed: 0,accuracy,precision,recall,f1,confusion_matrix,classification_report,n_fixes,n_issues
0,0.984774,0.846338,0.792397,0.818479,"[[58865, 386], [557, 2126]]",precision recall f1-score ...,0,309
1,0.990119,0.922073,0.859875,0.889888,"[[58849, 209], [403, 2473]]",precision recall f1-score ...,1,44
2,0.990861,0.938623,0.861856,0.898603,"[[58860, 164], [402, 2508]]",precision recall f1-score ...,2,20
3,0.99141,0.94494,0.868673,0.905203,"[[58862, 148], [384, 2540]]",precision recall f1-score ...,3,12
4,0.991991,0.951075,0.875768,0.911869,"[[58872, 132], [364, 2566]]",precision recall f1-score ...,4,7
5,0.991491,0.948212,0.867712,0.906178,"[[58862, 139], [388, 2545]]",precision recall f1-score ...,5,1
6,0.991588,0.948346,0.869802,0.907378,"[[58861, 139], [382, 2552]]",precision recall f1-score ...,6,0
7,0.991588,0.948346,0.869802,0.907378,"[[58861, 139], [382, 2552]]",precision recall f1-score ...,7,0


**Inspecting the results**

In [73]:
confusion_matrix(
    df_clean_labels[IN_BINARY_CLASS+'_original'],
    df_clean_labels[IN_BINARY_CLASS+'_fixed']
)

array([[58929,   322],
       [   71,  2612]])

In [76]:
df_clean_labels.query(
    f"{IN_BINARY_CLASS+'_original'} != {IN_BINARY_CLASS+'_fixed'} " 
    +f"and {IN_BINARY_CLASS+'_original'}==0"
).sample(frac=0.1, random_state=RANDOM_SEED)['ementa'].to_list()[:15]

['Institui a "Semana Nacional do Combate à Corrupção".',
 'Institui o dia Nacional de Mutirão da Saúde.',
 'Institui a data de 5 de dezembro como o "Dia Nacional da Pastoral da Criança".',
 'Institui o Dia do Técnico em Segurança do Trabalho, a ser comemorado em 27 de novembro.',
 'Inscreve o nome de Julio Cezar Ribeiro de Souza no Livro dos Heróis da Pátria.',
 'Denomina "Rodovia Luiz Otacílio Correia" o trecho da rodovia BR-230, entre as cidades de Lavras da Mangabeira e Várzea Alegre, no Estado do Ceará.',
 'Inscreve o nome dos servidores do Centro Técnico Aeroespacial  mortos no acidente com VLS 1, na Base de Alcântara, Maranhão, no Livro dos Heróis da Pátria.',
 'Denomina "Viaduto Governador Henrique Santillo" o viaduto localizado no km 432 da BR-153, no Município de Anápolis - GO.',
 'Denomina "Porto Fluvial Paulo de Souza Coelho".',
 'Institui o Dia Nacional de Prevenção e Combate à violência no Trânsito.',
 'Institui o Dia Nacional do Teatro para a Infância e a Juventude.',
 'I

In [75]:
df_clean_labels.query(
    f"{IN_BINARY_CLASS+'_original'} != {IN_BINARY_CLASS+'_fixed'} " 
    +f"and {IN_BINARY_CLASS+'_original'}==1"
)['ementa'].to_list()[:15]

['Dispõe sobre a criação da Década Brasileira pela Cultura da Paz.',
 'Cria no âmbito do Ministério da Cultura o Prêmio de Artes Plásticas Marcantônio Vilaça e dá outras providências.',
 'Revigora a Lei nº 2.689, de 20 de Dezembro de 1955, e  dá  outras providências.                              .',
 'Altera a Lei nº 6.454, de 24 de outubro de 1977, que "dispõe sobre a denominação de logradouros, obras, serviços e monumentos públicos, e dá outras providências.',
 'RESTABELECE O ESTIMULO FISCAL PREVISTO PELO ARTIGO 50 DA LEI 4504, DE 30 DE NOVEMBRO DE 1964, COM A REDAÇÃO DADA PELA LEI 6746, DE 10 DE DEZEMBRO DE 1979.',
 'DETERMINA, NO AMBITO DAS EMPRESAS PRIVADAS, O FORNECIMENTO DE ALIMENTAÇÃO AOS SEUS EMPREGADOS E DA OUTRAS PROVIDENCIAS.',
 'Cria o Fundo Nacional de Desenvolvimento da Pecuária de Corte - FUNPEC e institui contribuição de intervenção no domínio econômico destinada a fomentar o desenvolvimento do setor pecuário.',
 'Altera a Lei nº 6.454, de 24 de outubro de 1977, que "d

**Saving results**

In [77]:
df_metrics_results.to_csv(f'./data/{IN_BINARY_CLASS}_conf_learning_metrics_results.csv', index=False)

In [78]:
df_clean_labels.to_parquet(
    f'./data/{IN_BINARY_CLASS}_conf_learning_clean_labels.parquet', 
    index=False
)