## Importing Libraries

In [None]:
!pip install cleanlab[all]

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

from cleanlab import Datalab


RANDOM_SEED = 214
np.random.seed(RANDOM_SEED)

  from .autonotebook import tqdm as notebook_tqdm


## Importing data

In [2]:
df_pls_theme = pd.read_parquet('../../data/proposicoes_temas_one_hot_encoding.parquet')
df_pls_theme.head()

tema,id,ementa,Administração Pública,"Agricultura, Pecuária, Pesca e Extrativismo","Arte, Cultura e Religião",Cidades e Desenvolvimento Urbano,Comunicações,Defesa e Segurança,Direito Civil e Processual Civil,Direito Penal e Processual Penal,...,Finanças Públicas e Orçamento,Homenagens e Datas Comemorativas,"Indústria, Comércio e Serviços",Meio Ambiente e Desenvolvimento Sustentável,"Política, Partidos e Eleições",Previdência e Assistência Social,Saúde,Trabalho e Emprego,"Viação, Transporte e Mobilidade",Outro
0,14919,"Dispõe sobre a Política Nacional de Salários, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,14920,"Modifica o art. 6º da Lei nº 9.424, de 24 de d...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,14921,Dispõe sobre salário-família e dá outras provi...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,14922,"Modifica a Lei nº 4.117, de 1962, que ""institu...",0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,14923,Concede isenção do imposto sobre produtos indu...,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0


Contagem dos temas

In [3]:
df_pls_theme\
    .drop_duplicates('ementa')\
    .drop(columns=['id','ementa'])\
    .sum(axis=0)\
    .sort_values(ascending=False)

tema
Direitos Humanos e Minorias                    8544
Trabalho e Emprego                             7775
Saúde                                          7648
Finanças Públicas e Orçamento                  7168
Administração Pública                          6915
Direito Penal e Processual Penal               5704
Educação                                       5267
Indústria, Comércio e Serviços                 4540
Viação, Transporte e Mobilidade                4459
Defesa e Segurança                             3529
Direito Civil e Processual Civil               3413
Meio Ambiente e Desenvolvimento Sustentável    2988
Previdência e Assistência Social               2818
Homenagens e Datas Comemorativas               2683
Economia                                       2513
Direito e Defesa do Consumidor                 2360
Comunicações                                   2254
Cidades e Desenvolvimento Urbano               1995
Outro                                          1973
Energia

Selecionando o tema a ser classificado

In [4]:
BINARY_CLASS = "Homenagens e Datas Comemorativas"
IN_BINARY_CLASS = "in_" + BINARY_CLASS.lower().replace(" ", "_")

df_pls_theme = df_pls_theme.drop_duplicates(subset=["ementa"])
df_pls_theme = df_pls_theme[["ementa", BINARY_CLASS]]
df_pls_theme = df_pls_theme.rename(
    columns={BINARY_CLASS: IN_BINARY_CLASS}
)

df_pls_theme.info()

<class 'pandas.core.frame.DataFrame'>
Index: 61934 entries, 0 to 65976
Data columns (total 2 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   ementa                               61934 non-null  object
 1   in_homenagens_e_datas_comemorativas  61934 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.4+ MB


## Data preprocessing
[wip] leis ...

In [5]:
X = df_pls_theme.ementa
y = df_pls_theme[IN_BINARY_CLASS]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, shuffle=True, random_state=RANDOM_SEED
)

In [30]:
clf_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=True, strip_accents='ascii', max_features=5000, max_df=0.9)),
    ('clf', RandomForestClassifier(random_state=RANDOM_SEED, n_jobs=-1))
])

params = {
    'clf__n_estimators': [200, 300],
    'clf__max_depth': [None, 10, 30],
}

In [34]:
grid_search = GridSearchCV(
    clf_pipeline, params, cv=3, scoring='f1', n_jobs=2, verbose=1
)

grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 6 candidates, totalling 18 fits
CPU times: user 1min 6s, sys: 18.5 s, total: 1min 25s
Wall time: 2min 47s


In [47]:
print(grid_search.best_score_)
grid_search.best_estimator_

0.7992734162850049


**Cleaning the dataset**

In [6]:
clean_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=True, strip_accents='ascii', max_features=5000, max_df=0.9)),
    ('clf', RandomForestClassifier(random_state=RANDOM_SEED, n_jobs=-1, n_estimators=300))
])

In [7]:
y_pred_proba = cross_val_predict(
    clean_pipeline, 
    X_train, 
    y_train, 
    cv=5, 
    method='predict_proba', 
    verbose=1,
    n_jobs=-1
)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.5min finished


In [21]:
lab = Datalab(
    data=pd.DataFrame(X_train, y_train),
    label_name=IN_BINARY_CLASS,
)

In [23]:
lab.find_issues(pred_probs=y_pred_proba)

Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided pred_probs ...
Audit complete. 886 issues found in the dataset.


In [24]:
lab.report()

Here is a summary of the different kinds of issues found in the data:

issue_type  num_issues
   outlier         633
     label         253

Dataset Information: num_examples: 49547, num_classes: 2


---------------------- outlier issues ----------------------

About this issue:
	Examples that are very different from the rest of the dataset 
    (i.e. potentially out-of-distribution or rare/anomalous instances).
    

Number of examples with this issue: 633
Overall dataset quality in terms of this issue: 0.2110

Examples representing most severe instances of this issue:
       is_outlier_issue  outlier_score
27032              True   1.200724e-07
17772              True   1.200724e-07
42156              True   2.045688e-06
7105               True   3.316068e-06
33983              True   4.634476e-06


----------------------- label issues -----------------------

About this issue:
	Examples whose given label is estimated to be potentially incorrect
    (e.g. due to annotation error) are

In [59]:
lab.get_issue_summary("label")

Unnamed: 0,issue_type,score,num_issues
0,label,0.994894,253


In [56]:
y_train_clean_labels = lab.get_issues("label")[['predicted_label', 'is_label_issue']]
y_train_outlier = lab.get_issues("outlier")[['is_outlier_issue']]

df_ples_theme_clean = pd.DataFrame(X_train).reset_index(drop=True)
df_ples_theme_clean[IN_BINARY_CLASS] = y_train.reset_index(drop=True)
df_ples_theme_clean = df_ples_theme_clean.join(y_train_clean_labels)
df_ples_theme_clean = df_ples_theme_clean.join(y_train_outlier)

In [57]:
df_ples_theme_clean.query("is_label_issue")[['ementa', IN_BINARY_CLASS, 'predicted_label']]

Unnamed: 0,ementa,in_homenagens_e_datas_comemorativas,predicted_label
63,Denomina Viaduto Ovídio José dos Santos o viad...,0,1
111,"Denomina ""Campus Ceres - Domingos Mendes da Si...",0,1
269,"Institui o Dia do Evangélico, determinando fer...",0,1
335,Cria no âmbito do Ministério da Cultura o Prêm...,1,0
750,"Denomina ""Aeroporto Internacional de Macapá Ja...",0,1
...,...,...,...
48628,Inscreve o nome de Osvaldo Cruz no Livro dos H...,0,1
48685,Denomina Açude José Holanda Cunha a Barragem C...,0,1
48734,"Denomina ""Aeroporto do Cacau Escritor Jorge Am...",0,1
48737,Institui o Dia Nacional de Combate ao Alcoolismo.,0,1
