# Processamento de Linguagem Natural

**Prof. Dr. Hilário Thomaz Alves de Oliveira**  
**Pós-graduação em Desenvolvimento de Aplicações Inteligentes**  
**Processamento de Linguagem Natural — Projeto 01 - Classificação de Decisões Judiciais**  

**Nome:** Otávio Lube dos Santos  
**Matrícula:** 20231DEVAI0157

In [1]:
!pip install datasets



In [2]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
dataset = load_dataset('joelniklaus/brazilian_court_decisions')

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['process_number', 'orgao_julgador', 'publish_date', 'judge_relator', 'ementa_text', 'decision_description', 'judgment_text', 'judgment_label', 'unanimity_text', 'unanimity_label'],
        num_rows: 3234
    })
    validation: Dataset({
        features: ['process_number', 'orgao_julgador', 'publish_date', 'judge_relator', 'ementa_text', 'decision_description', 'judgment_text', 'judgment_label', 'unanimity_text', 'unanimity_label'],
        num_rows: 404
    })
    test: Dataset({
        features: ['process_number', 'orgao_julgador', 'publish_date', 'judge_relator', 'ementa_text', 'decision_description', 'judgment_text', 'judgment_label', 'unanimity_text', 'unanimity_label'],
        num_rows: 405
    })
})

In [5]:
dataset['train'][0]

{'process_number': '0800304-08.2018.8.02.0000',
 'orgao_julgador': 'Tribunal Pleno',
 'publish_date': '12/03/2019',
 'judge_relator': 'Des. João Luiz Azevedo Lessa',
 'ementa_text': 'DIREITO PENAL E PROCESSUAL PENAL. REVISÃO CRIMINAL. ART. 621 DO CÓDIGO DE PROCESSO PENAL. REQUERENTE CONDENADO EM JÚRI POPULAR PELA PRÁTICA DOS CRIMES DE HOMICÍDIO DUPLAMENTE QUALIFICADO E HOMICÍDIO QUALIFICADO TENTADO. PLEITO DE REFAZIMENTO DA DOSIMETRIA DA PENA IMPOSTA AO REQUERENTE. ADMISSIBILIDADE NA VIA REVISIONAL. PRECEDENTES. ALEGAÇÃO DE ERRO NO PROCESSO DE DOSIMETRIA DA PENA. COMPORTAMENTO DA VÍTIMA. CIRCUNSTÂNCIA JUDICIAL NEUTRA QUE NÃO PODE SER CONSIDERADA DE FORMA DESFAVORÁVEL AO SENTENCIANDO SEGUNDO PRECEDENTES DO SUPERIOR TRIBUNAL DE JUSTIÇA E NOVO ENTENDIMENTO DA CÂMARA CRIMINAL DESTE TRIBUNAL DE JUSTIÇA. AFASTAMENTO. CULPABILIDADE. AUSÊNCIA DE EXPOSIÇÃO DE MOTIVOS PARA O INCREMENTO DA PENA-BASE. AFASTADO O DESVALOR. VALORAÇÃO ATRIBUÍDA ÀS CIRCUNSTÂNCIAS DO CRIME MANTIDA. FUNDAMENTAÇÃO IDÔNEA

In [6]:
train_texts = dataset['train']['decision_description']
train_labels = dataset['train']['judgment_label']

test_texts = dataset['test']['decision_description']
test_labels = dataset['test']['judgment_label']

print(f'\nTrain size: {len(train_texts)} -- {len(train_labels)}')
print(f'Test size: {len(test_texts)} -- {len(test_labels)}')


Train size: 3234 -- 3234
Test size: 405 -- 405


In [7]:
from collections import Counter

print(f'Train Labels Distribution: {Counter(train_labels)}')
print(f'Test Labels Distribution: {Counter(test_labels)}')

Train Labels Distribution: Counter({'no': 1960, 'partial': 677, 'yes': 597})
Test Labels Distribution: Counter({'no': 234, 'partial': 93, 'yes': 78})


In [8]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

label_encoder.fit(train_labels)

train_labels = label_encoder.transform(train_labels)
test_labels = label_encoder.transform(test_labels)

print(f'Train Labels Distribution: {Counter(train_labels)}')
print(f'Test Labels Distribution: {Counter(test_labels)}')

Train Labels Distribution: Counter({np.int64(0): 1960, np.int64(1): 677, np.int64(2): 597})
Test Labels Distribution: Counter({np.int64(0): 234, np.int64(1): 93, np.int64(2): 78})


In [9]:
!python -m spacy download pt_core_news_sm

Collecting pt-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-3.8.0/pt_core_news_sm-3.8.0-py3-none-any.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('pt_core_news_sm')


In [10]:
import spacy

from tqdm import tqdm

def preprocess_texts(list_texts):
  nlp = spacy.load('pt_core_news_sm', disable=['ner'])
  new_texts = []
  with tqdm(total=len(list_texts), desc='Preprocessing') as pbar:
    for text in list_texts:
      doc = nlp(text)
      tokens = [t.lemma_.lower() for t in doc if t.pos_ != 'PUNCT' and not t.is_stop]
      texto_normalizado = ' '.join(tokens)
      new_texts.append(texto_normalizado)
      pbar.update(1)
  return new_texts

In [11]:
train_texts = preprocess_texts(train_texts)

Preprocessing: 100%|██████████| 3234/3234 [00:18<00:00, 177.10it/s]


In [12]:
test_texts = preprocess_texts(test_texts)

Preprocessing: 100%|██████████| 405/405 [00:02<00:00, 169.37it/s]


In [13]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer_option = 'binary'

vectorizer = None

if vectorizer_option == 'binary':
    vectorizer = CountVectorizer(binary=True, max_features=None, ngram_range=(1, 1))
elif vectorizer_option == 'count':
    vectorizer = CountVectorizer(binary=False, max_features=None, ngram_range=(1, 1))
elif vectorizer_option == 'tf_idf':
    vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1, 1))

print(f'Vectorizer Option: {vectorizer_option}')

Vectorizer Option: binary


In [14]:
X_train = vectorizer.fit_transform(train_texts).toarray()
X_test = vectorizer.transform(test_texts).toarray()

print(f'\nExample Raw Text: {train_texts[0]}')
print(f'\nExample Vectorized Text: {X_train[0]}')


Example Raw Text: direito penal processual penal revisão criminal artigo 621 código processo penal requerente condenado júri popular prática crimes homicídio duplamente qualificado homicídio qualificado tentado pleito refazimento dosimetria pena imposta requerente admissibilidade via revisional precedentes alegação erro processo dosimetria pena comportamento vítima circunstância judicial neutra considerada desfavorável sentenciando precedentes superior tribunal justiça entendimento câmara criminal tribunal justiça afastamento culpabilidade ausência exposição motivos incremento pena-base afastado desvalor valoração atribuída circunstâncias crime mantida fundamentação idônea pena-base reduzida compensação agravante motivação torpe atenuante confissão espontânea pena privativa liberdade redimensionada crime tentado aplicada fração redutora máxima ante distância atos praticados requerente consumação crime pena redimensionada

Example Vectorized Text: [0 0 0 ... 0 0 0]


In [15]:
print(f'Vocabulary: {len(vectorizer.vocabulary_)}')

Vocabulary: 12453


In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import PassiveAggressiveClassifier
from xgboost import XGBClassifier

In [17]:
classifiers = {
    'Logistic Regression': LogisticRegression(class_weight='balanced', max_iter=500),
    'Multinomial NB': MultinomialNB(),
    'KNN': KNeighborsClassifier(),
    'SVM': SVC(),
    'Random Forest': RandomForestClassifier(),
    'XGBoost': XGBClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Quadratic Discriminant': QuadraticDiscriminantAnalysis(),
    'Passive Aggressive': PassiveAggressiveClassifier(),
    'MLP Classifier': MLPClassifier()
}

In [19]:
import pandas as pd
from sklearn.metrics import classification_report, accuracy_score, precision_recall_fscore_support

results_df = pd.DataFrame(columns=['Classificador', 'Acurácia', 'Precisão', 'Recall', 'F1-Score'])

for classifier_name, classifier in classifiers.items():
    print(f'\nClassifier: {classifier_name}')

    classifier.fit(X_train, train_labels)

    y_pred = classifier.predict(X_test)

    report = classification_report(test_labels, y_pred, output_dict=True, digits=5)

    print(classification_report(test_labels, y_pred))

    # ConfusionMatrixDisplay.from_estimator(classifier, X_test, test_labels, display_labels=['Negative', 'Positive']).plot()

    # plt.show()

    accuracy = accuracy_score(test_labels, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(test_labels, y_pred, average='weighted')

    results_df.loc[len(results_df)] = {
        'Classificador': classifier_name,
        'Acurácia': accuracy,
        'Precisão': precision,
        'Recall': recall,
        'F1-Score': f1
    }

best_accuracy = results_df.loc[results_df['Acurácia'].idxmax(), 'Classificador']
results_df['Acurácia'] = results_df.apply(lambda x: f'**{x["Acurácia"] * 100:.3f}**' if x['Classificador'] == best_accuracy else f'{x["Acurácia"] * 100:.3f}', axis=1)

best_precision = results_df.loc[results_df['Precisão'].idxmax(), 'Classificador']
results_df['Precisão'] = results_df.apply(lambda x: f'**{x["Precisão"] * 100:.3f}**' if x['Classificador'] == best_precision else f'{x["Precisão"] * 100:.3f}', axis=1)

best_recall = results_df.loc[results_df['Recall'].idxmax(), 'Classificador']
results_df['Recall'] = results_df.apply(lambda x: f'**{x["Recall"] * 100:.3f}**' if x['Classificador'] == best_recall else f'{x["Recall"] * 100:.3f}', axis=1)

best_f1 = results_df.loc[results_df['F1-Score'].idxmax(), 'Classificador']
results_df['F1-Score'] = results_df.apply(lambda x: f'**{x["F1-Score"] * 100:.3f}**' if x['Classificador'] == best_f1 else f'{x["F1-Score"] * 100:.3f}', axis=1)

results_df.to_excel(f'resultados_classificadores_{vectorizer_option}.xlsx', index=False)

print(results_df)

results_df.to_excel(f'resultados_classificadores_binary.xlsx', index=False)


Classifier: Logistic Regression
              precision    recall  f1-score   support

           0       0.82      0.83      0.82       234
           1       0.65      0.67      0.66        93
           2       0.62      0.56      0.59        78

    accuracy                           0.74       405
   macro avg       0.69      0.69      0.69       405
weighted avg       0.74      0.74      0.74       405


Classifier: Multinomial NB
              precision    recall  f1-score   support

           0       0.79      0.65      0.71       234
           1       0.47      0.61      0.53        93
           2       0.48      0.56      0.52        78

    accuracy                           0.62       405
   macro avg       0.58      0.61      0.59       405
weighted avg       0.65      0.62      0.63       405


Classifier: KNN
              precision    recall  f1-score   support

           0       0.69      0.89      0.78       234
           1       0.74      0.34      0.47        



              precision    recall  f1-score   support

           0       0.88      0.10      0.18       234
           1       0.26      0.40      0.32        93
           2       0.21      0.65      0.32        78

    accuracy                           0.27       405
   macro avg       0.45      0.38      0.27       405
weighted avg       0.61      0.27      0.24       405


Classifier: Passive Aggressive
              precision    recall  f1-score   support

           0       0.79      0.84      0.81       234
           1       0.66      0.66      0.66        93
           2       0.60      0.49      0.54        78

    accuracy                           0.73       405
   macro avg       0.68      0.66      0.67       405
weighted avg       0.72      0.73      0.72       405


Classifier: MLP Classifier
              precision    recall  f1-score   support

           0       0.79      0.87      0.83       234
           1       0.68      0.66      0.67        93
           2   