<a href="https://colab.research.google.com/github/crux82/ganbert-pytorch/blob/main/GANBERT_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GAN-BERT (in Pytorch and compatible with HuggingFace)

This is a Pytorch (+ **Huggingface** transformers) implementation of the GAN-BERT model from https://github.com/crux82/ganbert. While the original GAN-BERT was an extension of BERT, this implementation can be adapted to several architectures, ranging from Roberta to Albert!

**NOTE**: given that this implementation is different from the original one in Tensorflow, some results can be slighty different.


Let's GO!

Required Imports.

In [46]:
import time

start_time = time.time()
!pip install transformers==4.3.2
import torch
import io
import torch.nn.functional as F
import random
import numpy as np
import time
import math
import datetime
import torch.nn as nn
from transformers import *
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
#!pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 -f https://download.pytorch.org/whl/torch_stable.html
#!pip install sentencepiece
import nltk
from nltk.tokenize import sent_tokenize

# Baixar o recurso de pontuação da NLTK (apenas na primeira vez)
nltk.download("punkt")


##Set random values
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
if torch.cuda.is_available():
  torch.cuda.manual_seed_all(seed_val)


Collecting transformers==4.3.2
  Using cached transformers-4.3.2-py3-none-any.whl.metadata (36 kB)
Collecting sacremoses (from transformers==4.3.2)
  Using cached sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting tokenizers<0.11,>=0.10.1 (from transformers==4.3.2)
  Using cached tokenizers-0.10.3.tar.gz (212 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Using cached transformers-4.3.2-py3-none-any.whl (1.8 MB)
Using cached sacremoses-0.1.1-py3-none-any.whl (897 kB)
Building wheels for collected packages: tokenizers
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mBuilding wheel for tokenizers [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [47]:
# If there's a GPU available...
if torch.cuda.is_available():
    # Tell PyTorch to use the GPU.
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: NVIDIA L4


### Input Parameters


In [48]:
#--------------------------------
#  Transformer parameters
#--------------------------------
max_seq_length = 64  #comprimento máximo de sequência de tokens
batch_size = 64 #número de exemplos de treinamento que serão processados

#--------------------------------
#  GAN-BERT specific parameters
#--------------------------------
# number of hidden layers in the generator,
# each of the size of the output space
num_hidden_layers_g = 1;
# number of hidden layers in the discriminator,
# each of the size of the input space
num_hidden_layers_d = 1;
# size of the generator's input noisy vectors
noise_size = 100
# dropout to be applied to discriminator's input vectors
out_dropout_rate = 0.1

# Replicate labeled data to balance poorly represented datasets,
# e.g., less than 1% of labeled material
apply_balance = True

#--------------------------------
#  Optimization parameters
#--------------------------------
learning_rate_discriminator = 5e-5
learning_rate_generator = 5e-5
epsilon = 1e-8 #evita divisão por zero
num_train_epochs = 10 #número de épocas para treinar o modelo
multi_gpu = True
# Scheduler
apply_scheduler = True
warmup_proportion = 0.1
# Print
print_each_n_step = 10

#--------------------------------
#  Adopted Tranformer model
#--------------------------------
# Since this version is compatible with Huggingface transformers, you can uncomment
# (or add) transformer models compatible with GAN

model_name = "bert-base-cased"
#model_name = "bert-base-uncased"
#model_name = "roberta-base"
#model_name = "albert-base-v2"
#model_name = "xlm-roberta-base"
#model_name = "amazon/bort"

#--------------------------------
#  Retrieve the TREC QC Dataset
#--------------------------------
! git clone https://github.com/mauriciokonrath/ganbert.git

#  NOTE: in this setting 50 classes are involved
labeled_file = "./ganbert/data/standardized_labeled_monsanto_withoutSub.tsv"
unlabeled_file = "./ganbert/data/unlabeled_enron_5000.tsv"
test_filename = "./ganbert/data/standardized_test_monsanto_withoutSub.tsv"

#categorias de rótulos que o modelo deve aprender a classificar.
#categorias de rótulos que o modelo deve aprender a classificar.
label_list = ["UNK_UNK","GHOST_ghost", "TOXIC_toxic",
              "CHEMI_chemi", "REGUL_regul"]

fatal: destination path 'ganbert' already exists and is not an empty directory.


Load the Tranformer Model

In [49]:
transformer = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/cd5ef92a9fb2f889e972770a36d4ed042daf221e/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.47.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/cd5ef92a9fb2f889e972770a36d4ed042daf221e/model.safetenso

Function required to load the dataset

In [50]:
#ler um arquivo de texto contendo dados de perguntas e criar exemplos de treinamento ou desenvolvimento
def get_qc_examples(input_file):
  """Creates examples for the training and dev sets."""
  examples = []

  with open(input_file, 'r') as f:
      contents = f.read()
      file_as_list = contents.splitlines()
      for line in file_as_list[1:]:
          split = line.split(" ")
          question = ' '.join(split[1:])

          text_a = question
          inn_split = split[0].split(":")
          label = inn_split[0] + "_" + inn_split[1]
          examples.append((text_a, label))
      f.close()

  return examples

**Load** the input QC dataset (fine-grained)

In [51]:
#Load the examples
labeled_examples = get_qc_examples(labeled_file)
unlabeled_examples = get_qc_examples(unlabeled_file)
test_examples = get_qc_examples(test_filename)

Functions required to convert examples into Dataloader

In [52]:
def generate_data_loader(input_examples, label_masks, label_map, do_shuffle = False, balance_label_examples = False):
  '''
  Generate a Dataloader given the input examples, eventually masked if they are
  to be considered NOT labeled.
  '''
  examples = []

  # Count the percentage of labeled examples
  num_labeled_examples = 0
  for label_mask in label_masks:
    if label_mask:
      num_labeled_examples += 1
  label_mask_rate = num_labeled_examples/len(input_examples)

  # if required it applies the balance
  for index, ex in enumerate(input_examples):
    if label_mask_rate == 1 or not balance_label_examples:
      examples.append((ex, label_masks[index]))
    else:
      # IT SIMULATE A LABELED EXAMPLE
      if label_masks[index]:
        balance = int(1/label_mask_rate)
        balance = int(math.log(balance,2))
        if balance < 1:
          balance = 1
        for b in range(0, int(balance)):
          examples.append((ex, label_masks[index]))
      else:
        examples.append((ex, label_masks[index]))

  #-----------------------------------------------
  # Generate input examples to the Transformer
  #-----------------------------------------------
  input_ids = []
  input_mask_array = []
  label_mask_array = []
  label_id_array = []

  # Tokenization
  for (text, label_mask) in examples:
    encoded_sent = tokenizer.encode(text[0], add_special_tokens=True, max_length=max_seq_length, padding="max_length", truncation=True)
    input_ids.append(encoded_sent)
    label_id_array.append(label_map[text[1]])
    label_mask_array.append(label_mask)

  # Attention to token (to ignore padded input wordpieces)
  for sent in input_ids:
    att_mask = [int(token_id > 0) for token_id in sent]
    input_mask_array.append(att_mask)
  # Convertion to Tensor
  input_ids = torch.tensor(input_ids)
  input_mask_array = torch.tensor(input_mask_array)
  label_id_array = torch.tensor(label_id_array, dtype=torch.long)
  label_mask_array = torch.tensor(label_mask_array)

  # Building the TensorDataset
  dataset = TensorDataset(input_ids, input_mask_array, label_id_array, label_mask_array)

  if do_shuffle:
    sampler = RandomSampler
  else:
    sampler = SequentialSampler

  # Building the DataLoader
  return DataLoader(
              dataset,  # The training samples.
              sampler = sampler(dataset),
              batch_size = batch_size) # Trains with this batch size.

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

Prepares input data

In [53]:
#MODIFICADO
def generate_data_loader(input_examples, label_masks, label_map, do_shuffle = False, balance_label_examples = False):
  '''
  Generate a Dataloader given the input examples, eventually masked if they are
  to be considered NOT labeled.
  '''
  examples = []

  # Count the percentage of labeled examples
  num_labeled_examples = 0
  for label_mask in label_masks:
    if label_mask:
      num_labeled_examples += 1
  label_mask_rate = num_labeled_examples/len(input_examples)


  # if required it applies the balance
  for index, ex in enumerate(input_examples):
    if label_mask_rate == 1 or not balance_label_examples:
      examples.append((ex, label_masks[index]))
    else:
      # IT SIMULATE A LABELED EXAMPLE
      if label_masks[index]:
        balance = int(1/label_mask_rate)
        balance = int(math.log(balance,2))
        if balance < 1:
          balance = 1
        for b in range(0, int(balance)):
          examples.append((ex, label_masks[index]))
      else:
        examples.append((ex, label_masks[index]))

  #-----------------------------------------------
  # Generate input examples to the Transformer
  #-----------------------------------------------
  input_ids = []
  input_mask_array = []
  label_mask_array = []
  label_id_array = []

  # Tokenization
  for (text, label_mask) in examples:
    # Clean up the label by removing extraneous characters and text
    label = text[1].split('\t')[0]
    encoded_sent = tokenizer.encode(text[0], add_special_tokens=True, max_length=max_seq_length, padding="max_length", truncation=True)
    input_ids.append(encoded_sent)

    # Check if the label is in the label_map
    if label in label_map:
      label_id_array.append(label_map[label])
    else:
      # Handle the case where the label is not in label_map
      # Here we assign the label 'UNK_UNK' if not found
      label_id_array.append(label_map['UNK_UNK'])

    label_mask_array.append(label_mask)

  # Attention to token (to ignore padded input wordpieces)
  for sent in input_ids:
    att_mask = [int(token_id > 0) for token_id in sent]
    input_mask_array.append(att_mask)
  # Convertion to Tensor
  input_ids = torch.tensor(input_ids)
  input_mask_array = torch.tensor(input_mask_array)
  label_id_array = torch.tensor(label_id_array, dtype=torch.long)
  label_mask_array = torch.tensor(label_mask_array)

  # Building the TensorDataset
  dataset = TensorDataset(input_ids, input_mask_array, label_id_array, label_mask_array)

  if do_shuffle:
    sampler = RandomSampler
  else:
    sampler = SequentialSampler

  # Building the DataLoader
  return DataLoader(
              dataset,  # The training samples.
              sampler = sampler(dataset),
              batch_size = batch_size) # Trains with this batch size.

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

Convert the input examples into DataLoader

In [54]:
label_map = {}
for (i, label) in enumerate(label_list):
  label_map[label] = i
#------------------------------
#   Load the train dataset
#------------------------------
train_examples = labeled_examples
#The labeled (train) dataset is assigned with a mask set to True
train_label_masks = np.ones(len(labeled_examples), dtype=bool)
#If unlabel examples are available
if unlabeled_examples:
  train_examples = train_examples + unlabeled_examples
  #The unlabeled (train) dataset is assigned with a mask set to False
  tmp_masks = np.zeros(len(unlabeled_examples), dtype=bool)
  train_label_masks = np.concatenate([train_label_masks,tmp_masks])

train_dataloader = generate_data_loader(train_examples, train_label_masks, label_map, do_shuffle = True, balance_label_examples = apply_balance)

#------------------------------
#   Load the test dataset
#------------------------------
#The labeled (test) dataset is assigned with a mask set to True
test_label_masks = np.ones(len(test_examples), dtype=bool)

test_dataloader = generate_data_loader(test_examples, test_label_masks, label_map, do_shuffle = False, balance_label_examples = False)

  label_mask_array = torch.tensor(label_mask_array)


We define the Generator and Discriminator as discussed in https://www.aclweb.org/anthology/2020.acl-main.191/

In [55]:
#------------------------------
#   The Generator as in
#   https://www.aclweb.org/anthology/2020.acl-main.191/
#   https://github.com/crux82/ganbert
#------------------------------
class Generator(nn.Module):
    def __init__(self, noise_size=100, output_size=512, hidden_sizes=[512], dropout_rate=0.1):
        super(Generator, self).__init__()
        layers = []
        hidden_sizes = [noise_size] + hidden_sizes
        for i in range(len(hidden_sizes)-1):
            layers.extend([nn.Linear(hidden_sizes[i], hidden_sizes[i+1]), nn.LeakyReLU(0.2, inplace=True), nn.Dropout(dropout_rate)])

        layers.append(nn.Linear(hidden_sizes[-1],output_size))
        self.layers = nn.Sequential(*layers)

    def forward(self, noise):
        output_rep = self.layers(noise)
        return output_rep

#------------------------------
#   The Discriminator
#   https://www.aclweb.org/anthology/2020.acl-main.191/
#   https://github.com/crux82/ganbert
#------------------------------
class Discriminator(nn.Module):
    def __init__(self, input_size=512, hidden_sizes=[512], num_labels=2, dropout_rate=0.1):
        super(Discriminator, self).__init__()
        self.input_dropout = nn.Dropout(p=dropout_rate)
        layers = []
        hidden_sizes = [input_size] + hidden_sizes
        for i in range(len(hidden_sizes)-1):
            layers.extend([nn.Linear(hidden_sizes[i], hidden_sizes[i+1]), nn.LeakyReLU(0.2, inplace=True), nn.Dropout(dropout_rate)])

        self.layers = nn.Sequential(*layers) #per il flatten
        self.logit = nn.Linear(hidden_sizes[-1],num_labels+1) # +1 for the probability of this sample being fake/real.
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, input_rep):
        input_rep = self.input_dropout(input_rep)
        last_rep = self.layers(input_rep)
        logits = self.logit(last_rep)
        probs = self.softmax(logits)
        return last_rep, logits, probs

We instantiate the Discriminator and Generator

In [56]:
# The config file is required to get the dimension of the vector produced by
# the underlying transformer
config = AutoConfig.from_pretrained(model_name)
hidden_size = int(config.hidden_size)
# Define the number and width of hidden layers
hidden_levels_g = [hidden_size for i in range(0, num_hidden_layers_g)]
hidden_levels_d = [hidden_size for i in range(0, num_hidden_layers_d)]

#-------------------------------------------------
#   Instantiate the Generator and Discriminator
#-------------------------------------------------
generator = Generator(noise_size=noise_size, output_size=hidden_size, hidden_sizes=hidden_levels_g, dropout_rate=out_dropout_rate)
discriminator = Discriminator(input_size=hidden_size, hidden_sizes=hidden_levels_d, num_labels=len(label_list), dropout_rate=out_dropout_rate)

# Put everything in the GPU if available
if torch.cuda.is_available():
  generator.cuda()
  discriminator.cuda()
  transformer.cuda()
  if multi_gpu:
    transformer = torch.nn.DataParallel(transformer)

# print(config)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/cd5ef92a9fb2f889e972770a36d4ed042daf221e/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.47.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



Let's go with the training procedure

In [57]:
import time
import torch
import torch.nn.functional as F
import numpy as np

# Função para calcular Recall, Precision e F1-Score manualmente para múltiplas classes
def calculate_recall_precision_f1_multiclass(y_true, y_pred):
    # Classes únicas
    classes = np.unique(y_true)

    recalls = []
    precisions = []
    f1_scores = []

    for cls in classes:
        # True Positives, False Positives, False Negatives para a classe atual
        tp = np.sum((y_true == cls) & (y_pred == cls))
        fp = np.sum((y_true != cls) & (y_pred == cls))
        fn = np.sum((y_true == cls) & (y_pred != cls))

        # Recall
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        recalls.append(recall)

        # Precision
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        precisions.append(precision)

        # F1 Score
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        f1_scores.append(f1)

    # Média das métricas por classe
    avg_recall = np.mean(recalls)
    avg_precision = np.mean(precisions)
    avg_f1 = np.mean(f1_scores)

    return avg_recall, avg_precision, avg_f1

# Função auxiliar para formatar o tempo
def format_time(elapsed):
    return str(round(elapsed, 2)) + "s"

training_stats = []

# Measure the total training time for the whole run.
total_t0 = time.time()

# Optimizers
transformer_vars = [i for i in transformer.parameters()]
d_vars = transformer_vars + [v for v in discriminator.parameters()]
g_vars = [v for v in generator.parameters()]

dis_optimizer = torch.optim.AdamW(d_vars, lr=learning_rate_discriminator)
gen_optimizer = torch.optim.AdamW(g_vars, lr=learning_rate_generator)

# Scheduler
if apply_scheduler:
    num_train_examples = len(train_examples)
    num_train_steps = int(num_train_examples / batch_size * num_train_epochs)
    num_warmup_steps = int(num_train_steps * warmup_proportion)

    scheduler_d = get_constant_schedule_with_warmup(dis_optimizer, num_warmup_steps=num_warmup_steps)
    scheduler_g = get_constant_schedule_with_warmup(gen_optimizer, num_warmup_steps=num_warmup_steps)

# Treinamento do modelo
for epoch_i in range(0, num_train_epochs):
    print("")
    print(f"======== Epoch {epoch_i + 1} / {num_train_epochs} ========")
    print("Training...")

    t0 = time.time()
    tr_g_loss = 0
    tr_d_loss = 0

    transformer.train()
    generator.train()
    discriminator.train()

    for step, batch in enumerate(train_dataloader):
        if step % print_each_n_step == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print(f"  Batch {step:,}  of  {len(train_dataloader):,}.    Elapsed: {elapsed}.")

        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        b_label_mask = batch[3].to(device)

        real_batch_size = b_input_ids.shape[0]

        model_outputs = transformer(b_input_ids, attention_mask=b_input_mask)
        hidden_states = model_outputs[-1]

        noise = torch.zeros(real_batch_size, noise_size, device=device).uniform_(0, 1)
        gen_rep = generator(noise)

        disciminator_input = torch.cat([hidden_states, gen_rep], dim=0)
        features, logits, probs = discriminator(disciminator_input)

        features_list = torch.split(features, real_batch_size)
        D_real_features, D_fake_features = features_list

        logits_list = torch.split(logits, real_batch_size)
        D_real_logits, D_fake_logits = logits_list

        probs_list = torch.split(probs, real_batch_size)
        D_real_probs, D_fake_probs = probs_list

        # Generator loss
        g_loss_d = -torch.mean(torch.log(1 - D_fake_probs[:, -1] + epsilon))
        g_feat_reg = torch.mean(torch.pow(torch.mean(D_real_features, dim=0) - torch.mean(D_fake_features, dim=0), 2))
        g_loss = g_loss_d + g_feat_reg

        # Discriminator loss
        logits = D_real_logits[:, :-1]
        log_probs = F.log_softmax(logits, dim=-1)
        label2one_hot = F.one_hot(b_labels, len(label_list))
        per_example_loss = -torch.sum(label2one_hot * log_probs, dim=-1)
        per_example_loss = torch.masked_select(per_example_loss, b_label_mask.to(device))
        labeled_example_count = per_example_loss.numel()

        if labeled_example_count == 0:
            D_L_Supervised = 0
        else:
            D_L_Supervised = torch.sum(per_example_loss) / labeled_example_count

        D_L_unsupervised1U = -torch.mean(torch.log(1 - D_real_probs[:, -1] + epsilon))
        D_L_unsupervised2U = -torch.mean(torch.log(D_fake_probs[:, -1] + epsilon))
        d_loss = D_L_Supervised + D_L_unsupervised1U + D_L_unsupervised2U

        gen_optimizer.zero_grad()
        dis_optimizer.zero_grad()

        g_loss.backward(retain_graph=True)
        d_loss.backward()

        gen_optimizer.step()
        dis_optimizer.step()

        tr_g_loss += g_loss.item()
        tr_d_loss += d_loss.item()

        if apply_scheduler:
            scheduler_d.step()
            scheduler_g.step()

    avg_train_loss_g = tr_g_loss / len(train_dataloader)
    avg_train_loss_d = tr_d_loss / len(train_dataloader)
    training_time = format_time(time.time() - t0)

    print(f"  Average training loss generator: {avg_train_loss_g:.3f}")
    print(f"  Average training loss discriminator: {avg_train_loss_d:.3f}")
    print(f"  Training epoch took: {training_time}")

    # Avaliação por época para calcular a acurácia
    print("\nRunning Test...")

    all_preds = []
    all_labels_ids = []
    total_test_loss = 0

    transformer.eval()
    discriminator.eval()
    generator.eval()

    for batch in test_dataloader:
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        with torch.no_grad():
            model_outputs = transformer(b_input_ids, attention_mask=b_input_mask)
            hidden_states = model_outputs[-1]
            _, logits, probs = discriminator(hidden_states)
            filtered_logits = logits[:, :-1]
            total_test_loss += F.cross_entropy(filtered_logits, b_labels, ignore_index=-1)

            _, preds = torch.max(filtered_logits, 1)
            all_preds += preds.detach().cpu()
            all_labels_ids += b_labels.detach().cpu()

    # Cálculo da acurácia para cada rodada
    all_preds = torch.stack(all_preds).numpy()
    all_labels_ids = torch.stack(all_labels_ids).numpy()
    test_accuracy = np.sum(all_preds == all_labels_ids) / len(all_preds)

    print(f"  Accuracy: {test_accuracy:.3f}")
    avg_test_loss = total_test_loss / len(test_dataloader)
    avg_test_loss = avg_test_loss.item()
    test_time = format_time(time.time() - t0)

    print(f"  Test Loss: {avg_test_loss:.3f}")
    print(f"  Test took: {test_time}")

    training_stats.append({
        'epoch': epoch_i + 1,
        'Training Loss generator': avg_train_loss_g,
        'Training Loss discriminator': avg_train_loss_d,
        'Valid. Loss': avg_test_loss,
        'Valid. Accur.': test_accuracy,
        'Training Time': training_time,
        'Test Time': test_time
    })

# Avaliação do modelo no final do treinamento
print("\nFinal Evaluation...")

recall, precision, f1 = calculate_recall_precision_f1_multiclass(all_labels_ids, all_preds)

print(f"Final Recall: {recall:.3f}")
print(f"Final Precision: {precision:.3f}")
print(f"Final F1 Score: {f1:.3f}")



Training...
  Batch 10  of  88.    Elapsed: 4.54s.
  Batch 20  of  88.    Elapsed: 9.13s.
  Batch 30  of  88.    Elapsed: 13.75s.
  Batch 40  of  88.    Elapsed: 18.38s.
  Batch 50  of  88.    Elapsed: 23.01s.
  Batch 60  of  88.    Elapsed: 27.64s.
  Batch 70  of  88.    Elapsed: 32.27s.
  Batch 80  of  88.    Elapsed: 36.89s.
  Average training loss generator: 0.481
  Average training loss discriminator: 2.654
  Training epoch took: 40.49s

Running Test...
  Accuracy: 0.461
  Test Loss: 1.489
  Test took: 40.61s

Training...
  Batch 10  of  88.    Elapsed: 4.57s.
  Batch 20  of  88.    Elapsed: 9.17s.
  Batch 30  of  88.    Elapsed: 13.77s.
  Batch 40  of  88.    Elapsed: 18.34s.
  Batch 50  of  88.    Elapsed: 22.9s.
  Batch 60  of  88.    Elapsed: 27.44s.
  Batch 70  of  88.    Elapsed: 31.96s.
  Batch 80  of  88.    Elapsed: 36.51s.
  Average training loss generator: 0.738
  Average training loss discriminator: 1.089
  Training epoch took: 40.04s

Running Test...
  Accuracy: 0.48

In [58]:
for stat in training_stats:
  print(stat)

print("\nTraining complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))

resultado = sum(range(10**6))

end_time = time.time()
execution_time = end_time - start_time
print(f"Tempo de execução: {execution_time:.4f} segundos")

{'epoch': 1, 'Training Loss generator': 0.48120969059792434, 'Training Loss discriminator': 2.6536005356095056, 'Valid. Loss': 1.488650918006897, 'Valid. Accur.': 0.4605263157894737, 'Training Time': '40.49s', 'Test Time': '40.61s'}
{'epoch': 2, 'Training Loss generator': 0.7377897981892932, 'Training Loss discriminator': 1.0889499336481094, 'Valid. Loss': 1.9006657600402832, 'Valid. Accur.': 0.4868421052631579, 'Training Time': '40.04s', 'Test Time': '40.16s'}
{'epoch': 3, 'Training Loss generator': 0.7213026149706407, 'Training Loss discriminator': 0.7510863217440519, 'Valid. Loss': 2.4213056564331055, 'Valid. Accur.': 0.5131578947368421, 'Training Time': '39.95s', 'Test Time': '40.07s'}
{'epoch': 4, 'Training Loss generator': 0.7120236415754665, 'Training Loss discriminator': 0.7291198433800177, 'Valid. Loss': 2.631476640701294, 'Valid. Accur.': 0.5263157894736842, 'Training Time': '40.38s', 'Test Time': '40.49s'}
{'epoch': 5, 'Training Loss generator': 0.7061840505762533, 'Training

In [59]:
#ADICIONADO
# Salvar os modelos
torch.save(generator.state_dict(), 'generator.pt')
torch.save(discriminator.state_dict(), 'discriminator.pt')
torch.save(transformer.state_dict(), 'transformer.pt')


In [60]:
#ADICIONADO
# Carregar os modelos
generator.load_state_dict(torch.load('generator.pt'))
discriminator.load_state_dict(torch.load('discriminator.pt'))
transformer.load_state_dict(torch.load('transformer.pt'))


  generator.load_state_dict(torch.load('generator.pt'))
  discriminator.load_state_dict(torch.load('discriminator.pt'))
  transformer.load_state_dict(torch.load('transformer.pt'))


<All keys matched successfully>

Preparar o Tokenizer e Configuração do Modelo

In [61]:
#ADICIONADO
from transformers import AutoTokenizer, AutoModel, AutoConfig

# Nome do modelo usado durante o treinamento
model_name = "bert-base-cased"

# Carregar tokenizer e configuração
tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
hidden_size = int(config.hidden_size)


loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/cd5ef92a9fb2f889e972770a36d4ed042daf221e/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.47.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/cd5ef92a9fb2f889e972770a36d4ed042daf221e/vocab.txt
loading file tokenize

Função de Pré-processamento

In [62]:
#ADICIONADO
def preprocess(text, max_seq_length=64):
    encoded_sent = tokenizer.encode(text, add_special_tokens=True, max_length=max_seq_length, padding="max_length", truncation=True)
    input_ids = torch.tensor([encoded_sent])
    attention_mask = torch.tensor([[int(token_id > 0) for token_id in encoded_sent]])
    return input_ids, attention_mask


Classificação de Novas Perguntas

In [63]:
#ADICIONADO
def classify_question(text):
    transformer.eval()
    discriminator.eval()

    # Pré-processar a nova pergunta
    input_ids, attention_mask = preprocess(text)

    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)

    with torch.no_grad():
        # Passar pela transformer
        model_outputs = transformer(input_ids, attention_mask=attention_mask)
        hidden_states = model_outputs[-1]

        # Passar pelo discriminator
        _, logits, probs = discriminator(hidden_states)

        # Filtrar os logits
        filtered_logits = logits[:, :-1]

        # Obter a predição
        pred = torch.argmax(filtered_logits, dim=1).item()

    return label_list[pred]

# Exemplo de uso
new_question = "Who is the current president of the United States?"
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")


Predicted Label: REGUL_regul


In [64]:
# Exemplo de uso
new_question = "An internal email discussing concerns over new regulatory guidelines for glyphosate, with Monsanto executives planning outreach to key figures in the regulatory process."
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")


Predicted Label: REGUL_regul


In [65]:
# Exemplo de uso
new_question = "I am forwarding for review the article titled Effects of Automation in the Financial Sector, submitted for the next issue of the journal. Please provide your comments and suggestions for revision by the end of the week."
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")


Predicted Label: GHOST_ghost


In [66]:
# Exemplo de uso
new_question = "What is the boiling point of water in Celsius?"
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: CHEMI_chemi


Testing


In [67]:
# Exemplo de uso
new_question = "Dear Larry, As we discussed earlier, we will be proceeding with the draft of the manuscript for the Expert Panel. We want to ensure that the final version has your expertise woven into it, particularly in the genotoxicity section. Please let us know your availability so we can finalize the consulting agreement, which should cover the necessary writing hours—roughly around 12K for 2015.Best regards, Donna Farmer Monsanto Toxicologist"
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: GHOST_ghost


In [68]:
# Exemplo de uso
new_question = "I’ve been asked to review a recent study on glyphosate’s cytotoxicity in rats. Given your expertise in the field, I believe it would be beneficial for both of you to act as peer reviewers. Once you’ve collated your comments, I’ll submit the final review on our behalf."
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: GHOST_ghost


In [69]:
# Exemplo de uso
new_question = "I wanted to reach out regarding the recent publication by Professor Seralini. Given the numerous inaccuracies in the study, I would recommend retracting the paper and restarting the review process. Let’s ensure we maintain scientific credibility."
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: REGUL_regul


In [70]:
# Exemplo de uso
new_question = "The surfactant used in our Roundup formulation may play a significant role in the skin tumor promotion study we’ve been analyzing. I believe this needs to be addressed in the manuscript to ensure that the surfactant's effects are fully acknowledged in the safety data."
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: REGUL_regul


In [71]:
# Exemplo de uso
new_question = "Following our internal review of the NNG carcinogenicity study in male mice, the results show a significant increase in malignant lymphomas. I am compiling the data to assess the risk, but we must be careful in framing this study as it could lead to regulatory challenges."
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: REGUL_regul


In [72]:
# Exemplo de uso
new_question = "Our recent storage tests for glyphosate show increased levels of NNG due to long-term storage conditions. I suggest we draft a report clarifying that these results are not representative of normal product aging to avoid unnecessary regulatory concerns."
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: REGUL_regul


In [73]:
# Exemplo de uso
new_question = "Our recent study confirms that surfactants in the Roundup formulation significantly increase glyphosate absorption through the skin. This presents a challenge in terms of risk evaluations, and we may need to reevaluate our safety guidelines in light of this."
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: CHEMI_chemi


In [74]:
# Exemplo de uso
new_question = "In response to the deposition questions, I can confirm that there is no scientific basis for claiming that glyphosate absorbed through the skin is excreted through feces. This should be communicated clearly in our defense, as the current evidence only supports excretion via urine."
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: TOXIC_toxic


In [75]:
# Exemplo de uso
new_question = "Our internal review of glyphosate metabolism shows continued uncertainty regarding dermal absorption rates. Additional studies might clarify the issue, but this also introduces the risk of discovering new metabolites, which could complicate our current safety profile."
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: CHEMI_chemi


In [76]:
# Exemplo de uso
new_question = "We need to gather further data on the excretion of glyphosate through dermal exposure. Thus far, the assumption has been that it’s primarily excreted through urine, but this might need more scrutiny, especially in light of regulatory concerns."
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: REGUL_regul


In [77]:
# Exemplo de uso
new_question = "As discussed, we need to escalate our efforts to gain favorable regulatory assessments in Europe. A mix of scientific review and political strategy will be essential to ensure glyphosate is not classified as a carcinogen. Let’s initiate a communication campaign to address any doubts regulators may have."
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: REGUL_regul


In [78]:
# Exemplo de uso
new_question = "I met with EPA officials earlier this afternoon, and we discussed delaying the risk assessment for glyphosate. It seems we have support from key figures, but we’ll need to strategize to ensure there are no surprises from other government agencies."
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: REGUL_regul


In [79]:
# Exemplo de uso
new_question = "Roundup used to kill people"
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: TOXIC_toxic


In [80]:
# Exemplo de uso
new_question = "There have been significant updates regarding our ongoing work on biochemical processes, and it is crucial that we handle the next steps with extreme vigilance. The information we are about to relay to government authorities contains delicate aspects that could draw unwelcome interest if not meticulously safeguarded. We must ensure that every part of this operation remains in strict confidentiality, especially considering the nature of the sensitive data at hand. Any kind of exposure could resul"
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: REGUL_regul


#Testando Text Split


In [81]:
import nltk
from nltk.tokenize import sent_tokenize

# Download the necessary data for tokenization
nltk.download('punkt_tab')

# Your existing code for text and sentence tokenization
text = "I’ve been asked to review a recent study on glyphosate’s cytotoxicity in rats. Given your expertise in the field, I believe it would be beneficial for both of you to act as peer reviewers. Once you’ve collated your comments, I’ll submit the final review on our behalf."
sentences = sent_tokenize(text)

# Continue with your code to classify sentences
for sentence in sentences:
    predicted_label = classify_question(sentence)
    print(f"Sentence: {sentence}")
    print(f"Predicted Label: {predicted_label}\n")

Sentence: I’ve been asked to review a recent study on glyphosate’s cytotoxicity in rats.
Predicted Label: GHOST_ghost

Sentence: Given your expertise in the field, I believe it would be beneficial for both of you to act as peer reviewers.
Predicted Label: REGUL_regul

Sentence: Once you’ve collated your comments, I’ll submit the final review on our behalf.
Predicted Label: GHOST_ghost



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [82]:
# Texto de entrada
text = "I met with EPA officials earlier this afternoon, and we discussed delaying the risk assessment for glyphosate. It seems we have support from key figures, but we’ll need to strategize to ensure there are no surprises from other government agencies."

# Dividir o texto em sentenças
sentences = sent_tokenize(text)
# Classificar cada sentença individualmente
for sentence in sentences:
    predicted_label = classify_question(sentence)
    print(f"Sentence: {sentence}")
    print(f"Predicted Label: {predicted_label}\n")




Sentence: I met with EPA officials earlier this afternoon, and we discussed delaying the risk assessment for glyphosate.
Predicted Label: REGUL_regul

Sentence: It seems we have support from key figures, but we’ll need to strategize to ensure there are no surprises from other government agencies.
Predicted Label: REGUL_regul



In [83]:
# Texto de entrada
text = "As discussed, we need to escalate our efforts to gain favorable regulatory assessments in Europe. A mix of scientific review and political strategy will be essential to ensure glyphosate is not classified as a carcinogen. Let’s initiate a communication campaign to address any doubts regulators may have."

# Dividir o texto em sentenças
sentences = sent_tokenize(text)
# Classificar cada sentença individualmente
for sentence in sentences:
    predicted_label = classify_question(sentence)
    print(f"Sentence: {sentence}")
    print(f"Predicted Label: {predicted_label}\n")




Sentence: As discussed, we need to escalate our efforts to gain favorable regulatory assessments in Europe.
Predicted Label: REGUL_regul

Sentence: A mix of scientific review and political strategy will be essential to ensure glyphosate is not classified as a carcinogen.
Predicted Label: REGUL_regul

Sentence: Let’s initiate a communication campaign to address any doubts regulators may have.
Predicted Label: REGUL_regul



In [84]:
# Texto de entrada
text = "I’ve been asked to review a recent study on glyphosate’s cytotoxicity in rats. Given your expertise in the field, I believe it would be beneficial for both of you to act as peer reviewers. Once you’ve collated your comments, I’ll submit the final review on our behalf."

# Dividir o texto em sentenças
sentences = sent_tokenize(text)
# Classificar cada sentença individualmente
for sentence in sentences:
    predicted_label = classify_question(sentence)
    print(f"Sentence: {sentence}")
    print(f"Predicted Label: {predicted_label}\n")



Sentence: I’ve been asked to review a recent study on glyphosate’s cytotoxicity in rats.
Predicted Label: GHOST_ghost

Sentence: Given your expertise in the field, I believe it would be beneficial for both of you to act as peer reviewers.
Predicted Label: REGUL_regul

Sentence: Once you’ve collated your comments, I’ll submit the final review on our behalf.
Predicted Label: GHOST_ghost



In [85]:
# Texto de entrada
text = "Our recent study confirms that surfactants in the Roundup formulation significantly increase glyphosate absorption through the skin. This presents a challenge in terms of risk evaluations, and we may need to reevaluate our safety guidelines in light of this."

# Dividir o texto em sentenças
sentences = sent_tokenize(text)
# Classificar cada sentença individualmente
for sentence in sentences:
    predicted_label = classify_question(sentence)
    print(f"Sentence: {sentence}")
    print(f"Predicted Label: {predicted_label}\n")



Sentence: Our recent study confirms that surfactants in the Roundup formulation significantly increase glyphosate absorption through the skin.
Predicted Label: CHEMI_chemi

Sentence: This presents a challenge in terms of risk evaluations, and we may need to reevaluate our safety guidelines in light of this.
Predicted Label: REGUL_regul



In [86]:
# Texto de entrada
text = "Our recent study confirms that surfactants in the Roundup formulation significantly increase glyphosate absorption through the skin? This presents a challenge in terms of risk evaluations, and we may need to reevaluate our safety guidelines in light of this!"

# Dividir o texto em sentenças
sentences = sent_tokenize(text)
# Classificar cada sentença individualmente
for sentence in sentences:
    predicted_label = classify_question(sentence)
    print(f"Sentence: {sentence}")
    print(f"Predicted Label: {predicted_label}\n")


Sentence: Our recent study confirms that surfactants in the Roundup formulation significantly increase glyphosate absorption through the skin?
Predicted Label: CHEMI_chemi

Sentence: This presents a challenge in terms of risk evaluations, and we may need to reevaluate our safety guidelines in light of this!
Predicted Label: REGUL_regul



In [87]:
# Texto de entrada
text = "good morning. Our recent study confirms that surfactants in the Roundup formulation significantly increase glyphosate absorption through the skin?"

# Dividir o texto em sentenças
sentences = sent_tokenize(text)
# Classificar cada sentença individualmente
for sentence in sentences:
    predicted_label = classify_question(sentence)
    print(f"Sentence: {sentence}")
    print(f"Predicted Label: {predicted_label}\n")


Sentence: good morning.
Predicted Label: GHOST_ghost

Sentence: Our recent study confirms that surfactants in the Roundup formulation significantly increase glyphosate absorption through the skin?
Predicted Label: CHEMI_chemi



In [88]:
# Texto de entrada
text = ''' How far is it from Denver to Aspen .
 What county is Modesto
 Who was Galileo .
 What is an atom .
 When did Hawaii become a state .
 How tall is the Sears Building .
 George Bush purchased a small interest in which baseball team .
 What is Australia s national flower .
 Why does the moon turn orange .
 What is autism .
 What city had a world fair in 1900 .
 What person s head is on a dime .
 What is the average weight of a Yellow Labrador .
 Who was the first man to fly across the Pacific Ocean .
 When did Idaho become a state .
 What is the life expectancy for crickets .
 What metal has the highest melting point .
 Who developed the vaccination against polio .
 What is epilepsy .
 What year did the Titanic sink .
 Who was the first American to walk in space .
 What is a biosphere .
 What river in the US is known as the Big Muddy .
 What is bipolar disorder .
 What is cholesterol .
 Who developed the Macintosh computer .
 What is caffeine .
 What imaginary line is halfway between the North and South Poles .
 Where is John Wayne airport .
 What hemisphere is the Philippines in .
 What is the average speed of the horses at the Kentucky Derby .
 Where are the Rocky Mountains .
 What are invertebrates .
 What is the temperature at the center of the earth .
 When did John F. Kennedy get elected as President .
 How old was Elvis Presley when he died .
 Where is the Orinoco River .
 How far is the service line from the net in tennis .
 How much fiber should you have per day .
 How many Great Lakes are there .
 Material called linen is made from what plant .
 What is Teflon .
 What is amitriptyline .
 What is a shaman .
 What is the proper name for a female walrus .
 What is a group of turkeys called .
 How long did Rip Van Winkle sleep .
 What are triglycerides .
 How many liters in a gallon .
 What is the name of the chocolate company in San Francisco .
 What are amphibians .
 Who discovered x-rays .
 Which comedian s signature line is `` Can we talk  .
 What is fibromyalgia .
 What is done with worn or outdated flags .
 What does cc in engines mean .
 When did Elvis Presley die .
 What is the capital of Yugoslavia .
 Where is Milan .
 What is the speed hummingbirds fly .
 What is the oldest city in the United States .
 What was W.C. Fields  real name .
 What river flows between Fargo
 What do bats eat .
 What state did the Battle of Bighorn take place in .
 Who was Abraham Lincoln .
 What do you call a newborn kangaroo .
 What are spider veins .
 What day and month did John Lennon die .
 What strait separates North America from Asia .
 What is the population of Seattle .
 How much was a ticket for the Titanic .
 What is the largest city in the world .
 What American composer wrote the music for `` West Side Story  .
 Where is the Mall of the America .
 What is the pH scale .
 What type of currency is used in Australia .
 How tall is the Gateway Arch in St. Louis
 How much does the human adult female brain weigh .
 Who was the first governor of Alaska .
 What is a prism .
 When was the first liver transplant .
 Who was elected president of South Africa in 1994 .
 What is the population of China .
 When was Rosa Parks born .
 Why is a ladybug helpful .
 What is amoxicillin .
 Who was the first female United States Representative .
 What are xerophytes .
 What country did Ponce de Leon come from .
 The U.S. Department of Treasury first issued paper currency for the U.S. during which war .
 What is desktop publishing .
 What is the temperature of the sun s surface .
 What year did Canada join the United Nations .
 What is the oldest university in the US .
 Where is Prince Edward Island .
 Mercury
 What is cryogenics .
 What are coral reefs .
 What is the longest major league baseball-winning streak .
 What is neurology .
 Who invented the calculator .
 How do you measure earthquakes .
 Who is Duke Ellington .
 What county is Phoenix
 What is a micron .
 The sun s core
 What is the Ohio state bird .
 When were William Shakespeare s twins born .
 What is the highest dam in the U.S. .
 What color is a poison arrow frog .
 What is acupuncture .
 What is the length of the coastline of the state of Alaska .
 What is the name of Neil Armstrong s wife .
 What is Hawaii s state flower .
 Who won Ms. American in 1989 .
 When did the Hindenberg crash .
 What mineral helps prevent osteoporosis .
 What was the last year that the Chicago Cubs won the World Series .
 Where is Perth .
 What year did WWII begin .
 What is the diameter of a golf ball .
 What is an eclipse .
 Who discovered America .
 What is the earth s diameter .
 Which president was unmarried .
 How wide is the Milky Way galaxy .
 During which season do most thunderstorms occur .
 What is Wimbledon .
 What is the gestation period for a cat .
 How far is a nautical mile .
 Who was the abolitionist who led the raid on Harper s Ferry in 1859 .
 What does target heart rate mean .
 What was the first satellite to go into space .
 What is foreclosure .
 What is the major fault line near Kentucky .
 Where is the Holland Tunnel .
 Who wrote the hymn `` Amazing Grace  .
 What position did Willie Davis play in baseball .
 What are platelets .
 What is severance pay .
 What is the name of Roy Roger s dog .
 Where are the National Archives .
 What is a baby turkey called .
 What is poliomyelitis .
 What is the longest bone in the human body .
 Who is a German philosopher .
 What were Christopher Columbus  three ships .
 What does Phi Beta Kappa mean .
 What is nicotine .
 What is another name for vitamin B1 .
 Who discovered radium .
 What are sunspots .
 When was Algeria colonized .
 What baseball team was the first to make numbers part of their uniform .
 What continent is Egypt on .
 What is the capital of Mongolia .
 What is nanotechnology .
 In the late 1700 s British convicts were used to populate which colony .
 What state is the geographic center of the lower 48 states .
 What is an obtuse angle .
 What are polymers .
 When is hurricane season in the Caribbean .
 Where is the volcano Mauna Loa .
 What is another astronomic term for the Northern Lights .
 What peninsula is Spain part of .
 When was Lyndon B. Johnson born .
 What is acetaminophen .
 What state has the least amount of rain per year .
 Who founded American Red Cross .
 What year did the Milwaukee Braves become the Atlanta Braves .
 How fast is alcohol absorbed .
 When is the summer solstice .
 What is supernova .
 Where is the Shawnee National Forest .
 What U.S. state s motto is `` Live free or Die  .
 Where is the Lourve .
 When was the first stamp issued .
 What primary colors do you mix to make orange .
 How far is Pluto from the sun .
 What body of water are the Canary Islands in .
 What is neuropathy .
 Where is the Euphrates River .
 What is cryptography .
 What is natural gas composed of .
 Who is the Prime Minister of Canada .
 What French ruler was defeated at the battle of Waterloo .
 What is leukemia .
 Where did Howard Hughes die .
 What is the birthstone for June .
 What is the sales tax in Minnesota .
 What is the distance in miles from the earth to the sun .
 What is the average life span for a chicken .
 When was the first Wal-Mart store opened .
 What is relative humidity .
 What city has the zip code of 35824 .
 What currency is used in Algeria .
 Who invented the hula hoop .
 What was the most popular toy in 1957 .
 What is pastrami made of .
 What is the name of the satellite that the Soviet Union sent into space in 1957 .
 What city s newspaper is called `` The Enquirer  .
 Who invented the slinky .
 What are the animals that don t have backbones called .
 What is the melting point of copper .
 Where is the volcano Olympus Mons located .
 Who was the 23rd president of the United States .
 What is the average body temperature .
 What does a defibrillator do .
 What is the effect of acid rain .
 What year did the United States abolish the draft .
 How fast is the speed of light .
 What province is Montreal in .
 What New York City structure is also known as the Twin Towers .
 What is fungus .
 What is the most frequently spoken language in the Netherlands .
 What is sodium chloride .
 What are the spots on dominoes called .
 How many pounds in a ton .
 What is influenza .
 What is ozone depletion .
 What year was the Mona Lisa painted .
 What does `` Sitting Shiva  mean .
 What is the electrical output in Madrid
 Which mountain range in North America stretches from Maine to Georgia .
 What is plastic made of .
 What is the population of Nigeria .
 What does your spleen do .
 Where is the Grand Canyon .
 Who invented the telephone .
 What year did the U.S. buy Alaska .
 What is the name of the leader of Ireland .
 What is phenylalanine .
 How many gallons of water are there in a cubic foot .
 What are the two houses of the Legislative branch .
 What is sonar .
 In Poland
 What is phosphorus .
 What is the location of the Sea of Tranquility .
 How fast is sound .
 What French province is cognac produced in .
 What is Valentine s Day .
 What causes gray hair .
 What is hypertension .
 What is bandwidth .
 What is the longest suspension bridge in the U.S. .
 What is a parasite .
 What is home equity .
 What do meteorologists do .
 What is the criterion for being legally blind .
 Who is the tallest man in the world .
 What are the twin cities .
 What did Edward Binney and Howard Smith invent in 1903 .
 What is the statue of liberty made of .
 What is pilates .
 What planet is known as the `` red  planet .
 What is the depth of the Nile river .
 What is the colorful Korean traditional dress called .
 What is Mardi Gras .
 Mexican pesos are worth what in U.S. dollars .
 Who was the first African American to play for the Brooklyn Dodgers .
 Who was the first Prime Minister of Canada .
 How many Admirals are there in the U.S. Navy .
 What instrument did Glenn Miller play .
 How old was Joan of Arc when she died .
 What does the word fortnight mean .
 What is dianetics .
 What is the capital of Ethiopia .
 For how long is an elephant pregnant .
 How did Janice Joplin die .
 What is the primary language in Iceland .
 What is the difference between AM radio stations and FM radio stations .
 What is osteoporosis .
 Who was the first woman governor in the U.S. .
 What is peyote .
 What is the esophagus used for .
 What is viscosity .
 What year did Oklahoma become a state .
 What is the abbreviation for Texas .
 What is a mirror made out of .
 Where on the body is a mortarboard worn .
 What was J.F.K. s wife s name .
 What does I.V. stand for .
 What is the chunnel .
 Where is Hitler buried .
 What are antacids .
 What is pulmonary fibrosis .
 What are Quaaludes .
 What is naproxen .
 What is strep throat .
 What is the largest city in the U.S. .
 What is foot and mouth disease .
 What is the life expectancy of a dollar bill .
 What do you call a professional map drawer .
 What are Aborigines .
 What is hybridization .
 What color is indigo .
 How old do you have to be in order to rent a car in Italy .
 What does a barometer measure .
 What color is a giraffe s tongue .
 What does USPS stand for .
 What year did the NFL go on strike .
 What is solar wind .
 What date did Neil Armstrong land on the moon .
 When was Hiroshima bombed .
 Where is the Savannah River .
 Who was the first woman killed in the Vietnam War .
 What planet has the strongest magnetic field of all the planets .
 Who is the governor of Alaska .
 What year did Mussolini seize power in Italy .
 What is the capital of Persia .
 Where is the Eiffel Tower .
 How many hearts does an octopus have .
 What is pneumonia .
 What is the deepest lake in the US .
 What is a fuel cell .
 Who was the first U.S. president to appear on TV .
 Where is the Little League Museum .
 What are the two types of twins .
 What is the brightest star .
 What is diabetes .
 When was President Kennedy shot .
 What is TMJ .
 What color is yak milk .
 What date was Dwight D. Eisenhower born .
 What does the technical term ISDN mean .
 Why is the sun yellow .
 What is the conversion rate between dollars and pounds .
 When was Abraham Lincoln born .
 What is the Milky Way .
 What is mold .
 What year was Mozart born .
 What is a group of frogs called .
 What is the name of William Penn s ship .
 What is the melting point of gold .
 What is the street address of the White House .
 What is semolina .
 What fruit is Melba sauce made from .
 What is Ursa Major .
 What is the percentage of water content in the human body .
 How much does water weigh .
 What was President Lyndon Johnson s reform program called .
 What is the murder rate in Windsor
 Who is the only president to serve 2 non-consecutive terms .
 What is the population of Australia .
 Who painted the ceiling of the Sistine Chapel .
 Name a stimulant .
 What is the effect of volcanoes on the climate .
 What year did the Andy Griffith show begin .
 What is acid rain .
 What is the date of Mexico s independence .
 What is the location of Lake Champlain .
 What is the Illinois state flower .
 What is Maryland s state bird .
 What is quicksilver .
 Who wrote `` The Divine Comedy  .
 What is the speed of light .
 What is the width of a football field .
 Why in tennis are zero points called love .
 What kind of dog was Toto in the Wizard of Oz .
 What is a thyroid .
 What does ciao mean .
 What is the only artery that carries blue blood from the heart to the lungs .
 How often does Old Faithful erupt at Yellowstone National Park .
 What is acetic acid .
 What is the elevation of St. Louis
 What color does litmus paper turn when it comes into contact with a strong acid .
 What are the colors of the German flag .
 What is the Moulin Rouge .
 What soviet seaport is on the Black Sea .
 What is the atomic weight of silver .
 What currency do they use in Brazil .
 What are pathogens .
 What is mad cow disease .
 Name a food high in zinc .
 When did North Carolina enter the union .
 Where do apple snails live .
 What are ethics .
 What does CPR stand for .
 What is an annuity .
 Who killed John F. Kennedy .
 Who was the first vice president of the U.S. .
 What birthstone is turquoise .
 Who was the first US President to ride in an automobile to his inauguration .
 How old was the youngest president of the United States .
 When was Ulysses S. Grant born .
 What is Muscular Dystrophy .
 Who lived in the Neuschwanstein castle .
 What is propylene glycol .
 What is a panic disorder .
 Who invented the instant Polaroid camera .
 What is a carcinogen .
 What is a baby lion called .
 What is the world s population .
 What is nepotism .
 What is die-casting .
 What is myopia .
 What is the sales tax rate in New York .
 Developing nations comprise what percentage of the world s population .
 What is the fourth highest mountain in the world .
 What is Shakespeare s nickname .
 What is the heaviest naturally occurring element .
 When is Father s Day .
 What does the acronym NASA stand for .
 How long is the Columbia River in miles .
 What city s newspaper is called `` The Star  .
 What is carbon dioxide .
 Where is the Mason/Dixon line .
 When was the Boston tea party .
 What is metabolism .
 Which U.S.A. president appeared on `` Laugh-In  .
 What are cigarettes made of .
 What is the capital of Zimbabwe .
 What does NASA stand for .
 What is the state flower of Michigan .
 What are semiconductors .
 What is nuclear power .
 What is a tsunami .
 Who is the congressman from state of Texas on the armed forces committee .
 Who was president in 1913 .
 When was the first kidney transplant .
 What are Canada s two territories .
 What was the name of the plane Lindbergh flew solo across the Atlantic .
 What is genocide .
 What continent is Argentina on .
 What monastery was raided by Vikings in the late eighth century .
 What is an earthquake .
 Where is the tallest roller coaster located .
 What are enzymes .
 Who discovered oxygen .
 What is bangers and mash .
 What is the name given to the Tiger at Louisiana State University .
 Where are the British crown jewels kept .
 Who was the first person to reach the North Pole .
 What is an ulcer .
 What is vertigo .
 What is the spirometer test .
 When is the official first day of summer .
 What does the abbreviation SOS mean .
 What is the smallest bird in Britain .
 Who invented Trivial Pursuit .
 What gasses are in the troposphere .
 Which country has the most water pollution .
 What is the scientific name for elephant .
 Who is the actress known for her role in the movie `` Gypsy  .
 What breed of hunting dog did the Beverly Hillbillies own .
 What is the rainiest place on Earth .
 Who was the first African American to win the Nobel Prize in literature .
 When is St. Patrick s Day .
 What was FDR s dog s name .
 What colors need to be mixed to get the color pink .
 What is the most popular sport in Japan .
 What is the active ingredient in baking soda .
 When was Thomas Jefferson born .
 How cold should a refrigerator be .
 When was the telephone invented .
 What is the most common eye color .
 Where was the first golf course in the United States .
 What is schizophrenia .
 What is angiotensin .
 What did Jesse Jackson organize .
 What is New York s state bird .
 What is the National Park in Utah .
 What is Susan B. Anthony s birthday .
 In which state would you find the Catskill Mountains .
 What do you call a word that is spelled the same backwards and forwards .
 What are pediatricians .
 What chain store is headquartered in Bentonville
 What are solar cells .
 What is compounded interest .
 What are capers .
 What is an antigen .
 What currency does Luxembourg use .
 What is the population of Venezuela .
 What type of polymer is used for bulletproof vests .
 What currency does Argentina use .
 What is a thermometer .
 What Canadian city has the largest population .
 What color are crickets .
 Which country gave New York the Statue of Liberty .
 What was the name of the first U.S. satellite sent into space .
 What precious stone is a form of pure carbon .
 What kind of gas is in a fluorescent bulb .
 What is rheumatoid arthritis .
 What river runs through Rowe
 What is cerebral palsy .
 What city is also known as `` The Gateway to the West  .
 How far away is the moon .
 What is the source of natural gas .
 In what spacecraft did U.S. astronaut Alan Shepard make his historic 1961 flight .
 What is pectin .
 What is bio-diversity .
 What s the easiest way to remove wallpaper .
 What year did the Titanic start on its journey .
 How much of an apple is water .
 Who was the 22nd President of the US .
 What is the money they use in Zambia .
 How many feet in a mile .
 What is the birthstone of October .
 What is e-coli .'''

# Dividir o texto em sentenças
sentences = sent_tokenize(text)
# Classificar cada sentença individualmente
for sentence in sentences:
    predicted_label = classify_question(sentence)
    print(f"Sentence: {sentence}")
    print(f"Predicted Label: {predicted_label}\n")


Sentence:  How far is it from Denver to Aspen .
Predicted Label: GHOST_ghost

Sentence: What county is Modesto
 Who was Galileo .
Predicted Label: REGUL_regul

Sentence: What is an atom .
Predicted Label: CHEMI_chemi

Sentence: When did Hawaii become a state .
Predicted Label: REGUL_regul

Sentence: How tall is the Sears Building .
Predicted Label: GHOST_ghost

Sentence: George Bush purchased a small interest in which baseball team .
Predicted Label: REGUL_regul

Sentence: What is Australia s national flower .
Predicted Label: REGUL_regul

Sentence: Why does the moon turn orange .
Predicted Label: TOXIC_toxic

Sentence: What is autism .
Predicted Label: CHEMI_chemi

Sentence: What city had a world fair in 1900 .
Predicted Label: REGUL_regul

Sentence: What person s head is on a dime .
Predicted Label: GHOST_ghost

Sentence: What is the average weight of a Yellow Labrador .
Predicted Label: CHEMI_chemi

Sentence: Who was the first man to fly across the Pacific Ocean .
Predicted Label: R

In [89]:
# Exemplo de uso
new_question = "This document contains a study on the absorption of glyphosate through human skin  highlighting concerns about increased absorption when surfactants are used in the formulation."
predicted_label = classify_question(new_question)
print(f"Predicted Label: {predicted_label}")

Predicted Label: CHEMI_chemi


In [90]:
# Texto de entrada
text = '''
     How far is it from Denver to Aspen ?
 What county is Modesto
 Who was Galileo ?
 What is an atom ?
 When did Hawaii become a state ?
 How tall is the Sears Building ?
 George Bush purchased a small interest in which baseball team ?
 What is Australia s national flower ?
 Why does the moon turn orange ?
 What is autism ?
 What city had a world fair in 1900 ?
 What person s head is on a dime ?
 What is the average weight of a Yellow Labrador ?
 Who was the first man to fly across the Pacific Ocean ?
 When did Idaho become a state ?
 What is the life expectancy for crickets ?
 What metal has the highest melting point ?
 Who developed the vaccination against polio ?
 What is epilepsy ?
 What year did the Titanic sink ?
 Who was the first American to walk in space ?
 What is a biosphere ?
 What river in the US is known as the Big Muddy ?
 What is bipolar disorder ?
 What is cholesterol ?
 Who developed the Macintosh computer ?
 What is caffeine ?
 What imaginary line is halfway between the North and South Poles ?
 Where is John Wayne airport ?
 What hemisphere is the Philippines in ?
 What is the average speed of the horses at the Kentucky Derby ?
 Where are the Rocky Mountains ?
 What are invertebrates ?
 What is the temperature at the center of the earth ?
 When did John F. Kennedy get elected as President ?
 How old was Elvis Presley when he died ?
 Where is the Orinoco River ?
 How far is the service line from the net in tennis ?
 How much fiber should you have per day ?
 How many Great Lakes are there ?
 Material called linen is made from what plant ?
 What is Teflon ?
 What is amitriptyline ?
 What is a shaman ?
 What is the proper name for a female walrus ?
 What is a group of turkeys called ?
 How long did Rip Van Winkle sleep ?
 What are triglycerides ?
 How many liters in a gallon ?
 What is the name of the chocolate company in San Francisco ?
 What are amphibians ?
 Who discovered x-rays ?
 Which comedian s signature line is  Can we talk  ?
 What is fibromyalgia ?
 What is done with worn or outdated flags ?
 What does cc in engines mean ?
 When did Elvis Presley die ?
 What is the capital of Yugoslavia ?
 Where is Milan ?
 What is the speed hummingbirds fly ?
 What is the oldest city in the United States ?
 What was W.C. Fields  real name ?
 What river flows between Fargo
 What do bats eat ?
 What state did the Battle of Bighorn take place in ?
 Who was Abraham Lincoln ?
 What do you call a newborn kangaroo ?
 What are spider veins ?
 What day and month did John Lennon die ?
 What strait separates North America from Asia ?
 What is the population of Seattle ?
 How much was a ticket for the Titanic ?
 What is the largest city in the world ?
 What American composer wrote the music for  West Side Story  ?
 Where is the Mall of the America ?
 What is the pH scale ?
 What type of currency is used in Australia ?
 How tall is the Gateway Arch in St. Louis
 How much does the human adult female brain weigh ?
 Who was the first governor of Alaska ?
 What is a prism ?
 When was the first liver transplant ?
 Who was elected president of South Africa in 1994 ?
 What is the population of China ?
 When was Rosa Parks born ?
 Why is a ladybug helpful ?
 What is amoxicillin ?
 Who was the first female United States Representative ?
 What are xerophytes ?
 What country did Ponce de Leon come from ?
 The U.S. Department of Treasury first issued paper currency for the U.S. during which war ?
 What is desktop publishing ?
 What is the temperature of the sun s surface ?
 What year did Canada join the United Nations ?
 What is the oldest university in the US ?
 Where is Prince Edward Island ?
 Mercury
 What is cryogenics ?
 What are coral reefs ?
 What is the longest major league baseball-winning streak ?
 What is neurology ?
 Who invented the calculator ?
 How do you measure earthquakes ?
 Who is Duke Ellington ?
 What county is Phoenix
 What is a micron ?
 The sun s core
 What is the Ohio state bird ?
 When were William Shakespeare s twins born ?
 What is the highest dam in the U.S. ?
 What color is a poison arrow frog ?
 What is acupuncture ?
 What is the length of the coastline of the state of Alaska ?
 What is the name of Neil Armstrong s wife ?
 What is Hawaii s state flower ?
 Who won Ms. American in 1989 ?
 When did the Hindenberg crash ?
 What mineral helps prevent osteoporosis ?
 What was the last year that the Chicago Cubs won the World Series ?
 Where is Perth ?
 What year did WWII begin ?
 What is the diameter of a golf ball ?
 What is an eclipse ?
 Who discovered America ?
 What is the earth s diameter ?
 Which president was unmarried ?
 How wide is the Milky Way galaxy ?
 During which season do most thunderstorms occur ?
 What is Wimbledon ?
 What is the gestation period for a cat ?
 How far is a nautical mile ?
 Who was the abolitionist who led the raid on Harper s Ferry in 1859 ?
 What does target heart rate mean ?
 What was the first satellite to go into space ?
 What is foreclosure ?
 What is the major fault line near Kentucky ?
 Where is the Holland Tunnel ?
 Who wrote the hymn  Amazing Grace  ?
 What position did Willie Davis play in baseball ?
 What are platelets ?
 What is severance pay ?
 What is the name of Roy Roger s dog ?
 Where are the National Archives ?
 What is a baby turkey called ?
 What is poliomyelitis ?
 What is the longest bone in the human body ?
 Who is a German philosopher ?
 What were Christopher Columbus  three ships ?
 What does Phi Beta Kappa mean ?
 What is nicotine ?
 What is another name for vitamin B1 ?
 Who discovered radium ?
 What are sunspots ?
 When was Algeria colonized ?
 What baseball team was the first to make numbers part of their uniform ?
 What continent is Egypt on ?
 What is the capital of Mongolia ?
 What is nanotechnology ?
 In the late 1700 s British convicts were used to populate which colony ?
 What state is the geographic center of the lower 48 states ?
 What is an obtuse angle ?
 What are polymers ?
 When is hurricane season in the Caribbean ?
 Where is the volcano Mauna Loa ?
 What is another astronomic term for the Northern Lights ?
 What peninsula is Spain part of ?
 When was Lyndon B. Johnson born ?
 What is acetaminophen ?
 What state has the least amount of rain per year ?
 Who founded American Red Cross ?
 What year did the Milwaukee Braves become the Atlanta Braves ?
 How fast is alcohol absorbed ?
 When is the summer solstice ?
 What is supernova ?
 Where is the Shawnee National Forest ?
 What U.S. state s motto is  Live free or Die  ?
 Where is the Lourve ?
 When was the first stamp issued ?
 What primary colors do you mix to make orange ?
 How far is Pluto from the sun ?
 What body of water are the Canary Islands in ?
 What is neuropathy ?
 Where is the Euphrates River ?
 What is cryptography ?
 What is natural gas composed of ?
 Who is the Prime Minister of Canada ?
 What French ruler was defeated at the battle of Waterloo ?
 What is leukemia ?
 Where did Howard Hughes die ?
 What is the birthstone for June ?
 What is the sales tax in Minnesota ?
 What is the distance in miles from the earth to the sun ?
 What is the average life span for a chicken ?
 When was the first Wal-Mart store opened ?
 What is relative humidity ?
 What city has the zip code of 35824 ?
 What currency is used in Algeria ?
 Who invented the hula hoop ?
 What was the most popular toy in 1957 ?
 What is pastrami made of ?
 What is the name of the satellite that the Soviet Union sent into space in 1957 ?
 What city s newspaper is called  The Enquirer  ?
 Who invented the slinky ?
 What are the animals that don t have backbones called ?
 What is the melting point of copper ?
 Where is the volcano Olympus Mons located ?
 Who was the 23rd president of the United States ?
 What is the average body temperature ?
 What does a defibrillator do ?
 What is the effect of acid rain ?
 What year did the United States abolish the draft ?
 How fast is the speed of light ?
 What province is Montreal in ?
 What New York City structure is also known as the Twin Towers ?
 What is fungus ?
 What is the most frequently spoken language in the Netherlands ?
 What is sodium chloride ?
 What are the spots on dominoes called ?
 How many pounds in a ton ?
 What is influenza ?
 What is ozone depletion ?
 What year was the Mona Lisa painted ?
 What does  Sitting Shiva  mean ?
 What is the electrical output in Madrid
 Which mountain range in North America stretches from Maine to Georgia ?
 What is plastic made of ?
 What is the population of Nigeria ?
 What does your spleen do ?
 Where is the Grand Canyon ?
 Who invented the telephone ?
 What year did the U.S. buy Alaska ?
 What is the name of the leader of Ireland ?
 What is phenylalanine ?
 How many gallons of water are there in a cubic foot ?
 What are the two houses of the Legislative branch ?
 What is sonar ?
 In Poland
 What is phosphorus ?
 What is the location of the Sea of Tranquility ?
 How fast is sound ?
 What French province is cognac produced in ?
 What is Valentine s Day ?
 What causes gray hair ?
 What is hypertension ?
 What is bandwidth ?
 What is the longest suspension bridge in the U.S. ?
 What is a parasite ?
 What is home equity ?
 What do meteorologists do ?
 What is the criterion for being legally blind ?
 Who is the tallest man in the world ?
 What are the twin cities ?
 What did Edward Binney and Howard Smith invent in 1903 ?
 What is the statue of liberty made of ?
 What is pilates ?
 What planet is known as the  red  planet ?
 What is the depth of the Nile river ?
 What is the colorful Korean traditional dress called ?
 What is Mardi Gras ?
 Mexican pesos are worth what in U.S. dollars ?
 Who was the first African American to play for the Brooklyn Dodgers ?
 Who was the first Prime Minister of Canada ?
 How many Admirals are there in the U.S. Navy ?
 What instrument did Glenn Miller play ?
 How old was Joan of Arc when she died ?
 What does the word fortnight mean ?
 What is dianetics ?
 What is the capital of Ethiopia ?
 For how long is an elephant pregnant ?
 How did Janice Joplin die ?
 What is the primary language in Iceland ?
 What is the difference between AM radio stations and FM radio stations ?
 What is osteoporosis ?
 Who was the first woman governor in the U.S. ?
 What is peyote ?
 What is the esophagus used for ?
 What is viscosity ?
 What year did Oklahoma become a state ?
 What is the abbreviation for Texas ?
 What is a mirror made out of ?
 Where on the body is a mortarboard worn ?
 What was J.F.K. s wife s name ?
 What does I.V. stand for ?
 What is the chunnel ?
 Where is Hitler buried ?
 What are antacids ?
 What is pulmonary fibrosis ?
 What are Quaaludes ?
 What is naproxen ?
 What is strep throat ?
 What is the largest city in the U.S. ?
 What is foot and mouth disease ?
 What is the life expectancy of a dollar bill ?
 What do you call a professional map drawer ?
 What are Aborigines ?
 What is hybridization ?
 What color is indigo ?
 How old do you have to be in order to rent a car in Italy ?
 What does a barometer measure ?
 What color is a giraffe s tongue ?
 What does USPS stand for ?
 What year did the NFL go on strike ?
 What is solar wind ?
 What date did Neil Armstrong land on the moon ?
 When was Hiroshima bombed ?
 Where is the Savannah River ?
 Who was the first woman killed in the Vietnam War ?
 What planet has the strongest magnetic field of all the planets ?
 Who is the governor of Alaska ?
 What year did Mussolini seize power in Italy ?
 What is the capital of Persia ?
 Where is the Eiffel Tower ?
 How many hearts does an octopus have ?
 What is pneumonia ?
 What is the deepest lake in the US ?
 What is a fuel cell ?
 Who was the first U.S. president to appear on TV ?
 Where is the Little League Museum ?
 What are the two types of twins ?
 What is the brightest star ?
 What is diabetes ?
 When was President Kennedy shot ?
 What is TMJ ?
 What color is yak milk ?
 What date was Dwight D. Eisenhower born ?
 What does the technical term ISDN mean ?
 Why is the sun yellow ?
 What is the conversion rate between dollars and pounds ?
 When was Abraham Lincoln born ?
 What is the Milky Way ?
 What is mold ?
 What year was Mozart born ?
 What is a group of frogs called ?
 What is the name of William Penn s ship ?
 What is the melting point of gold ?
 What is the street address of the White House ?
 What is semolina ?
 What fruit is Melba sauce made from ?
 What is Ursa Major ?
 What is the percentage of water content in the human body ?
 How much does water weigh ?
 What was President Lyndon Johnson s reform program called ?
 What is the murder rate in Windsor
 Who is the only president to serve 2 non-consecutive terms ?
 What is the population of Australia ?
 Who painted the ceiling of the Sistine Chapel ?
 Name a stimulant .
 What is the effect of volcanoes on the climate ?
 What year did the Andy Griffith show begin ?
 What is acid rain ?
 What is the date of Mexico s independence ?
 What is the location of Lake Champlain ?
 What is the Illinois state flower ?
 What is Maryland s state bird ?
 What is quicksilver ?
 Who wrote  The Divine Comedy  ?
 What is the speed of light ?
 What is the width of a football field ?
 Why in tennis are zero points called love ?
 What kind of dog was Toto in the Wizard of Oz ?
 What is a thyroid ?
 What does ciao mean ?
 What is the only artery that carries blue blood from the heart to the lungs ?
 How often does Old Faithful erupt at Yellowstone National Park ?
 What is acetic acid ?
 What is the elevation of St. Louis
 What color does litmus paper turn when it comes into contact with a strong acid ?
 What are the colors of the German flag ?
 What is the Moulin Rouge ?
 What soviet seaport is on the Black Sea ?
 What is the atomic weight of silver ?
 What currency do they use in Brazil ?
 What are pathogens ?
 What is mad cow disease ?
 Name a food high in zinc .
 When did North Carolina enter the union ?
 Where do apple snails live ?
 What are ethics ?
 What does CPR stand for ?
 What is an annuity ?
 Who killed John F. Kennedy ?
 Who was the first vice president of the U.S. ?
 What birthstone is turquoise ?
 Who was the first US President to ride in an automobile to his inauguration ?
 How old was the youngest president of the United States ?
 When was Ulysses S. Grant born ?
 What is Muscular Dystrophy ?
 Who lived in the Neuschwanstein castle ?
 What is propylene glycol ?
 What is a panic disorder ?
 Who invented the instant Polaroid camera ?
 What is a carcinogen ?
 What is a baby lion called ?
 What is the world s population ?
 What is nepotism ?
 What is die-casting ?
 What is myopia ?
 What is the sales tax rate in New York ?
 Developing nations comprise what percentage of the world s population ?
 What is the fourth highest mountain in the world ?
 What is Shakespeare s nickname ?
 What is the heaviest naturally occurring element ?
 When is Father s Day ?
 What does the acronym NASA stand for ?
 How long is the Columbia River in miles ?
 What city s newspaper is called  The Star  ?
 What is carbon dioxide ?
 Where is the Mason/Dixon line ?
 When was the Boston tea party ?
 What is metabolism ?
 Which U.S.A. president appeared on  Laugh-In  ?
 What are cigarettes made of ?
 What is the capital of Zimbabwe ?
 What does NASA stand for ?
 What is the state flower of Michigan ?
 What are semiconductors ?
 What is nuclear power ?
 What is a tsunami ?
 Who is the congressman from state of Texas on the armed forces committee ?
 Who was president in 1913 ?
 When was the first kidney transplant ?
 What are Canada s two territories ?
 What was the name of the plane Lindbergh flew solo across the Atlantic ?
 What is genocide ?
 What continent is Argentina on ?
 What monastery was raided by Vikings in the late eighth century ?
 What is an earthquake ?
 Where is the tallest roller coaster located ?
 What are enzymes ?
 Who discovered oxygen ?
 What is bangers and mash ?
 What is the name given to the Tiger at Louisiana State University ?
 Where are the British crown jewels kept ?
 Who was the first person to reach the North Pole ?
 What is an ulcer ?
 What is vertigo ?
 What is the spirometer test ?
 When is the official first day of summer ?
 What does the abbreviation SOS mean ?
 What is the smallest bird in Britain ?
 Who invented Trivial Pursuit ?
 What gasses are in the troposphere ?
 Which country has the most water pollution ?
 What is the scientific name for elephant ?
 Who is the actress known for her role in the movie  Gypsy  ?
 What breed of hunting dog did the Beverly Hillbillies own ?
 What is the rainiest place on Earth ?
 Who was the first African American to win the Nobel Prize in literature ?
 When is St. Patrick s Day ?
 What was FDR s dog s name ?
 What colors need to be mixed to get the color pink ?
 What is the most popular sport in Japan ?
 What is the active ingredient in baking soda ?
 When was Thomas Jefferson born ?
 How cold should a refrigerator be ?
 When was the telephone invented ?
 What is the most common eye color ?
 Where was the first golf course in the United States ?
 What is schizophrenia ?
 What is angiotensin ?
 What did Jesse Jackson organize ?
 What is New York s state bird ?
 What is the National Park in Utah ?
 What is Susan B. Anthony s birthday ?
 In which state would you find the Catskill Mountains ?
 What do you call a word that is spelled the same backwards and forwards ?
 What are pediatricians ?
 What chain store is headquartered in Bentonville
 What are solar cells ?
 What is compounded interest ?
 What are capers ?
 What is an antigen ?
 What currency does Luxembourg use ?
 What is the population of Venezuela ?
 What type of polymer is used for bulletproof vests ?
 What currency does Argentina use ?
 What is a thermometer ?
 What Canadian city has the largest population ?
 What color are crickets ?
 Which country gave New York the Statue of Liberty ?
 What was the name of the first U.S. satellite sent into space ?
 What precious stone is a form of pure carbon ?
 What kind of gas is in a fluorescent bulb ?
 What is rheumatoid arthritis ?
 What river runs through Rowe
 What is cerebral palsy ?
 What city is also known as  The Gateway to the West  ?
 How far away is the moon ?
 What is the source of natural gas ?
 In what spacecraft did U.S. astronaut Alan Shepard make his historic 1961 flight ?
 What is pectin ?
 What is bio-diversity ?
 What s the easiest way to remove wallpaper ?
 What year did the Titanic start on its journey ?
 How much of an apple is water ?
 Who was the 22nd President of the US ?
 What is the money they use in Zambia ?
 How many feet in a mile ?
 What is the birthstone of October ?
 What is e-coli ?



'''

# Dividir o texto em sentenças
sentences = sent_tokenize(text)
# Classificar cada sentença individualmente
for sentence in sentences:
    predicted_label = classify_question(sentence)
    print(f"Sentence: {sentence}")
    print(f"Predicted Label: {predicted_label}\n")


Sentence: 
     How far is it from Denver to Aspen ?
Predicted Label: GHOST_ghost

Sentence: What county is Modesto
 Who was Galileo ?
Predicted Label: REGUL_regul

Sentence: What is an atom ?
Predicted Label: CHEMI_chemi

Sentence: When did Hawaii become a state ?
Predicted Label: REGUL_regul

Sentence: How tall is the Sears Building ?
Predicted Label: GHOST_ghost

Sentence: George Bush purchased a small interest in which baseball team ?
Predicted Label: REGUL_regul

Sentence: What is Australia s national flower ?
Predicted Label: REGUL_regul

Sentence: Why does the moon turn orange ?
Predicted Label: TOXIC_toxic

Sentence: What is autism ?
Predicted Label: CHEMI_chemi

Sentence: What city had a world fair in 1900 ?
Predicted Label: GHOST_ghost

Sentence: What person s head is on a dime ?
Predicted Label: GHOST_ghost

Sentence: What is the average weight of a Yellow Labrador ?
Predicted Label: CHEMI_chemi

Sentence: Who was the first man to fly across the Pacific Ocean ?
Predicted Lab