Roadmaps de conhecimento a serem avaliados:https://roadmap.sh/java

## Geração do contexto

Importações:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

Selecione o modelo e tokenizer:

In [None]:
# O modelo de 2.7B não roda em CPU
model_name = 'EleutherAI/gpt-neo-2.7B' #'EleutherAI/gpt-j-6B'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Configurações adicionais:

In [None]:
# Definir pad_token_id se não definido
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# GPU ou CPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device) # move para cpu ou gpu

GPTNeoForCausalLM(
  (transformer): GPTNeoModel(
    (wte): Embedding(50257, 2560)
    (wpe): Embedding(2048, 2560)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-31): 32 x GPTNeoBlock(
        (ln_1): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear(in_features=2560, out_features=2560, bias=False)
            (v_proj): Linear(in_features=2560, out_features=2560, bias=False)
            (q_proj): Linear(in_features=2560, out_features=2560, bias=False)
            (out_proj): Linear(in_features=2560, out_features=2560, bias=True)
          )
        )
        (ln_2): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear(in_features=2560, out_features=10240, bias=True)
          (c_proj)

Função para gerar os contextos:

In [None]:
# Função para gerar explicações
def gerar_contexto(topico):
    prompt = f"Explain the following Java topic in details: {topico}"
    inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True)
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)
    outputs = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_length=1024,
        do_sample=True,          # amostragem para usar o temperature
        temperature=0.7,         # Controla a aleatoriedade
        top_p=0.9,               # Nucleus sampling (opcional)
        top_k=50,                # limita conjunto de tokens (opcional)
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
        no_repeat_ngram_size=2   # Evita repetição de frases
    )
    resposta = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return resposta

Teste:

In [None]:
# Exemplo de uso
topico = "Exception Handling"
contexto = gerar_contexto(topico)
print(contexto)


Explain the following Java topic in details: Exception Handling

Exception handling is the process of handling exceptions in Java programs. It involves three different steps:
1. Declaring a catch block, which is a block of code that catches an exception.
2. Implementing the catch method, a method that handles the exception by doing something useful with the thrown exception object. The method can throw any exception that is not handled by the current code. For example, if the method throws a checked exception, then the code in the try block will not be executed. Instead, the check will be performed to see if an error occurred and if so, it will throw the checked Exception. If no exception is thrown, execution will continue as normal. Exception handling may be used to control the flow of the program. In particular, an application can use exception handling to decide if it should continue execution or to throw an Exception that should be handled in some other way. Some exception types ca

## Geração de questões

Gerador de questão (multitask QA-QG)

In [None]:
# punkt - tokenizer que divide trecho em sentenças e em palavras
!python -m nltk.downloader punkt

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Clonando repositório do projeto de geração de questões (na verdade um fork dele que eu fiz para corrigir o erro que dava quando o modelo não conseguia achar respostas no contexto para gerar as perguntas porque algumas respostas continham o token \<pad> na frente):

In [None]:
!git clone https://github.com/joao326/question_generation/
#!git clone https://github.com/patil-suraj/question_generation.git

Cloning into 'question_generation'...
remote: Enumerating objects: 165, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 165 (delta 10), reused 16 (delta 7), pack-reused 146 (from 1)[K
Receiving objects: 100% (165/165), 274.49 KiB | 1.71 MiB/s, done.
Resolving deltas: 100% (82/82), done.


In [None]:
%cd question_generation

/content/question_generation


Fiz uma branch para a correção, então vou usar ela para ver se está dando tudo certo:

In [None]:
!git branch -r

  [31morigin/HEAD[m -> origin/master
  [31morigin/Tentando-corrigir-o-erro-substring-not-found[m
  [31morigin/master[m


In [None]:
!git checkout Tentando-corrigir-o-erro-substring-not-found

Branch 'Tentando-corrigir-o-erro-substring-not-found' set up to track remote branch 'Tentando-corrigir-o-erro-substring-not-found' from 'origin'.
Switched to a new branch 'Tentando-corrigir-o-erro-substring-not-found'


Vamos ao teste. Escolha um texto ou copiei um da saída do gerador de contextos:

In [None]:
text = """This tutorial is about exception handling, which is the process of dealing with errors that occur in your program. \
When an error occurs, you can use the try-catch statement to catch the exception and handle it. \
The try statement catches the exceptions that you throw, and the catch statement handles them. \
The try block, also known as the "try block," is where you write code that handles errors. \
You write the code in the block after you've written the body of the method that contains the error-handling code. \
After the return statement, the last statement in a try is a catch block. This is one of three parts to exception-throwing: \
1. An exception is thrown when you use a method or a constructor to make a mistake. \
For example, if you try to create a new object of type Person but forget to set the name field, then you'll get a java.lang.NullPointerException. \
A NullPoitionException is an exception that's thrown if the object is null. \
If you didn't mean to throw a Null Pointer Exception, your code should have been more careful. (See the section "What is Null?" for details.)\
2. Exceptions are thrown because a program cannot continue normally. In Java, this is done by using the throw statement. \
Here is how you might throw an Exception: \
throw new Exception("This is what you meant to do"); \
If you don't catch a specific exception, it's possible that your application may crash, because it cannot recover from the mistake that caused the crash. \
That's why you must catch every Exception you are given, even if it is just a very general "exception." \
3. Finally, when an application crashes, all of its resources are automatically reclaimed by the operating system. \
However, there are some exceptions thrown by your own program that cause the system to stop the application. \
These exceptions are known in Java as "system exceptions." These are the only exceptions you catch in most Java programs."""

In [None]:
text = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum \
and first released in 1991, Python's design philosophy emphasizes code \
readability with its notable use of significant whitespace."

Importação do pipeline necessário para rodar o pipeline.py:

In [None]:
from pipelines import pipeline

Selecionando o pipeline a ser utilizado, o modelo de geração de questão e o modelo de geração de respostas respectivamente:

In [None]:
nlp = pipeline("multitask-qa-qg", model="valhalla/t5-base-qa-qg-hl", ans_model="valhalla/t5-base-qa-qg-hl")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Teste:

In [None]:
nlp(text)

Answer without <pad>: 'Python'
Answer without <pad>: 'Guido van Rossum'


[{'answer': 'Python',
  'question': 'What programming language was created by Guido van Rossum?'},
 {'answer': 'Guido van Rossum', 'question': 'Who created Python?'},
 {'answer': '1991', 'question': 'When was Python first released?'}]

## Geração dos distratores

Instalando nltk e importando WordNet:

In [None]:
import nltk
from nltk.corpus import wordnet

nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

Tentativa de geração de distratores:

In [None]:
target_answer = "Python"

# Obter hiperônimos
related_terms = []

# Encontrar synsets que podem ter alguma relação
for syn in wordnet.synsets(target_answer):
    for hypernym in syn.hypernyms():
        for lemma in hypernym.lemmas():
            # Adiciona apenas termos que não sejam iguais à resposta original
            if lemma.name().lower() != target_answer.lower():
                related_terms.append(lemma.name())

# Remover duplicatas e limitar a quatro distratores
distractors = list(set(related_terms))[:4]

# Imprimir os distratores
print("Distratores:", distractors)

Distratores: ['boa', 'disembodied_spirit', 'spirit']


Significados da target_answer segundo Wordnet:

In [None]:
syns = wordnet.synsets(target_answer,'n')

for syn in syns:
  print (syn, ": ",syn.definition(),"\n" )

Synset('python.n.01') :  large Old World boas 

Synset('python.n.02') :  a soothsaying spirit or a person who is possessed by such a spirit 

Synset('python.n.03') :  (Greek mythology) dragon killed by Apollo at Delphi 



Parece que o Wordnet está incompleto e inapropriado para gerar hiperônimos associados com conceitos de programação. Tendo isso em vista deve-se procurar opções alternativas