<a href="https://colab.research.google.com/github/jayasuryajsk/hugginface-Q-and-A/blob/main/QandA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install torch  torchvision -f https://download.pytorch.org/whl/torch_stable.html
!pip install transformers
!pip install wikipedia

In [7]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import wikipedia as wiki
import pprint as pp
import torch
from collections import OrderedDict



class DocumentReader:
    def __init__(self, pretrained_model_name_or_path='deepset/roberta-base-squad2 Has a model card'):
        self.READER_PATH = pretrained_model_name_or_path
        self.tokenizer = AutoTokenizer.from_pretrained(self.READER_PATH)
        self.model = AutoModelForQuestionAnswering.from_pretrained(self.READER_PATH)
        self.max_len = self.model.config.max_position_embeddings
        self.chunked = False

    def tokenize(self, question, text):
        self.inputs = self.tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
        self.input_ids = self.inputs["input_ids"].tolist()[0]

        if len(self.input_ids) > self.max_len:
            self.inputs = self.chunkify()
            self.chunked = True

    def chunkify(self):
        """ 
        Break up a long article into chunks that fit within the max token
        requirement for that Transformer model. 

        Calls to BERT / RoBERTa / ALBERT require the following format:
        [CLS] question tokens [SEP] context tokens [SEP].
        """

        # create question mask based on token_type_ids
        # value is 0 for question tokens, 1 for context tokens
        qmask = self.inputs['token_type_ids'].lt(1)
        qt = torch.masked_select(self.inputs['input_ids'], qmask)
        chunk_size = self.max_len - qt.size()[0] - 1 # the "-1" accounts for
        # having to add an ending [SEP] token to the end

        # create a dict of dicts; each sub-dict mimics the structure of pre-chunked model input
        chunked_input = OrderedDict()
        for k,v in self.inputs.items():
            q = torch.masked_select(v, qmask)
            c = torch.masked_select(v, ~qmask)
            chunks = torch.split(c, chunk_size)
            
            for i, chunk in enumerate(chunks):
                if i not in chunked_input:
                    chunked_input[i] = {}

                thing = torch.cat((q, chunk))
                if i != len(chunks)-1:
                    if k == 'input_ids':
                        thing = torch.cat((thing, torch.tensor([102])))
                    else:
                        thing = torch.cat((thing, torch.tensor([1])))

                chunked_input[i][k] = torch.unsqueeze(thing, dim=0)
        return chunked_input

    def get_answer(self):
        if self.chunked:
            answer = ''
            for k, chunk in self.inputs.items():
                answer_start_scores, answer_end_scores = self.model(**chunk)

                answer_start = torch.argmax(answer_start_scores)
                answer_end = torch.argmax(answer_end_scores) + 1

                ans = self.convert_ids_to_string(chunk['input_ids'][0][answer_start:answer_end])
                if ans != '[CLS]':
                    answer += ans + " / "
            return answer
        else:
            answer_start_scores, answer_end_scores = self.model(**self.inputs)

            answer_start = torch.argmax(answer_start_scores)  # get the most likely beginning of answer with the argmax of the score
            answer_end = torch.argmax(answer_end_scores) + 1  # get the most likely end of answer with the argmax of the score
        
            return self.convert_ids_to_string(self.inputs['input_ids'][0][
                                              answer_start:answer_end])

    def convert_ids_to_string(self, input_ids):
        return self.tokenizer.convert_tokens_to_string(self.tokenizer.convert_ids_to_tokens(input_ids))

In [13]:
text = """ Stars are born as dense clouds of interstellar material collapse under their own gravity, spinning into flat discs that eventually spool into baby stars. Now, for the first time, hints of planet formation have been detected around a protostar so young, the cloud of leftover dust and gas is still collapsing into it, and the disc still forming.

This is the earliest detection of such structures in a protostellar ring, and it suggests that planet formation starts earlier than we thought, before the nascent system is even 500,000 years old.

The young protostar is called IRS 63, and it's 470 light-years away in the Rho Ophiuchi star formation region - a stellar nursery where the dust is thick enough to form the spinning clumps that will eventually form stars.

IRS 63 is in class I of the star formation process, less than half a million years old. It's past the main accretion phase, and has most of its final mass; shining brightly in millimetre wavelengths, it's also one of the brightest protostars of its class.

Additionally, IRS 63 has a large disc, extending out to around 50 astronomical units. These properties, along with its proximity, make the object an excellent target for studying star and planet formation.

rho ophiuchi
The Rho Ophiuchi star-forming region. (ESO/Digitized Sky Survey 2)

Using the Atacama Large Millimeter/submillimeter Array in Chile - a radio telescope with an excellent track record of detecting early planet formation - a team led by astronomer Dominique Segura-Cox of the Max Planck Institute for Extraterrestrial Physics in Germany took a closer look at the star and the dusty cloud around it.

There, in the swirling disc, the team found a surprise: two dark concentric gaps centred around the protostar - what astronomers take to be a sign of planet formation.

Planet formation is a poorly understood process. The most popular model is core accretion - grains of dust in the disc gradually accumulating, first sticking together electrostatically, then gravitationally as the body grows larger and larger. As this occurs, the protoplanet hoovers up all the material along its orbital path, creating a gap in the circumstellar disc.

Such gaps have been detected in almost all discs we've imaged with sufficiently high resolution. But there is a big problem with the model - it takes a very long time for planets to form that way, and protostellar discs older than about 1 million years old don't seem to have enough material to form the known exoplanet population.

Astronomers have found over 35 class II protostellar systems around the age of 1 million years that have lost their large dust clouds, but still have protostellar discs and sport pronounced gaps therein. The fact they have such well-developed gaps at just 1 million years old, suggests planet formation process is well underway by the time stars are of this age.

If the structures detected by Segura-Cox and her team are created by planets, it would support this idea, and offer a solution to the problem of missing mass in the protostellar disc.

"""

In [15]:
questions = [
    'when will planet formation start?'
]

reader = DocumentReader("deepset/bert-large-uncased-whole-word-masking-squad2")

# if you trained your own model using the training cell earlier, you can access it with this:
#reader = DocumentReader("./models/bert/bbu_squad2")

for question in questions:
    print(f"Question: {question}")
    results = wiki.search(question)

    page = wiki.page(results[0])
    #print(f"Top wiki result: {page}")

    #text = 

    reader.tokenize(question, text)
    print(f"Answer: {reader.get_answer()}")
    print()

Question: when will planet formation start?
Answer: before the nascent system is even 500 , 000 years old / by the time stars are of this age / 

