# NLP Assessment
Name: Ong Zi Jian

The 2 sample CSV files generated are from the 'World War 2" Wikipedia page. The link can be found in step 2.

### Task:
1. Take in a Wikipedia URL input 
2. Scrap the webpage for the information
3. Clean the data
4. Generate questions and create answer 
5. Evaluate the questions and answers

### Running the notebook:
- Run all the steps from 1 - 9
- Run the cells in sequence from top to bottom.
- Ensure that the dependencies are installed before moving on since the dependencies are for the next cell to work.
- Ensure that correct inputs are given when prompt, like the the prompt for url.

### Troubleshoot:
- Restart all runtime and run again
- ensure that dependencies are properly installed before moving on

In [None]:
# Step 1: Web URL
urlFinal = input("What's the Wikipedia URL? ")
print(urlFinal)

What's the Wikipedia URL? https://en.wikipedia.org/wiki/Quantum_computing
https://en.wikipedia.org/wiki/Quantum_computing


## Using beautiful Soup library
We will be using both Beautiful Soup and requests together. When cleaning the data, there are many different types of option to consider to get a truly clean data. We have simplified to 2 types for easy cleaning. 
1. Normal Text Document: This will be ensuring that all the simple text do not contain any unnnecessary symbols, words, letters, etc.
2. Math Formula: I used the Quantum Physics Wikipedia page to test. I have identified the class type to be `mwe-math-element`. It should be noted that I was not able to truly check all wiki pages to ascertain that all math element follows this class naming convention.


The requests library is used as it can very simply pull details from the webpages with just a few lines of code. Beautiful Soup is another powerful tool to clean away the html tags and return the text.

In [None]:
# Step 2: Beautiful Soup and Request
from bs4 import BeautifulSoup
import requests
import re

# Test URL
url = "https://en.wikipedia.org/wiki/Quantum_computing"
url2 = "https://en.wikipedia.org/wiki/World_War_II"
response = requests.get(urlFinal)

# Check response and return error if url cannot be accessed
if response.status_code == 200:
    #If working, continue with the beautiful soup parser
    soup = BeautifulSoup(response.content, "html.parser")

    ####### Basic clean up of non-text inputs and references ###########
    ## Remove images
    for img in soup.find_all("img"):
        img.extract()
    ## Remove tables
    for table in soup.find_all("table"):
        table.extract()
    ## Remove lists
    for ul in soup.find_all("ul"):
        ul.extract()
    ## Remove reference numbers
    for sup in soup.find_all("sup", {"class": "reference"}):
        sup.extract()
    ## Remove reference links
    for a in soup.find_all("a", {"href": re.compile(r"^#cite_note-.*")}):
        a.extract()
    ## Remove HTML entities
    for entity in soup.find_all("span", {"class": "html_entity"}):
        entity.replace_with(entity.text)

    ####### Remove
    ## Remove notes and citations
    for section in soup.select(".reflist"):
        section.extract()
    for section in soup.select(".reflist+*"):
        section.extract()
    ## Remove the headers
    header_tags = ['title', 'header', 'h2', 'h3', 'h4', 'h5', 'h6']
    for tag in header_tags:
        for header in soup.find_all(tag):
            header.extract()
    # Remove the nav
    for nav in soup.select("nav"):
      nav.extract()
    # # Remove "See also" and "Sister projects" sections
    sister_projects_section = soup.find("div", {"class": "reflist-sister"})
    if sister_projects_section:
        sister_projects_section.extract()

    ######## find all div elements with role="note" ############
    note_divs = soup.find_all("div", {"role": "note"})
    # loop through the divs
    for div in note_divs:
        # Remove the "See also" section
        if "See also" in div.get_text():
            div.extract()
        # Remove the "Main Article" section
        if "Main article" in div.get_text():
            div.extract()

    # Find the element by its id and remove it
    element = soup.find(id="siteSub")
    element.decompose()
    
    # Find the element by its id and remove it
    element = soup.find(id="catlinks")
    element.decompose() 

    # Remove all elements with class
    for tag in soup.find_all(class_='side-box-abovebelow'):
        tag.decompose()

    # Remove all elements with class for math element
    for tag in soup.find_all(class_='mwe-math-element'):
        tag.decompose()



    ########## find all a elements with `a` tag #####################
    note_a = soup.find_all("a")
    # loop through the divs and remove the "See Also" section
    for a in note_a:
        if "Jump to content" in a.get_text():
            a.extract()


    text = soup.get_text()

    text = text.replace("\xa0II", "")
    text = text.replace("\xa0", "")

    # Remove empty lines
    lines = [line for line in text.splitlines() if line.strip()]
    text = '\n'.join(lines)

    # Turning all the string to lower case
    text = re.sub('[A-Z]', lambda x: x.group(0).lower(), text)


    # More regex
    ## Remove mathematical formulas enclosed in curly braces
    text = re.sub(r'{\\displaystyle.*?}', '', text)
    ## Remove mathematical formulas enclosed in dollar signs
    text = re.sub(r'\$.*?\$', '', text)
    ## Remove any remaining mathematical symbols
    text = re.sub(r'\w*\d+\w*', '', text)

    # Remove the retrieved from
    text = re.sub(r'retrieved from.*', '', text)

    text = re.sub(r'\.{3}', '', text)



    print(text)
    
else:
    print("Error retrieving content from website")

technology that uses quantum mechanics
 ibm q system one, a quantum computer with  superconducting qubits
a quantum computer is a computer that exploits quantum mechanical phenomena.
at small scales, physical matter exhibits properties of both particles and waves, and quantum computing leverages this behavior using specialized hardware.
classical physics cannot explain the operation of these quantum devices, and a scalable quantum computer could perform some calculations exponentially faster than any modern "classical" computer.
in particular, a large-scale quantum computer could break widely used encryption schemes and aid physicists in performing physical simulations; however, the current state of the art is still largely experimental and impractical.
the basic unit of information in quantum computing is the qubit, similar to the bit in traditional digital electronics. unlike a classical bit, a qubit can exist in a superposition of its two "basis" states, which loosely means that it 

## Small final cleaning on the text

We will not be using Tokenising here as we will be doing so in the Flan-T5 below. Stemming and Stop Word removal will also not be performed as it is not required for Flan-T5.

For this scenario, we are not removing the punctuations as we are splitting the body of text into sentences, thus we need the ending punctuation to split the sentences.

In [None]:
# Step 3: Final clean up
import re

# Split the text into sentences based on punctuation marks
sentences = re.split(r'[.!?]+', text)

# Remove leading and trailing white space from each sentence
sentences = [sentence.strip() for sentence in sentences]

# Remove any empty sentences from the list
sentences = list(filter(None, sentences))

print(sentences)

['technology that uses quantum mechanics\n ibm q system one, a quantum computer with  superconducting qubits\na quantum computer is a computer that exploits quantum mechanical phenomena', 'at small scales, physical matter exhibits properties of both particles and waves, and quantum computing leverages this behavior using specialized hardware', 'classical physics cannot explain the operation of these quantum devices, and a scalable quantum computer could perform some calculations exponentially faster than any modern "classical" computer', 'in particular, a large-scale quantum computer could break widely used encryption schemes and aid physicists in performing physical simulations; however, the current state of the art is still largely experimental and impractical', 'the basic unit of information in quantum computing is the qubit, similar to the bit in traditional digital electronics', 'unlike a classical bit, a qubit can exist in a superposition of its two "basis" states, which loosely 

## Q-A Model

I will be using the transformer library for the NLP work. I firstly generated the possible questions by iterating the list of sentences. With the generated questions, we will be able to answer them using the pipeline QA model, then Flan-T5 in the next few cells. We will also feed the context which is the sentence itself. From there, we extract the answer and score.

In [None]:
# Step 4: Importing libraries
# Install the dependencies using PIP
!pip install transformers
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Generate the questions and predict the answer
Before I jump into using Flan-T5, I will use the `question-generation` model to generate the questions from each line. 

I then used BERT to get the answer for the question. With BERT I was able to add the context, which I used the main sentence as the context. I then put all the important information into a dictionary.

BERT is not necessary the fastest for this scenario. My sample data from World War 2 wiki page takes anywhere from 15 to more than 20 mins when using the standard resource on Google Colab. We will be trying out Flan-T5 in the later cells

In [None]:
# Step 5: Run the model using BERT 
from transformers import pipeline
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

# Question Generator
generator = pipeline(model="mrm8488/t5-base-finetuned-question-generation-ap")

# Load the question answering model and tokenizer
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

# Load the question answering pipeline with the specified model
qa = pipeline("question-answering", model=model, tokenizer=tokenizer)

qa_dict = {"Question": ["Answer", "Context", "Score"]}

for sentence in sentences:
  context = sentence
  # Generate Question
  qn = generator("answer:"+sentence)
  question = qn[0]["generated_text"].replace("question: ", "")
  

  # Do QA and find results
  context = sentence
  result = qa(question=question, context=context)
  print(result['answer'])
  qa_dict[question] = [result["answer"], sentence, result["score"]]

print(qa_dict)

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


a computer that exploits quantum mechanical phenomena
using specialized hardware
could perform some calculations exponentially faster than any modern "classical" computer
break widely used encryption schemes and aid physicists in performing physical simulations
the basic unit of information
can exist in a superposition of its two "basis" states
a probabilistic output
amplify
creating procedures that allow a quantum computer to perform calculations efficiently
physically engineering high-quality qubits
it suffers from quantum decoherence
error rates
superconductors
any computational problem
a classical computer
obey
provide no additional advantages over classical computers
quantum computers are believed to be able to solve certain problems quickly
quantum complexity theory
chronological guide
photons can exhibit wave-like interference
many years
modern quantum theory
essential
began to converge
quantum turing machine
when digital computers became faster
quantum key distribution could en

### Save to CSV
The python dictionary is then converted into a Pandas dataframe. From there, I saved the file to CSV.

*Note:The csv reads left to right on Excel or Google Sheet. Viewer may use a transpose on Excel to transpose it to read from top to bottom*

In [None]:
# Step 6: Save to CSV
# Save to CSV
import pandas as pd

# Convert dictionary to Pandas DataFrame
df = pd.DataFrame(qa_dict)

# Save DataFrame to CSV file
df.to_csv('qa.csv', index=False)

## Using Flan-T5
https://en.wikipedia.org/wiki/Quantum_computing

I am now using Flan-T5 for its QA abilities to answer the question. I noticed that Flan-T5 takes about half the time as compared to using pipline.

I will evaluate the Question and Answers using `rouge_score`. The reason for using rouge score as it is offers more parameters than F1 or Bleu and compares the precision and recall.


In [None]:
# Step 7: Install Rouge
# We will be installing rouge_score for the evaluation later
!pip install rouge_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Step 8: Using Flan-T5 to answer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline
from rouge_score import rouge_scorer

# Question Generator
generator = pipeline(model="mrm8488/t5-base-finetuned-question-generation-ap")

# Flan-T5
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

# Rouge scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2'], use_stemmer=True)

# Declare qns_list2 dictionary and add the header
qa_dict_t5 = {"Question": ["Answer", "Sentence", "Rouge-1(Precision)", "Rouge-1(Recall)", "Rouge-1(fmeasure)", "Rouge-2(Precision)", "Rouge-2(Recall)", "Rouge-2(fmeasure)"]}

# Generate list of questions
for sentence in sentences:
  # Generate Question
  qn = generator("answer:"+sentence)
  question = qn[0]["generated_text"].replace("question: ", "")

  # answer
  t5query = f"""Question: "{question}". Context: "{sentence}" """
  inputs = tokenizer(t5query, return_tensors="pt")
  outputs = model.generate(**inputs)
  answer = tokenizer.batch_decode(outputs, skip_special_tokens=True)

  # Evaluate accuracy of generated question-answer pair using rouge_score
  reference = sentence  
  candidate = answer
  reference_str = " ".join(reference)
  candidate_str = " ".join(candidate) 
  scores = scorer.score(reference, candidate[0])
  print("Question: " + question + ", Answer: " + str(answer) + ", Rouge-1: " + str(scores['rouge1']) + ", Rouge-2: " + str(scores['rouge2']))
  qa_dict_t5[question] = [answer[0], sentence, scores['rouge1'].precision, scores['rouge1'].recall, scores['rouge1'].fmeasure, scores['rouge2'].precision, scores['rouge2'].recall, scores['rouge2'].fmeasure]


print(qa_dict_t5)


The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


Question: What is a quantum computer?, Answer: ['a computer that exploits quantum mechanical phenomena'], Rouge-1: Score(precision=1.0, recall=0.2692307692307692, fmeasure=0.42424242424242425), Rouge-2: Score(precision=1.0, recall=0.24, fmeasure=0.3870967741935484)
Question: Quantum computing leverages the behavior of particles and waves?, Answer: ['physical matter exhibits properties'], Rouge-1: Score(precision=1.0, recall=0.19047619047619047, fmeasure=0.32), Rouge-2: Score(precision=1.0, recall=0.15, fmeasure=0.2608695652173913)
Question: What is the problem with quantum computers?, Answer: ['a scalable quantum computer could perform some calculations exponentially faster than any modern "classical'], Rouge-1: Score(precision=1.0, recall=0.5384615384615384, fmeasure=0.7000000000000001), Rouge-2: Score(precision=1.0, recall=0.52, fmeasure=0.6842105263157895)
Question: What could a quantum computer do?, Answer: ['perform physical simulations'], Rouge-1: Score(precision=1.0, recall=0.09

In [None]:
# Step 9: Save to CSV
# Save to CSV
import pandas as pd

# Convert dictionary to Pandas DataFrame
df = pd.DataFrame(qa_dict_t5)

# Save DataFrame to CSV file
df.to_csv('qa_T5.csv', index=False)

## Conclusion
Without much time and knowledge in NLP, I was limited in what I can achieve, but just my humble efforts here. 

Here are some of the observation from the finding:
1. For simpler questions, the model generated the correct output and is factually correct. 
2. For more complicated and open-ended questions, the model does seem to struggle. From what I can infer, it is probably due to quality of the context. If the question is generated poorly from the context, it will roll over and affect the answer as well.
3. Some of the question and answers are not of good quality which may be due to the sentence given as context. More could have been done, alternatively, if I was using this for some use case, I would just remove those with a poorer score and keep to better ones.
4. The Rouge Score
  - The precision is generally high and recall varies, but is significantly lower than the precision.
  - This indicates that the model is very conservative and does not label true positives
  - The pros and cons would depend, but if we are pruning by recall score, this will allow us to have a more accurate model, but at a cost of less information available.
  - As a result, when looking at the fmeasure, there are scores that are on both ends of the spectrum. However, with a median of around `0.26` and mean of `0.31`, there leaves more to be desired.



There would be some things that I would like to improve on.
1. Better regex and data clean. This is to ensure that no weirdly strung sentences end up in the data.
2. Qualitative checks. The sentences are split with punctuation and the regex is was only capable of removing some of the standard clutters like special character and spaces. It would be nice to be able to handle some qualitative work, where there would be a filter to remove sentences that does not bring any value or are not relevant to our scenario.
3. Better question generation models that would make the questions have a little context at least. For example, "What the Germans do in the war". Without the context of World War 2, and the sentence, The machine may give answers to Germanic wars from 17th to 19th century.
4. Fine tune the model. I did not have the bandwidth to research and figure out how to do so. With more time and energy to spend on this project, I would be able to figure out the settings to do fine-tuning.

