# Assignment 3 Part 1 - Python for Text Processing

*Submission deadline: Friday 3 May 2024, 11:55pm.*

*Assessment marks: 10 marks (10% of the total unit assessment)*



Unless a Special Consideration request has been submitted and approved, a 5% penalty (of the total possible mark of the task) will be applied for each day a written report or presentation assessment is not submitted, up until the 7th day (including weekends). After the 7th day, a grade of ‘0’ will be awarded even if the assessment is submitted. The submission time for all uploaded assessments is 11:55 pm. A 1-hour grace period will be provided to students who experience a technical concern. For any late submission of time-sensitive tasks, such as scheduled tests/exams, performance assessments/presentations, and/or scheduled practical assessments/labs, please apply for [Special Consideration](https://students.mq.edu.au/study/assessment-exams/special-consideration).

Note that the work submitted should be your own work. You are allowed to use AI-based code generators to help you understand the problem and possible solutions, but you are not allowed to use the code generated by these tools (see below).

You are allowed to base your code on the code presented in the unit lectures and lecture notebooks.

**A note on the use of AI generators**: In this assignment, we view AI code generators such as copilot, CodeGPT, etc as tools that can help you write code quickly. You are allowed to use these tools, but with some conditions. To understand what you can and what you cannot do, please visit these information pages provided by Macquarie University.

Artificial Intelligence Tools and Academic Integrity in FSE - https://bit.ly/3uxgQP4
If you choose to use these tools, make the following explicit in your Jupyter notebook, under a section with heading "Use of AI generators in this assignment" :

* What part of your code is based on the output of such tools,
* What tools you used,
* What prompts you used to generate the code or text, and
* What modifications you made on the generated code or text.
  
This will help us assess your work fairly.

## Objectives of this assignment

In assignment 3 you will work on a task of "query-focused summarisation" on medical questions where the goal is, given a medical question and a list of sentences extracted from relevant medical publications, to determine which of these sentences from the list can be used as part of the answer to the question. Assignment 3 is divided into two parts. Part 1 will help you get familar with the data, and Part 2 requires you to implement deep neural networks.

We will use data that has been derived from the **BioASQ challenge** (http://www.bioasq.org/), after some data manipulation to make it easier to process for this assignment. The BioASQ challenge organises several "shared tasks", including a task on biomedical semantic question answering which we are using here. The data are in the file `bioasq10_labelled.csv`, which is part of the zip file provided. Each row of the file has a question, a sentence text, and a label that indicates whether the sentence text is part of the answer to the question (1) or not (0).

The following code uses pandas to store the file `bioasq10_labelled.csv` in a data frame and show the first rows of data. For this code to run, first you need to unzip the file `data.zip`:

In [1]:
!unzip data.zip

Archive:  data.zip
  inflating: bioasq10b_labelled.csv  
  inflating: dev_test.csv            
  inflating: test.csv                
  inflating: training.csv            


In [1]:
import pandas as pd
dataset = pd.read_csv("bioasq10b_labelled.csv")
dataset.head()

Unnamed: 0,qid,sentid,question,sentence text,label
0,0,0,Is Hirschsprung disease a mendelian or a multi...,Hirschsprung disease (HSCR) is a multifactoria...,0
1,0,1,Is Hirschsprung disease a mendelian or a multi...,"In this study, we review the identification of...",1
2,0,2,Is Hirschsprung disease a mendelian or a multi...,The majority of the identified genes are relat...,1
3,0,3,Is Hirschsprung disease a mendelian or a multi...,The non-Mendelian inheritance of sporadic non-...,1
4,0,4,Is Hirschsprung disease a mendelian or a multi...,Coding sequence mutations in e.g.,0


The columns of the CSV file are:

* `qid`: an ID for a question. Several rows may have the same question ID, as we can see above.
* `sentid`: an ID for a sentence.
* `question`: The text of the question. In the above example, the first rows all have the same question: "Is Hirschsprung disease a mendelian or a multifactorial disorder?"
* `sentence text`: The text of the sentence.
* `label`: 1 if the sentence is a part of the answer, 0 if the sentence is not part of the answer.

You are provided with a template that contains the definitions of the functions that you need to implement in each of the tasks below. The template includes sample [Python doctests](https://docs.python.org/3/library/doctest.html) that you can use to check the correctness of the code. These tests are there to help you. But note that we will use a separate set of tests when we assess your submission. It is your responsibility to run your own tests, in addition to the doctests provided.

## The Tasks

### 1. Statistics of part of speech (3 marks)

Implement a function `stats_pos` that returns the normalized frequency of all appeared part of speech in the questions and answers (namely the `sentence text` column), respectively. To find the part of speech, use NLTK's "Universal" tag set. You may need to use NLTK's sent_tokenize and word_tokenize to get words. Each of the resulting two lists (one for questions, one for answers) must be sorted alphabetically according to tags, e.g. [(ADV,0.1), (NOUN,0.21),...].

The input argument of the function is a csv file path.

To produce the correct results, the function must do this:
* Just keep unique questions
* Concatenate all questions together. Same to the answers.
* Use the NLTK libraries to find the tokens and the stems. 
* Use NLTK's sentence tokeniser before NLTK's word tokeniser.
* Use NLTK's part of speech tagger, using the "Universal" tagset.
* Use NLTK's `pos_tag_sents` instead of `pos_tag`.
* Using a few of sentences to analyse whether the PoS of questions has similar distributions with answers.


In [2]:
import a3_1
a3_1.stats_pos('dev_test.csv')

([('.', 0.1201),
  ('ADJ', 0.0892),
  ('ADP', 0.1119),
  ('ADV', 0.011),
  ('CONJ', 0.0085),
  ('DET', 0.085),
  ('NOUN', 0.3536),
  ('NUM', 0.0056),
  ('PRON', 0.0377),
  ('PRT', 0.0104),
  ('VERB', 0.1659),
  ('X', 0.0011)],
 [('.', 0.123),
  ('ADJ', 0.1203),
  ('ADP', 0.1173),
  ('ADV', 0.0244),
  ('CONJ', 0.0349),
  ('DET', 0.0771),
  ('NOUN', 0.3466),
  ('NUM', 0.0186),
  ('PRON', 0.0099),
  ('PRT', 0.0158),
  ('VERB', 0.1115),
  ('X', 0.0008)])

### 2. Statistics of the top stem n-grams (3 marks)

Implement a function `stats_top_stem_ngrams` that returns the N most frequent n-gram of stems together with their normalized frequency for questions and answers, respectively. You must return two lists (one for questions, and the other one for answers), and each is sorted in descending order of frequency, e.g. if it is 1-gram, then each returned list should look like [(how,0.18),(study,0.06),...].

The input arguments are:

* csv_file_path
* *n*: The parameter of n-gram, which denotes the number of words. Eg, if n=2, then n-grams will be 2-grams
* *N*: Top N n-grams.

To produce the correct results, the function must do this:

* Just keep unique questions
* Concatenate all questions together. Same to the answers.
* Use the NLTK libraries to find the tokens and the stems. 
* Use NLTK's sentence tokeniser before NLTK's word tokeniser.
* Use NLTK's Porter stemmer to get the root words.
* When computing bigrams, do not consider words that are in different sentences. For example, if we have this text: "Sentence 1. And sentence 2." the bigrams are: `('Sentence','1'), ('1','.'), ('And','sentence'), ('sentence','2'), ('2','.')`. Note that the following would not be a valid bigram, since the punctuation mark and the word "And" are in different sentences: `('.','And')`.
* Set n=2 and N=5, then use a few of sentences to describe the overlap between questions and answers.

In [3]:
import a3_1
a3_1.stats_top_stem_ngrams('dev_test.csv',2,5)

([(('what', 'is'), 0.0294),
  (('is', 'the'), 0.0265),
  (('of', 'the'), 0.0104),
  (('in', 'the'), 0.006),
  (('are', 'the'), 0.0055)],
 [(('of', 'the'), 0.0065),
  (('in', 'the'), 0.0055),
  ((',', 'and'), 0.0053),
  ((')', ','), 0.004),
  (('is', 'a'), 0.003)])

### 3. Statistics of Named Entity (2 marks)
Implement a function `stats_ne` that returns the normalized frequency of all named entity types for questions and answers, respectively. Using the default entity types of spacy. The resulting two lists have the same format as in Task 1.

The result will vary when using different Named Entity Recognition Models. To be consistent, you are required to use `en_core_web_sm` with the spaCy tool. 

In [4]:
import a3_1
a3_1.stats_ne('dev_test.csv')

([('CARDINAL', 0.0966),
  ('DATE', 0.0207),
  ('EVENT', 0.0046),
  ('FAC', 0.0046),
  ('GPE', 0.1172),
  ('LAW', 0.0092),
  ('LOC', 0.0138),
  ('NORP', 0.0529),
  ('ORDINAL', 0.0115),
  ('ORG', 0.3977),
  ('PERCENT', 0.0023),
  ('PERSON', 0.2115),
  ('PRODUCT', 0.0483),
  ('QUANTITY', 0.0023),
  ('WORK_OF_ART', 0.0069)],
 [('CARDINAL', 0.2141),
  ('DATE', 0.0528),
  ('EVENT', 0.001),
  ('FAC', 0.0037),
  ('GPE', 0.0717),
  ('LANGUAGE', 0.0001),
  ('LAW', 0.0044),
  ('LOC', 0.0078),
  ('MONEY', 0.0021),
  ('NORP', 0.0348),
  ('ORDINAL', 0.0202),
  ('ORG', 0.3784),
  ('PERCENT', 0.0303),
  ('PERSON', 0.1364),
  ('PRODUCT', 0.0288),
  ('QUANTITY', 0.0065),
  ('TIME', 0.0025),
  ('WORK_OF_ART', 0.0044)])

### 4. Statistics of tf.idf-based similarity (2 marks)

Implement a function `stats_tfidf` that returns the ratio of questions that its most similar sentence falls in its answers. That means you need to calculate the cosine similarity between one question and all sentences in the `sentence text` column, and check whether the sentence with the highest similarity falls in the answers of the question. To compute the tf.idf, use sklearn's TfidfVectorizer with the option to remove the English stop words (stop_words='english'). 

To produce correct results, the function must do this:

* Use Scikit-learn's `TfidfVectorizer` with the option `stop_words='english'`.
* Fit the tfidf vectorizer using all unique questions and answers.

In [5]:
import a3_1
a3_1.stats_tfidf('dev_test.csv')

0.4876

## Submission

The submission must be a single Python file. Do not submit several files or a zip file since the automarker would not know what to do with your submission. Do not submit a Jupyter notebook.