# Assignment 3 Part 2 - Find complex answers to medical questions

*Submission deadline: Friday 24 May 2024, 11:55pm.*

*Assessment marks: 15 marks (15% of the total unit assessment)*

Unless a Special Consideration request has been submitted and approved, a 5% penalty (of the total possible mark of the task) will be applied for each day a written report or presentation assessment is not submitted, up until the 7th day (including weekends). After the 7th day, a grade of ‘0’ will be awarded even if the assessment is submitted. The submission time for all uploaded assessments is 11:55 pm. A 1-hour grace period will be provided to students who experience a technical concern. For any late submission of time-sensitive tasks, such as scheduled tests/exams, performance assessments/presentations, and/or scheduled practical assessments/labs, please apply for [Special Consideration](https://students.mq.edu.au/study/assessment-exams/special-consideration).

Note that the work submitted should be your own work. You are allowed to use AI-based code generators to help you understand the problem and possible solutions, but you are not allowed to use the code generated by these tools (see below).

You are allowed to base your code on the code presented in the unit lectures and lecture notebooks.

**A note on the use of AI generators**: In this assignment, we view AI code generators such as copilot, CodeGPT, etc as tools that can help you write code quickly. You are allowed to use these tools, but with some conditions. To understand what you can and what you cannot do, please visit these information pages provided by Macquarie University.

Artificial Intelligence Tools and Academic Integrity in FSE - https://bit.ly/3uxgQP4
If you choose to use these tools, make the following explicit in your Jupyter notebook, under a section with heading "Use of AI generators in this assignment" :

* What part of your code is based on the output of such tools,
* What tools you used,
* What prompts you used to generate the code or text, and
* What modifications you made on the generated code or text.
  
This will help us assess your work fairly.



# Overall Task Review

In assignment 3 you will work on a task of "query-focused summarisation" on medical questions where the goal is, given a medical question and a list of sentences extracted from relevant medical publications, to determine which of these sentences from the list can be used as part of the answer to the question. Assignment 3 is divided into two parts. Part 1 will help you get familar with the data, and Part 2 requires you to implement deep neural networks.

We will use data that has been derived from the **BioASQ challenge** (http://www.bioasq.org/), after some data manipulation to make it easier to process for this assignment. The BioASQ challenge organises several "shared tasks", including a task on biomedical semantic question answering which we are using here. The data are in the file `bioasq10_labelled.csv`, which is part of the zip file provided. Each row of the file has a question, a sentence text, and a label that indicates whether the sentence text is part of the answer to the question (1) or not (0).

# Data Review

The following code uses pandas to store the file `bioasq10_labelled.csv` in a data frame and show the first rows of data. For this code to run, first you need to unzip the file `data.zip`:

In [None]:
!unzip data.zip

In [2]:
import pandas as pd
dataset = pd.read_csv("bioasq10b_labelled.csv")
dataset.head()

Unnamed: 0,qid,sentid,question,sentence text,label
0,0,0,Is Hirschsprung disease a mendelian or a multi...,Hirschsprung disease (HSCR) is a multifactoria...,0
1,0,1,Is Hirschsprung disease a mendelian or a multi...,"In this study, we review the identification of...",1
2,0,2,Is Hirschsprung disease a mendelian or a multi...,The majority of the identified genes are relat...,1
3,0,3,Is Hirschsprung disease a mendelian or a multi...,The non-Mendelian inheritance of sporadic non-...,1
4,0,4,Is Hirschsprung disease a mendelian or a multi...,Coding sequence mutations in e.g.,0


The columns of the CSV file are:

* `qid`: an ID for a question. Several rows may have the same question ID, as we can see above.
* `sentid`: an ID for a sentence.
* `question`: The text of the question. In the above example, the first rows all have the same question: "Is Hirschsprung disease a mendelian or a multifactorial disorder?"
* `sentence text`: The text of the sentence.
* `label`: 1 if the sentence is a part of the answer, 0 if the sentence is not part of the answer.

# Now Let's get started for the Part 2 tasks

Use the provided files `training.csv`, `dev_test.csv`, and `test.csv` in the data.zip file for all the tasks below.

# Task 1 (5 marks): Simple Siamese NN

Implement a simple TensorFlow-Keras neural model that has the following sequence of layers:

1. An input layer that will accept the tf.idf of triplet data. The input of Siamese network is a triplet, consisting of anchor (i.e., the question), positive answer, negative answer.
2. 3 hidden layers and a relu activation function. You need to determine the size of the hidden layers.
3. Implement a class that serves as a distance layer. It returns the squared Euclidean distance between anchor and positive answer, as well as that between anchor and negative answer
4. Implement a function that prepares raw data in csv files into triplets. Note that it is important to keep the similar number of positive pairs and negative pairs. For example, if a question has 10 anwsers, then we at most can have 10 positive pairs and it is good to associate this question with 10~20 negative sentences. 


Train the model with the training data and use the `dev_test` set to determine a good size of the hidden layer. 

With the model that you have trained, implement a summariser that returns the $n$ sentences with highest predicted score. Use the following function signature:

```{python}
def nn_summariser(csvfile, questionids, n=1):
   """Return the IDs of the n sentences that have the highest predicted score. 
      The input questionids is a list of question ids. 
      The output is a list of lists of sentence ids
   """

```

Report the final results using the test set. Remember: use the test set to report the final results of the best system only.

The breakdown of marks is as follows:

* **1 mark** if the NN model has the correct layers, the correct activation functions, and the correct loss function.
* **1 mark** if the code passes input to the model correctly.
* **1 mark** if the code returns the IDs of the $n$ sentences that have the highest prediction score in the given question.
* **1 mark** if the notebook reports the F1 scores of the test sets and comments on the results.
* **1 mark** for good coding and documentation in this task. In particular, the code and results must include evidence that shows your choice of best size of the hidden layer. The explanations must be clear and concise. To make this task less time-consuming, use $n=1$.

In [5]:
# Write your code and answers here. Feel free to add more code and markdown cells.

# Task 2 (5 marks): Recurrent NN

Implement a more complex Siamese neural network that is composed of the following layers:

* An embedding layer that generates embedding vectors of the sentence text with 35 dimensions.
* A LSTM layer. You need to determine the size of this LSTM layer, and the text length limit (if needed).
* 3 hidden layers and a relu activation function. You need to determine the size of the hidden layers.

Train the model with the training data, use the `dev_test` set to determine a good size of the LSTM layer and an appropriate length limit (if needed), and report the final results using the test set. Again, remember to use the test set only after you have determined the optimal parameters of the LSTM layer.

Based on your experiments, comment on whether this system is better than the systems developed in the previous tasks.

The breakdown of marks is as follows:

* **1 mark** if the NN model has the correct layers, the correct activation functions, and the correct loss function.
* **1 mark** if the code passes the sentence text to the model correctly. The documentation needs to explain what decisions had to be made to process long sentences. In particular, did you need to truncate the input text, and how did you determine the length limit?
* **1 mark** if the code returns the IDs of the *n* sentences that have the highest prediction score in the given question.
* **1 mark** if the notebook reports the F1 scores of the test sets and comments on the results.
* **1 mark** for good coding and documentation in this task. In particular, the code and results must include evidence that shows your choice of best size of the LSTM layer (and length limit) and hidden layers. The explanations must be clear and concise. To make this task less time-consuming, use $n=1$.

In [3]:
# Write your code and answers here. Feel free to add more code and markdown cells.

# Task 3 (5 marks): Transformer

Implement a simple Transformer neural network that is composed of the following layers:

* Use BERT as feature extractor for each token.
* A few of transformer encoder layers, hidden dimension 768. You need to determine how many layers to use between 1~3.
* A few of transformer decoder layers, hidden dimension 768. You need to determine how many layers to use between 1~3.
* 1 hidden layer with size 512.
* The final output layer with one cell for binary classification to predict whether two inputs are related or not. 

Note that each input for this model should be a concatenation of a positive pair (i.e. question + one answer) or a negative pair (i.e. question + not related sentence). The format is usually like [CLS]+ question + [SEP] + a positive/negative sentence.

Train the model with the training data, use the dev_test set to determine a good size of the transformer layers, and report the final results using the test set. Again, remember to use the test set only after you have determined the optimal parameters of the transformer layers.

Based on your experiments, comment on whether this system is better than the systems developed in the previous tasks.

The breakdown of marks is as follows:

* **1 mark** if the model has the correct layers, the correct activation functions, and the correct loss function.
* **1 mark** if the code passes the sentence text to the model correctly. The documentation needs to explain how to handle length difference for a batch of data
* **1 mark** if the code returns the IDs of the *n* sentences that have the highest prediction score in the given question.
* **1 mark** if the notebook reports the F1 scores of the test sets and comments on the results.
* **1 mark** for good coding and documentation in this task. In particular, the code and results must include evidence that shows your choice of best size of the transformer layers. The explanations must be clear and concise. To make this task less time-consuming, use $n=1$.

# Submission 

Your submission should consist of this Jupyter notebook with all your code and explanations inserted into the notebook as text cells. **The notebook should contain the output of the runs. All code should run. Code with syntax errors or code without output will not be assessed.**

**Do not submit multiple files.**

Examine the text cells of this notebook so that you can have an idea of how to format text for good visual impact. You can also read this useful [guide to the MarkDown notation](https://daringfireball.net/projects/markdown/syntax),  which explains the format of the text cells.