# Word Representation in Biomedical Domain

Before you start, please make sure you have read this notebook. You are encouraged to follow the recommendations but you are also free to develop your own solution from scratch. 

## Marking Scheme

- Biomedical imaging project: 40%
    - 20%: accuracy of the final model on the test set
    - 20%: rationale of model design and final report
- Natural language processing project: 40%
    - 30%: completeness of the project
    - 10%: final report
- Presentation skills and team work: 20%


This project forms 40\% of the total score for summer/winter school. The marking scheme of each part of this project is provided below with a cap of 100\%.

You are allowed to use open source libraries as long as the libraries are properly cited in the code and final report. The usage of third-party code without proper reference will be treated as plagiarism, which will not be tolerated.

You are encouraged to develop the algorithms by yourselves (without using third-party code as much as possible). We will factor such effort into the marking process.

## Setup and Prerequisites 

Recommended environment

- Python 3.7 or newer
- Free disk space: 100GB

Download the data

## Part 1 (20%): Parse the Data

The JSON files are located in two sub-folders in `document_parses`. You will need to scan all JSON files and extract text (i.e. `string`) from relevant fields (e.g. body text, abstract, titles).

You are encouraged to extract full article text from body text if possible. If the hardware resource is limited, you can extract from abstract or titles as alternatives. 

Note: The number of JSON files is around 425k so it may take more than 10 minutes to parse all documents.

For more information about the dataset: https://www.semanticscholar.org/cord19/download

Recommended output:

- A list of text (`string`) extracted from JSON files.

In [None]:
###################
import os
import json
import string

#the function for eliminating punctuations
def remove_punctuation(text, keep_characters="'-"):
    punctuation = ''.join([char for char in string.punctuation if char not in keep_characters])
    translator = str.maketrans('', '', punctuation)
    return text.translate(translator)

#the function for etracting text from json files where titles and main bodies are stored respectively
def extract_text_from_json(directory):
    text_data = []
    for filename in os.listdir(directory):
        if filename.endswith('.json'):
            with open(os.path.join(directory, filename), 'r') as file:
                full_text_dict = json.load(file)
                element = full_text_dict['metadata']['title']
                for paragraph_dict in full_text_dict['body_text']:
                    element += paragraph_dict['text']
                element = remove_punctuation(element,keep_characters="'-")
                text_data.append(element)
    return text_data

directory1 = 'document_parses/pdf_json'
directory2 = 'document_parses/pmc_json'
text_data = extract_text_from_json(directory1) +extract_text_from_json(directory2)



###################

## Part 2 (30%): Tokenization

Traverse the extracted text and segment the text into words (or tokens).

The following tracks can be developed in independentely. You are encouraged to divide the workload to each team member.

Recommended output:

- Tokenizer(s) that is able to tokenize any input text.

Note: Because of the computation complexity of tokenizers, it may take hours/days to process all documents. Which tokenizer is more efficient? Any idea to speedup?

### Track 2.1 (10%): Use split()

Use the standard `split()` by Python.

### Track 2.2 (10%): Use NLTK or SciSpaCy

NLTK tokenizer: https://www.nltk.org/api/nltk.tokenize.html

SciSpaCy: https://github.com/allenai/scispacy

Note: You may need to install NLTK and SpaCy so please refer to their websites for installation instructions.

### Track 2.3 (10%): Use Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE): https://huggingface.co/transformers/tokenizer_summary.html

Note: You may need to install Huggingface's transformers so please refer to its website for installation instructions.

### Track 2.4 (Bonus +5%): Build new Byte-Pair Encoding (BPE)

This track may be dependent on track 2.3.

The above pre-built tokenization methods may not be suitable for biomedical domain as the words/tokens (e.g. diseases, sympotoms, chemicals, medications, phenotypes, genotypes etc.) can be very different from the words/tokens commonly used in daily life. Can you build and train a new BPE model for biomedical domain in particular?

### Open Question (Optional):

- What are the pros and cons of the above tokenizers?

In [None]:
###################
# 1.simple split
split_tokens = [text.split() for text in text_data]

In [None]:
# 2. NLTK Tokenization
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
nltk_tokens = [word_tokenize(text) for text in text_data]

In [None]:
# 3.Scispacy for biomedical text
import spacy


nlp = spacy.load("en_core_sci_sm")  
nlp.max_length = 2000000  

def preprocess_biomedical_text(text):
    doc = nlp(text)
    tokens = []
    seen_words = set() 
    for entity in doc.ents:
        tokens.append(entity.text)
        seen_words.update(entity.text.split())
    for token in doc:  
        if token.ent_iob_ == "O" and token.text not in seen_words:  
            tokens.append(token.text)  
    return tokens  

scispacy_tokens = [preprocess_biomedical_text(text) for text in text_data]


In [None]:
# 4. Byte-Pair Encoding (BPE) Tokenization
import sentencepiece as spm
def train_bpe(text_data, model_prefix='bpe_model'):
     with open('corpus.txt', 'w') as f:
         for text in text_data:
             f.write(text + '\n')
     spm.SentencePieceTrainer.Train(f'--input=corpus.txt --model_prefix={model_prefix} --vocab_size=100')

train_bpe(text_data)
sp = spm.SentencePieceProcessor(model_file='bpe_model.model')
bpe_tokens = [sp.encode_as_pieces(text) for text in text_data]
###################

## Part 3 (30%): Build Word Representations

Build word representations for each extracted word. If the hardware resource is limited, you may limit the vocabulary size up to 10k words/tokens (or even smaller) and the dimension of representations up to 256.

The following tracks can be developed independently. You are encouraged to divide the workload to each team member.

### Track 3.1 (15%): Use N-gram Language Modeling

N-gram Language Modeling is to predict a target word by using `n` words from previous context. Specifically,

$P(w_i | w_{i-1}, w_{i-2}, ..., w_{i-n+1})$

For example, given a sentence, `"the main symptoms of COVID-19 are fever and cough"`, if $n=7$, we use previous context `["the", "main", "symptoms", "of", "COVID-19", "are"]` to predict the next word `"fever"`.

More to read: https://web.stanford.edu/~jurafsky/slp3/3.pdf

Recommended outputs:

- A fixed vector for each word/token.

### Track 3.2 (15%): Use Skip-gram with Negative Sampling

In skip-gram, we use a central word to predict its context. Specifically,

$P(w_{c-m}, ... w_{c-1}, w_{c+1}, ..., w_{c+m} | w_c)$

As the learning objective of skip-gram is computational inefficient (summation of entire vocabulary $|V|$), negative sampling is commonly applied to accelerate the training.

In negative sampling, we randomly select one word from the context as a positive sample, and randomly select $K$ words from the vocabulary as negative samples. As a result, the learning objective is updated to

$L = -\log\sigma(u^T_{t} v_c) - \sum_{k=1}^K\log\sigma(-u^T_k v_c)$, where $u_t$ is the vector embedding of positive sample from context, $u_k$ are the vector embeddings of negative samples, $v_c$ is the vector embedding of the central word, $\sigma$ refers to the sigmoid function.

More to read http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf section 4.3 and 4.4

Recommended outputs:

- A fixed vector for each word/token.

### Track 3.3 (Bonus +5%): Use Contextualised Word Representation by Masked Language Model (MLM)

BERT introduces a new language model for pre-training named Masked Language Model (MLM). The advantage of MLM is that the word representations by MLM will be contextualised.

For example, "stick" may have different meanings in different context. By N-gram language modeling and word2vec (skip-gram, CBOW), the word representation of "stick" is fixed regardless of its context. However, MLM will learn the representation of "stick" dynamatically based on context. In other words, "stick" will have different representations in different context by MLM.

More to read: http://jalammar.github.io/illustrated-bert/ and https://arxiv.org/pdf/1810.04805.pdf

Recommended outputs:

- An algorithm that is able to generate contextualised representation in real time.

In [None]:
###################
# 1. N-gram Model
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser

phrases = Phrases(bpe_tokens, min_count=1, threshold=2)
bigram = Phraser(phrases)
ngram_tokens = [bigram[tokens] for tokens in bpe_tokens]

###################

In [None]:
# 2.Skip-gram with Negative Sampling
skipgram_model = Word2Vec(sentences=bpe_tokens, vector_size=100, window=5, sg=1, negative=10, min_count=1)

In [None]:
# 3. Masked Language Model (BERT or similar)
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_bert_embeddings(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).detach().numpy()

bert_embeddings = [get_bert_embeddings(text) for text in text_data]

In [None]:
from transformers import cached_path  
  
# 清除缓存  
cache_dir = cached_path('bert-base-uncased')  
if cache_dir:  
    import shutil  
    shutil.rmtree(cache_dir)  
  
# 尝试重新加载  
from transformers import BertTokenizer, BertModel  
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')  
model = BertModel.from_pretrained('bert-base-uncased')

## Part 4 (20%): Explore the Word Representations

The following tracks can be finished independently. You are encouraged to divide workload to each team member.

### Track 4.1 (5%): Visualise the word representations by t-SNE

t-SNE is an algorithm to reduce dimentionality and commonly used to visualise high-dimension vectors. Use t-SNE to visualise the word representations. You may visualise up to 1000 words as t-SNE is highly computationally complex.

More about t-SNE: https://lvdmaaten.github.io/tsne/

Recommended output:

- A diagram by t-SNE based on representations of up to 1000 words.

### Track 4.2 (5%): Visualise the Word Representations of Biomedical Entities by t-SNE

Instead of visualising the word representations of the entire vocabulary (or 1000 words that are selected at random), visualise the word representations of words which are biomedical entities. For example, fever, cough, diabetes etc. Based on the category of those biomedical entities, can you assign different colours to the entities and see if the entities from the same category can be clustered by t-SNE? For example, sinusitis and cough are both respirtory diseases so they should be assigned with the same colour and ideally their representations should be close to each other by t-SNE. Another example, Alzheimer and headache are neuralogical diseases which should be assigned by another colour.

Examples of biomedial ontology: https://www.ebi.ac.uk/ols/ontologies/hp and https://en.wikipedia.org/wiki/International_Classification_of_Diseases

Recommended output:

- A diagram with colours by t-SNE based on representations of biomedical entities.

### Track 4.3 (5%): Co-occurrence

- What are the biomedical entities which frequently co-occur with COVID-19 (or coronavirus)?

Recommended outputs:

- A sorted list of biomedical entities and description on how the entities are selected and sorted.

### Track 4.4 (5%): Semantic Similarity

- What are the biomedical entities which have closest semantic similarity COVID-19 (or coronavirus) based on word representations?

Recommended outputs:

- A sorted list of biomedical entities and description on how the entities are selected and sorted.

### Open Question (Optional): What else can you discover?


In [None]:
###################
#1. Visualise the word representations by t-SNE
import numpy as np
from sklearn.manifold import TSNE
import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.rcParams['text.usetex'] = False

vocab = list(skipgram_model.wv.index_to_key)
embeddings = np.array([skipgram_model.wv[word] for word in vocab])
selected_indices = np.random.choice(len(vocab), 500, replace=False)

vocab = [vocab[i] for i in selected_indices]
embeddings = embeddings[selected_indices]

tsne = TSNE(n_components=2, random_state=0, perplexity=30)
reduced_embeddings = tsne.fit_transform(embeddings)

plt.figure(figsize=(10, 10))
for i, label in enumerate(vocab):
    x, y = reduced_embeddings[i, :]
    plt.scatter(x, y)
plt.show()


###################

In [None]:
#2.Visualise the Word Representations of Biomedical Entities by t-SNE
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import scispacy
import spacy
from collections import defaultdict
import numpy as np

nlp = spacy.load("en_core_sci_sm")

def extract_biomedical_entities(text_data):
    entities = defaultdict(list)
    for doc in nlp.pipe(text_data):
        for entity in doc.ents:
            entities[entity.text].append(entity.label_)
    return entities

biomedical_entities = extract_biomedical_entities(text_data)
vocab = set(skipgram_model.wv.index_to_key)
biomedical_entities_in_vocab = {entity for entity in biomedical_entities if entity in vocab}

# Extract embeddings for the entities
biomedical_entity_vectors = np.array([skipgram_model.wv[entity] for entity in biomedical_entities_in_vocab])


max_entities = 500
selected_indices = np.random.choice(len(biomedical_entity_vectors), max_entities, replace=False)
biomedical_entity_vectors = biomedical_entity_vectors[selected_indices]
biomedical_entities_in_vocab = [list(biomedical_entities_in_vocab)[i] for i in selected_indices]

def visualize_biomedical_entities(embeddings, labels):
    tsne = TSNE(n_components=2, random_state=0)
    reduced_embeddings = tsne.fit_transform(embeddings)

    plt.figure(figsize=(10, 10))
    for i, label in enumerate(labels):
        x, y = reduced_embeddings[i, :]
        plt.scatter(x, y)
        plt.text(x + 0.1, y + 0.1, label, fontsize=9)
    plt.show()

# Visualize the biomedical entities
visualize_biomedical_entities(biomedical_entity_vectors, list(biomedical_entities_in_vocab))

In [None]:
#3.Co-occurrence with biomedical words
import numpy as np

# Co-occurrence Analysis
def co_occurrence(tokens, target_word):
    co_occur_dict = {}
    for token_list in tokens:
        if target_word in token_list:
            for token in token_list:
                if token != target_word:
                    co_occur_dict[token] = co_occur_dict.get(token, 0) + 1
    return co_occur_dict

co_occurrences = co_occurrence(biomedical_entities, 'COVID-19')
print(sorted(co_occurrences.items(), key=lambda x: x[1], reverse=True))

In [None]:
#4.Semantic Similarity with biomedical words
def get_most_similar_words(model, target_word, top_n=10):
    similar_words = model.wv.most_similar(target_word, topn=top_n)
    return similar_words

similar_words = get_most_similar_words(biomedical_entity_vectors, 'COVID_19')
print(similar_words)