# Assignment 4: Keyphrase Extraction, Named Entity Recognition & Neural Models

Due: Monday, February 06, 2023, at 2pm via Moodle

**Team Members** `<Fill out>`

Please note that this assignment comes with quite a number of artifacts, totaling somewhere around 5 GB of necessary disk space. In case you are running into issues or do want to keep your environment "clean", we suggest the use of [Google Colab](https://colab.research.google.com/).

In [None]:
%%bash
. ~/.bashrc
python3 -m pip install keybert
python3 -m pip install git+https://github.com/LIAAD/yake
python3 -m pip install transformers
python3 -m pip install datasets
python3 -m pip install nltk
python3 -m pip install spacy
# Install necessary packages for all questions

## Task 1: Keyphrase Extraction (5 + 3 + 3 + 5) = 16 Points

In this task, we will implement our own unsupervised keyphrase extraction (KPE) module utilizing a simple grammatical ruling system, which we apply to a Sherlock Holmes novel.
To generate TF-IDF-weighted phrases, we will be using the entire collection of Sir Arthur Donan Coyle novels to calculate document frequencies.

Finally, we compare the results to general-purpose KPE libraries.

### Sub Task 1: Unsupervised Keyphrase Extraction System (5 Points)

#### 1. Candidate Generation
We will need to generate a set of suitable candidate phrases first, which can then be ranked as keyphrases later on. To do this, we will again be using spaCy's, this time its rule-based [`Matcher` class](https://spacy.io/api/matcher).

The syntactic pattern of a keyphrase candidate should satisfy the following rules:

1. An optional adjective, noun, proper noun
2. An optional adjective, noun, proper noun
3. A mandatory noun or proper noun.

Add a second pattern, which recognizes the pattern

1. A noun or proper noun
2. An adposition
3. Another noun or proper noun

Note that the first condition will match any phrase of length between 1-3 tokens, which is a suitable approximation for our task at hand, whereas the second pattern is slightly more specific, always matching exactly three tokens.
An example of a valid matched phrases for the first pattern would be "Sherlock Holmes" ([PROPN, PROPN]), and "Hounds of Baskervilles" ([NOUN, ADP, PROPN]) for the second pattern.

In [None]:
import spacy
from spacy.matcher import Matcher

In [None]:
# load language model
nlp = spacy.load("en_core_web_sm", disable=["ner"])
matcher = Matcher(nlp.vocab)

# Define the above patterns
pattern = ## YOUR CODE

matcher.add( ## YOUR CODE

To verify whether your pattern is correct, use the below example.
If you have done everything correctly, your matcher will identify **13 phrases**.

In [None]:
doc = nlp("This is a simple test. It should return 'simple', and 'test', among other phrases. Maybe we can also see if it can recognize the art of war. Would it recognize integer linear programming, too?")
matches = matcher(doc)

print(len(matches))

#### 2. Applying Your System

Once you have matched the correct number of keyphrase candidates on the above example, apply your rule-based matcher to an actual data sample. We are going to use the Sherlock Holmes novel "Hounds of Baskervilles". You can find the raw text file at the following URL:

https://sherlock-holm.es/stories/plain-text/houn.txt

Download the text from this URL and apply your spaCy model and matcher on it.  
**Hint:** Make sure you properly decode your input, since some libraries return binary strings.

In [None]:
from urllib.request import urlopen
def load_txt_from_url(url: str = "https://sherlock-holm.es/stories/plain-text/houn.txt") -> str:
  ## YOUR CODE

text = load_txt_from_url()

# Apply the spacy model to the loaded text and extract the phrases with the Matcher
doc = ## YOUR CODE
matches = ## YOUR CODE

We will now investigate which phrase candidates are the most frequently appearing in this novel, simply based on the phrase frequency. Therefore, convert your abstract match objects into actual strings, lowercase them, and return the 20 most frequently occurring phrase candidates and their respective frequencies.  
**Hint:** For counting occurrences, you may look at `collections.Counter`.

In [None]:
candidates = []
# Lowercase and add the extracted candidate matches to `candidates`
## YOUR CODE

# Count the number of occurrences of different candidate phrases
candidate_phrases = ## YOUR CODE
# Print the most frequently occurring phrases, together with the respective frequencies
print( ## YOUR CODE

#### 3. Briefly summarize the quality of your top 20 candidates:

YOUR ANSWER

### Sub Task 2: Generating Document Frequency Values (3 Points)

To compare the previously generated terms with a more refined model, we are going to extract document frequencies from the collection of all Sherlock Holmes works. Since the books are relatively long documents, we are instead going to split based on a simple heuristic in the input document, which should allow a decent approximation by taking into account individual chapters of each novel.

1. Start by loading the Sherlock Holmes canon from https://sherlock-holm.es/stories/plain-text/cnus.txt  
Afterwards, split the full document into individual chapters. For this, use three consecutive line breaks `\n\n\n` as a splitting condition to approximate the chapters.

In [None]:
df_texts = load_txt_from_url("https://sherlock-holm.es/stories/plain-text/cnus.txt")

split_df_texts = ## YOUR CODE

print(len(split_df_texts))

After splitting, you should have 353 individual "documents" to work with.

2. Now, create a dictionary containing each phrase encountered in the larger corpus, and its associated document frequency. Again, ensure that phrase strings are lowercased for consistency with the previous transformation.  
**Hint:** Since the processing of 353 documents might take a while, incorporate [`tqdm.tqdm`](https://tqdm.github.io/) to visualize progress on the task.

In [None]:
from tqdm import tqdm
from typing import List

def return_occurring_phrases(doc_text: str) -> List[str]:
  # process text with spaCy and apply the Matcher
  doc = ## YOUR CODE
  matches = ## YOUR CODE

  # Candidates can be a set, since we only care about the occurrence *once* for IDF values.
  # Again, extract the lower-cased text of a matched span.
  candidates = set()
  for match_id, start, end in matches:
    ## YOUR CODE

  return list(candidates)

all_document_phrases = []
# Iterate through the individual documents and extract phrases for them. Use `tqdm` to visualize progress
## YOUR CODE

# Once again, count the frequency of term occurrences across all documents
## YOUR CODE



3. Output the 20 most frequently appearing document phrases that your system detected:

In [None]:
print( ## YOUR CODE

### Sub Task 3: Generating Weighted Keyphrases (3 Points)

We can now incorporate the extracted keyphrases to calculate `tf-idf` scores, and return a hopefully improved version of our keyphrases for the original "Hounds of Baskervilles" document. 

1. Iterate over all phrases occurring in the novel "Hounds of Baskervilles", and re-score phrases according to the definition of TF-IDF. Use the smoothed definition of idf:

$ idf(t, D) = \log \frac{|D|}{|\{d \in D : t \in d\}| + 1} + 1 $

In [None]:
import math
from typing import Dict

def tf_idf(tf: int, df_count: int) -> float:
  """
  Computes the TF-IDF scores according to the above-mentioned formula.
  Note that you may use a constant for the number of documents (|D|).
  """
  return ## YOUR CODE

tf_idf_weighted_candidates = []

# Iterate through all candidate phrase/frequency pairs and compute the TF-IDF scores for each phrase
# Store the phrase together with its TF-IDF score in `tf_idf_weighted_candidates`
for candidate, tf in candidate_phrases.items():
  tf_idf_weighted_candidates.append( ## YOUR CODE

2. Now print the top 20 candidate phrases by TF-IDF weight, and compare the results to your previous output. 

In [None]:
print(## YOUR CODE

3. Write your insights on the comparison of the results below. Try to theorize why some of the phrases still appear, or why other phrases are no longer present:

YOUR ANSWER HERE

4. Give two examples of how you could further improve the list of keyphrase values.

YOUR ANSWER HERE



### Sub Task 4: Apply off-the-shelf Keyphrase Extraction Tools (5 Points)

To put the findings of your system into context, compare them with two popular open-source libraries, namely [YAKE!](https://github.com/LIAAD/yake) and [KeyBERT](https://github.com/MaartenGr/KeyBERT).

1. First, start by running the document with YAKE!; you may use the default parameters. Print the resulting keyphrases, which by default returns 20 phrases.

In [None]:
from yake import KeywordExtractor

extractor = ## YOUR CODE
keywords = ## YOUR CODE

# Print the top 20 keywords
print(keywords)

2. Compare both runtime efficiency and the extracted phrases with your own system.

YOUR ANSWER HERE

3. Now use the KeyBERT library to extract keyphrases. Importantly, you will need to split the document into separate paragraphs, as the underlying neural model will be unable to handle the complete document as input.  
Use the pattern of `\n\n` to separate the text into smaller paragraphs, and filter out any empty lines after. An "empty line" also constitutes all inputs that only contain newline (`\n`) or whitespace ` ` characters.


In [None]:
# Split the input text according to the specified criteria and filter empty lines out.
split_text = ## YOUR CODE

4. To ensure consistency between the tools when extracting keyphrases, set the *n*-gram range to `(1,3)`.
Otherwise, leave all parameters at the default value, and extract the keyphrases from each paragraph.

In [None]:
from keybert import KeyBERT

# This might take a while to install
model = KeyBERT("all-MiniLM-L6-v2")

# Extract the keyphrases from each split, using the adjusted keyphrase ngram range
# Hint: You may pass a list to the extraction function and KeyBERT will automatically handle iteration.
extracted_phrases = ## YOUR CODE

5. Combine the predictions of all individual splits into a single list. For this, sum up the prediction scores across all splits.  
**Hint:** `collections.defaultdict` makes aggregations like this much easier.

In [None]:
from typing import List, Tuple
from collections import defaultdict

def merge_predictions(list_of_predictions: List[List[Tuple]]) -> List[Tuple]:
    """
    Combines lists of predictions into a single list with added scores.
    """
    phrase_dict = ## YOUR CODE

    # Iterate through all the lists of predictions and add the scores to the correct dict entry
    ## YOUR CODE

    # Extract the 20 keyphrases with the highest weithgts from `phrase_dict`
    phrase_list = ## YOUR CODE

    return phrase_list

In [None]:
print(merge_predictions(extracted_phrases))

6. Again, evaluate the result and compare it to the other two approaches in terms of extraction quality and extraction speed.

YOUR ANSWER HERE

## 2. Named Entity Recognition (4 + 5 + 5 = 14 Points)

Slightly different, but still operating on the sequence level, is the task of Named Entity Recognition (NER).
In this task, we will evaluate the NER capabilities of some more open-source libraries.
Particularly, we will also evaluate the utility of NER as a stand-in for Keyphrase Extraction.

### Sub Task 1: Using spaCy NER (4 Points)

So far, when using spaCy models, we have primarily disabled the NER component, as it requires significant extra compute.
In this task, we will explicitly leave the component enabled, to see what results it can produce on the text from the previous question.

In [None]:
import spacy

# Load the en_core_web_sm model, but with NER enabled.
nlp = spacy.load("en_core_web_sm")

1. Re-load the text for the "Hounds of Baskervilles" novel, and run it with the spacy model.

In [None]:
# Re-use the function from the previous exercise.
text = ## YOUR CODE

doc = ## YOUR CODE

2. Similar to the previous exercise, count the number of occurrences, however, this time for the extracted entities instead of phrases. Print the top 20 most frequently occurring entities.  
Make sure to lowercase the text again during your aggregation.

In [None]:
# Count the number of occurrences of particular entities
## YOUR CODE

# Print the top 20 most frequently occurring entities.
print( ## YOUR CODE

You might have noticed some unwanted results in the list, such as "night". Upon closer inspection, it turns out that the NER module further differentiates between different entity *categories*, such as PERSON (referencing, as expected, a physical person) or ORG (organizations, such as companies, NGOs, etc.), but also TIME (under which "night" falls). For reference, you can find the full list of supported NER labels by this particular model [here](https://spacy.io/models/en#en_core_web_sm-labels).

3. Refine the list of most common entities by printing out the top three occurring entities in the category `PERSON`, `ORG` and `GPE` (physical locations) instead.

In [None]:
def get_top_entities_by_class(doc: spacy.tokens.Doc, class_name: str, n: int = 3) -> List[Tuple[str, int]]:
    """
    Returns the three most frequent entities (and their frequencies)
    of entity type `class_name` from `doc`.
    """
    # Extract phrase and frequency of a particular entity class
    counter = ## YOUR CODE
    # Return the top 3 entities and frequencies
    return ## YOUR CODE

# Print the results for "PERSON", "ORG" and "GPE"
print( ## YOUR CODE
print( ## YOUR CODE
print( ## YOUR CODE

### Sub Task 2: Financial Bank Statements of Deutsche Bank (5 Points)

Instead of using the Sherlock Holmes Novels, we will now compare the functionality of spaCy and NLTK's NER modules on the financial statements of Deutsche Bank from 2021. For this, see the file available on Moodle.

1. Download it and convert the PDF document into text, by using the `pdftotext` command-line utility. In particular, run with the `-layout` option enabled.

In [None]:
%%bash
. ~/.bashrc
## YOUR SHELL COMMAND HERE
# If you have to execute this command through your shell, still paste the command you ran in here.

2. Given that the document is extremely long, split the inputs into chunks of 500.000 characters and process them separately.

In [None]:
def load_long_text_in_chunks(fp: str, chunk_size: int = 500_000) -> List[str]:
    """
    Loads a text file (located at `fp`) and chunks it into chunks fo at most `chunk_size` characters.
    Note that the last chunk might be significantly shorter.
    """
    # Load the text file
    text = ## YOUR CODE

    # Split the text into segments of at most `chunk_size` characters
    chunks = ## YOUR CODE
    return chunks

In [None]:
db_chunks = load_long_text_in_chunks( ## YOUR CODE HERE)

3. Print the top 5 occurring `ORG` entities that are not referencing Deutsche Bank itself, both by using spaCy's NER module and the NER function of NLTK.  
To exclude "Deutsche Bank" entities, filter out all entities that contain both "deutsche" and "bank" in their name, irrespective of the actual upper-/lowercasing.
**Hint:** For more information on how to run NER with NLTK, see [here](https://nanonets.com/blog/named-entity-recognition-with-nltk-and-spacy/#performing-ner-with-nltk-and-spacy)

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

org_entities_spacy = []
org_entities_nltk = []

def is_deutsche_bank_entity(name: str) -> bool:
    """
    Returns True if the entity name contains "deutsche" and "bank" in some upper-/lowercased version.
    This means both "Deutsche Bank" and "deutsche bank's" should be recognized.
    """
    ## YOUR CODE

for chunk in db_chunks:
    # Process the chunk with spaCy
    doc = ## YOUR CODE

    # And also with NLTK
    ## YOUR CODE
    

    # Add all the extracted "ORG" entities to `org_entities`, except those referencing Deutsche Bank
    org_entities_spacy.extend( ## YOUR CODE
    org_entities_nltk.extend( ## YOUR CODE
    

In [None]:
# Return the top 5 entities by frequency

entity_counts_spacy = ## YOUR CODE
entity_counts_nltk = ## YOUR CODE

print( ## YOUR CODE
print( ## YOUR CODE

4. Compare and analyze the different results between the two methods.

YOUR ANSWER HERE

### Sub Task 3: Co-Occurrence Counts of Entities (5 Points)

As is becoming apparent, the *raw* occurrence counts of entities might not be meaningful on its own, especially if we are interested in less frequently occurring entities.

Instead, we will "investigate" the entities that are most frequently mentioned in association with "Deutsche Bank". For this purpose, we will look at the textual co-occurrences of two named entities. The basic idea is that entities that frequently appear together are likely related.

1. For each text chunk, extract all mentions of the entity `('Deutsche Bank', 'ORG')`, as well as all `PERSON` entity mentions in the text using spaCy. Store the respective entity name and the text position. Unlike the previous question, you do *not* need to check for different spelllings of the "Deutsche Bank" entity.  
**Hint:** Entities are represented as a [`Span`](https://spacy.io/api/span) element in spaCy, which has access to text position.


In [None]:
entity_mentions_with_start_position = []

for chunk in db_chunks:
    chunk_mentions = []
    # Process the doc with spaCy
    doc = ## YOUR CODE
    
    # Extract only entity mentions of "Deutsche Bank" (ORG) or any PERSON mention.
    # Append each mention, including the text and its starting position, to `chunk_mentions`
    
    # Append the chunk's entities to the aggregate list
    entity_mentions_with_start_position.append(chunk_mentions)


2. Within each chunk, for each mention of `Deutsche Bank`, search for `PERSON` entities that have a starting position within 200 characters before/after the starting position of the `Deutsche Bank` mention. Count for each `PERSON` entity how many times it occurs nearby a mention of `Deutsche Bank`.  
Aggregate the co-occurrences across all chunks. 

In [None]:
co_occurrences = []

for chunk_mentions in entity_mentions_with_start_position:
    # Iterate through the entities. If the entity is a "Deutsche Bank" mention, extract nearby
    # PERSON references (less than +/- 200 character difference in the starting position)

    ## YOUR CODE


3. Return the number of co-occurrences and the name of the top 5 frequently occurring `PERSON` entities.


In [None]:
co_occurrence_counts = ## YOUR CODE

print( ## YOUR CODE

4. Look back at the results of your previous task. Are the `PERSON` entities returned by your co-occurrence method the same ones that appear most frequently by raw counts?

YOUR ANSWER HERE

## 3. Neural Models with Huggingface (3 + 5 + 2 = 10 Points)

For state-of-the-art performance, most text-related tasks nowadays use some variation of the Transformer architecture. The particular advantage is especiall the readily available weights for models that have been pre-trained on large general-purpose datasets, which reduces the amount of domain-specific labeled training data.

In this task, we will explore the [Huggingface](https://hf.co/) ecosystem to see in which way Transformer models can be used.
One of the central aspects of the Huggingface platform is the so-called [Model Hub](https://huggingface.co/models), where you can find many different models uploaded by community members for a variety of tasks.

Because the neural models are generally very expensive to run, this exercise will be limited to  less data than in previous questions.

### Sub Task 1: Loading Transformer Models (3 Points)

1. Install the `transformers` library and load the model `cardiffnlp/twitter-roberta-base-sentiment-latest` to classify a sequence.
2. Report the result of the prediction on the test sequence.

In [None]:
from transformers import ## YOUR IMPORTS

model = ## YOUR CODE
tokenizer = ## YOUR CODE

input_text = "Das ist ein Test."

prediction = ## YOUR CODE

### Sub Task 2: Using Pipelines (5 Points)

The most succinct way of using a Transformer model is the [`transformers.pipeline`](https://huggingface.co/docs/transformers/pipeline_tutorial). You can check out the linked tutorial for more information on the topic, but essentially, `pipeline` provides a light-weight wrapper around a number of different popular NLP tasks

1. Instead of manually defining a pipeline, now load a model through a `"text-classification"` pipeline. Look up the neural model that is loaded by default, and post the link to its [model card](https://huggingface.co/docs/hub/model-cards) below.


In [None]:
## YOUR CODE

2. Now, instead, load a pipeline for `"text-classification"`, but with a custom model and tokenizer. Use the Model Hub platform to find the most popular model for the German language (by number of downloads) and manually specify the usage of another model (and tokenizer) to the pipeline. Re-run the previous example, and report the prediction result.


In [None]:

model = ## YOUR CODE
tokenizer = ## YOUR CODE

# Instantiate the pipeline with custom components
pipe = ## YOUR CODE

# Output the prediction by your pipe on the test sample.
print( ## YOUR CODE

3. Keeping in line with the previous exercises, let us now try and actually predict something with the model. Re-load a pipeline, this time for Named Entity Recognition, using the default model.

In [8]:
## YOUR CODE

4. Run the pipeline with the text from the Deutsche Bank report from Question 2 and output the results.

In [None]:
## YOUR CODE

print( ## YOUR CODE

5. Look at the results. Something looks strange here; why is it not working properly? Elaborate your answer.

YOUR ANSWER HERE

### Sub Task 3: Using Datasets through Huggingface (2 Points)

Instead of using the `transformers` library for model training and inference, it is also possible to use other libraries by Huggingface without neural models.
In particular, the `datasets` library provides a centralized and streamlined way of accessing a variety of different datasets.

1. Using the `datasets` library, load the `imdb` dataset.

In [7]:
## YOUR CODE

2. Report the mean length of `text` column for the training, validation and test split, respectively.


In [9]:
## YOUR CODE