# Assignment 4: Keyphrase Extraction, Named Entity Recognition & Neural Models

Due: Monday, February 06, 2023, at 2pm via Moodle

**Team Members** Daniel Abanto, Levi Szamek, Clemens Langer

Please note that this assignment comes with quite a number of artifacts, totaling somewhere around 5 GB of necessary disk space. In case you are running into issues or do want to keep your environment "clean", we suggest the use of [Google Colab](https://colab.research.google.com/).

In [1]:
%%bash
. ~/.bashrc
python3 -m pip install keybert
python3 -m pip install git+https://github.com/LIAAD/yake
python3 -m pip install transformers
python3 -m pip install datasets
python3 -m pip install nltk
python3 -m pip install spacy
# Install necessary packages for all questions

Collecting keybert
  Using cached keybert-0.7.0.tar.gz (21 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting sentence-transformers>=0.3.8
  Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting scikit-learn>=0.22.2
  Downloading scikit_learn-1.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.8/24.8 MB 27.8 MB/s eta 0:00:00
Collecting numpy>=1.18.5
  Downloading numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.7/15.7 MB 38.0 MB/s eta 0:00:00
Collecting rich>=10.4.0
  Using cached rich-13.3.1-py3-none-any.whl (239 kB)
Collecting pygments<3.0.0,>=2.14.0
  Using cached Pygments-2.14.0-py3-none-any.whl (1.1 MB)
Collecting markdown-it-py<3.0.0,>=2.1.0
  Using cac

  Running command git clone --filter=blob:none --quiet https://github.com/LIAAD/yake /tmp/pip-req-build-y7q4mhdu


In [2]:
%%bash
python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 11.4 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Task 1: Keyphrase Extraction (5 + 3 + 3 + 5) = 16 Points

In this task, we will implement our own unsupervised keyphrase extraction (KPE) module utilizing a simple grammatical ruling system, which we apply to a Sherlock Holmes novel.
To generate TF-IDF-weighted phrases, we will be using the entire collection of Sir Arthur Donan Coyle novels to calculate document frequencies.

Finally, we compare the results to general-purpose KPE libraries.

### Sub Task 1: Unsupervised Keyphrase Extraction System (5 Points)

#### 1. Candidate Generation
We will need to generate a set of suitable candidate phrases first, which can then be ranked as keyphrases later on. To do this, we will again be using spaCy's, this time its rule-based [`Matcher` class](https://spacy.io/api/matcher).

The syntactic pattern of a keyphrase candidate should satisfy the following rules:

1. An optional adjective, noun, proper noun
2. An optional adjective, noun, proper noun
3. A mandatory noun or proper noun.

Add a second pattern, which recognizes the pattern

1. A noun or proper noun
2. An adposition
3. Another noun or proper noun

Note that the first condition will match any phrase of length between 1-3 tokens, which is a suitable approximation for our task at hand, whereas the second pattern is slightly more specific, always matching exactly three tokens.
An example of a valid matched phrases for the first pattern would be "Sherlock Holmes" ([PROPN, PROPN]), and "Hounds of Baskervilles" ([NOUN, ADP, PROPN]) for the second pattern.

In [3]:
import spacy
from spacy.matcher import Matcher

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# load language model
nlp = spacy.load("en_core_web_sm", disable=["ner"])
matcher = Matcher(nlp.vocab)


pattern1 = [{"POS": {"IN": ["ADJ", "NOUN", "PROPN"]},
            "OP": "?"},
            {"POS": {"IN": ["ADJ", "NOUN", "PROPN"]},
            "OP": "?"},
           {"POS": {"IN": ["NOUN", "PROPN"]},
            "OP": "+"}]

pattern2 = [{"POS": {"IN": ["NOUN", "PROPN"]},
            "OP": "+"},
            {"POS": "ADP", "OP": "+"},
            {"POS": {"IN": ["NOUN", "PROPN"]},
             "OP": "+"}]


matcher.add("pattern1", [pattern1])
matcher.add("pattern2", [pattern2])


To verify whether your pattern is correct, use the below example.
If you have done everything correctly, your matcher will identify **13 phrases**.

In [38]:
doc = nlp("This is a simple test. It should return 'simple', and 'test', among other phrases. Maybe we can also see if it can recognize the art of war. Would it recognize integer linear programming, too?")
matches = matcher(doc)

print(len(matches))

13


#### 2. Applying Your System

Once you have matched the correct number of keyphrase candidates on the above example, apply your rule-based matcher to an actual data sample. We are going to use the Sherlock Holmes novel "Hounds of Baskervilles". You can find the raw text file at the following URL:

https://sherlock-holm.es/stories/plain-text/houn.txt

Download the text from this URL and apply your spaCy model and matcher on it.  
**Hint:** Make sure you properly decode your input, since some libraries return binary strings.

In [5]:
from urllib.request import urlopen
def load_txt_from_url(url: str = "https://sherlock-holm.es/stories/plain-text/houn.txt") -> str:
    with urlopen(url) as webpage:
        content = webpage.read().decode()
        return content
    
        #@Daniel geht auch mit urlopen(url).read()


text = load_txt_from_url()


# Apply the spacy model to the loaded text and extract the phrases with the Matcher
doc = nlp(text)
matches = matcher(doc)

We will now investigate which phrase candidates are the most frequently appearing in this novel, simply based on the phrase frequency. Therefore, convert your abstract match objects into actual strings, lowercase them, and return the 20 most frequently occurring phrase candidates and their respective frequencies.  
**Hint:** For counting occurrences, you may look at `collections.Counter`.

In [43]:
from collections import Counter
candidates = []
for match_id, start, end in matches:
    rule_id = nlp.vocab.strings[match_id]  
    span = doc[start : end] 
    string = span.text.lower()
    candidates.append((rule_id,string))

candidate_phrases = []
for i,j in sorted(Counter(candidates).items(), key=lambda x:x[1], reverse=True):
    c_f = (f'{i[1]} ({j})')
    candidate_phrases.append(c_f)
    
print(candidate_phrases[:20])

['sir (350)', 'man (214)', 'holmes (192)', 'moor (159)', 'henry (156)', 'sir henry (135)', 'watson (117)', 'baskerville (116)', 'dr. (109)', 'charles (94)', 'stapleton (93)', 'mortimer (89)', 'night (88)', 'time (86)', 'sir charles (86)', 'house (75)', 'face (75)', 'hound (72)', 'barrymore (72)', 'eyes (71)']


#### 3. Briefly summarize the quality of your top 20 candidates:

YOUR ANSWER

### Sub Task 2: Generating Document Frequency Values (3 Points)

To compare the previously generated terms with a more refined model, we are going to extract document frequencies from the collection of all Sherlock Holmes works. Since the books are relatively long documents, we are instead going to split based on a simple heuristic in the input document, which should allow a decent approximation by taking into account individual chapters of each novel.

1. Start by loading the Sherlock Holmes canon from https://sherlock-holm.es/stories/plain-text/cnus.txt  
Afterwards, split the full document into individual chapters. For this, use three consecutive line breaks `\n\n\n` as a splitting condition to approximate the chapters.

In [44]:
df_texts = load_txt_from_url("https://sherlock-holm.es/stories/plain-text/cnus.txt")
split_df_texts = df_texts.split('\n\n\n')
print(len(split_df_texts))

353


After splitting, you should have 353 individual "documents" to work with.

2. Now, create a dictionary containing each phrase encountered in the larger corpus, and its associated document frequency. Again, ensure that phrase strings are lowercased for consistency with the previous transformation.  
**Hint:** Since the processing of 353 documents might take a while, incorporate [`tqdm.tqdm`](https://tqdm.github.io/) to visualize progress on the task.

In [64]:

from tqdm import tqdm
from typing import List
# process text with spaCy and apply the Matcher
# Candidates can be a set, since we only care about the occurrence *once* for IDF values.
 # Again, extract the lower-cased text of a matched span.

def return_occurring_phrases(doc_text: str) -> List[str]:
    matches_cand = []
    for document in doc_text:
        doc = nlp(document)
        matches = matcher(doc)
        for match_id, start, end in matches:
            rule_id = nlp.vocab.strings[match_id]  
            span = doc[start : end] 
            string = span.text.lower()
            matches_cand.append((rule_id,string))
    candidates = set(matches_cand)
    return list(candidates)

candidates = return_occurring_phrases(split_df_texts)

# Iterate through the individual documents and extract phrases for them. Use `tqdm` to visualize progress
candidate_phrases = []

for document in tqdm(split_df_texts):
    doc = nlp(document)
    matches = matcher(doc)
    for match_id, start, end in matches:
        rule_id = nlp.vocab.strings[match_id]  
        span = doc[start : end] 
        string = span.text.lower()
        candidate_phrases.append((rule_id,string))

candidate_phrases_document = []
canditate_phrases_dict = {}
for i,j in sorted(Counter(candidate_phrases).items(), key=lambda x:x[1], reverse=True):
    c_f = (f'{i[1]} ({j})')
    canditate_phrases_dict[i[1]]=j
    candidate_phrases_document.append(c_f)
    
candiate_phrases_document = sorted(Counter(candidate_phrases).items(), key=lambda x:x[1], reverse=True)


100%|██████████| 353/353 [00:50<00:00,  7.00it/s]


3. Output the 20 most frequently appearing document phrases that your system detected:

In [65]:
print(candidate_phrases_document[:20])

['holmes (2501)', 'man (1989)', 'mr. (1394)', 'room (900)', 'time (880)', 'sir (842)', 'watson (810)', 'house (773)', 'face (754)', 'night (738)', 'way (730)', 'door (684)', 'hand (632)', 'case (608)', 'eyes (552)', 'day (530)', 'matter (474)', 'morning (470)', 'friend (470)', 'mr. holmes (450)']


### Sub Task 3: Generating Weighted Keyphrases (3 Points)

We can now incorporate the extracted keyphrases to calculate `tf-idf` scores, and return a hopefully improved version of our keyphrases for the original "Hounds of Baskervilles" document. 

1. Iterate over all phrases occurring in the novel "Hounds of Baskervilles", and re-score phrases according to the definition of TF-IDF. Use the smoothed definition of idf:

$ idf(t, D) = \log \frac{|D|}{|\{d \in D : t \in d\}| + 1} + 1 $

In [78]:
import math
from typing import Dict

def tf_idf(tf: int, df_count: int) -> float:
    return math.log10(tf/float(df_count) + 1) + 1 

tf_idf_weighted_candidates = []

# Iterate through all candidate phrase/frequency pairs and compute the TF-IDF scores for each phrase
# Store the phrase together with its TF-IDF score in `tf_idf_weighted_candidates`
for candidate, tf in canditate_phrases_dict.items():
    tf_idf_weighted_candidates.append((candidate,tf_idf(tf,len(split_df_texts))))

2. Now print the top 20 candidate phrases by TF-IDF weight, and compare the results to your previous output. 

In [83]:
tf_idf_weighted_candidates.sort(key = lambda x: x[1])

In [86]:
print(tf_idf_weighted_candidates[0:20])

[('complete', 1.0012285566379653), ('complete sherlock', 1.0012285566379653), ('complete sherlock holmes', 1.0012285566379653), ('arthur conan', 1.0012285566379653), ('conan', 1.0012285566379653), ('arthur conan doyle', 1.0012285566379653), ('conan doyle', 1.0012285566379653), ('doyle', 1.0012285566379653), ('a scandal', 1.0012285566379653), ('a scandal in bohemia', 1.0012285566379653), ('a case', 1.0012285566379653), ('a case of identity', 1.0012285566379653), ('the boscombe', 1.0012285566379653), ('the boscombe valley', 1.0012285566379653), ('the boscombe valley mystery', 1.0012285566379653), ('the five', 1.0012285566379653), ('five', 1.0012285566379653), ('the five orange', 1.0012285566379653), ('five orange', 1.0012285566379653), ('the five orange pips', 1.0012285566379653)]


3. Write your insights on the comparison of the results below. Try to theorize why some of the phrases still appear, or why other phrases are no longer present:

YOUR ANSWER HERE

certain, phrases might occur in a less, documents in comparison to the absolute count. A common phrase e.g. ones including sherlock, are logical part o most of the documents, while others might be related to specific chapters of the book.

4. Give two examples of how you could further improve the list of keyphrase values.

YOUR ANSWER HERE

there are various phrases, that are very simmilar and obviouly belong together. Merging these might improve the results


### Sub Task 4: Apply off-the-shelf Keyphrase Extraction Tools (5 Points)

To put the findings of your system into context, compare them with two popular open-source libraries, namely [YAKE!](https://github.com/LIAAD/yake) and [KeyBERT](https://github.com/MaartenGr/KeyBERT).

1. First, start by running the document with YAKE!; you may use the default parameters. Print the resulting keyphrases, which by default returns 20 phrases.

In [6]:
from yake import KeywordExtractor

extractor = KeywordExtractor()
keywords = extractor.extract_keywords(text)
# Print the top 20 keywords
print(keywords)

[('Sir Henry Baskerville', 9.703823247628816e-06), ('Sir Henry', 1.648100191408312e-05), ('Sir Charles Baskerville', 2.0351834688440424e-05), ('Sir Charles', 3.724150731612377e-05), ('Sir', 0.00010788043646785337), ('Sir Charles death', 0.0001335320154208488), ('Henry Baskerville', 0.00020407970787866275), ('Holmes', 0.00027267641332312374), ('Hall Sir Henry', 0.0002827705937659273), ('Baskerville Hall', 0.00030297998752354974), ('Sherlock Holmes', 0.00030366832269808347), ('friend Sir Henry', 0.0003232815192427155), ('Baskerville Hall Sir', 0.00033497855089080193), ('Henry', 0.0003734244162998363), ('Charles Baskerville', 0.00045232284621121374), ('BASKERVILLES Arthur Conan', 0.00046235145087207454), ('asked Sir Henry', 0.0005001869412629496), ('Sir Henry put', 0.0005065910358925609), ('Arthur Conan Doyle', 0.0005408670117356265), ('Baskerville', 0.0005977236792600148)]


2. Compare both runtime efficiency and the extracted phrases with your own system.

YOUR ANSWER HERE

The run time of yake is way faster. No timing needed. The results from yake, include more persons.

3. Now use the KeyBERT library to extract keyphrases. Importantly, you will need to split the document into separate paragraphs, as the underlying neural model will be unable to handle the complete document as input.  
Use the pattern of `\n\n` to separate the text into smaller paragraphs, and filter out any empty lines after. An "empty line" also constitutes all inputs that only contain newline (`\n`) or whitespace ` ` characters.


In [7]:
# Split the input text according to the specified criteria and filter empty lines out.
split_text = [i for i in text.split("\n\n") if i not in [""," ","\n"]]

4. To ensure consistency between the tools when extracting keyphrases, set the *n*-gram range to `(1,3)`.
Otherwise, leave all parameters at the default value, and extract the keyphrases from each paragraph.

In [9]:
from keybert import KeyBERT
# This might take a while to install
kw_model = KeyBERT("all-MiniLM-L6-v2")

# Extract the keyphrases from each split, using the adjusted keyphrase ngram range
# Hint: You may pass a list to the extraction function and KeyBERT will automatically handle iteration.
extracted_phrases = kw_model.extract_keywords(docs=split_text, keyphrase_ngram_range=(1,3))

5. Combine the predictions of all individual splits into a single list. For this, sum up the prediction scores across all splits.  
**Hint:** `collections.defaultdict` makes aggregations like this much easier.

In [15]:
extracted_phrases

[[('hound baskervilles', 0.9136), ('hound', 0.7077), ('baskervilles', 0.6484)],
 [('arthur conan doyle', 1.0),
  ('conan doyle', 0.916),
  ('doyle', 0.7735),
  ('arthur conan', 0.771),
  ('arthur', 0.5805)],
 [('holmes curse baskervilles', 0.6754),
  ('hound baskervilles retrospection', 0.6599),
  ('hound baskervilles', 0.6145),
  ('sherlock holmes curse', 0.5656),
  ('mr sherlock holmes', 0.552)],
 [('chapter mr sherlock', 0.9278),
  ('mr sherlock holmes', 0.7358),
  ('sherlock holmes', 0.7091),
  ('mr sherlock', 0.6864),
  ('sherlock', 0.6769)],
 [('mortimer friends engraved', 0.5589),
  ('mr sherlock holmes', 0.5371),
  ('engraved date 1884', 0.5074),
  ('sherlock holmes', 0.5068),
  ('mr sherlock', 0.4895)],
 [('watson make', 0.6447), ('watson', 0.6362), ('make', 0.1632)],
 [('holmes sitting', 0.586),
  ('holmes sitting given', 0.5758),
  ('holmes', 0.5626),
  ('occupation', 0.3712),
  ('sign occupation', 0.3568)],
 [('doing believe eyes', 0.4782),
  ('believe eyes head', 0.4775),


In [22]:
from typing import List, Tuple
from collections import defaultdict

def merge_predictions(list_of_predictions: List[List[Tuple]]) -> List[Tuple]:
    """
    Combines lists of predictions into a single list with added scores.
    """
    phrase_dict = defaultdict(int)
    
    for preds in list_of_predictions:
        for pred in preds:
            phrase_dict[pred[0]]+=pred[1]
    phrase_list = [(i,j) for i,j in phrase_dict.items()]
    phrase_list.sort(key = lambda x: x[1])
    return phrase_list

In [23]:
print(merge_predictions(extracted_phrases))




6. Again, evaluate the result and compare it to the other two approaches in terms of extraction quality and extraction speed.

YOUR ANSWER HERE

extraction speed is way more faster than the other approaches, although the workload was heavy for my cpu. 

## 2. Named Entity Recognition (4 + 5 + 5 = 14 Points)

Slightly different, but still operating on the sequence level, is the task of Named Entity Recognition (NER).
In this task, we will evaluate the NER capabilities of some more open-source libraries.
Particularly, we will also evaluate the utility of NER as a stand-in for Keyphrase Extraction.

### Sub Task 1: Using spaCy NER (4 Points)

So far, when using spaCy models, we have primarily disabled the NER component, as it requires significant extra compute.
In this task, we will explicitly leave the component enabled, to see what results it can produce on the text from the previous question.

In [11]:
import spacy
import pandas as pd
# Load the en_core_web_sm model, but with NER enabled.
nlp = spacy.load("en_core_web_sm")



1. Re-load the text for the "Hounds of Baskervilles" novel, and run it with the spacy model.

In [15]:
# Re-use the function from the previous exercise.
text = load_txt_from_url()

doc = nlp(str(text))

2. Similar to the previous exercise, count the number of occurrences, however, this time for the extracted entities instead of phrases. Print the top 20 most frequently occurring entities.  
Make sure to lowercase the text again during your aggregation.

In [37]:
from collections import defaultdict
ents_freq = defaultdict(int)
for ent in doc.ents:
    ents_freq[ent.text.lower()]+=1
    
ents_freq_sorted = sorted(ents_freq.items(), key=lambda x:x[1])
#print(ents_freq)

print(ents_freq_sorted[::-1][0:20:])



[('holmes', 116), ('henry', 95), ('one', 86), ('watson', 86), ('and\\n     ', 76), ('his\\n', 72), ('mortimer', 70), ('stapleton', 62), ('two', 59), ('was\\n', 45), ('charles', 44), ('london', 44), ('first', 41), ('had\\n     ', 32), ('barrymore', 31), ('baskerville hall', 22), ('half', 20), ('baskerville', 20), ('three', 16), ('henry baskerville', 14)]


You might have noticed some unwanted results in the list, such as "night". Upon closer inspection, it turns out that the NER module further differentiates between different entity *categories*, such as PERSON (referencing, as expected, a physical person) or ORG (organizations, such as companies, NGOs, etc.), but also TIME (under which "night" falls). For reference, you can find the full list of supported NER labels by this particular model [here](https://spacy.io/models/en#en_core_web_sm-labels).

3. Refine the list of most common entities by printing out the top three occurring entities in the category `PERSON`, `ORG` and `GPE` (physical locations) instead.

In [39]:
def get_top_entities_by_class(doc: spacy.tokens.Doc, class_name: str, n: int = 3):
    """
    Returns the three most frequent entities (and their frequencies)
    of entity type `class_name` from `doc`.
    """
    ents_freq = defaultdict(int)
    for ent in doc.ents:
        if ent.label_ == class_name:
            ents_freq[ent.text.lower()]+=1
            

    ents_freq_sorted = sorted(ents_freq.items(), key=lambda x:x[1])
    res = ents_freq_sorted[::-1][0:n:]
    print(ents_freq_sorted[::-1][0:n:])
    return res

# Print the results for "PERSON", "ORG" and "GPE"
print(get_top_entities_by_class(doc,"PERSON"))
print( get_top_entities_by_class(doc,"ORG"))
print( get_top_entities_by_class(doc,"GPE"))

[('holmes', 115), ('henry', 94), ('watson', 86)]
[('holmes', 115), ('henry', 94), ('watson', 86)]
[('it\\', 9), ('stapleton\\', 8), ('times', 7)]
[('it\\', 9), ('stapleton\\', 8), ('times', 7)]
[('london', 44), ('stapleton', 18), ('devonshire', 14)]
[('london', 44), ('stapleton', 18), ('devonshire', 14)]


### Sub Task 2: Financial Bank Statements of Deutsche Bank (5 Points)

Instead of using the Sherlock Holmes Novels, we will now compare the functionality of spaCy and NLTK's NER modules on the financial statements of Deutsche Bank from 2021. For this, see the file available on Moodle.

1. Download it and convert the PDF document into text, by using the `pdftotext` command-line utility. In particular, run with the `-layout` option enabled.

In [None]:
%%bash
. ~/.bashrc
## pdftotext -layout DB_anual_report.pdf
# If you have to execute this command through your shell, still paste the command you ran in here.

2. Given that the document is extremely long, split the inputs into chunks of 500.000 characters and process them separately.

In [40]:
def load_long_text_in_chunks(fp: str, chunk_size: int = 500_000):
    """Loads a text file (located at `fp`) and chunks it into chunks fo at most `chunk_size` characters.
    Note that the last chunk might be significantly shorter.
    """
    with open(fp) as inf:
        text = inf.read()
        

    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
    return chunks

In [41]:
db_chunks = load_long_text_in_chunks("DB_annual_report.txt")

3. Print the top 5 occurring `ORG` entities that are not referencing Deutsche Bank itself, both by using spaCy's NER module and the NER function of NLTK.  
To exclude "Deutsche Bank" entities, filter out all entities that contain both "deutsche" and "bank" in their name, irrespective of the actual upper-/lowercasing.
**Hint:** For more information on how to run NER with NLTK, see [here](https://nanonets.com/blog/named-entity-recognition-with-nltk-and-spacy/#performing-ner-with-nltk-and-spacy)

In [76]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

org_entities_spacy = []
org_entities_nltk = []

def is_deutsche_bank_entity(name: str) -> bool:
    
    if "deutsche" in name.lower():
        return True
    if "bank" in name.lower():
        return True
    return False

for chunk in db_chunks:
    # Process the chunk with spaCy
    doc = nlp(chunk)
    
    
    ents_nltk = []

    # And also with NLTK
    ## YOUR CODE
    for sent in nltk.sent_tokenize(chunk):
        for ch in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
            if hasattr(ch, 'label'):
                if ch.label() == "ORGANIZATION":
                    text = ' '.join(c[0] for c in ch)
                    if not is_deutsche_bank_entity(text):
                        ents_nltk.append(text)
    

    # Add all the extracted "ORG" entities to `org_entities`, except those referencing Deutsche Bank
    org_entities_spacy.extend([i.text for i in doc.ents if not is_deutsche_bank_entity(i.text) and i.label_ == "ORG"])
    org_entities_nltk.extend(ents_nltk)
    

[nltk_data] Downloading package punkt to /home/clemens/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/clemens/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/clemens/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/clemens/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [77]:
# Return the top 5 entities by frequency
import collections

entity_counts_spacy = collections.Counter(org_entities_spacy)
entity_counts_nltk = collections.Counter(org_entities_nltk)

print(entity_counts_spacy.most_common(5))
print(entity_counts_nltk.most_common(5))


[('Group', 956), ('the Management Board', 273), ('the Supervisory Board', 183), ('COVID-19', 124), ('Management Board', 102)]
[('Group', 744), ('Management Board', 438), ('Supervisory Board', 342), ('IFRS', 175), ('Total', 127)]


4. Compare and analyze the different results between the two methods.

YOUR ANSWER HERE

The reuslts for show certain similariteies, while spacy eg. differed between "the management board" and "management board", nltk recognized it as one entitiy an there for has a higher count. Overall the counts differ quite a bit e.g. Supervisory Board (183 and 342) with no method showing structurly higher counts.

### Sub Task 3: Co-Occurrence Counts of Entities (5 Points)

As is becoming apparent, the *raw* occurrence counts of entities might not be meaningful on its own, especially if we are interested in less frequently occurring entities.

Instead, we will "investigate" the entities that are most frequently mentioned in association with "Deutsche Bank". For this purpose, we will look at the textual co-occurrences of two named entities. The basic idea is that entities that frequently appear together are likely related.

1. For each text chunk, extract all mentions of the entity `('Deutsche Bank', 'ORG')`, as well as all `PERSON` entity mentions in the text using spaCy. Store the respective entity name and the text position. Unlike the previous question, you do *not* need to check for different spelllings of the "Deutsche Bank" entity.  
**Hint:** Entities are represented as a [`Span`](https://spacy.io/api/span) element in spaCy, which has access to text position.


In [70]:
entity_mentions_with_start_position = []

for chunk in db_chunks:
    chunk_mentions = []
    # Process the doc with spaCy
    doc = nlp(chunk)
    
    for ent in doc.ents:
        if ent.label_ == "PERSON" or (ent.text == "Deutsche Bank" and ent.label_ == "ORG"):
            chunk_mentions.append((ent.text,ent.start_char))
    
    # Extract only entity mentions of "Deutsche Bank" (ORG) or any PERSON mention.
    # Append each mention, including the text and its starting position, to `chunk_mentions`
    
    # Append the chunk's entities to the aggregate list
    entity_mentions_with_start_position.append(chunk_mentions)


2. Within each chunk, for each mention of `Deutsche Bank`, search for `PERSON` entities that have a starting position within 200 characters before/after the starting position of the `Deutsche Bank` mention. Count for each `PERSON` entity how many times it occurs nearby a mention of `Deutsche Bank`.  
Aggregate the co-occurrences across all chunks. 

In [71]:
co_occurrences = []

for chunk_mentions in entity_mentions_with_start_position:
    for mention in chunk_mentions:
        if mention[0]== "Deutsche Bank":
            co_occurrences.extend([i[0] for i in chunk_mentions if (i[1]>=(mention[1]-200) and i[1]<=(mention[1]+200) and i[0] != "Deutsche Bank")])


3. Return the number of co-occurrences and the name of the top 5 frequently occurring `PERSON` entities.


In [72]:
co_occurrence_counts = collections.Counter(co_occurrences)

print(co_occurrence_counts.most_common(5))

[('Jeffrey Epstein', 7), ('Warburg Invest', 6), ('Steuerbescheinigungen', 5), ('Sewing', 5), ('KGaA', 4)]


4. Look back at the results of your previous task. Are the `PERSON` entities returned by your co-occurrence method the same ones that appear most frequently by raw counts?

YOUR ANSWER HERE

The person entities differ from the previous raw counts, quite a bit. They are significantly lower, which is expected due to the 200 char limit. 

## 3. Neural Models with Huggingface (3 + 5 + 2 = 10 Points)

For state-of-the-art performance, most text-related tasks nowadays use some variation of the Transformer architecture. The particular advantage is especiall the readily available weights for models that have been pre-trained on large general-purpose datasets, which reduces the amount of domain-specific labeled training data.

In this task, we will explore the [Huggingface](https://hf.co/) ecosystem to see in which way Transformer models can be used.
One of the central aspects of the Huggingface platform is the so-called [Model Hub](https://huggingface.co/models), where you can find many different models uploaded by community members for a variety of tasks.

Because the neural models are generally very expensive to run, this exercise will be limited to  less data than in previous questions.

### Sub Task 1: Loading Transformer Models (3 Points)

1. Install the `transformers` library and load the model `cardiffnlp/twitter-roberta-base-sentiment-latest` to classify a sequence.
2. Report the result of the prediction on the test sequence.

In [6]:
from transformers import AutoTokenizer, AutoConfig, AutoModelForSequenceClassification

MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
input_text = "Das ist ein Test."

prediction = model(**tokenizer(input_text,return_tensors="pt"))

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [7]:
prediction 

SequenceClassifierOutput(loss=None, logits=tensor([[-0.9775,  1.1736, -0.6320]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

### Sub Task 2: Using Pipelines (5 Points)

The most succinct way of using a Transformer model is the [`transformers.pipeline`](https://huggingface.co/docs/transformers/pipeline_tutorial). You can check out the linked tutorial for more information on the topic, but essentially, `pipeline` provides a light-weight wrapper around a number of different popular NLP tasks

1. Instead of manually defining a pipeline, now load a model through a `"text-classification"` pipeline. Look up the neural model that is loaded by default, and post the link to its [model card](https://huggingface.co/docs/hub/model-cards) below.


In [10]:
## YOUR 
from transformers import pipeline
pipe = pipeline("text-classification")

link = "https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english"

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


2. Now, instead, load a pipeline for `"text-classification"`, but with a custom model and tokenizer. Use the Model Hub platform to find the most popular model for the German language (by number of downloads) and manually specify the usage of another model (and tokenizer) to the pipeline. Re-run the previous example, and report the prediction result.


In [14]:

MODEL = f"oliverguhr/german-sentiment-bert"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
# Instantiate the pipeline with custom components
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Output the prediction by your pipe on the test sample.
print(pipe(input_text))

[{'label': 'positive', 'score': 0.8108309507369995}]


3. Keeping in line with the previous exercises, let us now try and actually predict something with the model. Re-load a pipeline, this time for Named Entity Recognition, using the default model.

In [15]:
pipe = pipeline("ner")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

4. Run the pipeline with the text from the Deutsche Bank report from Question 2 and output the results.

In [20]:
## YOUR CODE
with open("DB_annual_report.txt") as inf:
    text = inf.read()
print(pipe(text))

[{'entity': 'I-ORG', 'score': 0.9962645, 'index': 1, 'word': 'Deutsche', 'start': 0, 'end': 8}, {'entity': 'I-ORG', 'score': 0.9980811, 'index': 2, 'word': 'Bank', 'start': 9, 'end': 13}, {'entity': 'I-ORG', 'score': 0.9950453, 'index': 7, 'word': 'Deutsche', 'start': 38, 'end': 46}, {'entity': 'I-ORG', 'score': 0.997999, 'index': 8, 'word': 'Bank', 'start': 47, 'end': 51}, {'entity': 'I-ORG', 'score': 0.95706505, 'index': 9, 'word': 'Deutsche', 'start': 107, 'end': 115}, {'entity': 'I-ORG', 'score': 0.9976624, 'index': 10, 'word': 'Bank', 'start': 116, 'end': 120}, {'entity': 'I-ORG', 'score': 0.9963329, 'index': 15, 'word': 'Deutsche', 'start': 144, 'end': 152}, {'entity': 'I-ORG', 'score': 0.9973074, 'index': 16, 'word': 'Bank', 'start': 153, 'end': 157}, {'entity': 'I-ORG', 'score': 0.9989078, 'index': 213, 'word': 'Deutsche', 'start': 2172, 'end': 2180}, {'entity': 'I-ORG', 'score': 0.9987111, 'index': 214, 'word': 'Bank', 'start': 2181, 'end': 2185}]


5. Look at the results. Something looks strange here; why is it not working properly? Elaborate your answer.

The number of named entities is pretty low in comparison to the length of the original document. 

### Sub Task 3: Using Datasets through Huggingface (2 Points)

Instead of using the `transformers` library for model training and inference, it is also possible to use other libraries by Huggingface without neural models.
In particular, the `datasets` library provides a centralized and streamlined way of accessing a variety of different datasets.

1. Using the `datasets` library, load the `imdb` dataset.

In [65]:
from datasets import load_dataset_builder
dataset_builder = load_dataset_builder('imdb')

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /home/clemens/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /home/clemens/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

2. Report the mean length of `text` column for the training, validation and test split, respectively.


In [86]:
import numpy as np
print(np.mean([len(i) for i in dataset_builder["train"]["text"]]))
print(np.mean([len(i) for i in dataset_builder["test"]["text"]]))
print(np.mean([len(i) for i in dataset_builder["unsupervised"]["text"]]))

1325.06964
1293.7924
1329.9025
