<a href="https://colab.research.google.com/github/larajakl/Computational-Linguistics/blob/main/homeexercise1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exericse 1: Preprocessing and NER
In this first home exercise, you will use the knowledge from Tutorial 1 and Tutorial 2 to perform some preprocessing and NLP steps on a news article of your choice. An example article in English is provided in this notebook.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

We will use the newspaper libabry to facilitate the scraping of the news article from a webpage.

In [15]:
!pip install newspaper3k



In [16]:
!pip install lxml_html_clean newspaper3k

import newspaper
from newspaper import Article

url = 'https://edition.cnn.com/2024/10/25/style/banana-artwork-maurizio-cattelan-comedian-auction/index.html'
article = Article(url)
article.download()
article.parse()

#This line displays the authors of the article
print("Authors: ", article.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article.title, "\n")
print("Text of article: \n", article.text)

Authors:  ['Oscar Holland'] 

Title:  Maurizio Cattelan’s viral banana artwork ‘Comedian’ could now be worth $1.5 million 

Text of article: 
 CNN —

When a banana duct-taped to a wall sold for $120,000 in 2019, social media uproar and an age-old debate about the meaning of art ensued.

But artist Maurizio Cattelan’s viral creation, titled “Comedian,” may yet prove a sound investment: On Friday, auction house Sotheby’s announced that one of the artwork’s three “editions” is going back on sale — this time with an estimate of $1 million to $1.5 million.

For their money, the winning bidder will receive a roll of duct tape and one banana, as well as a certificate of authenticity and official instructions for installing the work. Sotheby’s confirmed to CNN that neither the tape nor, thankfully, the banana are the originals.

“‘Comedian’ is a conceptual artwork, and the actual physical materials are replaced with every installation,” an auction spokesperson said via email.

Cattelan and Fre

👋 ⚒ Use the above article or a news article of your choice and print the number of unique words in the text.

In [17]:
# Calculate and print the number of unique words in the text

import re  # I import re for being able to use regular expressions

text_as_list = re.split(r'[;:,\s!.]+', article.text)  # Split the article text by any space, semicolon, colon, comma, or exclamation mark. this creates a new variable "text_as_list" and saves the article text as a list in this variable
text_as_set = set(text_as_list)  # transforming the list into a set to eliminate duplicates

print(text_as_set)  # prints all unique words
print(len(text_as_set))  # prints the number of unique words

# for comparison I also print these:
print()
print(len(text_as_list))  # prints the number of words (including duplicates)
print(len(article.text))  # prints the number of characters in the article (because "article.text" is a string)

# note to self: this still includes punctuation so punctuation will also be counted as words here!

{'', 'artistic', 'right', 'While', 'Museum', '“‘Comedian’', 'prove', '(but', 'put', 'their', 'Milan', 'up', 'debate', 'stunned', 'three', 'originals', 'Source', 'higher', 'it', 'first', 'owner', 'would', 'rooted', 'famous', 'nor', 'so', 'had', 'and', 'vandalism', 'fruit', 'were', 'for', 'wall', 'asking', 'sum', 'critics', 'receive', 'on', 'respond', 'Perrotin', 'Tokyo', 'Kong', 'going', 'too', 'roll', 'student', 'creation', 'age-old', '-', 'New', 'could', '“Balancing', 'say', 'our', 'culture', 'contemporary', 'its', 'with', 'comment', 'winning', 'be', '20', 'store', 'because', 'place', 'request', 'using', 'eating', 'questions', 'certificate', 'Speaking', 'generation', 'Miami', 'displayed', 'act', 'tour', 'announced', 'seller', 'money', 'courtesy', 'core', 'very', 'million', 'created', 'satirical', 'incident', 'actual', 'in', 'headquarters', 'Feedback', 'marks', 'Americas', 'genius', 'Video', 'physical', 'National', 'grabbed', '“Comedian', '2019', '“not', 'Crowds', 'estimate', 'headline

## **Preprocessing**

👋 ⚒ Now perform the following preprocessing steps and see how the number of unique words changes:

1. Lowercase all words in the text.
2. Remove punctuation markers and numbers (Hint: `string.isalpha()).
3. Lemmatize all words in the text.

In [18]:
# Preprocess the text with all three steps and then calculate the number of
# unique words in the text again

# In the following, I used the NLTK lemmatizer. This was before I read in the forum that we are supposed to use the spaCy lemmatizer. See the next
# code block for the implementation using spaCy.

import nltk

nltk.download('averaged_perceptron_tagger')

nltk.download('punkt')

nltk.download('wordnet')

nltk.download('gutenberg')

# Lemmatizer

from nltk.stem import WordNetLemmatizer

# Initialize lemmatizer

lemmatizer = WordNetLemmatizer()

# Here I make all tokens in the article text lowercase and save them in a list called "text_as_list":
text_as_list = re.split(r'[;:,\s!.]+', article.text.lower())

# Here I initiate a new list:
new_text_nltk = []

# Here I fill the new list with (only alphabetic!) lemmatised words (I remove punctuation and numbers):
for word in text_as_list:  # this loops through the tokens in the list
  if word.isalpha():  # this line makes sure only alphabetic tokens (no punctuation, no numbers) are added to the new list
    lemmatized_word = lemmatizer.lemmatize(word)  # this lemmatises the current token
    new_text_nltk.append(lemmatized_word)  # this adds the current lemmatised toked to the new list
  else:
    continue  # if a word is not alphabetic, it is basically skipped

# To get unique words, I transfer this new list into a set again:
unique_words_after_preprocessing_nltk = set(new_text_nltk)
print(unique_words_after_preprocessing_nltk) # here I print the unique words
print(len(unique_words_after_preprocessing_nltk)) # here I print the number of unique words


{'artistic', 'right', 'prove', 'crowd', 'source', 'put', 'their', 'speaking', 'up', 'debate', 'stunned', 'three', 'higher', 'piece', 'event', 'it', 'first', 'owner', 'would', 'rooted', 'famous', 'nor', 'so', 'had', 'and', 'mark', 'vandalism', 'fruit', 'were', 'for', 'wall', 'asking', 'sum', 'cattelan', 'receive', 'on', 'respond', 'going', 'newspaper', 'too', 'roll', 'student', 'creation', 'university', 'could', 'say', 'our', 'culture', 'contemporary', 'comment', 'with', 'critic', 'winning', 'be', 'store', 'because', 'place', 'request', 'using', 'eating', 'certificate', 'generation', 'displayed', 'act', 'tour', 'announced', 'seller', 'money', 'basel', 'courtesy', 'core', 'very', 'million', 'medium', 'london', 'david', 'created', 'satirical', 'incident', 'actual', 'in', 'headquarters', 'wa', 'feedback', 'seoul', 'genius', 'physical', 'grabbed', 'estimate', 'reveal', 'embarks', 'two', 'purchased', 'lining', 'grocery', 'safety', 'concern', 'some', 'notion', 'tokyo', 'what', 'tape', 'to', '

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [19]:
# Here I use spaCy for the task:

# first I need to import spaCy
import spacy

# then I load the language model
nlp = spacy.load("en_core_web_sm")

# then I have the text processed; specifically the text in all lowercase. This line processes the text and I tokens, lemmas, POS will be accessible.
doc = nlp(article.text.lower())

# Here I initiate a new list:
new_text_spaCy = []

# Here I fill the new list with (only alphabetic!) lemmatised words (I remove punctuation and numbers):
for token in doc:  # this loops through the tokens in the article
  if token.is_alpha:  # this line makes sure only alphabetic tokens (no punctuation, no numbers) are added to the new list
    new_text_spaCy.append(token.lemma_)  # this adds the current lemmatised toked to the new list
  else:
    continue  # if a word is not alphabetic, it is basically skipped

# To get unique words, I transfer this new list into a set again:
unique_words_after_preprocessing_spaCy = set(new_text_spaCy)
print(unique_words_after_preprocessing_spaCy) # here I print the unique words
print(len(unique_words_after_preprocessing_spaCy)) # here I print the number of unique words

{'figure', 'artistic', 'right', 'prove', 'crowd', 'source', 'put', 'their', 'up', 'debate', 'stunned', 'three', 'piece', 'event', 'it', 'first', 'owner', 'would', 'intend', 'famous', 'nor', 'so', 'and', 'mark', 'vandalism', 'fruit', 'for', 'wall', 'grab', 'sum', 'cattelan', 'describe', 'receive', 'on', 'respond', 'newspaper', 'too', 'roll', 'student', 'creation', 'university', 'could', 'say', 'our', 'culture', 'contemporary', 'comment', 'with', 'its', 'critic', 'be', 'store', 'because', 'place', 'request', 'remove', 'certificate', 'generation', 'decide', 'act', 'undisclosed', 'tour', 'seller', 'money', 'use', 'basel', 'courtesy', 'replace', 'core', 'very', 'acquire', 'million', 'medium', 'london', 'david', 'title', 'satirical', 'incident', 'actual', 'in', 'headquarters', 'define', 'feedback', 'seoul', 'genius', 'physical', 'estimate', 'reveal', 'two', 'concern', 'grocery', 'safety', 'some', 'notion', 'add', 'tokyo', 'what', 'defend', 'tape', 'to', 'exhibit', 'sound', 'instal', 'attende

## **NER**

In the tutorial we have only used one of the different models available in spaCy. In this exercise, you will compare the performance of the different models of different sizes and implementations. A description of the type of available models is in the [spaCy documentation](https://spacy.io/models/en). First, the models to be used need to be installed. We will use the following three models.

In [20]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_trf

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m30.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation succe

👋 ⚒  Use each of the three models that were downloaded above and perform named entitiy recognition with each of them on the original not preprocessed article, one after another. You can use different code cells for the different models or write everything into one cell, as you prefer. For each of the model outputs, automatically calculate the number of NERs for each NER type that the model identifies.

In [21]:
#Here I use nlp.sm:

import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")
#nlp_lg = spacy.load("en_core_web_lg")
#nlp_trf = spacy.load("en_core_web_trf")

# I use the original not preprocessed article:
doc = nlp(article.text)

# this will print the named entities, their starting and ending character index of the entity within the document, and their NER type:
for ent in doc.ents:
  print(ent.text, ent.start_char, ent.end_char, ent.label_)

# Calculate the number of NERs for each NER type that the model identifies:
entity_counts = Counter([ent.label_ for ent in doc.ents])
print()
print(entity_counts)

CNN 0 3 ORG
120,000 52 59 MONEY
2019 63 67 DATE
Maurizio Cattelan 156 173 PERSON
Comedian 200 208 NORP
Friday 248 254 DATE
Sotheby’s 270 279 ORG
three 316 321 CARDINAL
$1 million to $1.5 million 387 413 MONEY
one 489 492 CARDINAL
Sotheby’s 593 602 ORG
CNN 616 619 ORG
Cattelan 841 849 NORP
French 854 860 NORP
Perrotin 873 881 PERSON
five years ago 914 928 DATE
Comedian 950 958 NORP
six 967 970 CARDINAL
the Art Basel Miami Beach 994 1019 FAC
Miami 1078 1083 GPE
CNN 1558 1561 ORG
David Datuna 1618 1630 PERSON
hundreds 1706 1714 CARDINAL
Miami 1844 1849 GPE
three 1923 1928 CARDINAL
Two 1961 1964 CARDINAL
120,000 2004 2011 MONEY
third 2023 2028 ORDINAL
Guggenheim 2108 2118 FAC
New York 2129 2137 GPE
Sotheby’s 2140 2149 ORG
November 2197 2205 DATE
one 2290 2293 CARDINAL
Miami 2349 2354 GPE
Cattelan 2369 2377 ORG
Comedian 2393 2401 NORP
the Art Newspaper 2440 2457 ORG
2021 2461 2465 DATE
Italian 2564 2571 NORP
CNN 2677 2680 ORG
November 2709 2717 DATE
Comedian 2752 2760 NORP
Sotheby's 2774 27

In [22]:
#Here I use nlp.lg:

import spacy
from collections import Counter

#nlp = spacy.load("en_core_web_sm")
nlp_lg = spacy.load("en_core_web_lg")
#nlp_trf = spacy.load("en_core_web_trf")

# I use the original not preprocessed article:
doc = nlp_lg(article.text)

for ent in doc.ents:
  print(ent.text, ent.start_char, ent.end_char, ent.label_)

# Calculate the number of NERs for each NER type that the model identifies:
entity_counts = Counter([ent.label_ for ent in doc.ents])
print()
print(entity_counts)

CNN 0 3 ORG
120,000 52 59 MONEY
2019 63 67 DATE
Maurizio Cattelan 156 173 PERSON
Comedian 200 208 WORK_OF_ART
Friday 248 254 DATE
Sotheby’s 270 279 ORG
three 316 321 CARDINAL
$1 million to $1.5 million 387 413 MONEY
one 489 492 CARDINAL
Sotheby’s 593 602 ORG
CNN 616 619 ORG
Comedian’ 692 701 WORK_OF_ART
Cattelan 841 849 PERSON
French 854 860 NORP
Perrotin 873 881 ORG
five years ago 914 928 DATE
Comedian 950 958 WORK_OF_ART
six 967 970 CARDINAL
the Art Basel Miami Beach 994 1019 FAC
Miami 1078 1083 GPE
Marcel Duchamp 1322 1336 PERSON
CNN 1558 1561 ORG
David Datuna 1618 1630 PERSON
hundreds 1706 1714 CARDINAL
Miami 1844 1849 GPE
three 1923 1928 CARDINAL
Two 1961 1964 CARDINAL
120,000 2004 2011 MONEY
third 2023 2028 ORDINAL
Guggenheim 2108 2118 ORG
New York 2129 2137 GPE
Sotheby’s 2140 2149 ORG
November 2197 2205 DATE
one 2290 2293 CARDINAL
Miami 2349 2354 GPE
Cattelan 2369 2377 PERSON
Comedian 2393 2401 WORK_OF_ART
the Art Newspaper 2440 2457 ORG
2021 2461 2465 DATE
Italian 2564 2571 NOR

In [23]:
#Here I use nlp.trf:
import spacy
from collections import Counter

#nlp = spacy.load("en_core_web_sm")
#nlp_lg = spacy.load("en_core_web_lg")
nlp_trf = spacy.load("en_core_web_trf")

# I use the original not preprocessed article:
doc = nlp_trf(article.text)

for ent in doc.ents:
  print(ent.text, ent.start_char, ent.end_char, ent.label_)

# Calculate the number of NERs for each NER type that the model identifies:
entity_counts = Counter([ent.label_ for ent in doc.ents])
print()
print(entity_counts)

ValueError: [E002] Can't find factory for 'curated_transformer' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, entity_ruler, tagger, morphologizer, ner, beam_ner, senter, sentencizer, spancat, spancat_singlelabel, span_finder, future_entity_ruler, span_ruler, textcat, textcat_multilabel, en.lemmatizer

You can use the following function to visualize the named entities in the text in order to facilitate the analysis.

In [None]:
# You can also visualize the detected named entities
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

👋 ⚒ Add your text of the analysis of differences between the three different models right here in the next text field.

The Counter seems to give similar output for the 3 models. However, some differences can be seen in the NER types that the models group the named entities into. All of the models use the following types: ORG, GPE, DATE, PERSON, MONEY, CARDINAL, LOC, ORDINAL, NORP.
However, there are some differences:
- Only the models sm and lg use FAC, which stands for Facility. For example, "Guggenheim" is recognised by the model sm as Facility, whereas the lg model groups it into ORG (organisation) and the trf model recognises "The Guggenheim Museum" and groups it into ORG too. This makes me believe that the models lg and especially trf are more accurate (at least concluding from this specific case).
- Only the models lg and the trf models use WORK_OF_ART, which is supposed to apply to titles or names of creative works, such as the title of the artwork in the article ("Comedian"). This is actually an interesting one because the sm model interprets "Comedian" as NORP whereas the lg and the trf models interpret it as WORK_OF_ART, which is more accurate!
- The trf model is the only one out of these 3 models that uses not only the type DATE, but also the additional types EVENT and TIME. The 3 models extract the same 9 named entities for date. The trf model additionally extracts 1 named entity for TIME ("02:42") which might be useful if we are looking for specific information regarding time in the text. The usefulness of the type EVENT is visible when looking at the NER type output for the named entity "Art Basel Miami Beach" which is recognised by the models sm and lg as FAC and by trf as EVENT. The type EVENT (and hence the trf model) is more accurate in this case because it actually is a fair/an event.

All in all, the trf model seems to be most accurate in the above executed NER.

👋 ⚒ Compare the analysis of the best performing spaCy model for NER on the article after it was preprocessed to the performance on the non-preprocessed article.

In [None]:
#Here I use nlp.trf on the preprocessed version:

nlp_trf = spacy.load("en_core_web_trf")

text_preprocessed = ' '.join(unique_words_after_preprocessing_spaCy)  # Join list items with spaces because the spaCy nlp() function expects a single string as input, not a list!

doc = nlp_trf(text_preprocessed)

for ent in doc.ents:
  print(ent.text, ent.start_char, ent.end_char, ent.label_)

# Calculate the number of NERs for each NER type that the model identifies:
entity_counts = Counter([ent.label_ for ent in doc.ents])
print()
print(entity_counts)

When I compare the NER output of (the in my opinion best performing spaCy model) trf on the article after it was preprocessed with the performance on the non-preprocessed article, I can say that the model recognises fewer named entities in the preprocessed article: 29 for the preprocessed article, 76 for the non-preprocessed article. The named entities that the model recognised on the preprocessed text seem to be a lot less accurate; they include quite random words in non-fitting NER types such as "take" in type PERSON.

It does make sense that the NER performs worse on the preprocessed text because we made all words lowercase and this probably makes it harder for the model to recognise named entities because these is often case-sensitive, e.g. "Apple" the company and "apple" the fruit. So an important signal for the model is taken away. Moreover, we lemmatised all words which might change entity names in a way that they cannot be recognised anymore. Furthermore, numbers (type MONEY, type DATE, type TIME) are obviously not included in the NER of the preprocessed text because we had previously removed all numbers and punctuation.

## **Multilingual NER**
In this exercise, the NER performance of spaCy in English is compared to another language of your choice.

👋 ⚒ Go the [spaCy page](https://spacy.io/models) detailing the available models to identify supported languages on the left listed under the heading "Trained Pipelines". Select a language and model of your choice. Find an article in this language and parse it using the newspaper package.

In [None]:
# Remember that you first need to load the model by replacing
#"en_core_web_sm" with the name of your model

import newspaper
from newspaper import Article

url = 'https://www.derstandard.at/story/3000000242971/arnold-schwarzenegger-waehlt-kamala-harris'
article = Article(url)
article.download()
article.parse()

!python -m spacy download de_core_news_sm

# Loading the model:
nlp = spacy.load("de_core_news_sm")


👋 ⚒ Perform NER on the selected article.

In [None]:
doc = nlp(article.text)

for ent in doc.ents:
  print(ent.text, ent.start_char, ent.end_char, ent.label_)

entity_counts = Counter([ent.label_ for ent in doc.ents])
print()
print(entity_counts)

👋 ⚒ How well did the NER in the language of your choice work as compared to the overall performance of NER with spaCy in English?

This NER model for this German text did not perform as well as the NER models for the previous tasks in English; perhaps I chose a model that is not as good yet.

Overall, the performance is mixed because some parts of the extraction are correct and some parts are incorrect.

The LOC type worked well. All locations were extracted correctly. However, the other types did not always work well. The PER type worked moderately well: Some persons were extracted correctly, but some were not extracted at all, such as "Kamela Harris", and some named entities in the type PER are incorrect, such as "Hass".

Moreover, the model extracted some named entities that are not actually named entities (e.g., "republikanische").

It certainly had trouble with more complex phrases (e.g., "republikanische Ex-Gouverneur", "Z.B. Browser-AddOns wie Adblocker").

The MISC type is used a few times but mostly for non-relevant or incorrect entities such as "Z.B. Browser-AddOns wie Adblocker" or for entities that should belong to other types, such as "STANDARD" which should be in type ORG. This might indicate that the model did not know how to group them but that it still considered them important enough to extract them.