# Home Exericse 1: Preprocessing and NER
In this first home exercise, you will use the knowledge from Tutorial 1 and Tutorial 2 to perform some preprocessing and NLP steps on a news article of your choice. An example article in English is provided in this notebook.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

We will use the newspaper libabry to facilitate the scraping of the news article from a webpage.

In [6]:
!pip install newspaper3k
!pip install lxml_html_clean
!pip install lxml[html_clean]



In [7]:
import newspaper
from newspaper import Article

url = 'https://edition.cnn.com/2024/10/25/style/banana-artwork-maurizio-cattelan-comedian-auction/index.html'
article = Article(url)
article.download()
article.parse()

#This line displays the authors of the article
print("Authors: ", article.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article.title, "\n")
print("Text of article: \n", article.text)

Authors:  ['Oscar Holland'] 

Title:  Maurizio Cattelan’s viral banana artwork ‘Comedian’ could now be worth $1.5 million 

Text of article: 
 CNN —

When a banana duct-taped to a wall sold for $120,000 in 2019, social media uproar and an age-old debate about the meaning of art ensued.

But artist Maurizio Cattelan’s viral creation, titled “Comedian,” may yet prove a sound investment: On Friday, auction house Sotheby’s announced that one of the artwork’s three “editions” is going back on sale — this time with an estimate of $1 million to $1.5 million.

For their money, the winning bidder will receive a roll of duct tape and one banana, as well as a certificate of authenticity and official instructions for installing the work. Sotheby’s confirmed to CNN that neither the tape nor, thankfully, the banana are the originals.

“‘Comedian’ is a conceptual artwork, and the actual physical materials are replaced with every installation,” an auction spokesperson said via email.

Cattelan and Fre

👋 ⚒ Use the above article or a news article of your choice and print the number of unique words in the text.

In [8]:
# Calculate and print the number of unique words in the text

### Solution with spaCy:
import spacy

# Loading spaCy model
nlp = spacy.load('en_core_web_sm')

# Extracting the text of the article -> this extracts the main text content of the article and stores it in the variable "text"
text = article.text

# Processing the text
doc = nlp(text)

# Extracting words and counting unique words
words = [token.text for token in doc]
unique_words = set(words)

# Printing the numbver of all unique words:
print(f"Number of unique words in the article: {len(unique_words)}")

# Printing all unique words to the console
for word in unique_words:
  print(f"Unique words in the article: {word}")

Number of unique words in the article: 370
Unique words in the article: a
Unique words in the article: asking
Unique words in the article: urinal
Unique words in the article: while
Unique words in the article: merits
Unique words in the article: acquired
Unique words in the article: uproar
Unique words in the article: was
Unique words in the article: South
Unique words in the article: New
Unique words in the article: incident
Unique words in the article: Italian
Unique words in the article: put
Unique words in the article: after
Unique words in the article: replaced
Unique words in the article: installing
Unique words in the article: too
Unique words in the article: receive
Unique words in the article: Perrotin
Unique words in the article: up
Unique words in the article: before
Unique words in the article: realization
Unique words in the article: "
Unique words in the article: prove
Unique words in the article: it
Unique words in the article: donated
Unique words in the article: questi

## **Preprocessing**

👋 ⚒ Now perform the following preprocessing steps and see how the number of unique words changes:

1. Lowercase all words in the text.
2. Remove punctuation markers and numbers (Hint: `string.isalpha()).
3. Lemmatize all words in the text.

In [9]:
# Preprocess the text with all three steps and then calculate the number of
# unique words in the text again


### Solution with spaCy:
import spacy

# Loading spaCy model
nlp = spacy.load('en_core_web_sm')

# Extracting the text of the article -> this extracts the main text content of the article and stores it in the variable "text"
text = article.text

## 1. LOWERING ALL WORDS IN THE TEXT
# Processing the text - the method ".lower()" can'† be used on the object "Article", but on the text content of the article
# -> This processes the lowercase text using the spaCy pipeline, resulting in a Doc object that contains the processed text.
doc = nlp(text.lower())

## 2. REMOVING PUNCTUATION MARKERS AND NUMBERS + 3. LEMMATIZING ALL WORDS
# Extracting words, removing punctuation and numbers, and lemmatizing
# -> the loop iterates over each token in the Doc object and extracts the text of the token if it consists only of alphabetic characters
words = [token.lemma_ for token in doc if token.is_alpha]

# This converts the list of words to a set -> automatically removes duplicates, resulting in a collection of unique words
unique_words = set(words)

# Printing the number of all unique words:
print(f"Number of unique words after preprocessing: {len(unique_words)}")

# Printing all unique words to the console
for word in unique_words:
  print(f"Unique words in the article: {word}")

Number of unique words after preprocessing: 311
Unique words in the article: a
Unique words in the article: urinal
Unique words in the article: while
Unique words in the article: basel
Unique words in the article: newspaper
Unique words in the article: uproar
Unique words in the article: why
Unique words in the article: incident
Unique words in the article: put
Unique words in the article: after
Unique words in the article: too
Unique words in the article: receive
Unique words in the article: up
Unique words in the article: before
Unique words in the article: korea
Unique words in the article: realization
Unique words in the article: prove
Unique words in the article: crowd
Unique words in the article: it
Unique words in the article: high
Unique words in the article: question
Unique words in the article: source
Unique words in the article: speak
Unique words in the article: request
Unique words in the article: certificate
Unique words in the article: immediately
Unique words in the art

## **NER**

In the tutorial we have only used one of the different models available in spaCy. In this exercise, you will compare the performance of the different models of different sizes and implementations. A description of the type of available models is in the [spaCy documentation](https://spacy.io/models/en). First, the models to be used need to be installed. We will use the following three models.

In [10]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_trf

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m81.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m?[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[

👋 ⚒  Use each of the three models that were downloaded above and perform named entitiy recognition with each of them on the original not preprocessed article, one after another. You can use different code cells for the different models or write everything into one cell, as you prefer. For each of the model outputs, automatically calculate the number of NERs for each NER type that the model identifies.

In [11]:
import spacy
from collections import Counter
nlp_sm = spacy.load("en_core_web_sm")
# Your code here

# Load spaCy models
nlp_sm = spacy.load("en_core_web_sm")
nlp_lg = spacy.load("en_core_web_lg")
nlp_trf = spacy.load("en_core_web_trf")

# Function to perform NER and count entities -> we can write a function in order to spare unnecessary lines of (the same, repeated) code
def perform_ner_and_count(nlp, text):
  doc = nlp(text)
  ner_counts = Counter([ent.label_ for ent in doc.ents])
  return ner_counts

# Perform NER with each model
ner_counts_sm = perform_ner_and_count(nlp_sm, text)
ner_counts_lg = perform_ner_and_count(nlp_lg, text)
ner_counts_trf = perform_ner_and_count(nlp_trf, text)

# Print results for each model
print("NER counts with en_core_web_sm:")
for label, count in ner_counts_sm.items():
    print(f"{label}: {count}")

print("\nNER counts with en_core_web_lg:")
for label, count in ner_counts_lg.items():
    print(f"{label}: {count}")

print("\nNER counts with en_core_web_trf:")
for label, count in ner_counts_trf.items():
    print(f"{label}: {count}")

  model.load_state_dict(torch.load(filelike, map_location=device))
  with torch.cuda.amp.autocast(self._mixed_precision):


NER counts with en_core_web_sm:
ORG: 17
MONEY: 3
DATE: 9
PERSON: 4
NORP: 11
CARDINAL: 7
FAC: 2
GPE: 15
ORDINAL: 2
LOC: 1

NER counts with en_core_web_lg:
ORG: 17
MONEY: 3
DATE: 9
PERSON: 7
WORK_OF_ART: 7
CARDINAL: 7
NORP: 2
FAC: 1
GPE: 15
ORDINAL: 2
LOC: 1

NER counts with en_core_web_trf:
ORG: 19
MONEY: 3
DATE: 9
PERSON: 6
WORK_OF_ART: 9
CARDINAL: 8
NORP: 2
EVENT: 1
GPE: 15
TIME: 1
ORDINAL: 2
LOC: 1


You can use the following function to visualize the named entities in the text in order to facilitate the analysis.

In [12]:
# You can also visualize the detected named entities
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

👋 ⚒ Add your text of the analysis of differences between the three different models right here in the next text field.

**en_core_web_sm:**
- this is the smallest from the three models, developed for applications where quick processing is more important than accuracy
- fast, but lower accuracy - because it has no word vectors
- COMPONENTS: includes components like tokenization, part-of-speech tagging, dependency parsing, named entity recognition, and lemmatization
-> Best for lightweight applications, quick experiments, and scenarios where computational resources are limited.

**en_core_web_lg:**
- larger than the first model
- improved accuracy - because it includes word vectors
- requires more memory and processing time
- a good balance between speed and performance
- COMPONENTS: contains all the components of the small model, plus additional word vectors for improved performance
-> Suitable for more demanding applications that require better accuracy and can afford the additional computational cost.

**en_core_web_trf:**
- the largest and slowest of the three models - because it is a transformer-based model, which leverages deep learning techniques for even higher accuracy
- provides the best performance for complex NLP tasks
- best for applications where precision is critical
- COMPONENTS: incorporates transformer-based components -> ability to understand context and semantics at a deeper level
-> Ideal for high-stakes applications where the highest accuracy is required, such as advanced research projects or production systems with ample resources.


👋 ⚒ Compare the analysis of the best performing spaCy model for NER on the article after it was preprocessed to the performance on the non-preprocessed article.

In [21]:
# Your code here

def preprocess_text(text):
  text = article.text
  doc = nlp(text.lower())
  words = [token.lemma_ for token in doc if token.is_alpha]
  unique_words = set(words)
  preprocessed_text = " ".join(words)  # Join lemmatized words back into a single string
  return preprocessed_text

preprocessed_text = preprocess_text(text)

# Function to perform NER and count entities
def perform_ner_and_count(nlp, text):
    doc = nlp(text)
    ner_counts = Counter([ent.label_ for ent in doc.ents])
    return ner_counts

# Perform NER on original and preprocessed text
ner_counts_original = perform_ner_and_count(nlp_trf, text)
ner_counts_preprocessed = perform_ner_and_count(nlp_trf, preprocessed_text)

# Print results
print("NER counts with en_core_web_trf on original text:")
for label, count in ner_counts_original.items():
    print(f"{label}: {count}")

print("\nNER counts with en_core_web_trf on preprocessed text:")
for label, count in ner_counts_preprocessed.items():
    print(f"{label}: {count}")



######### I needed to try it also with the worst model to see the difference in counting because there occured an error every time when running the
######### code. The error occured because I returned in the function 'preprocess_text' the original text instead of preprocessed text.
ner_counts_sm_original = perform_ner_and_count(nlp_sm, text)
ner_counts_sm_preprocessed = perform_ner_and_count(nlp_sm, preprocessed_text)
# Print results
print("\nNER counts with en_core_web_sm on original text:")
for label, count in ner_counts_sm_original.items():
    print(f"{label}: {count}")

print("\nNER counts with en_core_web_sm on preprocessed text:")
for label, count in ner_counts_sm_preprocessed.items():
    print(f"{label}: {count}")

  with torch.cuda.amp.autocast(self._mixed_precision):


NER counts with en_core_web_trf on original text:
ORG: 19
MONEY: 3
DATE: 9
PERSON: 6
WORK_OF_ART: 9
CARDINAL: 8
NORP: 2
EVENT: 1
GPE: 15
TIME: 1
ORDINAL: 2
LOC: 1

NER counts with en_core_web_trf on preprocessed text:
ORG: 10
GPE: 4
DATE: 2
PERSON: 8
NORP: 1
PRODUCT: 1

NER counts with en_core_web_sm on original text:
ORG: 17
MONEY: 3
DATE: 9
PERSON: 4
NORP: 11
CARDINAL: 7
FAC: 2
GPE: 15
ORDINAL: 2
LOC: 1

NER counts with en_core_web_sm on preprocessed text:
PERSON: 130
GPE: 12
LOC: 1
ORG: 49
PRODUCT: 8
NORP: 4
FAC: 1


## **Multilingual NER**
In this exercise, the NER performance of spaCy in English is compared to another language of your choice.

👋 ⚒ Go the [spaCy page](https://spacy.io/models) detailing the available models to identify supported languages on the left listed under the heading "Trained Pipelines". Select a language and model of your choice. Find an article in this language and parse it using the newspaper package.

In [4]:
# Remember that you first need to load the model by replacing
#"en_core_web_sm" with the name of your model
!python -m spacy download de_core_news_lg



Collecting de-core-news-lg==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_lg-3.7.0/de_core_news_lg-3.7.0-py3-none-any.whl (567.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m567.8/567.8 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


👋 ⚒ Perform NER on the selected article.

In [17]:
# !pip install newspaper3k
# import newspaper
# from newspaper import Article

url = 'https://www.moment.at/story/kuenstliche-intelligenz-klima-gut-schlecht/'
article = Article(url)
article.download()
article.parse()

# Assign the article's text to "text2" for clear identification
text2 = article.text

#This line displays the authors of the article
print("Authors: ", article.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article.title, "\n")
print("Text of article: \n", article.text)

Authors:  ['Jannik Hiddeßen', 'Von Jannik Hiddeßen'] 

Title:  Künstliche Intelligenz und Klima: Gut oder schlecht? 

Text of article: 
 ChatGPT und Co stehen in der Kritik. Ihre Server-Farmen fressen Energie und verbrauchen große Mengen Wasser. Die Emissionen der großen Tech-Konzerne wachsen durch die milliardenschweren Entwicklungen im KI-Bereich weiter an. Gleichzeitig entwickeln Forscher:innen KI-Anwendungen, die uns beim Klimaschutz unterstützen könnten - auch in Österreich. Was überwiegt beim KI-Hype? Chancen oder Risiken?

Kaum eine andere Technologie hat in den vergangenen Jahren einen solchen Hype erlebt wie die “Künstliche Intelligenz”. Studierende nutzen KI-Modelle für Seminararbeiten, Ärzt:innen als Unterstützung für Diagnosen und Unternehmen immer öfter als Verkaufsargument. Es gebe kaum ein Produkt, dass sich nicht durch Künstliche Intelligenz verbessern ließe. Das möchte uns die Tech-Industrie jedenfalls glauben lassen.

Ganz vorne dabei sind Tech-Giganten wie Microsoft 

In [20]:
# import spacy
# from collections import Counter

preprocessed_text = preprocess_text(text2)

# Load spaCy models
nlp_lg = spacy.load("de_core_news_lg")

# Function to perform NER and count entities -> we can write a function in order to spare unnecessary lines of (the same, repeated) code
def perform_ner_and_count(nlp, text):
  doc = nlp(text)
  ner_counts = Counter([ent.label_ for ent in doc.ents])
  return ner_counts

# Perform NER with lg model
ner_counts_lg = perform_ner_and_count(nlp_lg, text2)

print("\nNER counts with de_dep_news_lg:")
for label, count in ner_counts_lg.items():
  print(f"{label}: {count}")


# Print all NER found in the text
print("\nNamed Entities found in the text:")
doc = nlp_lg(text2)  # Analyze text2 to extract entities
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")


NER counts with de_dep_news_lg:
MISC: 31
LOC: 18
ORG: 21
PER: 13

Named Entities found in the text:
ChatGPT (MISC)
Co (MISC)
Österreich (LOC)
KI-Hype (LOC)
KI-Modelle (MISC)
Microsoft (ORG)
Google (ORG)
Gemini (MISC)
Microsoft-Suchmaschine Bing (MISC)
KI (ORG)
ChatGPT (MISC)
KI-Hype

 (LOC)
KI-Hype (LOC)
KI-Modelle (MISC)
KI (ORG)
ChatGPT (MISC)
Internets (MISC)
KI-generiertem Trash-Content (MISC)
KI

 (MISC)
KI (ORG)
ChatGPT (MISC)
Gemini (MISC)
OpenAI (ORG)
Google (MISC)
Microsoft (ORG)
Goldman Sachs (ORG)
KI-Modell (ORG)
Google (MISC)
Googles (MISC)
PDF (MISC)
Kalifornien (LOC)
Arizona (LOC)
den USA (LOC)
amerikanischen Wüste (LOC)
KI-Rechenzentren (ORG)
Vereinigte Königreich von Großbritannien (LOC)
Ivan Dukic (PER)
KI-Startups (ORG)
KI-Modelle (MISC)
Alpen (LOC)
Innsbruck (LOC)
Innsbrucker (LOC)
Ivan Dukic (PER)
Rechenzentrums (LOC)
Ralph Hintemann (PER)
Borderstep Institut zu Digitalisierung und Nachhaltigkeit (ORG)
KI-Modell (ORG)
Meta-Modellen (PER)
Gemini (MISC)
ChatGPT (MISC

In [19]:
# You can also visualize the detected named entities
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

👋 ⚒ How well did the NER in the language of your choice work as compared to the overall performance of NER with spaCy in English?

For this task I used the German pipeline "de_core_news_lg" of spaCy library for NLP. My colleagues started a discussion about the performance of NER with spaCy in German before I have finished the task. I can confirm now their concerns about a rather bad data processing. When using the mentioned pipeline, the model identifies entity types of MISC ("Miscellaneous"), LOC ("Location"), PER ("Person"), and ORG ("Organization") of the entities of the given text. I used also a visualization code of the detected named entities to see the results of the NER. The model correctly identified PER which represents names of the individual people. The entities of location (LOC) were also identified almost correctly, although there were some misidentified entities which should not have been identified as locations. I see the major problem with ORG and MISC where more errors have been made. Quite a few of the entities from the text represent products or services which have been labeled as MISC. I think a separate entity type label for products and services could help to solve the problem of broad labeling with the label "MISC" and serve a better named entity recognition. There is also lack of consistent labeling (f.e. "KI" labeled as ORG and than later as MISC).