**Introduction to Natural Language Processing**<br/>
[**CC-BY-NC-SA**](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en)<br/>
Prof. Dr. Annemarie Friedrich<br/>
Faculty of Applied Computer Science, University of Augsburg<br/>
Date: **SS 2025**

# Overview of NLP Tasks

**Learning Goals**

* Obtain an overview of a selection of common NLP tasks.
* Run some state-of-the-art models on different types of texts.
* Observe what type of information the models return.

**Don't Panic**

* You do not need to understand the code below in detail yet. The comments and explanations should enable you to run it and to make small modifications to the input examples. (If this is not the case, let me know!)
* Many of the links below lead to websites with more advanced information on the topics. You do not need to understand them in detail, but if you read a bit on these websites and take notes, you will probably learn a lot.

**Instructions for In-Class Group Activity**

1. Distribute the NLP tasks listed below among your team members. You have 15 minutes to work on one NLP task individually or in teams of two (if you are done more quickly, work on more NLP tasks). Be prepared to present your findings in a two-minute oral presentation (without slides). You may of course show your notebook to your fellow group members during this presentation.
2. Briefly present all the NLP tasks within your group (max. 2 minutes per task). Take notes.

**Technical Prerequisites**

Use a GPU to process your data.
This step is particularly helpful to speed up processing when working with large neural models. In this notebook, we use them for the following tasks:

* Text Classification and Sentiment Analysis
* Question Answering
* Grammar Correction

On Google Colaboraty, do the following steps:<br/>
``Runtime --> Change runtime type --> Hardware accelerator --> T4 GPU or TPU``

And now, let's get started.
First, let's import some modules that we will need.


In [None]:
# Imports and general settings
import pprint
pp = pprint.PrettyPrinter(indent=4)
from collections import defaultdict

In [None]:
# Imports
import spacy
from spacy import displacy

# Load the model that we will use for predictions
nlp = spacy.load("en_core_web_sm")

In [None]:
import torch
print(torch.__version__)

❗Now, you can jump to the section of the NLP Tasks that you were assigned within your group.

---

## Text Classification and Sentiment Analysis
We will use the [HuggingFace](https://huggingface.co/) library to load a model trained for binary sentiment analysis and apply it to a movie reviews corpus.

In [None]:
!pip install -q transformers
!pip install datasets
from datasets import load_dataset

In [None]:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love you", "I hate you"]
sentiment_pipeline(data)

### Working with a sentiment analysis dataset

We will next load an existing dataset of movie reviews (`rotten_tomatoes`) with gold labels, i.e., labels assigned by a human, that indicate whether a review expresses a positive or a negative sentiment towards the movie.
You can find more information on the dataset [here](https://huggingface.co/datasets/rotten_tomatoes).

In [None]:
# Load a dataset of movie reviews that are annotated with regard to whether they are positive or negative
dataset = load_dataset("rotten_tomatoes", split="validation") # we choose the validation split of the dataset

# Collect 10 positive and 10 negative instances
positive_reviews = defaultdict(list)
negative_reviews = defaultdict(list)
for i, (gold_label, review_text) in enumerate(zip(dataset["label"], dataset["text"])):
  if gold_label == 1 and len(positive_reviews["text"]) < 10:
    positive_reviews["text"].append(review_text)
    positive_reviews["label"].append(gold_label)
  if gold_label == 0 and len(negative_reviews["text"]) < 10:
    # skipping a few cases (honestly, they are just odd cases in the dataset, this happens :))
    if i not in (539, 541, 542):
      negative_reviews["text"].append(review_text)
      negative_reviews["label"].append(gold_label)

print("POSITIVE REVIEWS:")
for i, text in enumerate(positive_reviews["text"]):
  print(positive_reviews["label"][i], "\t", text)

print("\nNEGATIVE REVIEWS:")
for i, text in enumerate(negative_reviews["text"]):
  print(negative_reviews["label"][i], "\t", text)

In [None]:
# Let's process this dataset with our model
sentiment_pipeline(positive_reviews["text"])

In [None]:
# Wow, that worked well, what about the negative cases?
sentiment_pipeline(negative_reviews["text"])

❓[1.1] Why do you think the model got one case wrong here?

❓[1.2] Read about **aspect-based sentiment analysis** on [PapersWithCode](https://paperswithcode.com/task/aspect-based-sentiment-analysis). What would be possible aspects for hotel reviews / restaurant reviews?

Sentiment analysis is one type of **text classification** task: we provide the model with a text and obtain a label. In this case, we had a **binary** classification task that only decides between two labels. In principle, **multi-class** classification tasks decide between more than two labels, and **multi-label** classification tasks even allow a model (or a human annotator) to assign more than one label from the label set.

❓[1.3] Can you find further examples for binary, multi-class, and multi-label text classification tasks?

---


## Named Entity Recognition and Typing

We will use the [spaCy](https://spacy.io/) library to find out which **named entities** occur in a given text.

❓ [1.4] Briefly skim [the Wikipedia page on Named Entities](https://en.wikipedia.org/wiki/Named_entity) and note down a definition that will help you to remember what this NLP task is about.

**Definition:** In Information Extraction, the NLP task of Named Entity Recognition and Typing means ... *(add your answer here`)*

In [None]:
# Source: https://en.wikipedia.org/wiki/Augsburg
input_text = """Augsburg (UK: /ˈaʊɡzbɜːrɡ/ OWGZ-burg,[3] US: /ˈɔːɡz-/ AWGZ-,[4] German: [ˈaʊksbʊʁk] (listen);
Swabian German: Ougschburg) is a city in Swabia, Swabia, Germany,
around 50 kilometres (31 mi) west of Bavarian capital Munich.
It is a university town and regional seat of the Regierungsbezirk Swabia
with an impressive Altstadt (historical city centre).
Augsburg is an urban district and home to the institutions of the Landkreis Augsburg.
It is the third-largest city in Bavaria (after Munich and Nuremberg), with a population of 300,000
and 885,000 in its metropolitan area.[5]
"""

# Process the text and annotate the named entities.
doc = nlp(input_text)

# Visualize results.
displacy.render(doc, style='ent', jupyter=True, options={'distance': 90})

Among others, spaCy recognizes the following built-in _entity types_ (see [Kaggle](https://www.kaggle.com/code/curiousprogrammer/entity-extraction-and-classification-using-spacy)):

**PERSON** - People, including fictional.

**NORP** - Nationalities or religious or political groups.

**ORG** - Companies, agencies, institutions, etc.

**GPE** - Countries, cities, states.

**LOC** - Non-GPE locations, mountain ranges, bodies of water.

**LANGUAGE** - Any named language.

**DATE** - Absolute or relative dates or periods.

**TIME** - Times smaller than a day.

**QUANTITY** - Measurements, as of weight or distance.

**ORDINAL** - "first", "second", etc.

**CARDINAL** - Numerals that do not fall under another type.

❓[1.5] Which types of built-in _entity types_ does our spaCy model recognize in the input text? How are the types we have encountered in our text defined? Do all assigned types make sense to you in this case?

In [None]:
# Access text and spans using code
s = "{:<30} {:<4} {:<4} {:<8}"
for ent in doc.ents: # Iterate over the list of NEs that the model has found.
  print(s.format(ent.text, ent.start_char, ent.end_char, ent.label_))

You have just encountered the concept of **stand-off annotation**: Each NE annotation is defined by two character offsets referring to the number of characters counting from the beginning of the document. For example, *Augsburg* in the document above has the `start_char` (starting character) 0 and the `end_char` 8. (As almost always in computer science, we start counting at 0.)

You can most easily remember `star_char` as the number of characters that come before a word (or some annotation) in a document, and `end_char` as the number of characters that occur before the word is completed.


```
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |  ...
  A   u   g   s   b   u   r   g       (    U    K  ...
```

The Pythonian way to remember this is to just access the original text at these slices:

In [None]:
print(input_text[0:8])

❓[1.6] The model that we have loaded here was trained on English data. Feed in some input text of a different language of your choice. What do you observe?

---
## Question Answering

In this section, we will introduce different types of __question answering__ (QA) systems and run an extractive QA model based on the [HuggingFace](https://huggingface.co/) library on parts of the [Squad](https://huggingface.co/datasets/squad) dataset.

You have probably played with models like ChatGPT that are based on huge pre-trained language models (_large language models_ (LLMs)). These models are **generative** models, i.e., they _generate_ answers token by token without a direct reference to a source text. Their answers are thus based on the latent knowledge they have learned from large volumes of text, and the questions that is used as a _prompt_ to the model. Essentially, such **Generative QA** models output what their internal statistics think to be the most likely continuation after the question has been asked. This is also why it is extremely hard to tell whether they are saying the truth or whether they are "hallucinating." (If you're interested, you can read more about this problem [here](https://arxiv.org/abs/2202.03629).)

Another QA task is that of **Extractive QA**, where the task is to retrieve an answer given a source document (_context_). The task is:
* Determine whether the source document provides an answer.
* If so, return the **span** within the source document that corresponds to the answer. The span is returned as two character offsets counting the number of characters from the beginning of the document until the answer starts (`start`), and the number of characters from the beginning of the document until the answer ends (`end`).

Let's take a look at one example from the [Squad](https://huggingface.co/datasets/squad) dataset.

In [None]:
from textwrap import wrap

# The document text is provided as the context for answering the question.
doc_text = "CBS broadcast Super Bowl 50 in the U.S., and charged an average of $5 million for a 30-second commercial during the game. The Super Bowl 50 halftime show was headlined by the British rock group Coldplay with special guest performers Beyoncé and Bruno Mars, who headlined the Super Bowl XLVII and Super Bowl XLVIII halftime shows, respectively. It was the third-most watched U.S. broadcast ever."
# Let's print this in a more human-friendly way.
print("\n".join(wrap(doc_text, width=100)))

question1 = "Who were special guests for the Super Bowl halftime show?" # Answer: "Beyoncé and Bruno Mars"
question2 = "What was the cost for a half minute ad?" # Answer: "$5 million"
question3 = "Who watched the game on TV?" # Is there an answer in the text?
question4 = "Which teams played against each other in the Super Bowl 50?" # Not answerable based on the text above.

questions = [question1, question2, question3, question4]

In [None]:
# Load a QA model trained on the training part of the Squad dataset
from transformers import pipeline
qa_model = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

In [None]:
for question in questions:
  print("\n"+question)
  result = qa_model(question=question, context=doc_text)
  print(result)

❓[1.7] Which answers did the model get right/wrong?

❓[1.8] Does the _confidence score_ (`score`) correspond to the correctness of the answer?

❓[1.9] What do you think went wrong in the cases that the model answered incorrectly?

---

## Part-of-Speech Tagging and Syntactic Parsing

We will use the [spaCy](https://spacy.io/) library to find out which **part-of-speech** tag a word should carry in a sentence, and which **grammatical relations** hold between words in a sentence.

### Part-of-speech Tagging

**Part-of-speech** (POS) tagging refers to the task of assigning each word in a sentence or text its part-of-speech (German: _Wortart_) according to a fixed inventory. An influential inventory of POS tags is the [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).
The spaCy model below uses the [UPOS tagset](https://universaldependencies.org/u/pos/) developed by the [Universal Dependencies project](https://universaldependencies.org/).

Before neural models became the prevalent way of solving NLP tasks, POS tags were an important source of information for models based on "traditional" features. Today, analysing them can be an _explainable_ way of analysing a dataset or the results produced by a model. For example, we can answer questions such as "Are there many long noun phrases in the data? How does the model perform on them?"

In [None]:
# Imports
import spacy
from spacy import displacy

# Load the model that we will use for predictions
nlp = spacy.load("en_core_web_sm", disable=["parser"])

input_sentences = ["Susan can run 1km in under 3 minutes.", "This could open a can of worms."]

# Process the input sentences and print the POS tag of each word
for input_sentence in input_sentences:
  print("\n" + input_sentence)
  doc = nlp(input_sentence)
  for word in doc:
    print("{:<12} {:<4}".format(word.text, word.pos_))

❓[1.10] Which of the above POS tags can you define easily? Look up the UPOS tags that you cannot (yet) make sense of on the [Universal Dependencies page](https://universaldependencies.org/u/pos/).

❓[1.11] The word "can" occurs in each sentences. Which POS tag does it carry in each case? Hopefully, this case convinces you that POS tagging is not completely straightforward as there is no one-to-one correspondence between words and POS tags. Check out the currently reported best scores on POS tagging benchmarks on [this website](http://nlpprogress.com/english/part-of-speech_tagging.html) - is POS tagging a solved task? In which cases are still improvements possible?

### Syntactic Parsing

**Syntactic parsing** aims to identify grammatical relations between the words of a sentence. This is often very useful to describe grammatical rules, e.g., when teaching a language or when linguists analyse or compare languages.
The [Universal Dependencies project](https://universaldependencies.org/) collects **treebanks**, i.e., text corpora that are annotated with syntactic structures, for many languages of the world. Before neural models became prevalent in state-of-the-art NLP, syntactic features were a main source of information for automatic classifiers in many NLP tasks.

In [None]:
# Load the model including the parser
nlp = spacy.load("en_core_web_sm")

input_sentence = "Joe gave Mary the book."
# Try these as well
#input_sentence = "The cake has been baked by Mary."
#input_sentence = "He saw the man with the telescope."

# Process sentence
doc = nlp(input_sentence)
# Visualize results.
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

Above, you should now be able to see the **dependency tree** representing the syntactic structure of the sentence.
* "gave" is the _main verb_ of the sentence, and the dependency tree's _root node_.
* "Joe" is the _nominal subject_ (nsubj) of "gave": a _subject_ answers the question "**who** did sth.", nominal means that the subject is a noun.*
* "Mary" is the _dative object_, usually called _indirect object_ (iobj).
* "book" is the _direct object_ (dobj) of "gave", and "the" is its _determiner_ (det).

*If you are wondering what else a subject could be:
* "_That she won_ upset him." - The italic part is a _clausal subject_ (csubj).

❓[1.12] Try to parse another sentence using the code above. Try to understand the syntactic structure output by the parser.

## Grammar Correction

We will use a [grammar correction model](https://huggingface.co/vennify/t5-base-grammar-correction) based on a language model  called T5 which has been trained on a grammar correction dataset that consists of mistakes and their corrections.

In [None]:
# Install the dependencies
!pip install happytransformer
from happytransformer import HappyTextToText, TTSettings


In [None]:
happy_tt = HappyTextToText("T5", "vennify/t5-base-grammar-correction")

args = TTSettings(num_beams=5, min_length=1)

# Add the prefix "grammar: " before each input
result = happy_tt.generate_text("grammar: This sentences has has bads grammar.", args=args)

print(result.text)

In [None]:
inputs = [
    "The apple was in table.",
    "The apple was in tabel.",
    "Dear Sir or Madam, I would to request a dedline extinsion for submit thesis.",
    "I couldnt finish it yesterday because dog ate homework."
]
for input in inputs:
  result = happy_tt.generate_text("grammar: " + input, args=args)

  print(result.text)

❓[1.13] Try out a few sentences of your choice. Does the model work for languages other than English?

❓[1.14] Which types of errors were made in the sentences above? Which of them did the model fix?

The model we used above has been trained on the `validation` split of the [JFLEG dataset](https://huggingface.co/datasets/jfleg) as described in [this blogpost](https://www.vennify.ai/fine-tune-grammar-correction/).
Let's load the `test`split and run the model on it.

In [None]:
# Let's load the dataset on which the model has been trained
!pip install datasets
from transformers import pipeline
from datasets import load_dataset
jfleg_dataset = load_dataset("jfleg", split="test")

In [None]:
# Let's print some instances of the dataset, which consist of an input
# sentence with potentially wrong grammar, and 4 possible corrections.
for i in [0,4,8]:
  sentence = jfleg_dataset["sentence"][i]
  print("\n", sentence)
  print("\t" + "-"*90)
  result = happy_tt.generate_text("grammar: " + sentence, args=args)
  for correction in jfleg_dataset["corrections"][i]:
    print("\tGold correction:     ", correction)
  print("\t" + "-"*90)
  print("\tAutomatic correction:", result.text)

❓[1.15] Is a valid correction produced for each input sentence?

For seeing commercial grammar correction software in action, check out [Grammarly](grammarly.com).

---

## Other NLP Tasks

This section serves as a reference for several other NLP tasks (not part of in-class activity).

### Automatic Summarization

**Single-document summarization** refers to the process of condensing a longer piece of text into a shorter version while retaining its main ideas and key points. In **multi-document summarization**, the task is to summarize a collection of documents.

There are two main approaches to summarization: extractive and abstractive.

**Extractive Summarization** involves selecting and combining the most important sentences or phrases from the original text to form a summary. It relies on identifying relevant sentences that already exist in the source text. Some problems associated with extractive summarization include:

1. Redundancy: Extractive methods may include multiple similar or redundant sentences in the summary, leading to unnecessary repetition.
2. Incoherence: Extracted sentences may not always fit together coherently, resulting in a summary that lacks overall coherence and cohesion.
3. Lack of Paraphrasing: Extractive techniques do not involve paraphrasing or rewording the sentences. This can limit the expressiveness and novelty of the summary, making it a verbatim reproduction of the source text.

**Abstractive summarization** involves generating new sentences that capture the essence of the original text, rather than directly selecting sentences. It aims to create summaries that may not be present in the source text but convey the same meaning. Some challenges with abstractive summarization are:

1. Unfaithful Output: Abstractive summarization models may occasionally generate summaries that include incorrect or misleading information, deviating from the facts or intent of the original text.
2. Coherence and Consistency: Abstractive methods often struggle with maintaining coherent and consistent summaries, as they need to generate novel sentences while ensuring they align with the overall context and tone of the original text.
3. Overgeneralization or Undergeneralization: Abstractive methods may sometimes produce summaries that overgeneralize or undergeneralize the information, leading to misleading or incomplete representations of the source text.

Both extractive and abstractive summarization methods have their own strengths and limitations, and ongoing research aims to address these challenges to improve the quality and reliability of automatic summarization systems, e.g., by combining the approaches above.

Research also addresses **user-focused summarization**, where the aim is to generate summaries that cater specifically to the needs and preferences of individual users. The process involves analyzing the user's interests, preferences, and context to create personalized summaries that are highly relevant and useful to them.

### Intent Classification

**Intent classification** is a text classification task within _conversational AI_ that involves determining the underlying intent or purpose behind a given text or user query. It plays a crucial role in various applications, such as chatbots, virtual assistants, and customer support systems, by enabling accurate understanding and appropriate response generation.

Intent classification typically involves training a machine learning or deep learning model on labeled data, where each input is associated with a specific intent category. For example, a question like "How can I return a product I purchased?" is mapped to the _intent_ label `Return_Request`, which then triggers a specific response such as "You can find information on returning a product on this website: [URL]".
The model learns to recognize patterns and features within the text to predict the intent of unseen inputs. The predicted intent can then be used to trigger relevant actions or responses.



### Natural Language Inference (NLI)

**Natural Language Inference** (NLI), also known as **Recognizing Textual Entailment** (RTE), is a task in NLP that involves determining the relationship between two given sentences: a premise and a hypothesis.

For example, consider the following premise and hypothesis:

Premise: "The cat is sitting on the mat."<br/>
Hypothesis: "The mat is empty."

In this example, the task of NLI is to determine the relationship between the premise and hypothesis, which is that the hypothesis contradicts the premise. The NLI system would classify this relationship as `contradiction`.

The possible relationship categories in NLI are typically `entailment` (the hypothesis can be inferred from the premise), "contradiction" (the hypothesis contradicts the premise), or "neutral" (there is no clear relationship between the premise and hypothesis).

Some people question whether this is a "real" NLP task, but there has been a lot of research on this topic in the past years, as it provides an interesting testbed for the reasoning capabilities of large language models. (Though some datasets have also been critized to include simply biases that the models simply pick up, but this is just a side note.)

### Information Extraction

Above, we have already learned about one **information extraction** task: named entity recognition. Once we have identified the entities that occur in a sentence and assigned them a coarse-grained entity type, we can next perform **entity linking** (also called **named entity disambiguation**), which finds, for each entity mention (i.e., an entity mentioned in a sentence), the corresponding real-world entity in a given inventory. For example, such an inventory can be a _knowledge graph_ such as [WikiData](https://www.wikidata.org/).

**Relation extraction** is another information extraction task that involves identifying and extracting semantic relationships between entities mentioned in a text. It aims to determine the nature of the relationship between two entities, such as "works-for," "married-to," or "located-in." Consider the following sentence: "Barack Obama was born in Honolulu, Hawaii."
In this example, the entities mentioned are "Barack Obama" and "Honolulu, Hawaii." The relationship between these entities is that Barack Obama was born in Honolulu, Hawaii. Relation extraction aims to automatically identify and extract this relationship from the sentence.
The extracted relation could be represented as follows `(Barack Obama, born-in, Honolulu, Hawaii)`.

### Semantic Parsing

**Semantic Parsing** involves mapping natural language expressions to formal representations of their meaning. It aims to extract the underlying semantics or structured information from text and represent it in a machine-readable format.

For example, consider the following natural language sentence and its corresponding semantic representation:

Sentence: "John likes to play the guitar."

Semantic Representation: `likes_to_play(John, guitar)`

In this example, the semantic parsing system analyzes the sentence and maps it to a structured representation using subject-predicate-object format. The semantic representation captures the subject (John), the predicate (likes to play), and the object (guitar) of the sentence.

Semantic parsing enables machines to understand the relationships between entities in a sentence and extract the intended meaning. It has applications in various domains such as question answering, information retrieval, and knowledge graph construction.
For example, we could use this method to fill a big knowledge graph recording who plays which instrument. We can then find all people who play a particular instrument using a query like `likes_to_play(*, guitar)`.

A prominent implementation of an open information extraction system that performs semantic parsing is [OpenIE](https://nlp.stanford.edu/software/openie.html). Other semantic parsing approaches can, for example, parse directly to semantic representations such as SQL, [AMR](https://amr.isi.edu/), or [FrameNet](https://framenet.icsi.berkeley.edu/fndrupal/).



## Further Notes, Links & References

Congratulations, you made it through a very long list of NLP tasks! Hopefully, you're very motivated now about learning the methods, datasets, and tools, the NLP community has come up with to solve them. You will also learn how to apply the methods and ideas to new problems and datasets, and you will learn about the challenges and limitations we face.

The [NLPProgress](http://nlpprogress.com/) website maintains a list of NLP tasks together with datasets and the state-of-the-art scores achieved on datasets for each task.

## Homework

The questions in this section are intended to get you started for some further self-study on the topic of this session. Search the web (but do not trust ChatGPT, please), read as much as you want, take notes, discuss with your fellow students!

❗[1.16] Work through code of NLP tasks that you have not yet studied in detail and answer the related questions in Digicampus/Vips (graded self-test).

❓ [1.17] Which definitions of **Natural Language Processing** can you find on the web / in the literature? Collect at least 3 definitions. Which sounds best to you?

❓[1.18] What is the difference between **Natural Language Processing** vs. **Computational Linguistics** vs. **Speech Processing**?