# Exercises 3 (solution)

This notebook is inspired by Chapter 1 of the book **"Natural Language Processing with Transformers: Building Language Applications with Hugging Face"** by Tunstall, von Werra, and Wolf.

You will get first experience in using pretrained models for specific tasks, using the *pipeline* API from the Hugging Face Transformes library, which allows you to do inference at a very high level of abstraction.

In each exercise, you will first define a *pipeline* for a specific task, using a pretrained model from the Hugging Face models library, and apply the pipeline on text snippets from the course webpage.

## Resources:

- [pipeline documentation](https://huggingface.co/docs/transformers/main_classes/pipelines)
- [Hugging Face models library](https://huggingface.co/models)

## The input text

In [None]:
from transformers import pipeline
import pandas as pd
pd.set_option('display.max_rows', 100)

text = """Last year has shown multiple breakthroughs in deep learning, bringing large language
models to the mainstream. OpenAI's ChatGPT, Microsoft's new Bing Search and GitHub
Copilot, and Deep Mind's AlphaCode are the most prominent. While they still have many
flaws, they show a potential to transform many sectors of the economy, replace some
workers and make other vastly more productive.

NLP also has an immense potential to change research in economics. Most economists use
small and expensive structured datasets. NLP offers a way to work with novel data sources
that often can be scraped for free from the web. Examples are classifying speeches along
the political spectrum, classifying tweets to measure opinions, extracting concepts
mentioned in free-form survey replies, or translating questionnaires or datasets into
different languages.

This class is an introduction to deep learning and NLP for economists. Starting from
zero, the first half of the course focuses on learning the practical skills needed to
incorporate NLP into empirical workflows. We will use Huggingface's transformers library
and only work with pre-trained models for this. The second half of the class zooms in
and focuses on understanding what language models are, how they differ, and how they are
trained. We will write some purely didactical code in numpy and implement a few simple
models in PyTorch. The main focus of the second half is to build enough understanding to
work effectively with pre-trained models. It is beyond our scope and computational
resources to actually train large models."""

paragraphs = text.split("\n\n")

## Task 1 - Text classification

1. Define a classifier pipeline using the ` "distilbert-base-uncased-finetuned-sst-2-english"` model.

2. Apply the pipeline to the entire text.

3. Apply the pipeline separately on each paragraph of the input text to extract sentiments.

4. Convert the output of the previous task to a pandas DataFrame

In [None]:
classifier = pipeline(
    "text-classification", model="distilbert-base-uncased-finetuned-sst-2-english"
)

In [None]:
classifier(text)

In [None]:
sentiments = classifier(paragraphs)
sentiments

In [None]:
pd.DataFrame(sentiments)

## Task 2 - Named entity recognition
1. Define a named entity recognition (ner) pipeline using the `"dslim/bert-base-NER-uncased"` model. 

2. Apply the pipeline to `text` and convert the result to a DataFrame

In [None]:
ner_tagger = pipeline(
    "ner", 
    model="dslim/bert-base-NER-uncased",
)
pd.DataFrame(ner_tagger(text))

## Task 3 - Aggregation strategies

Use the same model as before, but try out different aggregation strategies. Which one gives you the best results?

**Note**: Below we show a solution where you apply all strategies in a loop and combine the result in one DataFrame. Other solutions are completely ok. 

In [None]:
ner_results = []
for strategy in ["none", "simple", "first", "average", "max"]:
    tagger = pipeline(
        "ner",
        model="dslim/bert-base-NER-uncased",
        aggregation_strategy=strategy,
    )
    df = pd.DataFrame(tagger(text))
    df["strategy"] = strategy
    ner_results.append(df)
    
pd.concat(ner_results).set_index("strategy")

## Task 4 - Clearing the cache

Use `huggingface-cli` to clear the model cache

## Task 5: Question answering

1. Define a question answering pipeline using the `"deepset/roberta-base-squad2"` model.

2. Come up with a few questions one might ask about the course logistics. 

3. Apply the pipeline to get answers to your questions.

**Do not trust any answer without double-checking it!**

In [None]:
logistics_text = """Logistics

The following page contains everything you need to know to take the class for credit. Please read it carefully before approaching us with questions.

Communication

All communication related to the course should be via zulip.

If you have not signed up for the `bonn-econ-teaching` zulip workspace yet, use this [link](https://bonn-econ-teaching.zulipchat.com/join/wuyru5foek3s3vb6tdkjubwc/) to do so. You will then automatically be subscribed to the `dl-intro` stream.

If you are already signed up from a previous course, please subscribe to the `dl-intro` stream.

Questions that do not contain any private information or sensitive data should be asked in a public stream such that everyone benefits from the answer.

We will announce relevant information about the lectures and course logistics on zulip. If you do not sign up or check your messages, you might miss something important!

Office hours

If you have a question or problem that is better discussed in person, we can do so
right after the lecture.


Distribution of materials

All materials can be downloaded from the [course webpage](https://dl-intro.readthedocs.io).

Grading

The grade is determined entirely by a final project. The final project consists of a practical application of deep learning methods (you can choose the topic) and  written answers to questions that we will distribute in due time.

In the written answers you can achieve up to 25 points. In the application you can achieve up to 75 points. The points are added and translate into grades as follows:

| Grade | Minimum number of points |
| ----- | ------------------------ |
| 1.0   | 98                       |
| 1.3   | 93                       |
| 1.7   | 87                       |
| 2.0   | 82                       |
| 2.3   | 77                       |
| 2.7   | 71                       |
| 3.0   | 66                       |
| 3.3   | 61                       |
| 3.7   | 55                       |
| 4.0   | 50                       |

How do we grade the application

The grade will be a holistic assessment of your project. We look at the following criteria:

- Is there a `README.md` that tells me how the repository is structured and how I can run the project?
- Does the code run?
- Is the code readable and well documented?
- How challenging is the task?
- How complete and correct is the solution?

A final project will be about as much work as two or three of the exercise notebooks you will see in class.

How do we grade the written answers

We are looking for precise answers to the questions within the specified word limits. Answers that are too long will not get full points. Answers that contain the correct answer but also wrong or redundant information will not achieve full points.

How will the project be submitted

Final projects are submitted in a GitHub repository created by GitHub classroom. Follow this [invitation](https://classroom.github.com/a/R1vgPUT1) to create the repository which you can then clone to your computer.

If you do not know yet what git is and how it works, don't worry. You will learn this during the lecture.


How should the code be structured

You can choose whether you want to work in jupyter notebooks, `.py` files or a mix thereof. You are also free to use a workflow system such as [pytask](https://github.com/pytask-dev/pytask) or [kedro](https://github.com/kedro-org/kedro). We think that in practice you should always use such a workflow system, but we do not have the time to teach them in this class.

The only strict requirement is that there is a simple way to run your entire project and produce all results you need, either by running one notebook (that potentially imports functions from many files) or by running one command (such as pytask) from the command line.

If there is no simple way of producing all results or it is not documented in the README, we will deduct points.

Where to put the written answers

The written answers to the questions have to be in the `README.md` file of your repository.

Which libraries can you use

You can use any python library you want. If you need packages that are not installed in the course environment, your project needs to contain an `environment.yml` file with all packages you use.

If you use libraries that are not part of the course environment and do not provide an environment file, we will deduct points.


Deadlines

- Between April 3 and April 10 you must register for the class via basis. Otherwise, you cannot take the class for credit! It happens to multiple students every year and the examination office does not make exceptions. Please also look at the [official dates](https://www.vwlpamt.uni-bonn.de/pruefungsamt-en/pdf/summer-semester-2023/time-table-summer-semester-2023) from the examination office for details.
- The topic of of your final project needs to be submitted until Tuesday, July 11 by writing it down at the top of the README.md file of your repository. This is to make sure that you have a topic and to make sure that you have created your repository and know how to push materials to it.
- The deadline for the final project is Sunday, September 10, 23:00. The projects we download via GitHub classroom will not contain any changes you push after the deadline.

Tentative Lecture Plan (this will change!)

- Lecture 1: Overview, logistics and installation
- Lecture 2: Python, Jupyter, Git and Markdown basics
- Lecture 3: Intro to huggingface ecosystem and different NLP tasks
- Lecture 4: Classification with sklearn
- Lecture 5: Tokenization
- Lecture 6: Text classification via feature extraction
- Lecture 7: Text classification via fine-tuning
- Lecture 8: Feedforward neural networks from scratch
- Lecture 9: (Pre-)training neural networks
- Lecture 10: RNNs, attention and transformers
- Lecture 11: Model architectures and loss functions
- Lecture 12: Final Projects tips / Bonus lecture
"""

In [None]:
reader = pipeline("question-answering", model="deepset/roberta-base-squad2")

questions = [
    "What is the deadline for the final project?",
    "Which python library can I use for the final project?",
    "How is the grade determined?",
    "What is the topic of lecture 3?",
    "How should we contact you?",
]
answers = pd.DataFrame(reader(question=questions, context=logistics_text))
answers["question"] = questions

answers

## Task 6: Summarization
1. Define a text summarizing pipeline using the `"sshleifer/distilbart-cnn-6-6"` model. 
2. Apply the pipeline to `text`
3. Play around with the keyword arguments `min_length` and `max_length` until you get something you like.

In [None]:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-6-6")

summarizer(
    text, 
    min_length=100,
    max_length=200,
    clean_up_tokenization_spaces=True,
)

## Task 7: Summarization by paragraph

1. Apply the pipeline from the previous task to each paragraph of `text`.
2. Play around with `min_length` and `max_length` until you are satisfied with the result.
3. Combine the results back into one string

In [None]:
summaries = summarizer(
    paragraphs, 
    min_length=40,
    max_length=60, 
    clean_up_tokenization_spaces=True
)
texts = [entry["summary_text"] for entry in summaries]
print("\n\n".join(texts))

## Task 8: Translation

1. Go to [huggingface](https://huggingface.co/models) and search for a model that can translate a text from english to your favorite language.
2. Define a pipeline to do the translation
3. Apply the pipeline on the example input text to translate the content to German.

In [None]:
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-es")

outputs = translator(text, clean_up_tokenization_spaces=True)
outputs[0]["translation_text"]