Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and email below:

In [None]:
# Full name
NAME = ""
# Institutional email (hm.edu or hmtm.de)
EMAIL = ""

---

# Day 4 - Visualizing painter biographies

## 4.0 - Getting Started

### Introduction

The fourth day of this class will show you:

- [HuggingFace](https://huggingface.co/), a platform for finding and working with different machine learning models.
- How to visualize how similarity between painters

Please download the code and data from the [github repository](https://github.com/aica-wavelab/aica-assignments) and follow the instructions in the `A4_painter_semantic_distance`.

[Github repository of the course](https://github.com/aica-wavelab/aica-assignments)

### Content of the repository

- `data`: A folder containing the summary information for artists gathered from Wikipedia.
- `A4_painter_semantic_distance.ipynb`: This notebook where we will do the analysis and visualization work.

### Assignment

Today's task is to find a way to cluster and visualize painters based on the summaries of their Wikipedia pages.

@@REVIEW@@

### Installation required

Make sure you have the following packages installed for today.

In [None]:
# Run this cell if to install the required packages
!pip install pandas
!pip install numpy
!pip install matplotlib 
!pip install seaborn
!pip install sentence-transformers

---

## 4.1 - The dataset

The dataset is something I put together using the `wikipedia-api` package (linked [here](https://pypi.org/project/Wikipedia-API/)). It's a collection of the summaries from painter pages on Wikipedia. The painter pages come Wikipedia's [List of painters by name](https://en.wikipedia.org/wiki/List_of_painters_by_name). While it has a lot of painters, it's important to note that it does not cover _all_ the painters who have pages on Wikipedia.

The dataset is divided into two sections:

- The main file is `painter_summaries_all.csv`; it has data on all 3900+ painters listed in the Wikipedia article. One listed painter has been removed from this dataset that appears in the partial files and the IDs not been changed.
- There are also 6 files in the `partial` directory with the format `painter_summaries_part##.csv`. These files have the data split into smaller chunks based on how the data was gathered.

### Inspecting the data

Open `painter_summaries_all.csv` file in a spreadsheet program (Excel, Numbers, Sheets, etc) and take a look at the data.

<div class="alert alert-info">
<b>Instruction:</b> What are the columns in this dataset? What do they each contain?
</div>

**@@ YOUR ANSWER HERE @@**

### Loading the data

Let's load the complete dataset and inspect it using pandas.

In [None]:
import pandas as pd

painter_summaries_df = pd.read_csv("data/painter_summaries_all.csv")

painter_summaries_df.head(5)

<div class="alert alert-info">
<b>Instruction:</b> How many painters are there in the dataset? Are there any duplicates?
</div>

In [None]:
# YOUR CODE HERE
painter_summaries_df["painter_name"].value_counts()

### Cleaning the dataset 
<div class="alert alert-info">
<b>Instruction:</b> Create a new dataframe <strong>painter_summaries_clean</strong> that does not have duplicates based on the <em>painter_name</em> column.
</div>

In [None]:
# YOUR CODE HERE
# raise NotImplementedError()
painter_summaries_clean = painter_summaries_df.drop_duplicates(subset="painter_name", keep="first")
painter_summaries_clean["painter_name"].value_counts()

Now that the dataset is duplicate free we can start working it for our analysis.

If you look at the data file in a spreadsheet program, you will notice that the summaries are of various lengths. Let's keep track of that somehow because we may want to filter later on.

In [None]:
def count_words(text):
    return len(text.split())

<div class="alert alert-info">
<b>Instruction:</b> Create a new column <em>summary_length</em> using the count_words() function.
</div>


In [None]:
painter_summaries_clean["summary_length"] = painter_summaries_clean["summary"].apply(count_words)

painter_summaries_clean.head(10)

We'll save the data as it is now and then we can work with these summaries.

In [None]:
painter_summaries_clean.to_csv("data/painter_summaries_clean.csv", index=False)


---

## 4.2 - Sentence Similarity

Let's take a step back and think about where we want to end up and where we are currently. Right now we have a dataset of biographies of different painters (with some differences in length). We want to end up with a visual of the painters clustered based on their biographies.

We could manually take each biography and interpret the text and try to group the painters ourselves. In some cases we might group painters by their nationality (e.g., Dutch painters), their style (e.g. Surrealist painters), their subject matter (e.g, still life painters), or the time period they lived in (e.g. Renaissance painters). 

<div class="alert alert-info">
<b>Instruction:</b> How many painter biographies would you go through before getting bored?
</div>

**@@YOUR ANSWER HERE@@**

We can use machine learning to assist us in clustering these biographies by comparing how similar or different the summaries are. This task is also known as Sentence Similarity and you can read more about it here: [https://huggingface.co/tasks/sentence-similarity](https://huggingface.co/tasks/sentence-similarity). 

For now we'll play a bit with the widget on the page. First let's get a series of painter summaries to work with. I picked names that might have some obvious groupings so we can do sanity checks as we work.

In [None]:
select_painter_names = [
    "Albrecht Dürer",
    "Leonardo da Vinci",
    "Michelangelo",
    "Raphael",
    "Titian",
    "Joaquín Sorolla",
    "Pablo Picasso",
    "Salvador Dalí",
    "Andy Warhol",
    "Vincent van Gogh",
    "Johannes Vermeer",
    "Sandro Botticelli",
    "Hokusai"
]

select_painter_bios = painter_summaries_clean[
    painter_summaries_clean["painter_name"].isin(select_painter_names)
]

# For this short dataset, we don't care about the other columns.
select_painter_bios = select_painter_bios[["painter_name", "summary"]]
select_painter_bios

<div class="alert alert-info">
<b>Instruction:</b> Cluster the 13 painters based on what you may know, can quickly read about them.
</div>

**@@YOUR ANSWER HERE@@**

Now let's play with the sentence similarity widget on Hugging face. For that we'll need the full summaries for each painter. I will save the previous table to a CSV for faster copy+paste, but you can also use the Python code under that to get the bios for a particular artist

In [None]:
select_painter_bios.to_csv("data/select_painter_bios.csv", index=False)

In [None]:
painter_name = "Vincent van Gogh"
select_painter_bios[select_painter_bios["painter_name"] == painter_name]["summary"].values[0]

<div class="alert alert-info">
<b>Instruction:</b> Pick 5 painters from our test set. Put their bios in the <a href="https://huggingface.co/tasks/sentence-similarity">Sentence Similarity demo</a> and write down the values. Then add your interpretation of the values. Are they high or low? Why might that be? Fill in the table below:
</div>

| painter_name  | similarity_score | Interpretation                                          |
|---------------|-----------------:|---------------------------------------------------------|
| SOURCE NAME |               -- | The first painter is the source and does not get a score |
| PAINTER NAME   |         Score @@ | Interpretation here @@                                  |
| PAINTER NAME   |         Score @@ | Interpretation here @@                                  |
| PAINTER NAME   |         Score @@ | Interpretation here @@                                  |
| PAINTER NAME   |         Score @@ | Interpretation here @@                                  |

This Sentence Similarity demo is quite cool. It takes each summary and converts it into an **embedding**, a numerical vector representation of the text that does a good job of capturing the semantics of the text. This is the part connected to machine learning. In the demo, the pre-trained model `all-MiniLM-L6-v2` is used to compute the embeddings. We'll work with this same model below.

Once all the embeddings are computed, then it's a math game. The demo takes the source embedding (whichever artist you introduced first) and compares that embedding with each of the other embeddings in pairs. For each pair that is compared, say *source_painter* and *painter_1*, it produces a score between 0 and 1, where 0 means there is no similarity, and 1 means they are identical. There are many ways to compute similarity and a popular one is Cosine Similarity. There is some info on the demo page linked above, but reproduced here:
>     The similarity of the embeddings is evaluated mainly on cosine similarity. It is calculated as the cosine of the angle between two vectors. It is particularly useful when your texts are not the same length

---

## 4.3 - Creating embeddings

The first step to being able to cluster and visualize the painters is to compute the embeddings. 

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
