# Exercise 5

In [None]:
import torch

## Task 1: Download and inspect a DatasetDict

1. Download the `dair-ai/emotion` dataset
2. Find out how many rows and columns it has
3. Find the cache directory and convince yourself that you would know how to delete the dataset to free up space
4. Create a pandas DataFrame containing the training split of the data

## Task 2: DatasetDict.map

Use `DatasetDict.map` to add a new variable called `label_name` to the dataset. The translation is as follows: sadness (0), joy (1), love (2), anger (3), fear (4), surprise (5).

**Test the function on one row before you call map**

## Task 3: Batched map

Rewrite the function from the previous task such that it works if you set `batched=True` in map, i.e. if you do `emotions.map(my_func, batched=True)`

Use the strategies you have just learned to find out how this works. Don't try out random things!

## Task 4: Write Tokenizers

Write a function called `character_tokenizer` that takes a string and returns a list of tokens. Use all characters of the latin alphabet and distingish lowercase and uppercase characters. Don't forget puntcuation. Encode the following text:

In [None]:
text = "Programming isn't about what you know; it's about what you can figure out."

Even this simple example shows you that a lot can go wrong when coding your own tokenizer. Always use pre-trained or pre-implemented tokenizers in practice!

## Task 5: Use a pretrained tokenizer

Use a pre-trained tokenizer for the `"distilbert-base-uncased"` model to encode the `text` from above. Decode each token so you can see how words were split into tokens. 

Now wrap the tokenizer into into a function called `tokenize` and tokenize the entire dataset using `DatasetDict.map`. 

For the tokenizer, the settings should be: 
    - padding=True
    - truncate=True
For `map` the settings should be:
    - batched=True,
    - batch_size=None,
    
Hint, if you write the function correctly, the following should work:

```python
tokenize(emotions["train"][:3])
```

## Important

Setting `batched=True` and `batch_size=None` means that all tweets are processed in one batch. This is very important. If the dataset was processed in multiple batches, each batch might be padded to a different size (the number of tokens in the longest tweet of that batch)

## Task 6: Redo numpy exercises in torch

The following is a subset of the exercises you did in the second lecture using numpy. Repeat them using `torch.tensors` instead of `np.arrays`. This is mainly to show how similar numpy and pytorch is. 

Create the following tensors:

1. A three-dimensional tensor of shape `(3, 3, 4)` containing zeros
2. A two-dimensional tensor with 4 rows and 3 columns that contain that is equivalent to the list `[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6], [0.7, 0.8, 0.9], [1.0,1.1,1.2]]`. Do not just type in the numbers.
3. Select the bottom left 2 x 2 array from the array you just created


Now do the following calculations with tensors 

1. Do a matrix multiplication of the two tensors x and y
2. Do an elementwise multiplication of the tensors x and y
3. Do an elementwise addition x and z
4. Do an elementwise addition of x and `z.reshape(-1, 1)`
5. Sum the two rows in x

In [None]:
x = torch.tensor([[0.5, 1.5], [2.5, 3.5]])
y = torch.diag(torch.tensor([2.0, 3.0]))
z = torch.tensor([2.0, 3.0])

## Task 7: Differences between torch and numpy

The following exercises show a few differences between torch and numpy. 

1. Do a matrix multiplication of the tensors u and v
2. Check the device of the tensor u
3. Explicitly set the device to 'cpu'

In [None]:
u = torch.ones(2, 2)
v = torch.tensor([[1, 2], [3, 4]])