# Probing Workshop - 33. TaCoS
Katja Konermann & Mikhail Sonkin

Let's explore probing together! Make sure you connect to a GPU runtime in this notebook.

First, we will have to install the relevant packages

In [9]:
# install
!pip3 install datasets scikit-learn transformers



## Importing the code
Next, we will have to import our code from the Github repository.


In [10]:
!git clone https://github.com/katjakon/probing_workshop

fatal: destination path 'probing_workshop' already exists and is not an empty directory.


Let's import all the relvant packages and our code!

In [11]:
# import statements
from datasets import load_dataset
from sklearn.linear_model import SGDClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, balanced_accuracy_score

from probing_workshop.probes import ClassifierProbe, ControlTaskProbe, MajorityBaseline, RandomProbe

## Datasets

In [None]:
wikiann_data = load_dataset("wikiann", "en")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/158k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/748k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/748k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.50M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/20000 [00:00<?, ? examples/s]

In [None]:
wikiann_data["train"]["ner_tags"]

[[3, 4, 0, 3, 4, 4, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 2, 0, 0],
 [1, 2, 2, 0, 0, 0, 0],
 [5, 6, 6, 6, 6],
 [0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  2,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  2,
  0,
  1,
  2,
  0,
  0,
  0,
  0],
 [3, 4, 4, 4, 4, 4, 4, 4, 4, 4],
 [3, 4, 4, 4, 4, 0],
 [1, 2, 0, 0, 0],
 [0, 0, 0, 3, 4, 4, 4, 4, 4, 0, 0],
 [1, 2, 1, 2, 1, 2, 1, 2],
 [0, 0, 3, 4, 0, 3],
 [0, 3, 4, 4, 4],
 [0, 5, 0, 0, 0, 0],
 [0, 0, 5, 6, 0, 0],
 [1, 2, 0, 0, 0],
 [1, 2, 2, 2, 2, 2, 2, 2, 2],
 [3, 4, 4, 4, 4, 4, 0],
 [5, 6, 6],
 [3, 4, 4, 4],
 [0, 0, 0, 0, 5, 6, 6, 6, 0, 0, 0, 0, 0, 0],
 [3, 4, 4],
 [0, 0, 5, 6, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 2],
 [0, 0, 0, 0, 0, 3, 4, 4, 0, 0],
 [0, 0, 0, 0, 1, 2, 0],
 [0, 0, 0, 3, 0, 0, 0],
 [0, 0, 1, 2, 2],
 [0, 0, 0, 3, 4, 0, 0, 0, 3, 4, 0, 0],
 [0, 0, 1, 2, 0],
 [0, 0, 0, 5, 0],
 [1, 2, 2],
 [0, 3, 4, 0, 0, 1, 2, 0, 0, 0],
 [0, 0, 0, 1, 2],
 [1, 2, 0, 0, 0],
 [3, 4, 0, 0, 0, 0, 0, 3, 4, 0],
 [1, 2

In [None]:
wikiann_data["train"].features

{'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None), length=-1, id=None),
 'langs': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'spans': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

# Probing

## Classifier Probe
Different types of probes and baseline take different arguments. Let's start with the most relevant one: The classifier probe!

In [7]:
help(ClassifierProbe)

Help on class ClassifierProbe in module probing_workshop.probes:

class ClassifierProbe(builtins.object)
 |  ClassifierProbe(data_set, clf, clf_kwargs: dict) -> None
 |  
 |  Methods defined here:
 |  
 |  __init__(self, data_set, clf, clf_kwargs: dict) -> None
 |      Initialize a probing classifier.
 |      
 |      Args:
 |          data_set (Custom Dataset type): Should have attributes for embeddings, labels & strings.
 |          clf (scikit-learn classifier): For instance, SGDClassifier or MLPClassifier
 |          clf_kwargs (dict): Keyword arguments to be given to clf
 |  
 |  fit(self)
 |      Fit the given probe to the given classifier.
 |  
 |  predict(self, embeddings)
 |      Predict given instances.
 |      
 |      Args:
 |          embeddings (matrix-like): Predict labels based on given embeddings.
 |      
 |      Returns:
 |          1-d array: Predicted labels.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors define

The probe is initialized with a data set and a `scikit-learn` classifier like `MLPClassifier`. Optionally, you can specify keyword arguments for the classifier.

You can look at the documentation of `scikit-learn` to find out which hyperparameters you can adjust:

- [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)
- [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)

In [None]:
# Initialize Probe


With the method `fit`, you can train the probing classifier on the given data set.

In [None]:
# Fit

After we have fitted the probing classifier, we can use it to predict and evaluate!

For **predicting**, we have to give the probing classifier the embeddings of the instances we want to predict.

In [None]:
# Predict

Let's evaluate how good are probe performs by giving it the embeddings of the test set.

In [None]:
# Evaluate with gold labels
accuracy_score(y_true=, y_pred=)

But what does that mean? Let's also compare our probe to some baselines:

## Random Initialization Baseline

Initialising the random baseline works the same as before. Internally, the random initialization baseline generates a random new embeddings for each token. Can a probe still extract information out of this?

In [None]:
# Initialize and fit
rand_probe = RandomProbe(data_set="", clf="")

When prediciting, we have to give the probe the token ids instead of the embeddings of the instances we want to classify.

In [None]:
# Predict with token ids
predictions = rand_probe.predict(token_ids=)

Now let's evaluate: What do we expect?

In [None]:
# Evaluate
accuracy_score(y_true=, y_pred=)

## Majority Baseline
This is a very simple concept: For each token, assign it the label it is most frequently associated with. For tokens that were not seen during training, assign it the most common label overall.

We initialize it by giving it a data set. No need to specify a classifier type. Then let's fit it. In this case, this just means counting up the label frequencies.

In [None]:
majority_baseline = MajorityBaseline(data_set=)
# Fit
majority_baseline.fit()

Again, predict and evaluate. For prediction, this baseline needs the token ids.

In [None]:
# Predict
predictions = majority_baseline.predict()

# Evaluate
accuracy_score(y_true=, y_pred=predictions)

## Control Task Probing

For a control tasks, each word type gets assigned a randomly sampled label from a set with the same cardinality. A probe should perform low on a control task and high on the actual probing task ideally. If a probe performs very good on a control task, it is able to simply memorize the the word types.

We initialize and fit it exactly like the classifier probe.

In [None]:
# Initialize
control_task_probe = ControlTaskProbe(data_set=, clf=)
# Fit
control_task_probe.fit()


Again, let's evaluate:

In [None]:
# Predict
predictions = control_task_probe.predict()
# Evaluation
accuracy_score(y_true, y_pred=predictions)

## Probing Experiments
Now your turn! Choose one of the tasks below and perform some probing experiments and baseline.
We have already specified the data set. Copy and adjust code from above to run your own probing experiments.
Some things to try out and think about:
- Choose between `MLPClassifier` and `SGDClassifier`
- Adjust different hyperparameters
- Evaluate the classifiers and compare them to the baselines
- What do you conclude from this? Has the BERT model learned knowledge of these tasks?

### Part-of-Speech Tagging

In [None]:
# Data Set

In [None]:
# Create Probing Classifier

# Predict the the test instance

# Evaluate

In [None]:
# Try out some baseline!

### Named Entity Recognition

In [None]:
# Data Set

In [None]:
# Create Probing Classifier

# Predict the the test instance

# Evaluate

In [None]:
# Try out some baseline!

### Semantic Roles

In [None]:
# Data Set

In [None]:
# Create Probing Classifier

# Predict the the test instance

# Evaluate

In [None]:
# Try out some baseline!