# Feature Extraction Tutorial

This tutorial introduces how to extract features, or word embeddings based on our stimulus transcript. Features are numeric vectors that capture the meaning of the words in our transcript. Here, we will extract two types of features: syntactic features from spacy.io ([Honnibal et al., 2020](https://github.com/explosion/spaCy)) and contextual word embeddings from GPT-2 ([Radford et al., 2019](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)).


Acknowledgments: This tutorial draws heavily on the [encling tutorial](https://github.com/snastase/encling-tutorial/blob/main/encling_tutorial.ipynb) by Samuel A. Nastase.
______________

First, we'll import some general-purpose Python packages. We use `spacy` for syntactic features and [`transformers`](https://huggingface.co/docs/transformers/index) from huggingface for contextual word embeddings.

In [36]:
from os import makedirs, path
import h5py
import numpy as np
import pandas as pd
import spacy
import torch
from sklearn.preprocessing import LabelBinarizer
from accelerate import Accelerator, find_executable_batch_size
from transformers import AutoModelForCausalLM, AutoTokenizer

## Extracting Syntactic Features

First, we will setup the output folder where we will save our syntactic features.

In [None]:
output_dir_syntactic = "/home/kw1166/scratch/247/monkey-data/stimuli/syntactic"
if not path.exists(output_dir_syntactic):
    makedirs(output_dir_syntactic)

Then we will load our stimulus transcript as a dataframe. It contains columns of words and their start and end information.

In [35]:
transcript_path = "/home/kw1166/scratch/247/monkey-data/stimuli/monkey_transcript.csv"
df = pd.read_csv(transcript_path)
df.head()

Unnamed: 0,word,start,end
0,Act,3.71,3.79
1,"one,",3.99,4.19
2,monkey,4.651,4.931
3,in,4.951,5.011
4,the,5.051,5.111


To extract syntactic features, we will use spacy.io ([Honnibal et al., 2020](https://github.com/explosion/spaCy)). First, we will downlod [`en-core-web-lg`](https://spacy.io/models/en#en_core_web_lg), one of the english pipelines from spacy containing tools like tagger and parser.

In [7]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m152.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


We now load the nlp pipeline from spacy by calling `spacy.load()`. We will create a `word_idx` column to keep track of our word and their indices. We will then tokenize the words using the tokenizer from the pipeline, meaning transform each word into a list of tokens. Then, we will [explode](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html) the dataframe so that each row of the dataframe is a token. Note that we will add white spaces to the end of words before tokenization so we can track the boundary of each word.

In [18]:
modelname = "en_core_web_lg"
nlp = spacy.load(modelname)
df.insert(0, "word_idx", df.index.values)
df["word_with_ws"] = df.word.astype(str) + " "
try:
    df["hftoken"] = df.word_with_ws.apply(nlp.tokenizer)
except TypeError:
    print("typeerror!")
    breakpoint()
df = df.explode("hftoken", ignore_index=True)

Now we will create a [doc object](https://spacy.io/api/doc) (which is a sequence of token objects) from our tokenized text and then pass it through the nlp pipeline to parse it for features.


In [19]:
words = [token.text for token in df.hftoken.tolist()]
spaces = [token.whitespace_ == " " for token in df.hftoken.tolist()]
doc = spacy.tokens.Doc(nlp.vocab, words=words, spaces=spaces)
doc = nlp(doc)

We will loop through the doc, and get the features for each token. The [features](https://spacy.io/usage/linguistic-features#pos-tagging) include `text`, `tag` (detailed part-of-speech tag), `dep` (syntactic dependency, i.e. the relation between tokens), and `is_stop` (is the token part of a stop list, i.e. the most common words of the language?). We will organize the features into a second dataframe and add those columns back to `df`. We will drop the two columns we don't need anymore, and then save `df` for future encoding.

In [20]:
features = []
for token in doc:
    features.append([token.text, token.tag_, token.dep_, token.is_stop])

df2 = pd.DataFrame(
        features, columns=["token", "pos", "dep", "stop"], index=df.index
    )
df = pd.concat([df, df2], axis=1)
df.drop(["hftoken", "word_with_ws"], axis=1, inplace=True)
df.to_csv(path.join(output_dir_syntactic, "transcript.tsv"), sep="\t")

Since the features we get are all categorical, we need to turn them into vectors for encoding. We will use [`LabelBinarizer`](https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.LabelBinarizer.html) from sklearn, which fits to all the possible category labels for a feature and then transforms our labels into one-hot vectors. There are 50 possible labels for `tag` and 45 possible for `dep`. So those two features will be turned into 50-dimensional and 45-dimensional vectors respectively. Our `is_stop` feature is binary, so it will just be one dimensional. We concatenate all three features to form a 96-dimensional syntactic feature overall and save it for future encoding.

In [30]:
taggerEncoder = LabelBinarizer().fit(nlp.get_pipe("tagger").labels)
dependencyEncoder = LabelBinarizer().fit(nlp.get_pipe("parser").labels)

a = taggerEncoder.transform(df.pos.tolist())
b = dependencyEncoder.transform(df.dep.tolist())
c = LabelBinarizer().fit_transform(df.stop.tolist())
embeddings = np.hstack((a, b, c))

with h5py.File(path.join(output_dir_syntactic, "states.hdf5"), "w") as f:
    f.create_dataset(name="vectors", data=embeddings)

## Extracting GPT-2 Features

Now we will extract contextual word embeddings from an autoregressive (or "causal") large language model (LLM) called GPT-2 ([Radford et al., 2019](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)). GPT-2 relies on the Transformer architecture to sculpt the embedding of a given word based on the preceding context. The model is composed of a repeated circuit motif—called the "attention head"—by which the model can "attend" to previous words in the context window when determining the meaning of the current word. This GPT-2 implementation is composed of 12 layers, each of which contains 12 attention heads that influence the embedding as it proceeds to the subsequent layer. The embeddings at each layer of the model comprise 768 features and the context window includes the preceding 1024 tokens. Note that certain words will be broken up into multiple tokens; we'll need to use GPT-2's "tokenizer" to convert words into the appropriate tokens. GPT-2 has been (pre)trained on large corpora of text according to a simple self-supervised objective function: predict the next word based on the prior context. If you want to learn more about LLMs and GPT-2, here are some great blogs explaining [transformers](https://jalammar.github.io/illustrated-transformer/) and [GPT-2](https://jalammar.github.io/illustrated-gpt2/) architecture.

First, we will setup the output folders where we will save our contextual word embeddings from gpt2.

In [37]:
output_dir_gpt2 = "/home/kw1166/scratch/247/monkey-data/stimuli/gpt2"
if not path.exists(output_dir_gpt2):
    makedirs(output_dir_gpt2)

Then we will load our stimulus transcript as a dataframe. It contains columns of words and their start and end information.

In [38]:
transcript_path = "/home/kw1166/scratch/247/monkey-data/stimuli/monkey_transcript.csv"
df = pd.read_csv(transcript_path)
df.head()

Unnamed: 0,word,start,end
0,Act,3.71,3.79
1,"one,",3.99,4.19
2,monkey,4.651,4.931
3,in,4.951,5.011
4,the,5.051,5.111


We will define some of the general arguments, including the model name as it appears on huggingface, the context length, which is how many tokens we input to the model to extract embeddings (the max value is 1024 for GPT-2, here we will use 32), and device, which is where the model will run. We can set the device to `cuda` to utilize gpu.

In [39]:
modelname = "gpt2"
context_len = 32
device = torch.device("cpu")
if torch.cuda.is_available():
    device = torch.device("cuda", 0)

We will now load the GPT-2 tokenizer to convert words into a list of tokens. Then, we will [explode](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html) the dataframe so that each row of the dataframe is a token. We will convert our tokens to token_ids (which are integers IDs corresponding to words in the GPT-2 vocabulary, which contains approximately 50,000 tokens) and use that as the input to GPT-2.

In [40]:
# Load model
tokenizer = AutoTokenizer.from_pretrained(modelname)

df.insert(0, "word_idx", df.index.values)
df["hftoken"] = df.word.apply(lambda x: tokenizer.tokenize(" " + x))
df = df.explode("hftoken", ignore_index=True)
df["token_id"] = df.hftoken.apply(tokenizer.convert_tokens_to_ids)

Then we will load the GPT-2 model. We can check its configurations in `model.config`.

In [42]:
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(modelname)

print(
    f"Model : {modelname}"
    f"\nLayers: {model.config.num_hidden_layers}"
    f"\nEmbDim: {model.config.hidden_size}"
    f"\nConfig: {model.config}"
)
model = model.eval()
model = model.to(device)

Loading model...
Model : gpt2
Layers: 12
EmbDim: 768
Config: GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.45.2",
  "use_cache": true,
  "vocab_size": 50257
}



Since our transcript contains more tokens than the context window (32), we will reformat all the token_ids into `data`, a torch tensor with the shape of (number of tokens * 33). This is because to extract feature for a token from GPT-2 using context length 32, we will need to input 33 tokens to GPT-2, which contains the token itself and the 32 preceding tokens. Note that for the first 32 tokens in the transcript, we will use the pad_token_id or 0 to pad the input length to 33.

In [48]:
token_ids = df.token_id.tolist()
fill_value = 0
if tokenizer.pad_token_id is not None:
    fill_value = tokenizer.pad_token_id

data = torch.full((len(token_ids), context_len + 1), fill_value, dtype=torch.long)
for i in range(len(token_ids)):
    example_tokens = token_ids[max(0, i - context_len) : i + 1]
    data[i, -len(example_tokens) :] = torch.tensor(example_tokens)
data.shape

torch.Size([5491, 33])

We will use [`Accelerator`](https://github.com/huggingface/accelerate) to make extracting features more efficient. It includes a [`find_executable_batch_size`](https://huggingface.co/docs/accelerate/v0.11.0/en/memory) algorithm, which can find the optimal batch size for the code by decreasing the batch size in half after each failed run on the code (in this case, our `inference_loop` function).

Inside the `inference_loop` funcion, we will use a PyTorch `DataLoader` to supply token IDs to the model in batches and extract the features. In addition to the embeddings, we'll also extract several other features of potential interest from the model. As GPT-2 proceeds through the text, it generates a probability distribution (the `logits` extracted below) across all words in the vocabulary with the goal of correctly predicting the next word. We can use this probability distribution to derive other features of the model's internal computations. We'll extract the following features from GPT-2:

* **embeddings**: the 768-dimensional contextual embedding capturing the meaning of the current word
* **top_guesses**: the highest probability word GPT-2 predicts for the current word
* **ranks**: the rank of the correct word given probabilities across the vocabulary
* **true_probs**: the probability at which GPT-2 predicted the current word
* **entropies**: how uncertain GPT-2 was about the current word
    * low entropy indicates that the probability distribution was "focused" on certain words
    * high entropy indicates the probability distribution was more uniform/dispersed across words

In [59]:
accelerator = Accelerator()

@find_executable_batch_size(starting_batch_size=32)
def inference_loop(batch_size=32):
    # nonlocal accelerator  # Ensure they can be used in our context
    accelerator.free_memory()  # Free all lingering references

    data_dl = torch.utils.data.DataLoader(
        data, batch_size=batch_size, shuffle=False
        )

    top_guesses = []
    ranks = []
    true_probs = []
    entropies = []
    embeddings = []

    with torch.no_grad():
        for batch in data_dl:
            # Get output from model
            output = model(batch.to(device), output_hidden_states=True)
            logits = output.logits
            states = output.hidden_states

            true_ids = batch[:, -1]
            brange = list(range(len(true_ids)))
            logits_order = logits[:, -2, :].argsort(descending=True)
            batch_top_guesses = logits_order[:, 0]
            batch_ranks = torch.eq(
                logits_order, true_ids.reshape(-1, 1).to(device)
            ).nonzero()[:, 1]
            batch_probs = torch.softmax(logits[:, -2, :], dim=-1)
            batch_true_probs = batch_probs[brange, true_ids]
            batch_entropy = torch.distributions.Categorical(
                probs=batch_probs
            ).entropy()
            batch_embeddings = [
                state[:, -1, :].numpy(force=True) for state in states
            ]

            top_guesses.append(batch_top_guesses.numpy(force=True))
            ranks.append(batch_ranks.numpy(force=True))
            true_probs.append(batch_true_probs.numpy(force=True))
            entropies.append(batch_entropy.numpy(force=True))
            embeddings.append(batch_embeddings)

        return top_guesses, ranks, true_probs, entropies, embeddings

top_guesses, ranks, true_probs, entropies, embeddings = inference_loop()

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Now we will add the additional information from GPT-2 as columns to `df`. We will then save it for future encoding.

In [62]:
df["rank"] = np.concatenate(ranks)
df["true_prob"] = np.concatenate(true_probs)
df["top_pred"] = np.concatenate(top_guesses)
df["entropy"] = np.concatenate(entropies)

df.to_csv(path.join(output_dir_gpt2, "transcript.tsv"), sep="\t")

Similarly, we will save the contextual word embeddings, which are 798-dimensional features, for future encoding. Note that since GPT-2 has 12 transformer layers, we can extract features from before and after each of the transformer layers. So we will have 13 different versions of GPT-2 features that we could use for encoding. Here, we save them to one file with separate datasets named by the layer index 1~13.

In [63]:
with h5py.File(path.join(output_dir_gpt2, "states.hdf5"), "w") as f:
    for i in range(len(embeddings[0])):
        layer_embeddings = np.vstack([e[i] for e in embeddings])
        f.create_dataset(name=f"layer-{i}", data=layer_embeddings)

Now that we have the features, we are reading for encoding!