# PyPremise Example: Word Embedding Examples

PyPremise enables easy identification of interpretable patterns in classifier performance. Beyond basic token matching, we can use word embeddings to capture semantic similarity between words — for example, allowing the model to group "photo", "picture", and "image" together.

In this notebook, we explore how to incorporate FastText or other custom word embeddings into PyPremise for richer pattern discovery.

**Why Use Word Embeddings?**

Without embeddings, pattern discovery only works with exact word matches. With embeddings, semantically similar words can be grouped together:
```
("photo" or "picture" or "image") → ✔ With embeddings
("photo" only)                   → ✖ Without embeddings
```
Word embeddings like FastText can represent each word as a vector in a high-dimensional space, where similar words are close together.

To use FastText .bin embeddings with PyPremise, make sure the fasttext Python package is installed.

In [1]:
pip install fasttext

/mounts/work/xinyuan/pypremise-dev-internal/.venv/bin/python: No module named pip
Note: you may need to restart the kernel to use updated packages.


We begin by importing all the packages we will need.

In [4]:
from pypremise import Premise, data_loaders

You’ll also need to have a vocabulary mapping prepared from your tokenized data:

In [5]:
# These should be created when you construct your PremiseInstance objects
# voc_index_to_token: mapping from token index to token string
premise_instances,  _, voc_index_to_token = data_loaders.get_dummy_data()

PyPremise provides built-in support for FastText .bin files (e.g. from fasttext.cc). Use this if your vocabulary is in English and you want good out-of-the-box semantic clustering.

Downloading FastText Word Embeddings
To use pretrained FastText word embeddings, you first need to download the .bin file.

 Download link (English, 300 dimensions):
https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz

Website overview:
FastText official vectors page:
https://fasttext.cc/docs/en/crawl-vectors.html

In [None]:
# Point this to your downloaded FastText binary file
fasttext_path = "/path/to/cc.en.300.bin"  # e.g. 300-dimensional English embeddings

embedding_index_to_vector, embedding_dimensionality = data_loaders.create_fasttext_mapping(
    fasttext_path, voc_index_to_token
)


Loading FastText model, this might take a bit.
FastText loaded. Mapping the tokens to their embeddings.


Then initialize Premise with these embeddings:

In [7]:
premise = Premise(
    voc_index_to_token=voc_index_to_token,
    embedding_index_to_vector=embedding_index_to_vector,
    embedding_dimensionality=embedding_dimensionality,
    max_neighbor_distance=2
)

This tells Premise to use semantic proximity (via the embedding space) when searching for patterns.

Now that Premise is embedding-aware, you can extract patterns as usual:

In [8]:
patterns = premise.find_patterns(premise_instances)
for p in patterns:
    print(p)

(How) and (many) towards group 0 (Instances: 9 in group 0, 0 in group 1)
(was) and (taken) and (When) and (photo-or-photograph) towards group 1 (Instances: 0 in group 0, 7 in group 1)


You can also use any other word embeddings of your choice. You just need to provide to Premise the following:
| **Parameter**                 | **Description**                                                                                              |
|------------------------------|--------------------------------------------------------------------------------------------------------------|
| `embedding_dimensionality`   | The dimensionality of the embedding vectors. Must match the number of dimensions (e.g., 300 for `cc.en.300.bin`). |
| `max_neighbor_distance`      | How many neighbors to look at. Should be a number > 0.                                                       |
| `embedding_index_to_vector`  | A mapping from an index to its corresponding embedding. Use `voc_token_to_index` to look up token indices.   |

**Note:** Make sure most tokens in your `voc_index_to_token` have a corresponding embedding to ensure good coverage.
                                            |
