This post is adapted from a Jupyter [Notebook found on GitHub](https://github.com/hodgesmr/llm_ai_eo).


On October 30th, 2023, President Biden signed the [Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence](https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/). The order itself is quite sweeping and touches many government departments and agencies, with a focus on harnessing AI's potential and defending against harms and risks.

In this post, we'll deploy language models to rapidly discover information from the Order. For the easiest setup, I recommend trying this out in a Google Colab notebook.

<a target="_blank" href="https://colab.research.google.com/github/hodgesmr/llm_ai_eo/blob/main/llm_ai_eo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a> <a target="_blank" href="https://github.com/hodgesmr/llm_ai_eo/blob/main/llm_ai_eo.ipynb">
  <img src="https://img.shields.io/badge/-Open_in_Github-blue?logo=github&labelColor=gray" alt="Open In Github"/>
</a>


Many of the strategies presented here are extensions from Simon Willison's work in his blog post, [Embedding paragraphs from my blog with E5-large-v2](https://til.simonwillison.net/llms/embed-paragraphs). Simon also maintains a handy command line utility for working with various LLM models, aptly named [LLM](https://llm.datasette.io/en/stable/). While Simon's writing largely focuses on the CLI capabilities of the tool (and the usefully opinionated integrations with SQLite), I prefer working with Pandas Dataframes. Here I show how to use the LLM library in that fashion.

Embeddings are kindof a magic black box to end users, but the basic idea is that language models can create vectors or numerical values that represent not only words or sentences, but also the symantic _meaning_ of those words. Early research on this subject comes from [word2vec](https://code.google.com/archive/p/word2vec/). To illustrate: `vector('king') - vector('man') + vector('woman')` is mathematically close to `vector('queen')`. I find that _fascinating_! We'll use this concept to extract and match information against the Executive Order text.

We'll deploy a technique known as [Retrieval-Augmented Generation](https://research.ibm.com/blog/retrieval-augmented-generation-RAG). From a high level, this allows us to inject context into a LLM without training or tuning it. We use another system to locate language that likely contains the answer to our query, and then ask the model to pull it out for us.

Our high livel strategy:

1.   Calculate embeddings on the Executive Order text
2.   Calculate embeddings on a query
3.   Calculate the cosine similarity between every text embedding and the query
4.   Select the top three passages that are symantically similar to the query
5.   Pass the passages and the query to the LLM for rapid summarization

### Environment

First install the dependencies, which include the [MLC LLaMA 2 model](https://mlc.ai) for summarization, the [LLM](https://llm.datasette.io/en/stable/) library, and the [E5-large-v2](https://huggingface.co/intfloat/e5-large-v2) language model for text embedding.

Note, these models are constantly changing, and getting them up and running on your system might take some independent investigation. If running in Google Colab, check [this tutorial for MLC](https://colab.research.google.com/github/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb). If running LLaMA with the LLM library on macOS, check the [repository's instructions](https://github.com/simonw/llm-mlc).

In [1]:
pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-cu118 mlc-ai-nightly-cu118
git lfs install
pip install llm
llm install llm-sentence-transformers
llm sentence-transformers register intfloat/e5-large-v2 -a lv2
llm install llm-mlc
llm mlc setup
llm mlc download-model Llama-2-7b-chat --alias llama2

### Load Data

Before getting started, we need the Executive Order text to work against. This is probably the least interesting part of this post. I simply opened the Order in Firefox reader view, copy+pasted it into VSCode, did some manual find/replace to clean up the white space, and then concatenated paragraphs to get chunks as close to 400 words as I could. I picked 400 because the embedding model truncates at 512 _tokens_ and a token is either a word or a symantically important subset of a word, so I allowed for some buffer. _This took less than half an hour._ Rather than share code to do this work, I simply provide the cleaned text here.

Load it into a Pandas Dataframe with a single column:



In [2]:
import pandas as pd

df = pd.read_csv(
    "https://raw.githubusercontent.com/hodgesmr/llm_ai_eo/main/eo.txt",
    sep="_",  # trick to let us read the lines into a Dataframe; '_' not present
    header=None,
)
df.columns = ["passage"]

df.head()

Unnamed: 0,passage
0,By the authority vested in me as President by ...
1,(a) Artificial Intelligence must be safe and s...
2,(c) The responsible development and use of AI ...
3,(e) The interests of Americans who increasingl...
4,(g) It is important to manage the risks from t...


### Calculate Embeddings

Now that we have a Dataframe of chunks of the Executive Order, we can calculate embeddings of each chunk. To do this we'll use the [E5-large-v2](https://huggingface.co/intfloat/e5-large-v2) language model, which was trained to handle text prefixed with either `passage: ` or `query: `. Every chunk is considered a passage. We'll add this as another column on our Dataframe.

In [3]:
import llm

embedding_model = llm.get_embedding_model("lv2")
text_to_embed = df.passage.to_list()

# Our embedding model expects `passage: ` prefixes
text_to_embed = [f'passage: {t}' for t in text_to_embed]

df['embedding'] = list(embedding_model.embed_multi(text_to_embed))

df.head()

Unnamed: 0,passage,embedding
0,By the authority vested in me as President by ...,"[0.032344698905944824, -0.04333016648888588, 0..."
1,(a) Artificial Intelligence must be safe and s...,"[0.01886950619518757, -0.057347141206264496, 0..."
2,(c) The responsible development and use of AI ...,"[0.0486459881067276, -0.0712570995092392, 0.02..."
3,(e) The interests of Americans who increasingl...,"[0.03564070537686348, -0.04887280985713005, 0...."
4,(g) It is important to manage the risks from t...,"[0.04095401614904404, -0.042341429740190506, 0..."


For our symantic searching, we'll also need an embedding of our query. And the model would like that prefixed with `query: `. Let's ask what the Order says regarding AI and healthcare:

In [4]:
query = "what does it say about healthcare?"

# Our embbeding model expects `query: ` prefix for retrieval
query_to_embed = f"query: {query}"
query_vector = embedding_model.embed(query_to_embed)

print(query_vector)

[0.011035123839974403, -0.06264020502567291, 0.036343760788440704, -0.022550513967871666, -0.004930663388222456, 0.027655886486172676, -0.04244294762611389, -0.026744479313492775, -0.022813718765974045, 0.013104002922773361, 0.027848346158862114, -0.041959188878536224, 0.02923852950334549, 0.03592933714389801, 0.02084210328757763, 0.028341282159090042, -0.02188134379684925, 0.009380431845784187, 0.010694948956370354, -0.046167585998773575, 0.04979575797915459, -0.04051537066698074, -0.04705166816711426, 0.054594166576862335, -0.021378282457590103, -0.006090054754167795, -0.027005767449736595, -0.0056915683671832085, -0.02485739439725876, 0.025049963966012, 0.0013038198230788112, 0.020098360255360603, 0.03132014721632004, -0.10214236378669739, 0.03457639366388321, -0.005869136657565832, -0.041733402758836746, -0.0533079169690609, 0.043018240481615067, 0.02142527513206005, -0.013251637108623981, 0.021434243768453598, -0.01846863329410553, 0.06185981631278992, -0.006901243235915899, -0.00

### Symantic Search

If we were using the LLM module's preferred structures for Collection and storing data in SQLite, we could simply use [llm similar](https://llm.datasette.io/en/stable/embeddings/cli.html#llm-similar) or its [corresponding Python API](https://llm.datasette.io/en/stable/embeddings/python-api.html#retrieving-similar-items). As far as I can tell, the API doesn't yet support other data structures of embeddings (like our Dataframe), so we'll have to calculate [cosine similarities](https://en.wikipedia.org/wiki/Cosine_similarity) ourselves. Lucky for us, we can [borrow from Simon's open source library](https://github.com/simonw/llm/blob/abcb457b20367ee56e27602e3553bb4bd6a17312/llm/__init__.py#L252):

In [5]:
def cosine_similarity(a, b):
    dot_product = sum(x * y for x, y in zip(a, b))
    magnitude_a = sum(x * x for x in a) ** 0.5
    magnitude_b = sum(x * x for x in b) ** 0.5
    return dot_product / (magnitude_a * magnitude_b)

Now, iterate over every embedding in our Dataframe and calculate the similarity score against our query embedding vector:

In [6]:
comp_df = df.copy()
comp_df['similarity'] = comp_df.apply(
    lambda row : cosine_similarity(
        query_vector,
        row.embedding,
    ),
    axis=1,
)

comp_df.head()

Unnamed: 0,passage,embedding,similarity
0,By the authority vested in me as President by ...,"[0.032344698905944824, -0.04333016648888588, 0...",0.781552
1,(a) Artificial Intelligence must be safe and s...,"[0.01886950619518757, -0.057347141206264496, 0...",0.778486
2,(c) The responsible development and use of AI ...,"[0.0486459881067276, -0.0712570995092392, 0.02...",0.779455
3,(e) The interests of Americans who increasingl...,"[0.03564070537686348, -0.04887280985713005, 0....",0.794971
4,(g) It is important to manage the risks from t...,"[0.04095401614904404, -0.042341429740190506, 0...",0.785406


And select the 3 passages with the best similary scores. We'll feed this as context to the LLaMA model.

In [7]:
best_3_matches = comp_df.sort_values("similarity", ascending = False).head(3)
context = "\n".join(best_3_matches.passage.values)

### Ask the LLM

Now that we've selected the top 3 passages, let's feed them into LLaMA 2.

In [8]:
model = llm.get_model("llama2")

Even though we're providing prefixed context to the model, it's helpful to give it a system prompt to guide how it responds. This can help it stay "focussed" on the context and respond in the voice that we expect. The system prompt is open to creativity and experimentation.

In [9]:
system = "You are an assistant. You answer questions in a single \
paragraph about the policy from President Biden. The provided context \
comes directly from the policy. You MUST use the provided information \
as context. Not all provided information will be helpful, ONLY reference \
information if it is related to my query. You may quote the context \
information if helpful."

Now, feed the context and the query into the model.

In [10]:
print(f"Query: {query}\n")
response = model.prompt(
    f'{context}\n{query}',
    system=system,
)

print(f"Response:\n")
print(response.text())

Query: what does it say about healthcare?

Response:

The policy from President Biden related to healthcare is outlined in section 8(b)(i) of the policy, which states that:
"Within 90 days of the date of this order, the Secretary of HHS shall, in consultation with the Secretary of Defense and the Secretary of Veterans Affairs, establish an HHS AI Task Force that shall, within 365 days of its creation, develop a strategic plan that includes policies and frameworks — possibly including regulatory action, as appropriate — on responsible deployment and use of AI and AI-enabled technologies in the health and human services sector (including research and discovery, drug and device safety, healthcare delivery and financing, and public health), and identify appropriate guidance and resources to promote that deployment, including in the following areas:
(A) development, maintenance, and use of predictive and generative AI-enabled technologies in healthcare delivery and financing — including qua

Overall, this looks like it does a good job!

Of course, it's extremely important to keep a human in the loop when referencing government documents. The model may still hallucinate, or it could entirely miss important context. Some of these shortcoming are baked into the model itself, others are implementation details of this post.

If nothing else, this shows a fascinating interface to interact with long, wordy, documents!

{{< include ../_code-license.qmd >}}