In [1]:
!pip install -qU \
  datasets==2.14.6 \
  voyageai==0.1.3

## Dataset Download

We're going to test with a more real world use-case, with messy, imperfect data. We will use the [`jamescalam/ai-arxiv-chunked`](https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked) dataset.

In [2]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
data

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})

First we define our embedding function.

In [3]:
import os
import voyageai

voyageai.api_key = os.getenv("VOYAGE_API_KEY") or "YOUR_VOYAGE_API_KEY"
model_id = "voyage-01"

def embed(docs: list[str]) -> list[list[float]]:
    if len(docs) > 8:
        raise ValueError("List of documents cannot be longer than 8")
    doc_embeds = voyageai.get_embeddings(docs, model=model_id)
    return doc_embeds

Use this to build a Numpy array of cohere embedding vectors.

In [4]:
from tqdm.auto import tqdm
import numpy as np

chunks = data["chunk"]
batch_size = 8

for i in tqdm(range(0, len(chunks), batch_size)):
    i_end = min(len(chunks), i+batch_size)
    chunk_batch = chunks[i:i_end]
    # embed current batch
    embed_batch = embed(chunk_batch)
    # add to existing np array if exists
    if i == 0:
        arr = embed_batch.copy()
    else:
        arr = np.concatenate(
            [arr, embed_batch.copy()
        ])

  0%|          | 0/5198 [00:00<?, ?it/s]

Now we need to create the query mechanism, this is simply a cosine similarity calculation between a query vector and our `arr` vectors.

In [13]:
from numpy.linalg import norm

# convert chunks list to array for easy indexing
chunk_arr = np.array(chunks)

def query(text: str, top_k: int=3) -> list[str]:
    # create query embedding
    xq = np.array(embed([text])[0])
    # calculate cosine similarities
    sim = np.dot(arr, xq.T) / \
        (norm(arr, axis=1)*norm(xq.T))
    # get indices of top_k records
    idx = np.argpartition(sim, -top_k)[-top_k:]
    docs = chunk_arr[idx]
    for d in docs.tolist():
        print(d)
        print("----------")

In [14]:
len(embed(["text"])[0])

1024

In [15]:
query("why should I use llama 2?")

models will be released as we improve model safety with community feedback.
License A custom commercial license is available at: ai.meta.com/resources/
models-and-libraries/llama-downloads/
Where to send commentsInstructions on how to provide feedback or comments on the model can be
found in the model README, or by opening an issue in the GitHub repository
(https://github.com/facebookresearch/llama/ ).
Intended Use
Intended Use Cases L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle is intended for commercial and research use in English. Tuned models
are intended for assistant-like chat, whereas pretrained models can be adapted
for a variety of natural language generation tasks.
Out-of-Scope Uses Use in any manner that violates applicable laws or regulations (including trade
compliancelaws). UseinlanguagesotherthanEnglish. Useinanyotherway
that is prohibited by the Acceptable Use Policy and Licensing Agreement for
L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle.
Hardware and Software (Section 2.2)
Trainin

In [16]:
query("can you tell me about red teaming for llama 2?")

Asian 3 2.6%
Black or African American 10 8.7%
Hispanic, Latino, or Spanish 1 0.9%
Middle Eastern or North African 1 0.9%
Native Hawaiian or Paciﬁc Islander 1 0.9%
White or Caucasian 94 81.7%
Prefer not to say 1 0.9%
Other 2 1.7%
Education
High school or some college 40 34.8%
College degree 62 53.9%
Graduate or professional degree 12 10.4%
Prefer not to say 0 0%
Other 1 0.9%
Disability
Hearing difﬁculty 0 0%
Vision difﬁculty 1 0.9%
Cognitive difﬁculty 1 0.9%
Ambulatory (mobility) difﬁculty 4 3%
Self-care difﬁculty 1 0.9%
Other 2 1.5%
None 106 92%
Figure 4 Results of a demographic survey completed by 115of324red team members.
model that evaluates the inherent efﬁcacy of a red team member, which we plot in Figure 5 (Right). We
ﬁnd that some workers are particularly effective at red teaming, whereas others are not. In Appendix A.3 we
----------
cyber); ﬁndingsonthesetopicsweremarginal andweremitigated. Nonetheless, wewill continueourred
teaming eﬀorts in this front.
Todate,allofourredteam

In [17]:
query("what is the performance of llama 2?")

the provided license and our Acceptable Use Policy , which prohibit any uses that would violate applicable
policies, laws, rules, and regulations.
Wealsoprovidecodeexamplestohelpdevelopersreplicateoursafegenerationswith L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc and
applybasicsafetytechniquesattheuserinputandmodeloutputlayers. Thesecodesamplesareavailable
here: https://github.com/facebookresearch/llama . Finally,wearesharinga ResponsibleUseGuide ,which
provides guidelines regarding safe development and deployment.
ResponsibleRelease. WhilemanycompanieshaveoptedtobuildAIbehindcloseddoors,wearereleasing
L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle openly to encourage responsible AI innovation. Based on our experience, an open approach draws
uponthecollectivewisdom,diversity,andingenuityoftheAI-practitionercommunitytorealizethebeneﬁtsof
thistechnology. Collaborationwillmakethesemodelsbetterandsafer. TheentireAIcommunity—academic
researchers, civil society, policymakers, and industr

In [18]:
query("what is the best llm?")

EL- LEX 7 67.5 70.0 69.5 69.0 42.9 47.7 63.4 55.7 63.6 41.4 53.6 55.4 25.1 49.9 46.5 58.6 61.8 44.8 52.3 52.8
MF1-LEX 3 63.3 63.4 61.7 62.8 41.0 58.9 53.6 50.1 64.7 40.3 54.8 53.9 31.3 49.8 33.5 38.9 67.8 42.4 47.3 46.0
MF1-LEX 7 64.7 64.3 63.4 64.1 46.3 58.1 61.3 54.6 65.7 46.4 53.1 55.5 36.5 53.1 36.1 44.4 70.2 43.1 52.2 49.2
MF10
KM EANS-LEX 3 64.8 61.9 64.2 63.7 35.9 52.0 56.0 51.7 57.5 41.5 50.3 51.1 30.3 47.4 46.2 52.6 69.2 44.1 49.2 52.3
MF10
----------
worse on multi-turn conversations, which could be due to its lack of multi-turn supervised ﬁne-tuning data.
InFigure19,weshowtheper-categorysafetyviolationpercentageofdiﬀerentLLMs. Whilemodelperformanceissimilaracrosscategories, L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc hasrelativelymoreviolationsunderthe unqualiﬁedadvice
category (although still low in an absolute sense), for various reasons, including lack of an appropriate
disclaimer (e.g., “I am not a professional” ) at times. For the other two categories, L/l.s

In [19]:
query("what is the difference between gpt-4 and llama 2?")

(ii)For GPT-4 results alone, the translated responses show superior performance over the generated
response in Chinese, probably because GPT-4 is trained in richer English corpus than Chinese, which
leads to stronger English instruction-following ability. In Figure 5 (c), we show results for all models
who are asked to answer in Chinese.
We compare LLaMA-GPT4 with GPT-4 and Alpaca unnatural instructions in Figure 6. In terms of the
average ROUGE-L scores, Alpaca outperforms the other two models. We note that LLaMA-GPT4 and
GPT4 is gradually performing better when the ground truth response length is increasing, eventually
showing higher performance when the length is longer than 4. This means that they can better follow
instructions when the scenarios are more creative. Across different subsets, LLaMA-GPT4 can
7
0-2 3-5 6-10 10>
Groundtruth Response Length0.30.40.5RougeL
-0.043
-0.009+0.0132-0.004 +0.0562
+0.0387-0.012
----------
to GPT-3 corresponds to the Stanford Alpaca model. From F

---