In [1]:
!pip install -qU \
  datasets==2.14.6 \
  cohere==4.34

## Dataset Download

We're going to test with a more real world use-case, with messy, imperfect data. We will use the [`jamescalam/ai-arxiv-chunked`](https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked) dataset.

In [2]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
data

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})

First we define our embedding function.

In [46]:
import os
from getpass import getpass
import cohere

cohere_key = os.getenv("COHERE_API_KEY") or getpass("Cohere API key: ")
co = cohere.Client(cohere_key)

def embed(docs: list[str]) -> list[list[float]]:
    doc_embeds = co.embed(
        docs,
        input_type="search_document",
        model="embed-english-v3.0"
    )
    return doc_embeds.embeddings

Use this to build a Numpy array of cohere embedding vectors.

In [47]:
from tqdm.auto import tqdm
import numpy as np

chunks = data["chunk"]
batch_size = 128

for i in tqdm(range(0, len(chunks), batch_size)):
    i_end = min(len(chunks), i+batch_size)
    chunk_batch = chunks[i:i_end]
    # embed current batch
    embed_batch = embed(chunk_batch)
    # add to existing np array if exists (otherwise create)
    if i == 0:
        arr = np.array(embed_batch)
    else:
        arr = np.concatenate([arr, np.array(embed_batch)])

  0%|          | 0/325 [00:00<?, ?it/s]

In [48]:
xq = co.embed(
    ["why should I use llama 2?"],
    input_type="search_query",
    model="embed-english-v3.0"
).embeddings
xq = np.array(xq[0])

In [52]:
sim = np.dot(arr, xq.T)
top_k=3
idx = np.argpartition(sim, -top_k)[-top_k:]
idx

array([18290, 39437, 39445])

Now we need to create the query mechanism, this is simply a cosine similarity calculation between a query vector and our `arr` vectors.

In [54]:
from numpy.linalg import norm

# convert chunks list to array for easy indexing
chunk_arr = np.array(chunks)

def query(text: str, top_k: int=3) -> list[str]:
    # create query embedding
    xq = co.embed(
        [text],
        input_type="search_query",
        model="embed-english-v3.0"
    ).embeddings
    xq = np.array(xq[0])
    # calculate cosine similarities
    sim = np.dot(arr, xq.T)
    print(sim.shape)
    # get indices of top_k records
    idx = np.argpartition(sim, -top_k)[-top_k:]
    print(sim[idx])
    # get docs and print
    docs = chunk_arr[idx]
    print(docs.shape)
    for d in docs.tolist():
        print(d)
        print("----------")

In [55]:
query("why should I use llama 2?")

(41584,)
[0.47466855 0.53013851 0.53044737]
(3,)
Equal contribution. Correspondence: {htouvron,
thibautlav,gizacard,egrave,glample}@meta.com
1https://github.com/facebookresearch/llamaperformance, a smaller one trained longer will
ultimately be cheaper at inference. For instance,
although Hoffmann et al. (2022) recommends
training a 10B model on 200B tokens, we ﬁnd
that the performance of a 7B model continues to
improve even after 1T tokens.
The focus of this work is to train a series of
language models that achieve the best possible performance at various inference budgets, by training
on more tokens than what is typically used. The
resulting models, called LLaMA , ranges from 7B
to 65B parameters with competitive performance
compared to the best existing LLMs. For instance,
LLaMA-13B outperforms GPT-3 on most benchmarks, despite being 10 smaller. We believe that
this model will help democratize the access and
study of LLMs, since it can be run on a single GPU.
At the higher-end of t

In [56]:
query("can you tell me about red teaming for llama 2?")

(41584,)
[0.47529661 0.47952211 0.48869599]
(3,)
the training data [13], aiding in disinformation campaigns [12], generating extremist texts [37], spreading
falsehoods [35], and more [9, 10, 18, 57, 22, 51]. As AI systems improve, the scope of possible harms seems
likely to grow [22]. Many strategies have been developed to address some of these harms (e.g., [58, 4, 48,
36, 34, 19, 60]). One potentially useful tool for addressing harm is red teaming—using manual or automated
methods to adversarially probe a language model for harmful outputs, and then updating the model to avoid
such outputs [42, 20, 3, 11]. In this paper, we describe our early efforts to implement manual red teaming to
both make models safer and measure the safety of our models. The models trained with red team data were
described in [4], so here we focus on describing our red team results and techniques in detail in the hope that
others may beneﬁt from and improve on them.
Correspondence to: {deep, liane, jackson, ja

In [57]:
query("what is the best llm?")

(41584,)
[0.49388744 0.5080906  0.51699355]
(3,)
for ﬁtting LLMs is an enormous training dataset, e.g., the Pile [15], which contains documents from
Arxiv, PubMed, Stack Exchange, Wikipedia, as well as a subset of Common Crawl2, and GitHub,
among others. For these kinds of LLMs, [16] introduced the terminology of foundation models ,
which deﬁnes training on a very large data basis and the ability to adapt to a variety of downstream
tasks.
2.2 ChatGPT
ChatGPT is an LLM developed by OpenAI that was ﬁrst released on November 30th, 2022. The
user can directly prompt the model via an API in a conversational way, e.g., allowing for follow-up
questions or admission of mistakes [1]. The backbone of ChatGPT is based on the generative pretrained transformer series (GPT; [17, 18, 19]). Despite the success and capacity of the third GPT
iteration (GPT-3) [19] with 175B parameters, the challenge of engineering text prompts for achieving
the desired generative output remained. This is due to the auto

In [58]:
query("what is the difference between gpt-4 and llama 2?")

(41584,)
[0.63758657 0.63869209 0.64677286]
(3,)
to GPT-3 corresponds to the Stanford Alpaca model. From Figure 3(a), we observe that ( i) For the
“Helpfulness” criterion, GPT-4 is the clear winner with 54.12% of the votes. GPT-3 only wins 19.74%
of the time. ( ii) For the “Honesty” and “Harmlessness” criteria, the largest portion of votes goes
to the tie category, which is substantially higher than the winning categories but GPT-3 (Alpaca) is
slightly superior.
Second, we compare GPT-4-instruction-tuned LLaMA models against the teacher model GPT-4 in
Figure 3(b). The observations are quite consistent over the three criteria: GPT-4-instruction-tuned
LLaMA performs similarly to the original GPT-4. We conclude that learning from GPT-4 generated
5
60% 70% 80% 90% 100%12345BRanking Group 94% 624 : 66792% 614 : 67091% 623 : 68289% 597 : 66989% 605 : 67891% 609 : 666
----------
of the reward model.
We compare all the chatbots in Figure 4(c,d). Instruction tuning of LLaMA with GPT-4 often
ach

---