In [1]:
!pip install -qU \
  datasets==2.14.6 \
  transformers==4.35.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m81.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m106.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m64.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m31.9 MB/s[0m eta [36m0:00:00[0m
[?25h

## Dataset Download

We're going to test with a more real world use-case, with messy, imperfect data. We will use the [`jamescalam/ai-arxiv-chunked`](https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked) dataset.

In [2]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
data

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/153M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})

First we define our embedding function.

In [5]:
import torch
from torch.nn.functional import normalize
from transformers import AutoModel, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device}")

model_id = "jinaai/jina-embeddings-v2-base-en"

# initialize tokenizer and model
model = AutoModel.from_pretrained(
    model_id,
    trust_remote_code=True
).to(device)
model.eval()

def embed(docs: list[str]) -> list[list[float]]:
    with torch.no_grad():
        # process with model for token-level embeddings
        doc_embeds = model.encode(docs)
    return doc_embeds

Using cuda


Use this to build a Numpy array of embedding vectors.

In [6]:
from tqdm.auto import tqdm
import numpy as np

chunks = data["chunk"]
batch_size = 256

for i in tqdm(range(0, len(chunks), batch_size)):
    i_end = min(len(chunks), i+batch_size)
    chunk_batch = chunks[i:i_end]
    # embed current batch
    embed_batch = embed(chunk_batch)
    # add to existing np array if exists (otherwise create)
    if i == 0:
        arr = embed_batch.copy()
    else:
        arr = np.concatenate([arr, embed_batch.copy()])

  0%|          | 0/163 [00:00<?, ?it/s]

Now we need to create the query mechanism, this is simply a cosine similarity calculation between a query vector and our `arr` vectors.

In [7]:
from numpy.linalg import norm

# convert chunks list to array for easy indexing
chunk_arr = np.array(chunks)

def query(text: str, top_k: int=3) -> list[str]:
    # create query embedding
    xq = embed([text])[0]
    # calculate cosine similarities
    sim = np.dot(arr, xq.T) / (norm(arr, axis=1)*norm(xq.T))
    # get indices of top_k records
    idx = np.argpartition(sim, -top_k)[-top_k:]
    docs = chunk_arr[idx]
    for d in docs.tolist():
        print(d)
        print("----------")

In [8]:
query("why should I use llama 2?")

Equal contribution. Correspondence: {htouvron,
thibautlav,gizacard,egrave,glample}@meta.com
1https://github.com/facebookresearch/llamaperformance, a smaller one trained longer will
ultimately be cheaper at inference. For instance,
although Hoffmann et al. (2022) recommends
training a 10B model on 200B tokens, we ﬁnd
that the performance of a 7B model continues to
improve even after 1T tokens.
The focus of this work is to train a series of
language models that achieve the best possible performance at various inference budgets, by training
on more tokens than what is typically used. The
resulting models, called LLaMA , ranges from 7B
to 65B parameters with competitive performance
compared to the best existing LLMs. For instance,
LLaMA-13B outperforms GPT-3 on most benchmarks, despite being 10 smaller. We believe that
this model will help democratize the access and
study of LLMs, since it can be run on a single GPU.
At the higher-end of the scale, our 65B-parameter
model is also competi

In [9]:
query("can you tell me about red teaming for llama 2?")

by red teams allow organizations to improve security and sys tem integrity before and during deployment.
Knowledge that a lab has a red team can potentially improve th e trustworthiness of an organization with
respect to their safety and security claims, at least to the e xtent that effective red teaming practices exist
and are demonstrably employed.
As indicated by the number of cases in which AI systems cause o r threaten to cause harm, developers of an
AI system often fail to anticipate the potential risks assoc iated with technical systems they develop. These
risks include both inadvertent failures and deliberate mis use. Those not involved in the development
of a particular system may be able to more easily adopt and pra ctice an attacker’s skillset. A growing
number of industry labs have dedicated red teams, although b est practices for such efforts are generally
in their early stages.24There is a need for experimentation both within and across or ganizations in order
to move red

In [10]:
query("what is the best llm?")

et al. (2022), in which a LLM is trained and refined
on its own output iteratively. Specifically, with CoT
prompting, the model first generates initial rationales. And then, the model is finetuned on rationales that lead to correct answers. This process can
be repeated, with each iteration resulting in an improved model that can generate better training data,
which in turn leads to further improvements. As a
follow-up to this work, Huang et al. (2022a) show
that LLMs are able to self-improve their reasoning
abilities without the need for supervised data by
leveraging the self-consistency of reasoning (Wang
et al., 2022c).
4 Measuring Reasoning in Large
Language Models
We summarize methods and benchmarks for evaluating reasoning abilities of LLMs in this section.
4.1 End Task Performance
One way to measure reasoning abilities of LLMs is
to report their performance, e.g., accuracy, on end
tasks that require reasoning. We list some common
benchmarks as follows.
Arithmetic Reasoning. Arith

In [None]:
query("what is the difference between gpt-4 and llama 2?")

-0.043
-0.009+0.0132-0.004 +0.0562
+0.0387-0.012
-0.076Alpaca: 0.39 LLaMA-GPT4: 0.34 GPT4: 0.37Figure 6: ROUGE-L on unnatural instructions evaluated with 9K samples. The instructions are
grouped into four subsets based on the ground-truth response length. The mean values are reported in
the legend. The difference with GPT-4 is reported on the bar per group. LLaMA-GPT4 is a closer
proxy to GPT-4 than Alpaca.
closely follow the behavior of GPT-4. When the sequence length is short, both LLaMA-GPT4 and
GPT-4 can generate responses that contains the simple ground truth answers, but add extra words to
make the response more chat-like, which probably leads to lower ROUGE-L scores.
5 R ELATED WORK
Instruction Tuning. Instruction tuning of LLMs is an increasingly popular research direction in
NLP (Zhong et al., 2021; Ouyang et al., 2022; Wei et al., 2021). Existing works aim to improve
the quality and scale of three factors in the development pipeline, including instruction-following
----------

---