# Data Prep for Canopy

In [None]:
!pip install -qU \
    canopy-sdk \
    datasets==2.14.6\
    python-multipart

## Create JSON File

Canopy reads local JSON / parquet files that contain the fields `["id", "text", "source", "metadata"]`. We will use the [`jamescalam/ai-arxiv`](https://huggingface.co/datasets/jamescalam/ai-arxiv) dataset. First we download it like so:

In [1]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv", split="train")
data

Dataset({
    features: ['id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'content', 'references'],
    num_rows: 423
})

In [2]:
data[0]

{'id': '2210.03945',
 'title': 'Understanding HTML with Large Language Models',
 'summary': 'Large language models (LLMs) have shown exceptional performance on a variety\nof natural language tasks. Yet, their capabilities for HTML understanding --\ni.e., parsing the raw HTML of a webpage, with applications to automation of\nweb-based tasks, crawling, and browser-assisted retrieval -- have not been\nfully explored. We contribute HTML understanding models (fine-tuned LLMs) and\nan in-depth analysis of their capabilities under three tasks: (i) Semantic\nClassification of HTML elements, (ii) Description Generation for HTML inputs,\nand (iii) Autonomous Web Navigation of HTML pages. While previous work has\ndeveloped dedicated architectures and training procedures for HTML\nunderstanding, we show that LLMs pretrained on standard natural language\ncorpora transfer remarkably well to HTML understanding tasks. For instance,\nfine-tuned LLMs are 12% more accurate at semantic classification comp

In [4]:
len(data[0]["content"])

66352

Then we must format it into the format we need (`["id", "text", "source", "metadata"]`):

In [5]:
data = data.map(lambda x: {
    "id": x["id"],
    "text": x["content"],
    "source": x["source"],
    "metadata": {
        "title": x["title"],
        "primary_category": x["primary_category"],
        "published": x["published"],
        "updated": x["updated"],
    }
})
# drop uneeded columns
data = data.remove_columns([
    "title", "summary", "content",
    "authors", "categories", "comment",
    "journal_ref", "primary_category",
    "published", "updated", "references"
])
data

Dataset({
    features: ['id', 'source', 'text', 'metadata'],
    num_rows: 423
})

In [6]:
data.to_json("ai_arxiv.jsonl", orient="records", lines=True)

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

38070885

## Jump into Canopy!

From here we can switch across to Canopy CLI (or other method) and run:

```
canopy
canopy upsert ./ai_arxiv.jsonl
```

Then we begin chatting by first starting the Canopy Server:

```
canopy start
```

Then begin chatting with:

```
canopy chat
```

_(we can also add the `--no-rag` flag to see how our RAG vs. non-RAG results compare!)_