# Contributing a LlamaDataset To LlamaHub

`LlamaDataset`'s storage is managed through a git repository. To contribute a dataset requires making a pull request to `llama_index/llama_datasets` Github (LFS) repository. 

To contribute a `LabelledRagDataset` (a subclass of `BaseLlamaDataset`), two sets of files are required:

1. The `LabelledRagDataset` saved as json named `rag_dataset.json`
2. Source document files used to create the `LabelledRagDataset`

This brief notebook provides a quick example using the Paul Graham Essay text file.

In [None]:
import nest_asyncio

nest_asyncio.apply()

### Load Data

In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2023-11-28 14:57:09--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2023-11-28 14:57:09 (3.23 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



In [None]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader

# Load documents and build index
documents = SimpleDirectoryReader(
    input_files=["data/paul_graham/paul_graham_essay.txt"]
).load_data()
index = VectorStoreIndex.from_documents(documents)

In [None]:
# generate questions against chunks
from llama_index.llama_dataset.generator import RagDatasetGenerator
from llama_index.llms import OpenAI
from llama_index import ServiceContext

# set context for llm provider
gpt_35_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-4", temperature=0.3)
)

# instantiate a DatasetGenerator
dataset_generator = RagDatasetGenerator.from_documents(
    documents,
    service_context=gpt_35_context,
    num_questions_per_chunk=2,  # set the number of questions per nodes
    show_progress=True,
)

rag_dataset = dataset_generator.generate_dataset_from_nodes()

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

100%|███████████████████████████████████████████████████████| 22/22 [00:13<00:00,  1.67it/s]
100%|█████████████████████████████████████████████████████████| 2/2 [00:16<00:00,  8.42s/it]
100%|█████████████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.17s/it]
100%|█████████████████████████████████████████████████████████| 2/2 [00:28<00:00, 14.04s/it]
100%|█████████████████████████████████████████████████████████| 2/2 [00:13<00:00,  6.72s/it]
100%|█████████████████████████████████████████████████████████| 2/2 [00:23<00:00, 11.78s/it]
100%|█████████████████████████████████████████████████████████| 2/2 [00:11<00:00,  5.54s/it]
100%|█████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.39s/it]
100%|█████████████████████████████████████████████████████████| 2/2 [00:20<00:00, 10.15s/it]
100%|█████████████████████████████████████████████████████████| 2/2 [00:13<00:00,  6.71s/it]
100%|█████████████████████████████████████████████████████████| 2/2 [0

Now that we have our `LabelledRagDataset` generated (btw, it's totally fine to manually create one with human generated queries and reference answers!), we store this into the necessary json file.

In [None]:
rag_dataset.save_json("rag_dataset.json")

With the `rag_dataset.json` and source file `paul_graham_essay.txt` (note in this case, there is only one source document, but there can be several), we can perform the two steps for contributing a `LlamaDataset` into `LlamaHub`:

1. Similar, to how contributions are made for `loader`'s, `agent`'s and `pack`'s, create a pull-request for `llama_hub` repository that adds a new folder for new `LlamaDataset`. This step uploads the information about the new `LlamaDataset` so that it can be presented in the `LlamaHub` UI.

2. Create a pull request into `llama_datasets` repository to actually upload the data files.

### Step 0 (Pre-requisites)

Fork and then clone (onto your local machine) both, the `llama_hub` Github repository as well as the `llama_datasets` one. You'll be submitting a pull requests into both of these repos from a new branch off of your forked versions.

### Step 1

Create a new folder in `llama_datasets/` of the `llama_hub` Github repository. For example, in this case we would create a new folder `llama_datasets/paul_graham_essay_truncated`.

In that folder, two files are required:
- `card.json`
- `README.md`
The suggestion here is to look at previously submitted `LlamaDataset`'s and modify as needed for your new one.

In our current example, we need the `card.json` to look as follows

```json
{
    "name": "Paul Graham Essay Truncated",
    "description": "This is truncated version of the original Paul Graham Essay text file.",
    "numberObservations": 4,
    "containsExamplesByHumans": false,
    "containsExamplesByAI": true,
    "sourceUrls": [
        "http://www.paulgraham.com/articles.html"
    ],
    "evaluation": {
        "context_similarity": 0.95
        "correctness": 4.5,
        "faithfulness": 1.0,
        "relevancy": 1.0
    }
}
```

And for the `README.md`, these are pretty standard, requiring you to change the name of the dataset argument in the `download_llama_dataset()` function call.

```python
from llama_index.llama_datasets import download_llama_datasets
from llama_index.llama_pack import download_llama_pack

# download and install dependencies for rag evaluator pack
RagEvaluatorPack = download_llama_pack(
  "RagEvaluatorPack", "./rag_evaluator_pack"
)
rag_evaluator_pack = RagEvaluatorPack()

# download and install dependencies for benchmark dataset
paul_graham_qa_data = download_llama_datasets(
  "PaulGrahamEssayTruncatedDataset", "./data"
)

# evaluate
query_engine = VectorStoreIndex.as_query_engine()  # previously defined, not shown here
rag_evaluate_pack.run(dataset=paul_graham_qa_data, query_engine=query_engine)
```


Finally, the last item for Step 1 is to create an entry to `llama_datasets/library.json` file. In this case:

```json
    ...,
    "PaulGrahamEssayTruncatedDataset": {
    "id": "llama_datasets/paul_graham_essay_truncated",
    "author": "andrei-fajardo",
    "keywords": ["rag"],
    "extra_files": ["paul_graham_essay_truncated.txt"]
  }
```

Note: the `extra_files` field is reserved for the source files.

### Step 2 Uploading The Actual Datasets

In this step, since we use Github LFS on our `llama_datasets` repo, making a contribution is exactly the same way you would make a contribution with any of our other open Github repos. That is, submit a pull request.

Make a fork of the `llama_datasets` repo, and create a new folder in the `llama_datasets/` directory that matches the `id` field of the entry made in the `library.json` file. So, for this example, we'll create a new folder `llama_datasets/paul_graham_essay_truncated/`. It is here where we will add the documents and make the pull request.

To this folder, add `rag_dataset.json` (it must be called this), as well as the rest of the source documents, which in our case is the `paul_graham_essay_truncated.txt` file.

```sh
llama_datasets/paul_graham_essay_truncated/
├── paul_graham_essay_truncated.txt
└── rag_dataset.json
```

Now, simply `git add`, `git commit` and `git push` your branch, and make your PR.