# Data Preparation for Training RAG Agent

Data preparation for PDFs, Docs, CSV, etc. for RAG with LlamaIndex & [LlamaParse](https://github.com/run-llama/llama_cloud_services/blob/main/parse.md).

## Install LlamaParse

In [None]:
%pip install llama-parse

## Add PDF to convert to markdown

Copy documents to be converted to the `/content` folder.

## Patch asyncio to allow nested event loops

In [None]:
import nest_asyncio

nest_asyncio.apply()

## Set LlamaCloud API key

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

llama_api_key = os.getenv('LLAMA_CLOUD_API_KEY')

if llama_api_key:
    print(f"Llama API Key exists and begins {llama_api_key[:4]}")
else:
    print("Llama API Key not set")


## Convert PDF document to markdown

In [None]:
from llama_parse import LlamaParse

document = LlamaParse(api_key=llama_api_key,result_type="markdown").load_data("../content/apple_10k.pdf")

In [None]:
document

In [None]:
# check chunks of content
print(document[50].text[:1000])

## Save the text as a markdown file

In [None]:
file_name = "../content/apple_10k.md"
with open(file_name, 'w', encoding="utf-8") as file:
  for doc in document:
    file.write(doc.text)

## Make a summary of the document in markdown to remove the fluff so better for LLM

In [None]:
documents_with_instruction = LlamaParse(
    result_type="markdown",
    parsing_instruction="""
    This is the Apple annual report. make a summary
    """
    ).load_data("../content/apple_10k.pdf")

In [None]:
file_name = "../content/apple_10k_instructions.md"
with open(file_name, 'w') as file:
  for doc in documents_with_instruction:
    file.write(doc.text)