# Hands-on 2: Metadata & Indexing

## Problem

Create a function with these requirements:​

- Input: path of paul_graham_essay.txt​

- Output: index able to answer correctly to these questions: ​

    - Who is the author of the book?​

    - What inspired the author to switch from studying philosophy to studying AI in college?​

    - What would the author say about art vs. Engineering?​

    - Why did the author have to learn italian?​

    - Why the author was in Florence?

## Code

In [1]:
!mkdir -p 'data/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham_essay.txt'

The syntax of the command is incorrect.
'wget' is not recognized as an internal or external command,
operable program or batch file.


In [None]:
%pip install llama-index>=0.11.20
%pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.


c:\Users\n.fretti\Desktop\projects\rag_and_roll\.venv\Scripts\python.exe: No module named pip


In [15]:
import nest_asyncio
nest_asyncio.apply()

In [16]:
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
from rich import print as rprint
import os

In [None]:
# set the OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = "here your openai api key"

True

In [18]:
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

In [19]:
import re
from llama_index.core.schema import TransformComponent


class TextCleaner(TransformComponent):
    def __call__(self, nodes, **kwargs):
        for node in nodes:
            node.text = re.sub(r"[^0-9A-Za-z ]", "", node.text)
        return nodes

In [20]:
from llama_index.core import VectorStoreIndex

def create_index(text_path: str) -> VectorStoreIndex:
    pipeline = IngestionPipeline(
        transformations=[
            TextCleaner(),
            SentenceSplitter(chunk_size=512),
            QuestionsAnsweredExtractor(questions=3)
        ],
    )
    documents = SimpleDirectoryReader(text_path, required_exts=[".txt"]).load_data()
    nodes = pipeline.run(documents=documents)
    index = VectorStoreIndex(nodes=nodes)
    return index

In [21]:
text_path = "./data/" 

index = create_index(text_path)

engine = index.as_query_engine()

100%|██████████| 51/51 [00:20<00:00,  2.55it/s]


In [22]:
rprint(engine.query("Who is the author of the book?").response)

In [23]:
rprint(engine.query("What inspired the author to switch from studying philosophy to studying AI in college?").response)

In [24]:
rprint(engine.query("What would the author say about art vs. engineering?").response)

In [25]:
rprint(engine.query("Why did the author have to learn italian?").response)

In [26]:
rprint(engine.query("Why the author was in Florence?").response)