# RAG - Retrieval Augmented Generation

## DocumentLoaders

In [1]:
# Carregando PDF's
from langchain_community.document_loaders.pdf import PyPDFLoader

caminho = './data/attention-is-all-your-need.pdf'

loader = PyPDFLoader(caminho)
documentos = loader.load()

In [2]:
len(documentos)

15

In [3]:
print(documentos[1].page_content)

1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks
in particular, have been firmly established as state of the art approaches in sequence modeling and
transduction problems such as language modeling and machine translation [ 35, 2, 5]. Numerous
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
architectures [38, 24, 15].
Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently
sequential nature precludes parallelization within training examples, which becomes critical at longer
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
significant improvements in computational efficiency through facto

In [4]:
print(documentos[1].metadata)

{'source': './data/attention-is-all-your-need.pdf', 'page': 1, 'page_label': '2'}


## Fazendo perguntas para o arquivo

In [5]:
from langchain.chains.question_answering import load_qa_chain
from langchain_openai import ChatOpenAI

chat = ChatOpenAI(model='gpt-4o-mini')

chain = load_qa_chain(
    llm=chat,
    chain_type='stuff',
    verbose=True
)

stuff: https://python.langchain.com/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/docs/how_to/#qa-with-rag
  chain = load_qa_chain(


In [6]:
pergunta = 'Qual o assunto principal do documento?'

chain.run(
    input_documents=documentos,
    question=pergunta
)

  chain.run(




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
con

'O assunto principal do documento é a apresentação do modelo Transformer, uma nova arquitetura de rede neural baseada exclusivamente em mecanismos de atenção, que elimina a necessidade de recorrência e convoluções. O documento discute sua eficácia em tarefas de transdução de sequência, como tradução automática, demonstrando melhorias significativas na qualidade e na eficiência de treinamento em comparação com modelos anteriores.'

## Carregando CSV

In [7]:
from langchain_community.document_loaders.csv_loader import CSVLoader

caminho = './data/IMDB top 1000.csv'
loader = CSVLoader(caminho, encoding='utf-8')
documentos = loader.load()

In [8]:
len(documentos)

1000

In [9]:
print(documentos[0].page_content)

: 0
Title: 1. The Shawshank Redemption (1994)
Certificate: R
Duration: 142 min
Genre: Drama
Rate: 9.3
Metascore: 80
Description: Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.
Cast: Director: Frank Darabont | Stars: Tim Robbins, Morgan Freeman, Bob Gunton, William Sadler
Info: Votes: 2,295,987 | Gross: $28.34M


In [10]:
print(documentos[0].metadata)

{'source': './data/IMDB top 1000.csv', 'row': 0}


In [11]:
from langchain.chains.question_answering import load_qa_chain
from langchain_openai import ChatOpenAI

chat = ChatOpenAI(model='gpt-4o-mini')

chain = load_qa_chain(
    llm=chat,
    chain_type='stuff',
    verbose=True
)

In [13]:
chain.run(
    input_documents=documentos,
    question="Qual o melhor filme da história segundo o IMDB?"
)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
: 0
Title: 1. The Shawshank Redemption (1994)
Certificate: R
Duration: 142 min
Genre: Drama
Rate: 9.3
Metascore: 80
Description: Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.
Cast: Director: Frank Darabont | Stars: Tim Robbins, Morgan Freeman, Bob Gunton, William Sadler
Info: Votes: 2,295,987 | Gross: $28.34M

: 1
Title: 2. The Godfather (1972)
Certificate: R
Duration: 175 min
Genre: Crime, Drama
Rate: 9.2
Metascore: 100
Description: The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.
Cast: Director: Francis Ford Coppola | Stars: Marlon 

'O melhor filme da história segundo o IMDb é "The Shawshank Redemption" (1994), com uma classificação de 9.3.'

## Carregando da Internet

### Youtube

In [16]:
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser

In [17]:
url = 'https://www.youtube.com/watch?v=rOjusRRO1EI'
save_dir = 'data/youtube/'

loader = GenericLoader(
    YoutubeAudioLoader([url], save_dir),
    OpenAIWhisperParser()
)

docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/watch?v=rOjusRRO1EI
[youtube] rOjusRRO1EI: Downloading webpage
[youtube] rOjusRRO1EI: Downloading tv client config
[youtube] rOjusRRO1EI: Downloading player e7567ecf
[youtube] rOjusRRO1EI: Downloading tv player API JSON
[youtube] rOjusRRO1EI: Downloading ios player API JSON
[youtube] rOjusRRO1EI: Downloading m3u8 information
[info] rOjusRRO1EI: Downloading 1 format(s): 140
[download] Destination: data\youtube\Como usar o GPT com seus próprios dados？.m4a
[download] 100% of   25.64MiB in 00:00:04 at 5.76MiB/s     


ERROR: Postprocessing: ffprobe and ffmpeg not found. Please install or provide the path using --ffmpeg-location


DownloadError: ERROR: Postprocessing: ffprobe and ffmpeg not found. Please install or provide the path using --ffmpeg-location