# Query Multiple Files with `GPT`
📓 Notebook from [gkamradt](https://github.com/gkamradt/langchain-tutorials)

In [6]:
#!pip install langchain
#!pip install python-magic-bin
#!pip install chromadb
#!pip install unstructured

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain import OpenAI, VectorDBQA
from langchain.document_loaders import DirectoryLoader
import magic
import os
import nltk

# os.environ['OPENAI_API_KEY'] = '...'

# nltk.download('averaged_perceptron_tagger')

# pip install unstructured
# Other dependencies to install https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/unstructured_file.html
# pip install python-magic-bin
# pip install chromadb

In [9]:
import os
import openai
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
openai.api_key = OPENAI_API_KEY

In [2]:
loader = DirectoryLoader('data/PaulGrahamEssaySmall/', glob='**/*.txt')
documents = loader.load()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Luke\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Luke\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


In [5]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

In [10]:
embeddings = OpenAIEmbeddings()

In [11]:
docsearch = Chroma.from_documents(texts, embeddings)

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.


In [12]:
qa = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=docsearch)



In [18]:
query = "What did McCarthy discover?"
ans = qa.run(query)

ans = ans.split()
for i in range(0, len(ans), 10):
    print(' '.join(ans[i:i+10]))

McCarthy discovered that, given a handful of simple operators and
a notation for functions, you can build a whole programming
language which he called Lisp. He also used a simple
data structure called a list for both code and data.


### `Langchain` return Sources

In [19]:
qa = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=docsearch, return_source_documents=True)
query = "What did McCarthy discover?"
result = qa({"query": query})



In [20]:
result['result']

' McCarthy discovered a way to create a whole programming language out of a handful of simple operators and a notation for functions. He called this language Lisp.'

In [26]:
result['source_documents']

[Document(page_content='May 2001\n\n(I wrote this article to help myself understand exactly\n\nwhat McCarthy discovered.  You don\'t need to know this stuff\n\nto program in Lisp, but it should be helpful to\n\nanyone who wants to\n\nunderstand the essence of Lisp \x97 both in the sense of its\n\norigins and its semantic core.  The fact that it has such a core\n\nis one of Lisp\'s distinguishing features, and the reason why,\n\nunlike other languages, Lisp has dialects.)In 1960, John\n\nMcCarthy published a remarkable paper in\n\nwhich he did for programming something like what Euclid did for\n\ngeometry. He showed how, given a handful of simple\n\noperators and a notation for functions, you can\n\nbuild a whole programming language.\n\nHe called this language Lisp, for "List Processing,"\n\nbecause one of his key ideas was to use a simple\n\ndata structure called a list for both\n\ncode and data.It\'s worth understanding what McCarthy discovered, not\n\njust as a landmark in the histo

In [29]:
qa = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=docsearch, return_source_documents=True)
query = "What is the biggest town between Christchurch and Dunedin?"
result = qa({"query": query})

In [30]:
result

{'query': 'What is the biggest town between Christchurch and Dunedin?',
 'result': ' Oamaru is the biggest town between Christchurch and Dunedin.',
 'source_documents': [Document(page_content='made clear, is to talk confidently about things they don\'t\n\nunderstand.An event like this is thus a uniquely powerful way of taking people\'s\n\nmeasure. As Warren Buffett said, "It\'s only when the tide goes out\n\nthat you learn who\'s been swimming naked." And the tide has just\n\ngone out like never before.Now that we\'ve seen the results, let\'s remember what we saw, because\n\nthis is the most accurate test of credibility we\'re ever likely to have. I hope.', lookup_str='', metadata={'source': 'data\\PaulGrahamEssaySmall\\cred.txt'}, lookup_index=0),
  Document(page_content="April 2012A palliative care nurse called Bronnie Ware made a list of the\n\nbiggest regrets\n\nof the dying.  Her list seems plausible.  I could see\n\nmyself — can see myself — making at least 4 of these\n\n5 mistak