## Vectorstore retrieval


In [3]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [4]:
# !pip install lark

### Similarity Search

In [5]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

In [6]:
embedding = OpenAIEmbeddings()
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

  embedding = OpenAIEmbeddings()
  vectordb = Chroma(


In [7]:
print(vectordb._collection.count())

228


In [8]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [9]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

In [10]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [11]:
smalldb.similarity_search(question, k=2)

[Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(metadata={}, page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).')]

In [12]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(metadata={}, page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

### Addressing Diversity: Maximum marginal relevance


In [13]:
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)

In [14]:
docs_ss[0].page_content[:100]

'those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people c'

In [15]:
docs_ss[1].page_content[:100]

'those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people c'

Note the difference in results with `MMR`.

In [16]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [17]:
docs_mmr[0].page_content[:100]

'those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people c'

In [18]:
docs_mmr[1].page_content[:100]

'into his office and he said, "Oh, professor, professor, thank you so much for your \nmachine learning'

### Addressing Specificity: working with metadata


In [19]:
question = "what did they say about regression in the third lecture?"

In [20]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

In [21]:
for d in docs:
    print(d.metadata)

### Addressing Specificity: working with metadata using self-query retriever


In [22]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [23]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [24]:
document_content_description = "Lecture notes"
llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

  llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)


In [25]:
question = "what did they say about regression in the third lecture?"

In [26]:
docs = retriever.get_relevant_documents(question)

  docs = retriever.get_relevant_documents(question)


In [27]:
for d in docs:
    print(d.metadata)

### Additional tricks: compression


In [28]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [29]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [30]:
# Wrap our vectorstore
llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")
compressor = LLMChainExtractor.from_llm(llm)

In [31]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [32]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

- those homeworks will be done in either MATLAB or in Octave
- I know some people call it a free version of MATLAB
- MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data
- it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms
- there's also a software package called Octave that you can download for free off the Internet
- it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about everything
- once a colleague of mine at a different university, not at Stanford, actually teaches another machine learning course
----------------------------------------------------------------------------------------------------
Document 2:

- those homeworks will be done in either MATLAB or in Octave
- I know some people call it a free version of MATLAB
- MATLAB is

## Combining various techniques

In [33]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [34]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

- those homeworks will be done in either MATLAB or in Octave
- I know some people call it a free version of MATLAB
- MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data
- it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms
- there's also a software package called Octave that you can download for free off the Internet
- it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about everything
- once a colleague of mine at a different university, not at Stanford, actually teaches another machine learning course
----------------------------------------------------------------------------------------------------
Document 2:

"Oh, it was the MATLAB."
----------------------------------------------------------------------------------------------------


## Other types of retrieval


In [35]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [38]:
# Load PDF
loader = PyPDFLoader("MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [39]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [42]:
question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question)
print(docs_svm[0])

page_content='Testing, testing. Okay, cool. Thanks. So all right, online resources. The class has a home page, so it's in on the handouts. I 
won't write on the chalkboard — http:// cs229.stanford.edu. And so when there are 
homework assignments or things like that, we usually won't sort of — in the mission of 
saving trees, we will usually not give out many handouts in class. So homework 
assignments, homework solutions will be posted online at the course home page.  
As far as this class, I've also written, and I guess I've also revised every year a set of 
fairly detailed lecture notes that cover the technical content of this class. And so if you 
visit the course homepage, you'll also find the detailed lecture notes that go over in detail 
all the math and equations and so on that I'll be doing in class.  
There's also a newsgroup, su.class.cs229, also written on the handout. This is a 
newsgroup that's sort of a forum for people in the class to get to know each other and 
have wha

In [43]:
question = "what did they say about matlab?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
print(docs_tfidf[0])

page_content='yourselves. You can also come and talk to me or the TAs if you want to brainstorm ideas 
with us.  
Okay. So one more organizational question. I'm curious, how many of you know 
MATLAB? Wow, cool, quite a lot. Okay. So as part of the — act ually how many of you 
know Octave or have used Octave? Oh, okay, much smaller number.  
So as part of this class, especially in the homeworks, we'll ask you to implement a few 
programs, a few machine learning algorithms as part of the homeworks. And most of those homeworks will be done in either MATLAB or in Octave, which is sort of — I 
know some people call it a free version of MATLAB, which it sort of is, sort of isn't.  
So I guess for those of you that haven't seen MATLAB before, and I know most of you 
have, MATLAB is I guess part of the programming language that makes it very easy to 
write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. And it's sort of an extremely easy to learn