# Vectorstores and Embeddings

Recall the overall workflow for retrieval augmented generation (RAG):

![overview.jpg](overview.jpg)

Now we will see third phase: `Storage - Vectorstore`

<div style="text-align:center"><img src="vectorstore.png" /></div>

In [6]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

We just discussed `Document Loading` and `Splitting`.

In [1]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("../docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("../docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("../docs/cs229_lectures/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("../docs/cs229_lectures/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [2]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [3]:
splits = text_splitter.split_documents(docs)

In [4]:
len(splits)

209

## Embeddings

Let's take our splits and embed them.

In [None]:
# ! pip install --upgrade --quiet  langchain sentence_transformers

In [10]:
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedding = HuggingFaceEmbeddings(model_name=embeddings_model_name)

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [11]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [12]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [13]:
import numpy as np

In [14]:
np.dot(embedding1, embedding2)

0.9151646510452669

In [15]:
np.dot(embedding1, embedding3)

0.08337093342478041

In [16]:
np.dot(embedding2, embedding3)

0.04040366916087286

`Note`: sentence1 is more similar to sentence2 (and vice versa) than setence3 is similar to both.

## Vectorstores

In [None]:
# ! pip install chromadb

In [21]:
from langchain.vectorstores import Chroma

In [22]:
persist_directory = '../docs/chroma/'

In [25]:
!del ../docs/chroma  # remove old database files if any

Op��o inv�lida - "docs".


In [26]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [27]:
print(vectordb._collection.count())

209


In [28]:
vectordb

<langchain_community.vectorstores.chroma.Chroma at 0x1ded07d1dc0>

### Similarity Search

In [29]:
question = "is there an email i can ask for help"

In [37]:
docs = vectordb.similarity_search(question,k=3)

In [38]:
len(docs)

3

In [36]:
print(docs[0].page_content)  # docs[0].page_content

cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So 
rather than sending us email individually, if you send email to this account, it will 
actually let us get back to you maximally quickly with answers to your questions.  
If you're asking questions about homework probl ems, please say in the subject line which 
assignment and which question the email refers to, since that will also help us to route 
your question to the appropriate TA or to me  appropriately and get the response back to 
you quickly.  
Let's see. Skipping ahead — let's see — for homework, one midterm, one open and term 
project. Notice on the honor code. So one thi ng that I think will help you to succeed and 
do well in this class and even help you to enjoy this cla ss more is if you form a study 
group.  
So start looking around where you' re sitting now or at the end of class today, mingle a 
little bit and get to know your classmates. I strongly encourage you to form study gro

In [39]:
print(docs[1].page_content)

cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So 
rather than sending us email individually, if you send email to this account, it will 
actually let us get back to you maximally quickly with answers to your questions.  
If you're asking questions about homework probl ems, please say in the subject line which 
assignment and which question the email refers to, since that will also help us to route 
your question to the appropriate TA or to me  appropriately and get the response back to 
you quickly.  
Let's see. Skipping ahead — let's see — for homework, one midterm, one open and term 
project. Notice on the honor code. So one thi ng that I think will help you to succeed and 
do well in this class and even help you to enjoy this cla ss more is if you form a study 
group.  
So start looking around where you' re sitting now or at the end of class today, mingle a 
little bit and get to know your classmates. I strongly encourage you to form study gro

In [40]:
print(docs[2].page_content)

more fun for you, and you'd probably have a be tter learning experience if you form a 
study group of people to work with. So I definitely encourage you to do that.  
And just to say a word on the honor code, whic h is I definitely en courage you to form a 
study group and work together, discuss homew ork problems together. But if you discuss


Let's save this so we can use it later!

In [41]:
vectordb.persist()

## Failure modes

This seems great, and basic similarity search will get you 80% of the way there very easily. 

But there are some failure modes that can creep up. 

Here are some edge cases that can arise - we'll fix them in the next class.

In [44]:
question = "what did they say about matlab?"

In [45]:
docs = vectordb.similarity_search(question,k=5)

Notice that we're getting duplicate chunks (because of the duplicate `MachineLearning-Lecture01.pdf` in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

`docs[0]` and `docs[1]` are indentical.

In [46]:
docs[0]

Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class,

In [47]:
docs[1]

Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class,

We can see a new failure mode.

The question below asks a question about the third lecture, but includes results from other lectures as well.

In [48]:
question = "what did they say about regression in the third lecture?"

In [49]:
docs = vectordb.similarity_search(question,k=5)

In [50]:
for doc in docs:
    print(doc.metadata)

{'page': 0, 'source': '../docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 2, 'source': '../docs/cs229_lectures/MachineLearning-Lecture02.pdf'}
{'page': 11, 'source': '../docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 13, 'source': '../docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 10, 'source': '../docs/cs229_lectures/MachineLearning-Lecture03.pdf'}


In [51]:
print(docs[4].page_content)

answer. You predict that if X is to the right of, sort of, the mid-point here then Y is one 
and then next to the left of that mid-point then Y is zero.  
So some people actually do this. Apply linear  regression to classi fication problems and 
sometimes it’ll work okay, but in general it’s actually a pretty bad idea to apply linear 
regression to classification problems like thes e and here’s why. Let’s say I change my 
training set by giving you just one more tr aining example all the way up there, right? 
Imagine if given this training set is actually  still entirely obvious  what the relationship 
between X and Y is, right? It’s ju st – take this value as greate r than Y is one and it’s less 
then Y is zero. By giving you this additiona l training example it really shouldn’t change 
anything. I mean, I didn’t really convey much  new information. There’s no surprise that 
this corresponds to Y equals one. But if you now  fit linear regression to this data set you 
end up with a lin

Approaches discussed in the next lecture can be used to address both!

More tests that I did

In [53]:
question = "what did they say about cross-section in the first lecture?"

In [54]:
docs = vectordb.similarity_search(question,k=5)

In [55]:
for doc in docs:
    print(doc.metadata)

{'page': 5, 'source': '../docs/cs229_lectures/MachineLearning-Lecture01.pdf'}
{'page': 5, 'source': '../docs/cs229_lectures/MachineLearning-Lecture01.pdf'}
{'page': 8, 'source': '../docs/cs229_lectures/MachineLearning-Lecture01.pdf'}
{'page': 8, 'source': '../docs/cs229_lectures/MachineLearning-Lecture01.pdf'}
{'page': 4, 'source': '../docs/cs229_lectures/MachineLearning-Lecture01.pdf'}


In [58]:
print(docs[3].page_content)

into his office and he said, "Oh, professo r, professor, thank you so much for your 
machine learning class. I learned so much from it. There's this stuff that I learned in your 
class, and I now use every day. And it's help ed me make lots of money, and here's a 
picture of my big house."  
So my friend was very excited. He said, "W ow. That's great. I'm glad to hear this 
machine learning stuff was actually useful. So what was it that you learned? Was it 
logistic regression? Was it the PCA? Was it the data ne tworks? What was it that you 
learned that was so helpful?" And the student said, "Oh, it was the MATLAB."  
So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, 
and we'll actually have a short MATLAB tutori al in one of the discussion sections for 
those of you that don't know it.  
Okay. The very last piece of logistical th ing is the discussion s ections. So discussion 
sections will be taught by the TAs, and atte ndance at discussion secti

`Note`: The main subject is about "cross-section", the lecture is correct but any of found document text says about it.

Let's do a test changing only "first lecture" to "second lecture" and see if we also get it right this time.

In [59]:
question = "what did they say about cross-section in the second lecture?"

In [60]:
docs = vectordb.similarity_search(question,k=5)

In [61]:
for doc in docs:
    print(doc.metadata)

{'page': 8, 'source': '../docs/cs229_lectures/MachineLearning-Lecture01.pdf'}
{'page': 8, 'source': '../docs/cs229_lectures/MachineLearning-Lecture01.pdf'}
{'page': 5, 'source': '../docs/cs229_lectures/MachineLearning-Lecture01.pdf'}
{'page': 5, 'source': '../docs/cs229_lectures/MachineLearning-Lecture01.pdf'}
{'page': 11, 'source': '../docs/cs229_lectures/MachineLearning-Lecture02.pdf'}


In [62]:
print(docs[4].page_content)

and then step, and then update with the sec ond training example, a nd update all the theta 
Is, and then step? And is that why you get sort of this really – ?  
Instructor (Andrew Ng) :Let's see, right. So I'm going to look at my first training 
example and then I'm going to take a ste p, and then I'm going to perform the second


```markdown
# We got this return for the only document that represent the second lecture, but it's because the embedding has the word _second_ in documento

and then step, and then update with the sec ond training example, a nd update all the theta 
Is, and then step? And is that why you get sort of this really – ?  
Instructor (Andrew Ng) :Let's see, right. So I'm going to look at my first training 
example and then I'm going to take a ste p, and then I'm going to perform the _second_
```