# Vectorstores and Embeddings

In [4]:
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_google_genai import GoogleGenerativeAIEmbeddings
# from langchain_core.vectorstores import InMemoryVectorStore
from langchain_cohere import CohereEmbeddings
import numpy as np

os.environ["GOOGLE_API_KEY"] = os.environ["GEMINI_API_KEY"]
llm_model = "gemini-2.0-flash-lite" # "gemma-3-27b-it" # 

llm = ChatGoogleGenerativeAI(
    model=llm_model,
    temperature=0.9,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

gembeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

# vector_store = InMemoryVectorStore(embeddings)

cembeddings = CohereEmbeddings(model="embed-english-v3.0")

In [6]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

embedding1 = gembeddings.embed_query(sentence1)
embedding2 = gembeddings.embed_query(sentence2)
embedding3 = gembeddings.embed_query(sentence3)

print(np.dot(embedding1, embedding2))
print(np.dot(embedding1, embedding3))
print(np.dot(embedding2, embedding3))

0.917833118609301
0.348090868656266
0.33660261297654337


In [7]:
embedding1 = cembeddings.embed_query(sentence1)
embedding2 = cembeddings.embed_query(sentence2)
embedding3 = cembeddings.embed_query(sentence3)

print(np.dot(embedding1, embedding2))
print(np.dot(embedding1, embedding3))
print(np.dot(embedding2, embedding3))

0.89638375773275
0.18742213262546126
0.13767806888275178


## Vectorstores

In [14]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("docs/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(docs)
len(splits)

208

In [17]:
from langchain.vectorstores import Chroma

persist_directory = 'docs/chroma/'
# !rm -rf ./docs/chroma  # remove old database files if any

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=gembeddings,
    persist_directory=persist_directory
)

print(vectordb._collection.count())

208


### Similarity Search

In [18]:
question = "is there an email i can ask for help"
docs = vectordb.similarity_search(question,k=3)

print(len(docs))

print(docs[0].page_content)

3
So all right, online resources. The class has a home page, so it's in on the handouts. I 
won't write on the chalkboard — http:// cs229.stanford.edu. And so when there are 
homework assignments or things like that, we usually won't sort of — in the mission of 
saving trees, we will usually not give out many handouts in class. So homework 
assignments, homework solutions will be posted online at the course home page.  
As far as this class, I've also written, and I guess I've also revised every year a set of 
fairly detailed lecture notes that cover the technical content of this class. And so if you 
visit the course homepage, you'll also find the detailed lecture notes that go over in detail 
all the math and equations and so on that I'll be doing in class.  
There's also a newsgroup, su.class.cs229, also written on the handout. This is a 
newsgroup that's sort of a forum for people in the class to get to know each other and 
have whatever discussions you want to have amongst yoursel

In [19]:
vectordb.persist()

  vectordb.persist()


### Failure modes

In [20]:
question = "what did they say about matlab?"
docs = vectordb.similarity_search(question,k=5)

In [21]:
docs[0]

Document(metadata={'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'page': 8, 'total_pages': 22, 'creationdate': '2008-07-11T11:25:23-07:00', 'author': '', 'page_label': '9', 'creator': 'PScript5.dll Version 5.2.2', 'source': 'docs/MachineLearning-Lecture01.pdf', 'moddate': '2008-07-11T11:25:23-07:00', 'title': ''}, page_content='those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your own home computer or something if you \ndon\'t have a MATLAB li

In [22]:
docs[1]

Document(metadata={'page_label': '9', 'author': '', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'source': 'docs/MachineLearning-Lecture01.pdf', 'creationdate': '2008-07-11T11:25:23-07:00', 'moddate': '2008-07-11T11:25:23-07:00', 'total_pages': 22, 'creator': 'PScript5.dll Version 5.2.2', 'title': '', 'page': 8}, page_content='those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your own home computer or something if you \ndon\'t have a MATLAB li

Notice that we're getting duplicate chunks (because of the duplicate `MachineLearning-Lecture01.pdf` in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

`docs[0]` and `docs[1]` are indentical.

In [23]:
question = "what did they say about regression in the third lecture?"
docs = vectordb.similarity_search(question,k=5)

for doc in docs:
    print(doc.metadata)

{'page': 0, 'total_pages': 16, 'moddate': '2008-07-11T11:25:03-07:00', 'creator': 'PScript5.dll Version 5.2.2', 'source': 'docs/MachineLearning-Lecture03.pdf', 'creationdate': '2008-07-11T11:25:03-07:00', 'author': '', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'title': '', 'page_label': '1'}
{'creator': 'PScript5.dll Version 5.2.2', 'author': '', 'title': '', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'moddate': '2008-07-11T11:25:03-07:00', 'page': 14, 'source': 'docs/MachineLearning-Lecture03.pdf', 'total_pages': 16, 'page_label': '15', 'creationdate': '2008-07-11T11:25:03-07:00'}
{'author': '', 'moddate': '2008-07-11T11:25:03-07:00', 'source': 'docs/MachineLearning-Lecture03.pdf', 'page': 13, 'creator': 'PScript5.dll Version 5.2.2', 'total_pages': 16, 'title': '', 'creationdate': '2008-07-11T11:25:03-07:00', 'page_label': '14', 'producer': 'Acrobat Distiller 8.1.0 (Windows)'}
{'creator': 'PScript5.dll Version 5.2.2', 'author': '', 'moddate': '2008-07-11T11:25:05-07:00', 

In [24]:
print(docs[3])

page_content='really makes a difference between a good solution and amazing solution. And to give 
everyone to just how we do points assignments, or what is it that causes a solution to get 
full marks, or just how to write amazing solutions. Becoming a grader is usually a good 
way to do that.  
Graders are paid positions and you also get free food, and it's usually fun for us to sort of 
hang out for an evening and grade all the assignments. Okay, so I will send email. So 
don't email me yet if you want to be a grader. I'll send email to the entire class later with 
the administrative details and to solicit applications. So you can email us back then, to 
apply, if you'd be interested in being a grader.  
Okay, any questions about that? All right, okay, so let's get started with today's material. 
So welcome back to the second lecture. What I want to do today is talk about linear 
regression, gradient descent, and the normal equations. And I should also say, lecture 
notes have been 

## With Cohere Embeddings

In [25]:
persist_directory = 'docs/chroma_cohere/'
# !rm -rf ./docs/chroma  # remove old database files if any

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=cembeddings,
    persist_directory=persist_directory
)

print(vectordb._collection.count())

question = "is there an email i can ask for help"
docs = vectordb.similarity_search(question,k=3)

print(len(docs))

print(docs[0].page_content)

208
3
cs229-qa@cs.stanford.edu. This goes to an account that's read by all the TAs and me. So 
rather than sending us email individually, if you send email to this account, it will 
actually let us get back to you maximally quickly with answers to your questions.  
If you're asking questions about homework problems, please say in the subject line which 
assignment and which question the email refers to, since that will also help us to route 
your question to the appropriate TA or to me appropriately and get the response back to 
you quickly.  
Let's see. Skipping ahead — let's see — for homework, one midterm, one open and term 
project. Notice on the honor code. So one thing that I think will help you to succeed and 
do well in this class and even help you to enjoy this class more is if you form a study 
group.  
So start looking around where you're sitting now or at the end of class today, mingle a 
little bit and get to know your classmates. I strongly encourage you to form study gro

In [26]:
question = "what did they say about matlab?"
docs = vectordb.similarity_search(question,k=5)

In [27]:
docs[0]

Document(metadata={'title': '', 'source': 'docs/MachineLearning-Lecture01.pdf', 'creationdate': '2008-07-11T11:25:23-07:00', 'total_pages': 22, 'page_label': '9', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'PScript5.dll Version 5.2.2', 'moddate': '2008-07-11T11:25:23-07:00', 'author': '', 'page': 8}, page_content='those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your own home computer or something if you \ndon\'t have a MATLAB li

In [28]:
docs[1]

Document(metadata={'total_pages': 22, 'page_label': '9', 'source': 'docs/MachineLearning-Lecture01.pdf', 'creationdate': '2008-07-11T11:25:23-07:00', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'author': '', 'title': '', 'creator': 'PScript5.dll Version 5.2.2', 'page': 8, 'moddate': '2008-07-11T11:25:23-07:00'}, page_content='those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your own home computer or something if you \ndon\'t have a MATLAB li

In [29]:
question = "what did they say about regression in the third lecture?"
docs = vectordb.similarity_search(question,k=5)

for doc in docs:
    print(doc.metadata)

{'creator': 'PScript5.dll Version 5.2.2', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'total_pages': 16, 'page_label': '1', 'author': '', 'creationdate': '2008-07-11T11:25:03-07:00', 'title': '', 'source': 'docs/MachineLearning-Lecture03.pdf', 'page': 0, 'moddate': '2008-07-11T11:25:03-07:00'}
{'author': '', 'total_pages': 16, 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'title': '', 'page_label': '7', 'source': 'docs/MachineLearning-Lecture03.pdf', 'moddate': '2008-07-11T11:25:03-07:00', 'page': 6, 'creator': 'PScript5.dll Version 5.2.2', 'creationdate': '2008-07-11T11:25:03-07:00'}
{'total_pages': 16, 'creator': 'PScript5.dll Version 5.2.2', 'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'moddate': '2008-07-11T11:25:03-07:00', 'title': '', 'page_label': '15', 'author': '', 'page': 14, 'creationdate': '2008-07-11T11:25:03-07:00', 'source': 'docs/MachineLearning-Lecture03.pdf'}
{'moddate': '2008-07-11T11:25:05-07:00', 'creationdate': '2008-07-11T11:25:05-07:00', 'producer': 

In [30]:
print(docs[3])

page_content='Instructor (Andrew Ng):All right, so who thought driving could be that dramatic, right? 
Switch back to the chalkboard, please. I should say, this work was done about 15 years 
ago and autonomous driving has come a long way. So many of you will have heard of the 
DARPA Grand Challenge, where one of my colleagues, Sebastian Thrun, the winning 
team's drive a car across a desert by itself.  
So Alvin was, I think, absolutely amazing work for its time, but autonomous driving has 
obviously come a long way since then. So what you just saw was an example, again, of 
supervised learning, and in particular it was an example of what they call the regression 
problem, because the vehicle is trying to predict a continuous value variables of a 
continuous value steering directions, we call the regression problem.  
And what I want to do today is talk about our first supervised learning algorithm, and it 
will also be to a regression task. So for the running example that I'm going to

In [32]:
print(docs[4])

page_content='into his office and he said, "Oh, professor, professor, thank you so much for your 
machine learning class. I learned so much from it. There's this stuff that I learned in your 
class, and I now use every day. And it's helped me make lots of money, and here's a 
picture of my big house."  
So my friend was very excited. He said, "Wow. That's great. I'm glad to hear this 
machine learning stuff was actually useful. So what was it that you learned? Was it 
logistic regression? Was it the PCA? Was it the data networks? What was it that you 
learned that was so helpful?" And the student said, "Oh, it was the MATLAB."  
So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, 
and we'll actually have a short MATLAB tutorial in one of the discussion sections for 
those of you that don't know it.  
Okay. The very last piece of logistical thing is the discussion sections. So discussion 
sections will be taught by the TAs, and attendance at discussion