# Vector Stores and Embeddings

In [5]:
%pip install openai

Note: you may need to restart the kernel to use updated packages.


In [7]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

## Load PDF

In [8]:
## Load PDF
from langchain.document_loaders import PyPDFLoader

loaders = [
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture03.pdf")
]

docs = []

for loader in loaders:
    docs.extend(loader.load())

## Split

In [10]:
## Split
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150
)

splits = text_splitter.split_documents(docs)

len(splits)

208

## Embeddings

In [12]:
from langchain.embeddings import OpenAIEmbeddings

embedding = OpenAIEmbeddings()

sentence1 = "i like dogs"
sentence2 = "i like cannies"
sentence3 = "the weather is urgly outside"

embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)


In [13]:
import numpy as np

In [14]:
np.dot(embedding1, embedding2)

0.8414146376677061

In [15]:
np.dot(embedding1, embedding3)

0.77658456780201

## Vector Stores

In [16]:
%pip install chromadb

Collecting chromadb
  Downloading chromadb-0.6.2-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-macosx_11_0_arm64.whl.metadata (252 bytes)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.8.3-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp311-cp311-macosx_13_0_universal2.whl.metadata (4.5 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.29.0-py3-none-any.whl.metadata (1.4 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.29.0-py3-none-any.whl.metadata (2.2 kB)
Collecting opentelemetry-instrumentation-fastapi>=0.41b0 (from chromadb)
  Downloading opentelemetry_instrumentation_fastapi-0.50b

In [24]:
from langchain.vectorstores import Chroma

persist_directory = 'docs/chroma/'

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [25]:
print(vectordb._collection.count())

208


## Similarity Search

In [26]:
question = "is there an email i can ask for help"

In [28]:
docs = vectordb.similarity_search(question, k=3)

In [29]:
len(docs)

3

In [30]:
docs[0].page_content

"cs229-qa@cs.stanford.edu. This goes to an account that's read by all the TAs and me. So \nrather than sending us email individually, if you send email to this account, it will \nactually let us get back to you maximally quickly with answers to your questions.  \nIf you're asking questions about homework problems, please say in the subject line which \nassignment and which question the email refers to, since that will also help us to route \nyour question to the appropriate TA or to me appropriately and get the response back to \nyou quickly.  \nLet's see. Skipping ahead — let's see — for homework, one midterm, one open and term \nproject. Notice on the honor code. So one thing that I think will help you to succeed and \ndo well in this class and even help you to enjoy this class more is if you form a study \ngroup.  \nSo start looking around where you're sitting now or at the end of class today, mingle a \nlittle bit and get to know your classmates. I strongly encourage you to form st

In [31]:
vectordb.persist()

  vectordb.persist()


## Failure Modes

In [32]:
question = "what did they say about matlab?"

In [33]:
docs = vectordb.similarity_search(question)

In [34]:
len(docs)

4

In [35]:
docs[0]

Document(metadata={'page': 8, 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf'}, page_content='those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat 

In [36]:
docs[1]

Document(metadata={'page': 8, 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf'}, page_content='those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat 

## Failures


In [37]:
question = "what did they say about regression in the third lecture?"
docs = vectordb.similarity_search(question,k=5)

In [38]:
for doc in docs:
    print(doc.metadata)

{'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 17, 'source': 'docs/cs229_lectures/MachineLearning-Lecture02.pdf'}
{'page': 8, 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf'}
{'page': 13, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 6, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}


In [39]:
print(docs[4].page_content)

data sets as well. So don’t want to talk about that. If you’re interested, look up the work 
of Andrew Moore on KD-trees. He, sort of, figured out ways to fit these models much 
more efficiently. That’s not something I want to go into today. Okay? Let me move one. 
Let’s take more questions later.  
So, okay. So that’s locally weighted regression. Remember the outline I had, I guess, at 
the beginning of this lecture. What I want to do now is talk about a probabilistic 
interpretation of linear regression, all right? And in particular of the – it’ll be this 
probabilistic interpretation that let’s us move on to talk about logistic regression, which 
will be our first classification algorithm. So let’s put aside locally weighted regression for 
now. We’ll just talk about ordinary unweighted linear regression. Let’s ask the question 
of why least squares, right? Of all the things we could optimize how do we come up with 
this criteria for minimizing the square of the area between the pre