## Vectorstores and Embeddings

Recall the overall workflow for retrieval augmented generation (RAG):

<img src="fig3.jpg" width="600">

### Vectorstores
<img src="fig4.png" width="400">

### Embeddings

<img src="fig5.png" width="500">

<img src="fig6.png" width="600">

### Vector Store

<img src="fig7.png" width="550">

### Vector Store/Database

<img src="fig9.png" width="600">

In [115]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

We just discussed Document Loading and Splitting.

In [119]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [121]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [123]:
splits = text_splitter.split_documents(docs)

In [125]:
len(splits)

208

### Embeddings
Let's take our splits and embed them.

In [127]:
from langchain_openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

In [60]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [62]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [131]:
len(embedding1)

1536

In [64]:
import numpy as np

In [133]:
np.dot(embedding1, embedding2)

0.9630397143104905

In [135]:
np.dot(embedding1, embedding3)

0.7702742223497945

In [137]:
np.dot(embedding2, embedding3)

0.7590147808716893

### Vectorstores

In [151]:
#!pip install chromadb
from langchain.vectorstores import Chroma

In [153]:
persist_directory = './docs/chroma'

In [155]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [156]:
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)

In [157]:
print(vectordb._collection.count())

624


### Similarity Search

In [161]:
question = "is there an email i can ask for help"

In [163]:
docs = vectordb.similarity_search(question,k=3) # k is number of documents 

In [165]:
len(docs)

3

In [167]:
docs[0].page_content

"cs229-qa@cs.stanford.edu. This goes to an account that's read by all the TAs and me. So \nrather than sending us email individually, if you send email to this account, it will \nactually let us get back to you maximally quickly with answers to your questions.  \nIf you're asking questions about homework problems, please say in the subject line which \nassignment and which question the email refers to, since that will also help us to route \nyour question to the appropriate TA or to me appropriately and get the response back to \nyou quickly.  \nLet's see. Skipping ahead — let's see — for homework, one midterm, one open and term \nproject. Notice on the honor code. So one thing that I think will help you to succeed and \ndo well in this class and even help you to enjoy this class more is if you form a study \ngroup.  \nSo start looking around where you're sitting now or at the end of class today, mingle a \nlittle bit and get to know your classmates. I strongly encourage you to form st

In [92]:
vectordb

<langchain_community.vectorstores.chroma.Chroma at 0x1ce06120830>

### Failure modes

이것은 훌륭해 보이며 기본 유사성 검색을 통해 80%의 결과를 매우 쉽게 얻을 수 있습니다.

그러나 서서히 나타날 수 있는 몇 가지 실패 모드가 있습니다.

다음은 발생할 수 있는 몇 가지 극단적인 경우입니다. 다음 수업에서 이를 수정하겠습니다.

In [172]:
question = "what did they say about matlab?"

In [174]:
docs = vectordb.similarity_search(question,k=5)

중복 청크(인덱스에 MachineLearning-Lecture01.pdf가 중복되어 있기 때문에)가 표시됩니다.

시맨틱 검색은 모든 유사한 문서를 가져오지만 다양성을 강요하지는 않습니다.

docs[0]과 docs[1]은 동일합니다.

In [177]:
docs[0]

Document(metadata={'page': 8, 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf'}, page_content='those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat 

In [179]:
docs[1]

Document(metadata={'page': 8, 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf'}, page_content='those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat 

새로운 실패 모드를 볼 수 있습니다.

아래 질문은 세 번째 강의에 대한 질문이지만 다른 강의의 결과도 포함됩니다.

In [181]:
question = "what did they say about regression in the third lecture?"

In [183]:
docs = vectordb.similarity_search(question,k=5)

In [185]:
for doc in docs:
    print(doc.metadata)

{'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 17, 'source': 'docs/cs229_lectures/MachineLearning-Lecture02.pdf'}
{'page': 17, 'source': 'docs/cs229_lectures/MachineLearning-Lecture02.pdf'}


In [189]:
print(docs[3].page_content)

algebra. Okay, some of you look a little bit dazed, but this is our first learning hour. 
Aren't you excited? Any quick questions about this before we close for today?  
Student:[Inaudible].  
Instructor (Andrew Ng):Say that again.  
Student:What you derived, wasn't that just [inaudible] of X?  
Instructor (Andrew Ng):What inverse? 
Student:Pseudo inverse.  
Instructor (Andrew Ng):Pseudo inverse?  
Student:Pseudo inverse.  
Instructor (Andrew Ng):Yeah, I turns out that in cases, if X transpose X is not 
invertible, than you use the pseudo inverse minimized to solve this. But it turns out X 
transpose X is not invertible. That usually means your features were dependent. It usually 
means you did something like repeat the same feature twice in your training set. So if this 
is not invertible, it turns out the minimum is obtained by the pseudo inverses of the 
inverse.  
If you don't know what I just said, don't worry about it. It usually won't be a problem. 
Anything else?  
Student:On t

다음 강의에서 논의되는 접근 방식은 두 가지를 모두 해결하는 데 사용할 수 있습니다!