# Retrieval: Similarity Search

In [None]:
# Run the line of code below to check the version of langchain in the current environment.
# Substitute "langchain" with any other package name to check their version.

In [1]:
pip show langchain

Name: langchain
Version: 0.3.26
Summary: Building applications with LLMs through composability
Home-page: 
Author: 
Author-email: 
License: MIT
Location: C:\Users\Marcus\anaconda3\envs\langchain_env_py312\Lib\site-packages
Requires: langchain-core, langchain-text-splitters, langsmith, pydantic, PyYAML, requests, SQLAlchemy
Required-by: langchain-community
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Load environment variable
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [7]:
# Import OpenAI embeddings, chroma and document classes
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.documents import Document

In [8]:
# Define embedding object
embedding = OpenAIEmbeddings(model='text-embedding-ada-002')

In [9]:
# Define vectorstore object
vectorstore = Chroma(persist_directory = "./intro-to-ds-lectures", 
                     embedding_function = embedding)

In [11]:
# Create chunk to the vectorstore to create a duplicate in the database
added_document = Document(page_content='Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. You can basically split the relevant tools into two categories—programming languages and software. Knowing a programming language enables you to devise programs that can execute specific operations. Moreover, you can reuse these programs whenever you need to execute the same action', 
                          metadata={'Course Title': 'Introduction to Data and Data Science', 
                                    'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'})

In [12]:
# Add chunk to the vectorstore to create a duplicate in the database
vectorstore.add_documents([added_document])

['83dca477-c444-4421-82ed-37298736797a']

In [13]:
# Define a question related to data science
question = "What programming languages do data scientists use?"

In [14]:
# Create a variable called retrieved documents and set it equal to vector store dot similarity search
retrieved_docs = vectorstore.similarity_search(query = question, 
                                               k = 5)

In [15]:
retrieved_docs

[Document(id='2c0d8284-119a-416e-a499-aa4f6e44cc08', metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'}, page_content='What about big data? Apart from R and Python, people working in this area are often proficient in other languages like Java or Scala. These two have not been developed specifically for doing statistical analyses, however they turn out to be very useful when combining data from multiple sources. All right! Let’s finish off with machine learning. When it comes to machine learning, we often deal with big data'),
 Document(id='02e2e8aa-649b-426f-a407-075ab6953e7d', metadata={'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need', 'Course Title': 'Introduction to Data and Data Science'}, page_content='What about big data? Apart from R and Python, people working in this area are often proficient in other languag

In [17]:
# Create a for loop that displays only the page content and lecture title of each document
for i in retrieved_docs:
    print(f"Page Content: {i.page_content}\n----------\nLecture Title:{i.metadata['Lecture Title']}\n")

Page Content: What about big data? Apart from R and Python, people working in this area are often proficient in other languages like Java or Scala. These two have not been developed specifically for doing statistical analyses, however they turn out to be very useful when combining data from multiple sources. All right! Let’s finish off with machine learning. When it comes to machine learning, we often deal with big data
----------
Lecture Title:Programming Languages & Software Employed in Data Science - All the Tools You Need

Page Content: What about big data? Apart from R and Python, people working in this area are often proficient in other languages like Java or Scala. These two have not been developed specifically for doing statistical analyses, however they turn out to be very useful when combining data from multiple sources. All right! Let’s finish off with machine learning. When it comes to machine learning, we often deal with big data
----------
Lecture Title:Programming Langua