# Week 2

Welcome to the class!

We first download the required software: LangChain and its dependency `pypdf`

In [1]:
!pip install --upgrade pip
!pip install --upgrade langchain pypdf



We then load LangChain's `pypdf` loader.

In [2]:
from langchain.document_loaders import PyPDFLoader

Let's first load our PDF... 

In [3]:
loader = PyPDFLoader("Data/2021-census-population-occupied-private-dwellings-community-2001-2021.pdf")

In [4]:
pdf = loader.load_and_split()

Having loaded the PDF, let's now load the CSV. 

In [5]:
from langchain.document_loaders.csv_loader import CSVLoader

In [6]:
loader = CSVLoader("Data/Urban_Design_and_Architecture_Awards_Recipients.csv")

In [7]:
csv = loader.load()

Having loaded our data, we'll now download and load the embedding model.

In [8]:
!pip install sentence_transformers > /dev/null

In [9]:
from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings

In [10]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

Let's try embedding some text. Observe the output. Once you've tried it, scroll down to continue.

In [11]:
text = "This is a test document."

In [12]:
embeddings.embed_query(text)

[-0.03833852708339691,
 0.1234646886587143,
 -0.028642937541007996,
 0.05365273728966713,
 0.00884535163640976,
 -0.03983931988477707,
 -0.07300585508346558,
 0.04777132347226143,
 -0.030462520197033882,
 0.05497976765036583,
 0.08505293726921082,
 0.0366566926240921,
 -0.005319987423717976,
 -0.0022331285290420055,
 -0.06071098893880844,
 -0.027237899601459503,
 -0.011351611465215683,
 -0.042437728494405746,
 0.009129906073212624,
 0.100815549492836,
 0.07578731328248978,
 0.06911718100309372,
 0.009857481345534325,
 -0.0018377420492470264,
 0.026249045506119728,
 0.032902419567108154,
 -0.07177435606718063,
 0.028384260833263397,
 0.061709530651569366,
 -0.052529558539390564,
 0.03366165980696678,
 0.07446815818548203,
 0.07536036521196365,
 0.03538402169942856,
 0.06713403761386871,
 0.010798039846122265,
 0.08167023211717606,
 0.01656291075050831,
 0.03283059597015381,
 0.03632567450404167,
 0.002172845648601651,
 -0.09895741194486618,
 0.005046740174293518,
 0.05089650675654411,
 

We now have a working embedding function. Let's install Chroma.

In [13]:
!pip install -U chromadb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting chromadb
  Downloading chromadb-0.4.19-py3-none-any.whl.metadata (7.3 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp311-cp311-macosx_10_9_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.105.0-py3-none-any.whl.metadata (24 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.24.0.post1-py3-none-any.whl.metadata (6.4 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.1.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting pulsar-client>=3.1.0 (from chromadb)
  Downloading pulsar_client-3.3.0-cp311-cp311-macosx_10_15_universal2.whl.metadata (1.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.16.3-cp311-cp311-macosx_10_15_x86_64.whl.metadata (4.3 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.21.0-py3-none-any.whl.metadata (1.4 kB)
Collecting opentele

In [14]:
from langchain.vectorstores import Chroma

Let's make a vector store for our loaded documents!

In [15]:
db = Chroma.from_documents(pdf, embeddings)

In [16]:
db.add_documents(csv)

['354fb36a-9a28-11ee-8b53-acde48001122',
 '354fb450-9a28-11ee-8b53-acde48001122',
 '354fb4aa-9a28-11ee-8b53-acde48001122',
 '354fb4e6-9a28-11ee-8b53-acde48001122',
 '354fb536-9a28-11ee-8b53-acde48001122',
 '354fb57c-9a28-11ee-8b53-acde48001122',
 '354fb5b8-9a28-11ee-8b53-acde48001122',
 '354fb5fe-9a28-11ee-8b53-acde48001122',
 '354fb630-9a28-11ee-8b53-acde48001122',
 '354fb66c-9a28-11ee-8b53-acde48001122',
 '354fb6a8-9a28-11ee-8b53-acde48001122',
 '354fb6da-9a28-11ee-8b53-acde48001122',
 '354fb70c-9a28-11ee-8b53-acde48001122',
 '354fb73e-9a28-11ee-8b53-acde48001122',
 '354fb766-9a28-11ee-8b53-acde48001122',
 '354fb7a2-9a28-11ee-8b53-acde48001122',
 '354fb7d4-9a28-11ee-8b53-acde48001122',
 '354fb806-9a28-11ee-8b53-acde48001122',
 '354fb838-9a28-11ee-8b53-acde48001122',
 '354fb86a-9a28-11ee-8b53-acde48001122',
 '354fb89c-9a28-11ee-8b53-acde48001122',
 '354fb8ce-9a28-11ee-8b53-acde48001122',
 '354fb900-9a28-11ee-8b53-acde48001122',
 '354fb932-9a28-11ee-8b53-acde48001122',
 '354fb964-9a28-

Let's try retrieving a relevant document.

In [17]:
query = "An award concerning art."
db.similarity_search(query)

[Document(page_content='\ufeffX: 591976.7816\nY: 4790547.4424\nOBJECTID: 86\nAWARD_WINNER: The James North Art Crawl\nPROJECT_DESCRIPTION: On the second Friday evening of every month this event programs the historic James Street North streetscape from Murray to King Street with an eclectic array of gallery openings, performances and outdoor art reflective of the emerging arts community in t\nRECIPIENT: The Gallery and Studies of the James North Community\nAWARD_YEAR: 2007\nCATEGORY: Award of Merit for Visionary Project\nLOCATION: James Street North between Murray and King Streets\nCOMMUNITY: Hamilton\nLATITUDE: 43.2621251\nLONGITUDE: -79.8667415', metadata={'row': 85, 'source': 'Data/Urban_Design_and_Architecture_Awards_Recipients.csv'}),
 Document(page_content='\ufeffX: 591942.8531\nY: 4790007.4241\nOBJECTID: 45\nAWARD_WINNER: Empire Times\nPROJECT_DESCRIPTION: The project is an adaptive re-use of an historic building into a performing arts centre and affordable housing for artists. T