# Week 2

Welcome to the class!

We first download the required software: LangChain and its dependency `pypdf`

In [1]:
!pip install --upgrade pip
!pip install --upgrade langchain pypdf

Collecting langchain
  Downloading langchain-0.1.15-py3-none-any.whl.metadata (13 kB)
Collecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl.metadata (7.4 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Downloading SQLAlchemy-2.0.29-cp311-cp311-macosx_10_9_x86_64.whl.metadata (9.6 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.9.3-cp311-cp311-macosx_10_9_x86_64.whl.metadata (7.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl.metadata (25 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Using cached jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-community<0.1,>=0.0.32 (from langchain)
  Downloading langchain_community-0.0.32-py3-none-any.whl.metadata (8.5 kB)
Collecting langchain-core<0.2.0,>=0.1.41 (from langchain)
  Downloading langchain_core-0.1.41-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain-text-splitters<0.1,>=0.0.1 (from langcha

We then load LangChain's `pypdf` loader.

In [2]:
from langchain.document_loaders import PyPDFLoader

Let's first load our PDF... 

In [4]:
loader = PyPDFLoader("Data/2021-census-population-occupied-private-dwellings-community-2001-2021.pdf")

In [5]:
pdf = loader.load_and_split()

Take a look at page one of the PDF ...

In [6]:
pdf[0]

Document(page_content='2001 2006 2011 2016 2021\nCommunity Population Occupied Population Occupied Population Occupied Population Occupied Population Occupied\nPrivate Private Private Private Private\nDwellings Dwellings Dwellings Dwellings Dwellings\nAncaster 27,490 9,075 33,230 10,780 36,910 12,235 40,560 13,610 43,510 14,805\nDundas 24,385 9,080 24,700 9,365 24,910 9,910 24,285 9,920 24,150 9,990\nFlamborough 37,795 12,645 39,220 13,070 40,090 13,925 42,655 14,995 46,860 16,405\nGlanbrook 12,145 4,360 15,290 5,680 22,440 8,215 29,860 10,560 35,075 11,865\nHamilton 331,120 133,350 329,820 133,780 330,480 136,150 330,090 137,490 343,280 142,175\n Lower Hamilton 187,730 81,340 182,365 79,935 180,245 80,460 176,815 80,325 185,744 83,743\n Upper Hamilton 143,390 52,010 147,455 53,845 150,235 55,690 153,275 57,165 157,536 58,432\nStoney Creek 57,330 19,710 62,290 21,780 65,120 23,370 69,470 25,030 76,480 27,560\nTotal 490,265 188,165 504,550 194,455 519,950 203,805 536,920 211,605 569,355

Having loaded the PDF, let's now load the CSV. 

In [7]:
from langchain.document_loaders.csv_loader import CSVLoader

In [8]:
loader = CSVLoader("Data/Urban_Design_and_Architecture_Awards_Recipients.csv")

In [11]:
csv = loader.load()

Having loaded our data, we'll now download and load the embedding model.

In [12]:
!pip install sentence_transformers > /dev/null

In [13]:
from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings

In [15]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

Let's try embedding some text. Observe the output. Once you've tried it, scroll down to continue.

In [16]:
text = "This is a test document."

In [17]:
embeddings.embed_query(text)

[-0.038338545709848404,
 0.12346471846103668,
 -0.028642959892749786,
 0.05365273728966713,
 0.008845352567732334,
 -0.039839353412389755,
 -0.07300589233636856,
 0.047771234065294266,
 -0.03046250157058239,
 0.05497973784804344,
 0.08505292981863022,
 0.03665667027235031,
 -0.005320012103766203,
 -0.0022332090884447098,
 -0.06071096286177635,
 -0.02723788283765316,
 -0.011351660825312138,
 -0.04243776947259903,
 0.00912997592240572,
 0.10081552714109421,
 0.0757872462272644,
 0.06911724805831909,
 0.009857500903308392,
 -0.0018377398373559117,
 0.026249051094055176,
 0.032902367413043976,
 -0.07177437096834183,
 0.02838427573442459,
 0.061709512025117874,
 -0.05252954363822937,
 0.03366173058748245,
 0.07446814328432083,
 0.07536028325557709,
 0.035384006798267365,
 0.06713408976793289,
 0.010798053815960884,
 0.08167026191949844,
 0.01656290702521801,
 0.032830651849508286,
 0.036325663328170776,
 0.0021728859283030033,
 -0.09895738214254379,
 0.005046775098890066,
 0.050896495580673

We now have a working embedding function. Let's install Chroma.

In [18]:
!pip install -U chromadb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting chromadb
  Downloading chromadb-0.4.24-py3-none-any.whl.metadata (7.3 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.1-py3-none-any.whl.metadata (4.3 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Using cached chroma_hnswlib-0.7.3-cp311-cp311-macosx_10_9_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.110.1-py3-none-any.whl.metadata (24 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.29.0-py3-none-any.whl.metadata (6.3 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting pulsar-client>=3.1.0 (from chromadb)
  Using cached pulsar_client-3.4.0-cp311-cp311-macosx_10_15_universal2.whl.metadata (1.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.17.1-cp311-cp311-macosx_11_0_universal2.whl.metadata (4.2 kB)
Collecting opentelemetry-api>=1.2.0 (from chr

In [19]:
#👇 this first import not in Week 2 download notebook, but is in video
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

In [20]:
# this not in Week 2 download notebook, but is in video
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

In [21]:
# this not in Week 2 download notebook, but is in video
text_splitter.split_documents(csv)

[Document(page_content='\ufeffX: 592386.8305\nY: 4789657.2917\nOBJECTID: 1\nAWARD_WINNER: Procession\nPROJECT_DESCRIPTION: Ferguson Station is a heritage landmark in the Downtown core of the City Hamilton and is a reflection of the area’s rich history, culture and economic importance. The trains that helped bring prosperity to the area, no longer regularly pass through the\nRECIPIENT: Lester Coloma, Salvation Army, International Village BIA, Core Urban Inc\nAWARD_YEAR: 2021\nCATEGORY: Award of Merit for Urban Elements\nLOCATION: 250 King Street East\nCOMMUNITY: Hamilton\nLATITUDE: 43.254061\nLONGITUDE: -79.8618395', metadata={'source': 'Data/Urban_Design_and_Architecture_Awards_Recipients.csv', 'row': 0}),
 Document(page_content='\ufeffX: 596394.4785\nY: 4781830.8287\nOBJECTID: 2\nAWARD_WINNER: CONNECT Communities\nPROJECT_DESCRIPTION: A transitional residence for those recovering from acquired brain injuries or stroke – our client, Connect Communities Hamilton, is implementing a new t

Let's make a vector store for our loaded documents!

In [22]:
db = Chroma.from_documents(pdf, embeddings)

In [23]:
db.add_documents(csv)

['9320cfa3-e3f7-4144-b2d2-3cdc8073087e',
 'a0d940e5-eda3-48ba-9fb0-aebb69c9cf3d',
 'f5bda848-c90d-442b-a3ce-c92b4c7b72dd',
 '72e5b5b3-c3b8-4a76-9c09-b72def1f2342',
 '69b6727a-4560-4358-b5ee-3f5566062a06',
 '739761c5-239c-4673-bb1d-b736196dd089',
 '4bf3ed39-3d9a-47bb-84f1-8f0147cfa0dc',
 '5e259d6c-d499-4d04-9916-aa8dfd9a4b70',
 '5352535d-0506-4df9-8136-75ef32ed96d3',
 '0ea6281d-d607-4e84-aafd-92078999ecb5',
 'c9d51a83-531b-4273-8dd5-f3193862c1ef',
 '31f5a16d-92b8-40c4-8476-02946c876e66',
 'bd9ebf04-42a9-48b0-b744-65ffbdb9eedf',
 '691d1843-5623-4369-9ed1-95fc2ec377ce',
 'c11a74d8-107a-4e3a-80bd-36dfb25402a1',
 '0d157542-390d-429d-bffc-1e0da55cc030',
 '7923e722-0ee9-4e91-9bba-d7c3f077164e',
 'ea4e476b-4193-4112-9a14-3f2fe18243cd',
 '9c6ad8f6-dc60-4a4f-b311-ca2d1a616798',
 'c6300c0f-6968-4dfe-84ae-fd26517751e0',
 '282305d0-01b5-4a1c-b8ff-c2a09b8e40c9',
 'c64b5993-3b38-4ed6-a1b8-e0fe0fac7061',
 '45da1614-85af-4fc8-a37f-edc1cbfaa0f7',
 '34171ae0-fc5e-4cdd-8c88-4fd815203714',
 '6020e200-401e-

Let's try retrieving a relevant document.

In [24]:
query = "An award concerning art."
db.similarity_search(query)

[Document(page_content='\ufeffX: 591976.7816\nY: 4790547.4424\nOBJECTID: 86\nAWARD_WINNER: The James North Art Crawl\nPROJECT_DESCRIPTION: On the second Friday evening of every month this event programs the historic James Street North streetscape from Murray to King Street with an eclectic array of gallery openings, performances and outdoor art reflective of the emerging arts community in t\nRECIPIENT: The Gallery and Studies of the James North Community\nAWARD_YEAR: 2007\nCATEGORY: Award of Merit for Visionary Project\nLOCATION: James Street North between Murray and King Streets\nCOMMUNITY: Hamilton\nLATITUDE: 43.2621251\nLONGITUDE: -79.8667415', metadata={'row': 85, 'source': 'Data/Urban_Design_and_Architecture_Awards_Recipients.csv'}),
 Document(page_content='\ufeffX: 591942.8531\nY: 4790007.4241\nOBJECTID: 45\nAWARD_WINNER: Empire Times\nPROJECT_DESCRIPTION: The project is an adaptive re-use of an historic building into a performing arts centre and affordable housing for artists. T

In [25]:
query = "Where do most people live?"
db.similarity_search(query)

[Document(page_content='2001 2006 2011 2016 2021\nCommunity Population Occupied Population Occupied Population Occupied Population Occupied Population Occupied\nPrivate Private Private Private Private\nDwellings Dwellings Dwellings Dwellings Dwellings\nAncaster 27,490 9,075 33,230 10,780 36,910 12,235 40,560 13,610 43,510 14,805\nDundas 24,385 9,080 24,700 9,365 24,910 9,910 24,285 9,920 24,150 9,990\nFlamborough 37,795 12,645 39,220 13,070 40,090 13,925 42,655 14,995 46,860 16,405\nGlanbrook 12,145 4,360 15,290 5,680 22,440 8,215 29,860 10,560 35,075 11,865\nHamilton 331,120 133,350 329,820 133,780 330,480 136,150 330,090 137,490 343,280 142,175\n Lower Hamilton 187,730 81,340 182,365 79,935 180,245 80,460 176,815 80,325 185,744 83,743\n Upper Hamilton 143,390 52,010 147,455 53,845 150,235 55,690 153,275 57,165 157,536 58,432\nStoney Creek 57,330 19,710 62,290 21,780 65,120 23,370 69,470 25,030 76,480 27,560\nTotal 490,265 188,165 504,550 194,455 519,950 203,805 536,920 211,605 569,35