## Running your own LLM

We first download the required software: LangChain and its dependency `pypdf`

In [1]:
#!pip install --upgrade pip 
#!pip install --upgrade langchain pypdf
#!pip install -U langchain-community

We then load LangChain's `pypdf` loader.

In [16]:
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader

Let's first load our PDF... 

In [17]:
loader = PyPDFLoader("../data/2021-census-population-occupied-private-dwellings-community-2001-2021.pdf")

In [18]:
single_pdf = loader.load_and_split()

In [19]:
loader = PyPDFDirectoryLoader("../data/Supreme Court opinions 2014/")

In [20]:
many_pdfs = loader.load_and_split()

Having loaded both the single PDF and a directory of PDFs, let's now load the CSV. 

In [7]:
#from langchain.document_loaders.csv_loader import CSVLoader

In [10]:
#loader = CSVLoader("../data/Urban_Design_and_Architecture_Awards_Recipients.csv")

In [11]:
#csv = loader.load()

Having loaded our data, we'll now download and load the embedding model.

In [5]:
!pip install sentence_transformers > /dev/null

In [21]:
from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings

In [22]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

Let's try embedding some text. Observe the output. Once you've tried it, scroll down to continue.

In [23]:
text = "This is a test document."

In [24]:
embeddings.embed_query(text)

[-0.03833850845694542,
 0.1234646588563919,
 -0.02864295430481434,
 0.05365273728966713,
 0.008845346979796886,
 -0.03983934596180916,
 -0.07300586998462677,
 0.04777126759290695,
 -0.03046250157058239,
 0.05497973784804344,
 0.08505290001630783,
 0.03665672987699509,
 -0.005319980438798666,
 -0.0022331627551466227,
 -0.06071098521351814,
 -0.027237940579652786,
 -0.011351647786796093,
 -0.04243769869208336,
 0.009129912592470646,
 0.10081558674573898,
 0.07578732818365097,
 0.06911719590425491,
 0.009857515804469585,
 -0.0018377389060333371,
 0.026249056681990623,
 0.03290240094065666,
 -0.07177440077066422,
 0.02838428132236004,
 0.061709512025117874,
 -0.052529554814100266,
 0.03366169333457947,
 0.07446811348199844,
 0.07536035776138306,
 0.03538399934768677,
 0.06713403761386871,
 0.010798030532896519,
 0.08167024701833725,
 0.01656290702521801,
 0.032830629497766495,
 0.03632565960288048,
 0.002172861248254776,
 -0.09895740449428558,
 0.005046738777309656,
 0.05089650675654411,
 

We now have a working embedding function. Let's install Chroma.

In [25]:
#!pip install -U chromadb

In [26]:
from langchain.vectorstores import Chroma

Let's make a vector store for our loaded documents!

In [27]:
db = Chroma.from_documents(single_pdf, embeddings)

In [33]:
%%time
#db.add_documents(many_pdfs)

CPU times: user 2 µs, sys: 2 µs, total: 4 µs
Wall time: 14.1 µs


Let's try retrieving a relevant document.

In [34]:
query = "What exceptions does Rule 606(b)(1) contain?"
db.similarity_search(query)

Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3


[Document(metadata={'page': 0, 'source': '../data/2021-census-population-occupied-private-dwellings-community-2001-2021.pdf'}, page_content='2001 2006 2011 2016 2021\nCommunity Population Occupied Population Occupied Population Occupied Population Occupied Population Occupied\nPrivate Private Private Private Private\nDwellings Dwellings Dwellings Dwellings Dwellings\nAncaster 27,490 9,075 33,230 10,780 36,910 12,235 40,560 13,610 43,510 14,805\nDundas 24,385 9,080 24,700 9,365 24,910 9,910 24,285 9,920 24,150 9,990\nFlamborough 37,795 12,645 39,220 13,070 40,090 13,925 42,655 14,995 46,860 16,405\nGlanbrook 12,145 4,360 15,290 5,680 22,440 8,215 29,860 10,560 35,075 11,865\nHamilton 331,120 133,350 329,820 133,780 330,480 136,150 330,090 137,490 343,280 142,175\n Lower Hamilton 187,730 81,340 182,365 79,935 180,245 80,460 176,815 80,325 185,744 83,743\n Upper Hamilton 143,390 52,010 147,455 53,845 150,235 55,690 153,275 57,165 157,536 58,432\nStoney Creek 57,330 19,710 62,290 21,780 65

In [32]:
query = "What documents include Sotomayor?"
db.similarity_search(query)

Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3


[Document(metadata={'page': 0, 'source': '../data/2021-census-population-occupied-private-dwellings-community-2001-2021.pdf'}, page_content='2001 2006 2011 2016 2021\nCommunity Population Occupied Population Occupied Population Occupied Population Occupied Population Occupied\nPrivate Private Private Private Private\nDwellings Dwellings Dwellings Dwellings Dwellings\nAncaster 27,490 9,075 33,230 10,780 36,910 12,235 40,560 13,610 43,510 14,805\nDundas 24,385 9,080 24,700 9,365 24,910 9,910 24,285 9,920 24,150 9,990\nFlamborough 37,795 12,645 39,220 13,070 40,090 13,925 42,655 14,995 46,860 16,405\nGlanbrook 12,145 4,360 15,290 5,680 22,440 8,215 29,860 10,560 35,075 11,865\nHamilton 331,120 133,350 329,820 133,780 330,480 136,150 330,090 137,490 343,280 142,175\n Lower Hamilton 187,730 81,340 182,365 79,935 180,245 80,460 176,815 80,325 185,744 83,743\n Upper Hamilton 143,390 52,010 147,455 53,845 150,235 55,690 153,275 57,165 157,536 58,432\nStoney Creek 57,330 19,710 62,290 21,780 65