## Running your own LLM

This notebook was adapted from Sil Hamilton's class [Generative AI for journalists](https://www.kccourses.org/enrol/index.php?id=116). 

We first download the required software: LangChain and its dependency `pypdf`

In [11]:
#!pip install --upgrade pip 
#!pip install --upgrade langchain pypdf
#!pip install -U langchain-community
!pip install -U langchain-huggingface

Collecting langchain-huggingface
  Downloading langchain_huggingface-0.0.3-py3-none-any.whl.metadata (1.2 kB)
Collecting huggingface-hub>=0.23.0 (from langchain-huggingface)
  Downloading huggingface_hub-0.26.2-py3-none-any.whl.metadata (13 kB)
Collecting sentence-transformers>=2.6.0 (from langchain-huggingface)
  Downloading sentence_transformers-3.2.1-py3-none-any.whl.metadata (10 kB)
Collecting tokenizers>=0.19.1 (from langchain-huggingface)
  Downloading tokenizers-0.20.3-cp38-cp38-macosx_10_12_x86_64.whl.metadata (6.7 kB)
Collecting transformers>=4.39.0 (from langchain-huggingface)
  Downloading transformers-4.46.2-py3-none-any.whl.metadata (44 kB)
Downloading langchain_huggingface-0.0.3-py3-none-any.whl (17 kB)
Downloading huggingface_hub-0.26.2-py3-none-any.whl (447 kB)
Downloading sentence_transformers-3.2.1-py3-none-any.whl (255 kB)
Downloading tokenizers-0.20.3-cp38-cp38-macosx_10_12_x86_64.whl (2.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB

We then load LangChain's `pypdf` loader.

In [12]:
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader

Now let's load PDFs

In [13]:
# loader = PyPDFLoader("../data/2021-census-population-occupied-private-dwellings-community-2001-2021.pdf")

In [14]:
# single_pdf = loader.load_and_split()

In [15]:
loader = PyPDFDirectoryLoader("../data/Supreme Court opinions 2014/")

In [16]:
many_pdfs = loader.load_and_split()

Having loaded our data, we'll now download and load the embedding model.

In [17]:
#!pip install sentence_transformers > /dev/null

In [19]:
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings

In [21]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

Let's try embedding some text. Observe the output. Once you've tried it, scroll down to continue.

In [22]:
text = "This is a test document."

In [23]:
embeddings.embed_query(text)

[-0.03833850845694542,
 0.1234646588563919,
 -0.02864295430481434,
 0.05365273728966713,
 0.008845346979796886,
 -0.03983934596180916,
 -0.07300586998462677,
 0.04777126759290695,
 -0.03046250157058239,
 0.05497973784804344,
 0.08505290001630783,
 0.03665672987699509,
 -0.005319980438798666,
 -0.0022331627551466227,
 -0.06071098521351814,
 -0.027237940579652786,
 -0.011351647786796093,
 -0.04243769869208336,
 0.009129912592470646,
 0.10081558674573898,
 0.07578732818365097,
 0.06911719590425491,
 0.009857515804469585,
 -0.0018377389060333371,
 0.026249056681990623,
 0.03290240094065666,
 -0.07177440077066422,
 0.02838428132236004,
 0.061709512025117874,
 -0.052529554814100266,
 0.03366169333457947,
 0.07446811348199844,
 0.07536035776138306,
 0.03538399934768677,
 0.06713403761386871,
 0.010798030532896519,
 0.08167024701833725,
 0.01656290702521801,
 0.032830629497766495,
 0.03632565960288048,
 0.002172861248254776,
 -0.09895740449428558,
 0.005046738777309656,
 0.05089650675654411,
 

We now have a working embedding function. Let's install Chroma.

In [24]:
#!pip install -U chromadb

In [25]:
from langchain.vectorstores import Chroma

Let's make a vector store for our loaded documents!

In [27]:
%%time
db = Chroma.from_documents(many_pdfs, embeddings)

CPU times: user 60 s, sys: 1.87 s, total: 1min 1s
Wall time: 50 s


Let's try retrieving a relevant document.

In [28]:
query = "What documents include Sotomayor?"
db.similarity_search(query)

[Document(metadata={'page': 11, 'source': '../data/Supreme Court opinions 2014/13-433_5h26.pdf'}, page_content='_________________ \n \n_________________ \n \n \n \n \n \n \n  \n \n  \n  \n \n \n \n \n \n \n \n1 Cite as: 574 U. S. ____ (2014) \nSOTOMAYOR, J., concurring \nSUPREME COURT OF THE UNITED STATES \nNo. 13–433 \nINTEGRITY STAFFING SOLUTIONS, INC., \nPETITIONER v. JESSE BUSK ET AL. \nON WRIT OF CERTIORARI TO THE UNITED STATES COURT OF \nAPPEALS FOR THE NINTH CIRCUIT\n \n[December 9, 2014]\n JUSTICE SOTOMAYOR, with whom J USTICE KAGAN joins,\nconcurring. \nI concur in the Court’s opinion, and write separately\nonly to explain my understanding of the standards the\nCourt applies. \nThe Court reaches two critical conclusions.  First, the \nCourt confirms that compensable “ ‘principal’” activities \n“‘includ[e] . . . those closely related activities which are \nindispensable to [a principal activity’s] performance,’ ” \nante, at 6 (quoting 29 CFR §790.8(c)(2013)), and holds that \nt