## Running your own LLM

This notebook was adapted from Sil Hamilton's class [Generative AI for journalists](https://www.kccourses.org/enrol/index.php?id=116). 

We first download the required software: LangChain and its dependency `pypdf`

In [1]:
#!pip install --upgrade pip 
#!pip install --upgrade langchain pypdf
#!pip install -U langchain-community
# !pip install -U langchain-huggingface

We then load LangChain's `pypdf` loader.

Now let's load PDFs

In [5]:
loader = PyPDFDirectoryLoader("../data/Supreme Court opinions 2014/")

In [6]:
many_pdfs = loader.load_and_split()

Having loaded our data, we'll now download and load the embedding model.

In [7]:
#!pip install sentence_transformers > /dev/null

In [8]:
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings

In [9]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Let's try embedding some text. Observe the output. Once you've tried it, scroll down to continue.

In [10]:
text = "This is a test document."

In [11]:
embeddings.embed_query(text)

[-0.0383385568857193,
 0.12346471101045609,
 -0.028642993420362473,
 0.053652726113796234,
 0.008845396339893341,
 -0.039839357137680054,
 -0.07300589233636856,
 0.04777125269174576,
 -0.03046250343322754,
 0.054979756474494934,
 0.08505292236804962,
 0.0366566926240921,
 -0.005320002790540457,
 -0.0022331965155899525,
 -0.06071098893880844,
 -0.027237897738814354,
 -0.011351661756634712,
 -0.04243772476911545,
 0.009129962883889675,
 0.1008155420422554,
 0.07578727602958679,
 0.0691172331571579,
 0.0098574822768569,
 -0.0018377574160695076,
 0.026249036192893982,
 0.03290237858891487,
 -0.07177439332008362,
 0.028384266421198845,
 0.06170950084924698,
 -0.05252956226468086,
 0.03366170451045036,
 0.07446813583374023,
 0.07536029815673828,
 0.03538401052355766,
 0.06713409721851349,
 0.010798059403896332,
 0.08167024701833725,
 0.01656287908554077,
 0.032830629497766495,
 0.03632567077875137,
 0.0021728689316660166,
 -0.09895738214254379,
 0.005046762991696596,
 0.05089649185538292,
 0

We now have a working embedding function. Let's install Chroma.

In [12]:
#!pip install -U chromadb

In [13]:
from langchain.vectorstores import Chroma

Let's make a vector store for our loaded documents!

In [14]:
%%time
db = Chroma.from_documents(many_pdfs, embeddings)

CPU times: user 1.74 s, sys: 233 ms, total: 1.97 s
Wall time: 3.02 s


Let's try retrieving a relevant document.

In [15]:
query = "What documents include Sotomayor?"
db.similarity_search(query)

[Document(metadata={'page': 11, 'source': '../data/Supreme Court opinions 2014/13-433_5h26.pdf'}, page_content='_________________ \n \n_________________ \n \n \n \n \n \n \n  \n \n  \n  \n \n \n \n \n \n \n \n1 Cite as: 574 U. S. ____ (2014) \nSOTOMAYOR, J., concurring \nSUPREME COURT OF THE UNITED STATES \nNo. 13–433 \nINTEGRITY STAFFING SOLUTIONS, INC., \nPETITIONER v. JESSE BUSK ET AL. \nON WRIT OF CERTIORARI TO THE UNITED STATES COURT OF \nAPPEALS FOR THE NINTH CIRCUIT\n \n[December 9, 2014]\n JUSTICE SOTOMAYOR, with whom J USTICE KAGAN joins,\nconcurring. \nI concur in the Court’s opinion, and write separately\nonly to explain my understanding of the standards the\nCourt applies. \nThe Court reaches two critical conclusions.  First, the \nCourt confirms that compensable “ ‘principal’” activities \n“‘includ[e] . . . those closely related activities which are \nindispensable to [a principal activity’s] performance,’ ” \nante, at 6 (quoting 29 CFR §790.8(c)(2013)), and holds that \nt