In [1]:
from ragatouille import RAGPretrainedModel
import pickle

In [2]:
persist_directory = "../embeddings"
colbert_path = "./../colbertv2.0/"
index_root = "./../colbert_index/"

In [3]:
# Load in previously processed documents - syllabi and advising
with open(f"{persist_directory}/documents.pickle", "rb") as handle:
    documents = pickle.load(handle)

with open(f"{persist_directory}/transcripts.pickle", "rb") as handle:
    transcripts = pickle.load(handle)

In [4]:
# Remove one document from transcripts
transcripts = [
    t
    for t in transcripts
    if t.metadata["source"]
    != "01_client-projects-and-data-webinar-from-the-engaged-learning-office.en.txt"
]

# Split out documents to separate lists of document text and metadata
doc_list = [doc.page_content for doc in documents]
metadata_list = [doc.metadata for doc in documents]

trans_list = [doc.page_content for doc in transcripts]
trans_metadata_list = [doc.metadata for doc in transcripts]

In [5]:
# Create new model from downloaded base model available on Hugging Face (https://huggingface.co/colbert-ir/colbertv2.0)
# This does _not_ recognize the Apple Silicon GPU at this time
RAG = RAGPretrainedModel.from_pretrained(colbert_path, index_root=index_root)

[Apr 09, 18:09:26] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




In [6]:
# Create a new index. Documents as they stand are too long, even though they have been chunked.
# According to the documentation, 512 is about the maximum useful length, so the documents are split agian.
RAG.index(
    collection=doc_list,
    document_metadatas=metadata_list,
    index_name="documents",
    max_document_length=512,
    split_documents=True,
    use_faiss=False,
)

This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[Apr 09, 18:09:32] #> Note: Output directory /Volumes/ARN_T7/RAG/colbert_index/colbert/indexes/documents already exists


[Apr 09, 18:09:32] #> Will delete 1 files already at /Volumes/ARN_T7/RAG/colbert_index/colbert/indexes/documents in 20 seconds...
[Apr 09, 18:09:52] [0] 		 #> Encoding 835 passages..


100%|███████████████████████████████████████████| 27/27 [01:56<00:00,  4.32s/it]

[Apr 09, 18:11:49] [0] 		 avg_doclen_est = 103.75569152832031 	 len(local_sample) = 835
[Apr 09, 18:11:49] [0] 		 Creating 4,096 partitions.
[Apr 09, 18:11:49] [0] 		 *Estimated* 86,636 embeddings.
[Apr 09, 18:11:49] [0] 		 #> Saving the indexing plan to /Volumes/ARN_T7/RAG/colbert_index/colbert/indexes/documents/plan.json ..





used 20 iterations (11.6584s) to cluster 82305 items into 4096 clusters
[0.031, 0.03, 0.029, 0.026, 0.027, 0.029, 0.029, 0.027, 0.028, 0.027, 0.028, 0.029, 0.03, 0.028, 0.029, 0.03, 0.026, 0.028, 0.026, 0.029, 0.028, 0.03, 0.029, 0.029, 0.028, 0.028, 0.032, 0.029, 0.029, 0.031, 0.032, 0.032, 0.032, 0.028, 0.027, 0.026, 0.03, 0.029, 0.028, 0.034, 0.03, 0.03, 0.028, 0.029, 0.03, 0.028, 0.028, 0.032, 0.031, 0.026, 0.026, 0.028, 0.031, 0.029, 0.028, 0.03, 0.031, 0.03, 0.034, 0.028, 0.029, 0.03, 0.03, 0.029, 0.033, 0.031, 0.03, 0.029, 0.029, 0.029, 0.03, 0.027, 0.03, 0.03, 0.029, 0.029, 0.03, 0.029, 0.03, 0.033, 0.032, 0.03, 0.029, 0.031, 0.029, 0.029, 0.028, 0.029, 0.028, 0.033, 0.029, 0.03, 0.029, 0.032, 0.029, 0.028, 0.033, 0.027, 0.03, 0.029, 0.03, 0.03, 0.028, 0.029, 0.029, 0.026, 0.028, 0.028, 0.027, 0.027, 0.03, 0.03, 0.03, 0.027, 0.031, 0.027, 0.032, 0.03, 0.03, 0.031, 0.029, 0.03, 0.028, 0.031, 0.028, 0.031, 0.029, 0.027]


0it [00:00, ?it/s]

[Apr 09, 18:12:01] [0] 		 #> Encoding 835 passages..



  0%|                                                    | 0/27 [00:00<?, ?it/s][A
  4%|█▋                                          | 1/27 [00:04<02:03,  4.76s/it][A
  7%|███▎                                        | 2/27 [00:09<01:56,  4.67s/it][A
 11%|████▉                                       | 3/27 [00:14<01:52,  4.67s/it][A
 15%|██████▌                                     | 4/27 [00:18<01:47,  4.66s/it][A
 19%|████████▏                                   | 5/27 [00:23<01:42,  4.65s/it][A
 22%|█████████▊                                  | 6/27 [00:27<01:37,  4.64s/it][A
 26%|███████████▍                                | 7/27 [00:32<01:32,  4.63s/it][A
 30%|█████████████                               | 8/27 [00:37<01:28,  4.63s/it][A
 33%|██████████████▋                             | 9/27 [00:41<01:23,  4.64s/it][A
 37%|███████████████▉                           | 10/27 [00:46<01:18,  4.64s/it][A
 41%|█████████████████▌                         | 11/27 [00:51<01:14,  4.63

[Apr 09, 18:14:02] #> Optimizing IVF to store map from centroids to list of pids..
[Apr 09, 18:14:02] #> Building the emb2pid mapping..
[Apr 09, 18:14:02] len(emb2pid) = 86636



100%|███████████████████████████████████| 4096/4096 [00:00<00:00, 142629.53it/s]

[Apr 09, 18:14:02] #> Saved optimized IVF to /Volumes/ARN_T7/RAG/colbert_index/colbert/indexes/documents/ivf.pid.pt
Done indexing!





'/Volumes/ARN_T7/RAG/colbert_index/colbert/indexes/documents'

In [7]:
# Index for transcripts
RAG.index(
    collection=trans_list,
    document_metadatas=trans_metadata_list,
    index_name="transcripts",
    max_document_length=512,
    split_documents=True,
    use_faiss=False,
)

New index_name received! Updating current index_name (documents) to transcripts
This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[Apr 09, 18:15:23] #> Note: Output directory /Volumes/ARN_T7/RAG/colbert_index/colbert/indexes/transcripts already exists


[Apr 09, 18:15:23] #> Will delete 11 files already at /Volumes/ARN_T7/RAG/colbert_index/colbert/indexes/transcripts in 20 seconds...
[Apr 09, 18:15:44] [0] 		 #> Encoding 7253 passages..


100%|███████████████████████████████████████████| 50/50 [03:51<00:00,  4.63s/it]
100%|███████████████████████████████████████████| 50/50 [03:39<00:00,  4.39s/it]
100%|███████████████████████████████████████████| 50/50 [03:53<00:00,  4.67s/it]
100%|███████████████████████████████████████████| 50/50 [03:53<00:00,  4.68s/it]
100%|███████████████████████████████████████████| 27/27 [02:04<00:00,  4.59s/it]


[Apr 09, 18:33:09] [0] 		 avg_doclen_est = 238.19992065429688 	 len(local_sample) = 7,253
[Apr 09, 18:33:10] [0] 		 Creating 16,384 partitions.
[Apr 09, 18:33:10] [0] 		 *Estimated* 1,727,664 embeddings.
[Apr 09, 18:33:10] [0] 		 #> Saving the indexing plan to /Volumes/ARN_T7/RAG/colbert_index/colbert/indexes/transcripts/plan.json ..
used 20 iterations (9.75s) to cluster 1677664 items into 16384 clusters
[0.037, 0.038, 0.037, 0.034, 0.034, 0.039, 0.037, 0.034, 0.036, 0.036, 0.037, 0.036, 0.037, 0.039, 0.037, 0.041, 0.034, 0.037, 0.036, 0.035, 0.037, 0.039, 0.035, 0.037, 0.035, 0.036, 0.038, 0.038, 0.038, 0.038, 0.037, 0.041, 0.039, 0.035, 0.037, 0.033, 0.04, 0.036, 0.037, 0.042, 0.037, 0.038, 0.037, 0.038, 0.038, 0.035, 0.036, 0.042, 0.04, 0.038, 0.036, 0.036, 0.042, 0.038, 0.036, 0.036, 0.041, 0.04, 0.047, 0.036, 0.037, 0.041, 0.038, 0.039, 0.04, 0.039, 0.04, 0.038, 0.035, 0.036, 0.04, 0.034, 0.037, 0.04, 0.038, 0.039, 0.039, 0.039, 0.04, 0.043, 0.042, 0.037, 0.037, 0.039, 0.035, 0.03

0it [00:00, ?it/s]

[Apr 09, 18:33:22] [0] 		 #> Encoding 7253 passages..



  0%|                                                    | 0/50 [00:00<?, ?it/s][A
  2%|▉                                           | 1/50 [00:04<03:42,  4.54s/it][A
  4%|█▊                                          | 2/50 [00:09<03:36,  4.51s/it][A
  6%|██▋                                         | 3/50 [00:13<03:33,  4.54s/it][A
  8%|███▌                                        | 4/50 [00:18<03:29,  4.56s/it][A
 10%|████▍                                       | 5/50 [00:22<03:25,  4.58s/it][A
 12%|█████▎                                      | 6/50 [00:27<03:22,  4.60s/it][A
 14%|██████▏                                     | 7/50 [00:32<03:18,  4.61s/it][A
 16%|███████                                     | 8/50 [00:36<03:13,  4.62s/it][A
 18%|███████▉                                    | 9/50 [00:41<03:09,  4.63s/it][A
 20%|████████▌                                  | 10/50 [00:46<03:05,  4.64s/it][A
 22%|█████████▍                                 | 11/50 [00:50<03:00,  4.63

[Apr 09, 18:51:53] #> Optimizing IVF to store map from centroids to list of pids..
[Apr 09, 18:51:53] #> Building the emb2pid mapping..
[Apr 09, 18:51:53] len(emb2pid) = 1727664



100%|█████████████████████████████████| 16384/16384 [00:00<00:00, 141447.29it/s]

[Apr 09, 18:51:53] #> Saved optimized IVF to /Volumes/ARN_T7/RAG/colbert_index/colbert/indexes/transcripts/ivf.pid.pt





Done indexing!


'/Volumes/ARN_T7/RAG/colbert_index/colbert/indexes/transcripts'

In [8]:
# This takes 30+ seconds to start up the first time, but runs faster after that
RAG.search(query="Which class involves time series analysis?")  # documents

Loading searcher for index transcripts for the first time... This may take a few seconds
[Apr 09, 18:52:13] #> Loading codec...
[Apr 09, 18:52:13] #> Loading IVF...
[Apr 09, 18:52:13] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Apr 09, 18:52:14] #> Loading doclens...


100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 2192.53it/s]

[Apr 09, 18:52:14] #> Loading codes and residuals...



100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 30.88it/s]

[Apr 09, 18:52:14] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...





[Apr 09, 18:52:14] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . Which class involves time series analysis?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([ 101,    1, 2029, 2465, 7336, 2051, 2186, 4106, 1029,  102,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])





[{'content': "In today's lecture, we're going to\nbe looking at time series and date functionality in pandas. Manipulating dates and\ntimes is quite flexible in pandas and thus allows us to conduct more\nanalysis such as time series analysis, which we're going to talk about soon. Actually, pandas was originally created\nby Wes McKinney to handle date and time data when he worked as\na consultant for hedge funds. So it's quite robust in this matter. Let's bring in pandas and numpy as usual. All right,\npandas has four main time related classes. Timestamp, DatetimeIndex,\nPeriod, and PeriodIndex.",
  'score': 23.03000831604004,
  'rank': 1,
  'document_id': 'afdfefe6-105f-4a59-8341-c55d5dd4c9b3',
  'passage_id': 2142,
  'document_metadata': {'source': '08_date-time-functionality.en.txt',
   'course_number': 'SIADS 505',
   'course_title': 'Data Manipulation',
   'start_index': 0}},
 {'content': 'In fact, this is usually\nwhat we collect in reality. We take the measurements. We cannot tak

In [9]:
RAG.search(query="How does PCA work?")  # transcripts

[{'content': "Let's start by looking at a very important and widely used linear dimensionality\nreduction technique called principal component\nanalysis or PCA. There are a couple of ways\nto describe how PCA works. An intuitive, more geometric way and then there's\na linear algebra way. What we're going\nto do is to start, we're going to look\nat the geometric way, the visually intuitive\nway and then later, we'll look at the\nlinear algebra behind PCA as part of understanding a powerful general\ndimensionality reduction method called singular value\ndecomposition or SVD, which is very closely\nconnected to PCA. Intuitively what PCA does, it takes your\noriginal data points. Here I have a very simple\ndataset with two features. It's a two-dimensional\ndataset and imagine each instance is denoted by a point here in the\nscatterplot and intuitively, what PCA does geometrically to these original data points\nis it finds a rotation of the points so that the\ndimensions are statistically u

In [10]:
# Ragatouille let's you create a LangChain retriever from the indexed model
retriever = RAG.as_langchain_retriever(k=5)

In [11]:
retriever.invoke("What is a backpack?")



[Document(page_content="This approach is\npopularly known as a bag-of-words approach in natural language\nprocessing literature. The reason it is called\nbag-of-words is because it just gives about if and how\nmany times a word occurs. It doesn't care\nabout the position or order of the word\nin the sentence. Bag-of-words based language\nmodeling approaches have been the mainstay of language modeling\nfor a long time, and give comparative\nperformance on several natural language\nprocessing tasks were ordering information\nis not very important. However, they can\nperform poorly on tasks where ordering\ninformation is important.", metadata={'source': '01_sequence-modeling.en.txt', 'course_number': 'SIADS 642', 'course_title': 'Deep Learning I', 'start_index': 2446}),
 Document(page_content='', metadata={'source': '05_university-of-michigans-primary-data-center.en.txt', 'course_number': 'SIADS 673', 'course_title': 'Cloud Computing', 'start_index': 0}),
 Document(page_content="So I'll g

Next step is to add this to the RAG pipeline and check its performance...