In [5]:
from ragatouille import RAGPretrainedModel
import pickle

In [6]:
persist_directory = "../embeddings"
colbert_path = "./../colbertv2.0/"
index_root = "./../colbert_index/"

In [7]:
# Load in previously processed documents - syllabi and advising
with open(f"{persist_directory}/documents.pickle", "rb") as handle:
    documents = pickle.load(handle)

with open(f"{persist_directory}/transcripts.pickle", "rb") as handle:
    transcripts = pickle.load(handle)

In [8]:
# Remove one document from transcripts
transcripts = [
    t
    for t in transcripts
    if t.metadata["source"]
    != "01_client-projects-and-data-webinar-from-the-engaged-learning-office.en.txt"
]

# Split out documents to separate lists of document text and metadata
doc_list = [doc.page_content for doc in documents]
metadata_list = [doc.metadata for doc in documents]

trans_list = [doc.page_content for doc in transcripts]
trans_metadata_list = [doc.metadata for doc in transcripts]

combined_doc_list = doc_list + trans_list
combined_metadata_list = metadata_list + trans_metadata_list

In [9]:
# Create new model from downloaded base model available on Hugging Face (https://huggingface.co/colbert-ir/colbertv2.0)
# This does _not_ recognize the Apple Silicon GPU at this time
RAG = RAGPretrainedModel.from_pretrained(colbert_path, index_root=index_root)

[Apr 09, 18:53:28] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




In [10]:
# Create a new index. Documents as they stand are too long, even though they have been chunked.
# According to the documentation, 512 is about the maximum useful length, so the documents are split agian.
RAG.index(
    collection=combined_doc_list,
    document_metadatas=combined_metadata_list,
    index_name="combined",
    max_document_length=512,
    split_documents=True,
    use_faiss=False,
)

This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[Apr 09, 18:53:36] #> Note: Output directory /Volumes/ARN_T7/RAG/colbert_index/colbert/indexes/combined already exists


[Apr 09, 18:53:36] #> Will delete 1 files already at /Volumes/ARN_T7/RAG/colbert_index/colbert/indexes/combined in 20 seconds...
[Apr 09, 18:53:56] [0] 		 #> Encoding 8088 passages..


100%|███████████████████████████████████████████| 50/50 [03:55<00:00,  4.71s/it]
100%|███████████████████████████████████████████| 50/50 [03:54<00:00,  4.68s/it]
100%|███████████████████████████████████████████| 50/50 [03:53<00:00,  4.68s/it]
100%|███████████████████████████████████████████| 50/50 [03:53<00:00,  4.66s/it]
100%|███████████████████████████████████████████| 50/50 [03:52<00:00,  4.65s/it]
100%|█████████████████████████████████████████████| 3/3 [00:12<00:00,  4.13s/it]


[Apr 09, 19:13:41] [0] 		 avg_doclen_est = 224.31997680664062 	 len(local_sample) = 8,088
[Apr 09, 19:13:42] [0] 		 Creating 16,384 partitions.
[Apr 09, 19:13:42] [0] 		 *Estimated* 1,814,299 embeddings.
[Apr 09, 19:13:42] [0] 		 #> Saving the indexing plan to /Volumes/ARN_T7/RAG/colbert_index/colbert/indexes/combined/plan.json ..
used 20 iterations (10.1077s) to cluster 1764300 items into 16384 clusters
[0.038, 0.038, 0.037, 0.035, 0.035, 0.04, 0.037, 0.035, 0.036, 0.036, 0.037, 0.037, 0.037, 0.039, 0.037, 0.042, 0.035, 0.037, 0.036, 0.035, 0.037, 0.039, 0.035, 0.038, 0.035, 0.037, 0.038, 0.038, 0.039, 0.038, 0.038, 0.041, 0.04, 0.035, 0.037, 0.033, 0.04, 0.036, 0.037, 0.043, 0.038, 0.038, 0.038, 0.038, 0.039, 0.035, 0.036, 0.042, 0.04, 0.038, 0.036, 0.036, 0.042, 0.038, 0.036, 0.037, 0.041, 0.041, 0.047, 0.036, 0.037, 0.041, 0.038, 0.039, 0.04, 0.039, 0.04, 0.038, 0.036, 0.036, 0.04, 0.034, 0.037, 0.04, 0.039, 0.039, 0.04, 0.039, 0.04, 0.043, 0.042, 0.037, 0.037, 0.039, 0.035, 0.037,

0it [00:00, ?it/s]

[Apr 09, 19:13:55] [0] 		 #> Encoding 8088 passages..



  0%|                                                    | 0/50 [00:00<?, ?it/s][A
  2%|▉                                           | 1/50 [00:04<03:31,  4.32s/it][A
  4%|█▊                                          | 2/50 [00:08<03:27,  4.32s/it][A
  6%|██▋                                         | 3/50 [00:12<03:23,  4.33s/it][A
  8%|███▌                                        | 4/50 [00:17<03:19,  4.33s/it][A
 10%|████▍                                       | 5/50 [00:21<03:15,  4.35s/it][A
 12%|█████▎                                      | 6/50 [00:26<03:11,  4.36s/it][A
 14%|██████▏                                     | 7/50 [00:30<03:07,  4.36s/it][A
 16%|███████                                     | 8/50 [00:34<03:03,  4.37s/it][A
 18%|███████▉                                    | 9/50 [00:39<02:59,  4.37s/it][A
 20%|████████▌                                  | 10/50 [00:43<02:54,  4.37s/it][A
 22%|█████████▍                                 | 11/50 [00:47<02:50,  4.38

[Apr 09, 19:35:04] #> Optimizing IVF to store map from centroids to list of pids..
[Apr 09, 19:35:04] #> Building the emb2pid mapping..
[Apr 09, 19:35:04] len(emb2pid) = 1814300



100%|█████████████████████████████████| 16384/16384 [00:00<00:00, 117942.78it/s]

[Apr 09, 19:35:04] #> Saved optimized IVF to /Volumes/ARN_T7/RAG/colbert_index/colbert/indexes/combined/ivf.pid.pt





Done indexing!


'/Volumes/ARN_T7/RAG/colbert_index/colbert/indexes/combined'

In [11]:
# This takes 30+ seconds to start up the first time, but runs faster after that
RAG.search(query="Which class involves time series analysis?")  # documents

Loading searcher for index combined for the first time... This may take a few seconds
[Apr 09, 19:36:42] #> Loading codec...
[Apr 09, 19:36:42] #> Loading IVF...
[Apr 09, 19:36:42] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Apr 09, 19:36:42] #> Loading doclens...


100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 1786.33it/s]

[Apr 09, 19:36:42] #> Loading codes and residuals...



100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 28.28it/s]

[Apr 09, 19:36:42] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...





[Apr 09, 19:36:42] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . Which class involves time series analysis?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([ 101,    1, 2029, 2465, 7336, 2051, 2186, 4106, 1029,  102,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])





[{'content': "In today's lecture, we're going to\nbe looking at time series and date functionality in pandas. Manipulating dates and\ntimes is quite flexible in pandas and thus allows us to conduct more\nanalysis such as time series analysis, which we're going to talk about soon. Actually, pandas was originally created\nby Wes McKinney to handle date and time data when he worked as\na consultant for hedge funds. So it's quite robust in this matter. Let's bring in pandas and numpy as usual. All right,\npandas has four main time related classes. Timestamp, DatetimeIndex,\nPeriod, and PeriodIndex.",
  'score': 23.05040740966797,
  'rank': 1,
  'document_id': '0cc544ac-8a28-4aad-9f6a-34123f90166c',
  'passage_id': 2977,
  'document_metadata': {'source': '08_date-time-functionality.en.txt',
   'course_number': 'SIADS 505',
   'course_title': 'Data Manipulation',
   'start_index': 0}},
 {'content': 'In fact, this is usually\nwhat we collect in reality. We take the measurements. We cannot tak

In [12]:
RAG.search(query="How does PCA work?")  # transcripts

[{'content': "Let's start by looking at a very important and widely used linear dimensionality\nreduction technique called principal component\nanalysis or PCA. There are a couple of ways\nto describe how PCA works. An intuitive, more geometric way and then there's\na linear algebra way. What we're going\nto do is to start, we're going to look\nat the geometric way, the visually intuitive\nway and then later, we'll look at the\nlinear algebra behind PCA as part of understanding a powerful general\ndimensionality reduction method called singular value\ndecomposition or SVD, which is very closely\nconnected to PCA. Intuitively what PCA does, it takes your\noriginal data points. Here I have a very simple\ndataset with two features. It's a two-dimensional\ndataset and imagine each instance is denoted by a point here in the\nscatterplot and intuitively, what PCA does geometrically to these original data points\nis it finds a rotation of the points so that the\ndimensions are statistically u

In [13]:
# Ragatouille let's you create a LangChain retriever from the indexed model
retriever = RAG.as_langchain_retriever(k=5)

In [14]:
retriever.invoke("What is a backpack?")



[Document(page_content='Class Registration > Q: What is a Backpack?: A: The Backpack is a feature available on [Wolverine Access](https://wolverineaccess.umich.edu/) that works much like the "shopping carts" you have seen on many retail websites. With the Backpack you can prepare for your upcoming registration appointment by filling it with classes you want to take. When it is time to register, you will select one or more classes from your Backpack to register for it. NOTE: Placing a class in your Backpack does not enroll you in that class. You must register for a class to become enrolled in it. It is important to note that receiving an override does not enroll you in the course, you still must register through [Wolverine Access](https://wolverineaccess.umich.edu/) to claim the seat that has been opened for you.', metadata={'source': 'advising_guide.md', 'heading': 'Class Registration > Q: What is a Backpack?', 'section': '21', 'course_number': 'n/a', 'course_title': 'n/a', 'course_dat

Next step is to add this to the RAG pipeline and check its performance...