# Data Loader
This script makes use of open source datasets published to Hugging Face. In particular this will download various articles or writing samples with some associated questions and answers. This is great for experimenting with an FAQ RAG application, comparing new questions from users with existing questions in the database.

First, specify the dataset to download from Hugging Face (stanfordnlp/coqa).

In [1]:
from datasets import load_dataset

REPO_ID = "stanfordnlp/coqa"

dataset = load_dataset(REPO_ID)

Create tables in the Vector Database to store the raw and embedding data

In [8]:
import iris

conn = iris.connect(hostname='localhost', 
                    port=51972, 
                    namespace='USER',
                    username='SuperUser', 
                    password='SYS')

cursor = conn.cursor()
cursor.execute("""
                CREATE TABLE RAG_COQA.Story (
                    StoryId INTEGER,
                    Source VARCHAR(50),
                    Story VARCHAR(10000),
                    StoryEmbedding VECTOR(DOUBLE, 768)
                )
            """)

cursor.execute("""
                CREATE TABLE RAG_COQA.QandA (
                    StoryId INTEGER,
                    Question VARCHAR(500),
                    QuestionEmbedding VECTOR(DOUBLE, 768),
                    Answer VARCHAR(1000)         
                )
            """)
cursor.close()

Cache the embedding model from Hugging Face locally. You can view a leaderboard of text embedding models based on the MTE Benchmark here: https://huggingface.co/spaces/mteb/leaderboard

In [3]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("avsolatorio/GIST-Embedding-v0", cache_folder='.\\huggingface_cache')

We will extract the 'train' data partition for the embeddings. To check the naming of the dataset splits, check the Hugging Face Dataset repo directly: https://huggingface.co/datasets/stanfordnlp/coqa

In [4]:
data_split = dataset['train']

from itertools import islice
iterator = islice(data_split, 10) # Slice top 10 stories
for story in iterator:
    print(story)

{'source': 'wikipedia', 'story': 'The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. \n\nThe Vatican Secret Archives were separated from 

Create the embeddings from the partition, and persist them over SQL using our vector database.

In [5]:
len(data_split)
print(data_split)

Dataset({
    features: ['source', 'story', 'questions', 'answers'],
    num_rows: 7199
})


We will only embed the first 300 articles, with all of their respective Q and As (4474 total). The embedding and persistence script may take a few minutes to complete. Check the progress bar to follow.

In [12]:
chunk = []

story_insert = "INSERT INTO RAG_COQA.Story (StoryId, Source, Story, StoryEmbedding) VALUES (?,?,?,?)"
qanda_insert = "INSERT INTO RAG_COQA.QandA (StoryId, Question, QuestionEmbedding, Answer) VALUES (?,?,TO_VECTOR(?,DOUBLE,768),?)"

story_cursor = conn.cursor()
qanda_cursor = conn.cursor()

from alive_progress import alive_bar

limit = 300
assert limit < len(data_split), "Limit cannot surpass size of data set."

with alive_bar(limit, force_tty=True) as bar:
    qanda_count = 0
    for idx, story in enumerate(data_split, start=1):
        #;
        story_embedding = model.encode(story['story'])
        story_embedding_to_list = str(story_embedding.tolist())[1:-1]
        story_cursor.execute(story_insert, [idx, story['source'][0:9990], story['story'], story_embedding_to_list])
        #;
        # Create embeddings for the questions with the pre-trained model.
        chunk = []
        questions = list(story['questions']) # This is a list[str]
        answers = list(story['answers']['input_text']) # This is also a list[str]
        qanda_count += len(questions)
        question_embeddings = model.encode(questions) # This is a list[tensor]
        #;
        #
        question_embeddings_to_list = [str(embedding.tolist())[1:-1] for embedding in question_embeddings]
        #;
        #
        for qdx, question in enumerate(questions):
            chunk.append([idx, question, question_embeddings_to_list[qdx], answers[qdx][0:999]])
        qanda_cursor.executemany(qanda_insert, chunk)
        #
        if idx == limit + 1: 
            break
        bar()

print("Questions embedded: " + str(qanda_count), end='')


|████████████████████████████████████████| 300/300 [100%] in 2:41.5 (1.86/s)    
Questions embedded: 4474

To remove the tables, run a DROP command. Dropping these tables will lose all embeddings data from this notebook.

In [13]:
## Drop tables
drop = input("Warning! Are you sure you want to drop tables? You will need to rebuild the tables and embeddings to run any further exercises. (Y/N)")
if drop == "Y":
    cursor = conn.cursor()
    cursor.execute("DROP TABLE RAG_COQA.Story")
    cursor.execute("DROP TABLE RAG_COQA.QandA")
    cursor.close()
    print("Tables dropped.")
else:
    print("Tables not dropped.")

Tables not dropped.
