To start,
we'll just accumulate info in flat `pandas` dataframes.

In [1]:
import pandas as pd


document_df = pd.DataFrame(columns=["text", "source", "sha256"])

document_df

Unnamed: 0,text,source,sha256


The Q&A will be more useful the more precisely we slice and link the documents,
so we want to split a semantic "document", like a lecture or a video,
up into sub-documents first.

**Note**: we leave it up to the `langchain.TextSplitter` to split sub-documents into chunks smaller than a source at time of upsert into the vector database.

## Markdown Files

Most pages on the FSDL website
are originally written in Markdown,
which makes it easy to pull out relevant sub-documents.

### Lectures

We first define a `DataFrame` with basic metadata about where the lectures can be found -- on the website and as raw Markdown.

In [8]:
notes_md_url_base = "https://raw.githubusercontent.com/ramnathv/corise-r-for-ds/main/notes/"

In [9]:
notes_slugs = {
    1: "week-00/01-welcome-to-r-for-ds/01-welcome-to-r-for-ds.md",
    2: "week-00/02-how-to-prepare-for-this-course/02-how-to-prepare-for-this-course.md",
    3: "week-00/03-logistic-faqs/03-logistic-faqs.md",
    4: "week-01/01-doing-data-science/01-doing-data-science.md",
    5: "week-01/02-data-science-in-action/02-data-science-in-action.md",
    6: "week-01/03-importing-data/03-importing-data.md",
    7: "week-01/04-visualizing-data/04-visualizing-data.md",
    8: "week-01/05-transforming-data/05-transforming-data.md",
    9: "week-01/06-manipulating-data/06-manipulating-data.md",
    10: "week-02/01-aggregating-data/01-aggregating-data.md",
    11: "week-02/02-reshaping-data/02-reshaping-data.md",
    12: "week-02/03-combining-data/03-combining-data.md",
    13: "week-02/04-grammar-of-graphics/04-grammar-of-graphics.md",
    14: "week-02/05-data-science-in-action-again/05-data-science-in-action-again.md"
}

notes_df = pd.DataFrame.from_dict(notes_slugs, orient="index", columns=["url-slug"])
notes_df

Unnamed: 0,url-slug
1,week-00/01-welcome-to-r-for-ds/01-welcome-to-r...
2,week-00/02-how-to-prepare-for-this-course/02-h...
3,week-00/03-logistic-faqs/03-logistic-faqs.md
4,week-01/01-doing-data-science/01-doing-data-sc...
5,week-01/02-data-science-in-action/02-data-scie...
6,week-01/03-importing-data/03-importing-data.md
7,week-01/04-visualizing-data/04-visualizing-dat...
8,week-01/05-transforming-data/05-transforming-d...
9,week-01/06-manipulating-data/06-manipulating-d...
10,week-02/01-aggregating-data/01-aggregating-dat...


In [10]:
notes_df["raw-md-url"] = notes_df["url-slug"].apply(lambda s: f"{notes_md_url_base}/{s}".format(s))

We then bring in the markdown files from GitHub,
parse them to split out headings as our "sources",
and use `slugify` to create URLs for those heading sources.

In [11]:
from smart_open import open


def get_text_from(url):
    with open(url) as f:
        contents = f.read()
    return contents

notes_df["raw-text"] = notes_df["raw-md-url"].apply(lambda url: get_text_from(url))

In [18]:
import mistune
from slugify import slugify


def get_target_headings_and_slugs(text):
    markdown_parser = mistune.create_markdown(renderer="ast")
    parsed_text = markdown_parser(text)
    
    heading_objects = [obj for obj in parsed_text if obj["type"] == "heading"]
    h2_objects = [obj for obj in heading_objects if obj["level"] == 2]
    
    targets = [obj for obj in h2_objects if not(obj["children"][0]["text"].startswith("description: "))]
    target_headings = [tgt["children"][0]["text"] for tgt in targets]
    
    heading_slugs = [slugify(target_heading) for target_heading in target_headings]
    
    return target_headings, heading_slugs

In [19]:
def split_notes(row):
    text = row["raw-text"]
    
    headings, slugs = get_target_headings_and_slugs(text)
    
    texts = split_by_headings(text, headings)
    slugs = [""] + slugs
    
    text_rows = []
    for text, slug in zip(texts, slugs):
        text_rows.append({
            "url-slug": row["url-slug"] + "#" + slug,
            "raw-md-url": row["raw-md-url"],
            "text": text,
        })
    
    return pd.DataFrame.from_records(text_rows)

In [20]:
def split_by_headings(text, headings):
    texts = []
    for heading in reversed(headings):
        text, section = text.split("# " + heading)
        texts.append(f"## {heading}{section}")
    texts.append(text)
    texts = list(reversed(texts))
    return texts

In [22]:
note_dfs = []
for idx, row in notes_df.iterrows():
    single_note_df = split_notes(row)
    single_note_df["notes-idx"] = idx
    note_dfs.append(single_note_df)
    
split_notes_df = pd.concat(note_dfs, ignore_index=True)

In [23]:
split_notes_df

Unnamed: 0,url-slug,raw-md-url,text,notes-idx
0,week-00/01-welcome-to-r-for-ds/01-welcome-to-r...,https://raw.githubusercontent.com/ramnathv/cor...,"\n### 👋 Hi!!\n\nHello, I’m Ramnath, and I’m ex...",1
1,week-00/02-how-to-prepare-for-this-course/02-h...,https://raw.githubusercontent.com/ramnathv/cor...,\n### How to Prepare for This Course?\n\nWe cr...,2
2,week-00/03-logistic-faqs/03-logistic-faqs.md#,https://raw.githubusercontent.com/ramnathv/cor...,\n### Are sessions recorded?\n\nWe encourage y...,3
3,week-01/01-doing-data-science/01-doing-data-sc...,https://raw.githubusercontent.com/ramnathv/cor...,\n#,4
4,week-01/01-doing-data-science/01-doing-data-sc...,https://raw.githubusercontent.com/ramnathv/cor...,## Doing Data Science\n\n### What is Data Scie...,4
5,week-01/02-data-science-in-action/02-data-scie...,https://raw.githubusercontent.com/ramnathv/cor...,\n#,5
6,week-01/02-data-science-in-action/02-data-scie...,https://raw.githubusercontent.com/ramnathv/cor...,## Data Science in Action\n\nThe best way to g...,5
7,week-01/03-importing-data/03-importing-data.md#,https://raw.githubusercontent.com/ramnathv/cor...,\n#,6
8,week-01/03-importing-data/03-importing-data.md...,https://raw.githubusercontent.com/ramnathv/cor...,## Importing Data\n\nImporting data refers to ...,6
9,week-01/04-visualizing-data/04-visualizing-dat...,https://raw.githubusercontent.com/ramnathv/cor...,\n#,7


In [32]:
import hashlib

doc_ids = []
for _, row in split_notes_df.iterrows():
    m = hashlib.sha256()
    m.update(row["text"].encode("utf-8"))
    doc_ids.append(m.hexdigest())
    
split_notes_df.index = doc_ids

## Persist to Disk

As a first step to persisting our corpus,
let's save it to disk and reload it.

The data involved is relatively simple --
basically all strings --
so we don't need to `pickle` the `DataFrame`,
which comes with its own woes.

Instead, we just format it as `JSON` --
the web's favorite serialization format.

In [35]:
documents_json = split_notes_df.to_json(orient="index", index=True)

ValueError: DataFrame index must be unique for orient='index'.

In [27]:
with open("documents.json", "w") as f:
    f.write(documents_json)

Before moving on,
let's check that we can in fact reload the data.

In [28]:
import json

with open("documents.json") as f:
    s = f.read()
    
key, document = list(json.loads(s).items())[0]

## Put into MongoDB

But a local filesystem isn't a good method for persistence.

We want these documents to be available via an API,
with the ability to scale reads and writes if needed.

So let's put them in a database.

We choose MongoDB simply for convenience --
we don't want to define a schema just yet,
since these tools are evolving rapidly,
and there are nice free hosting options.

> MongoDB is, in NoSQL terms, a "document database",
but the term document means something different
than it does in "Document Q&A".
In Mongoland, a "document" is just a blob of JSON.
We format our Q&A documents as JSON
and store them in Mongo,
so the distinction is not obvious here.

If you're running this yourself,
you'll need to create a hosted MongoDB instance
and add a database called `fsdl`
with a collection called `ask-fsdl`.

You can find instructions
[here](https://www.mongodb.com/basics/mongodb-atlas-tutorial).

You'll need the URL and password info
from that setup process to connect.

Add them to the `.env` file.

In [1]:
import json
import os

from dotenv import load_dotenv
import pymongo
from pymongo import InsertOne

load_dotenv()

mongodb_url = os.environ["MONGODB_URI"]
mongodb_password = os.environ["MONGODB_PASSWORD"]

CONNECTION_STRING = os.environ["MONGODB_URI"]

# connect to the database server
client = pymongo.MongoClient(CONNECTION_STRING)
# connect to the database
db = client.get_database("Cluster0")
# get a representation of the collection
collection = db.get_collection("Cluster0")

collection

Collection(Database(MongoClient(host=['ac-mmh8jah-shard-00-00.6fpuqjc.mongodb.net:27017', 'ac-mmh8jah-shard-00-02.6fpuqjc.mongodb.net:27017', 'ac-mmh8jah-shard-00-01.6fpuqjc.mongodb.net:27017'], document_class=dict, tz_aware=False, connect=True, retrywrites=True, w='majority', authsource='admin', replicaset='atlas-q5qzwz-shard-0', tls=True), 'Cluster0'), 'Cluster0')

Now that we're connected,
we're ready to upsert.

We loop over the documents -- loaded from disk --
and format them into a Python dictionary
that fits our `Document` pseudoschema.

With `pymongo`,
we can just insert that dictionary directly,
using `InsertOne`,
and use `bulk_write` to get batching.

In [None]:
CHUNK_SIZE = 250
requesting = []

with open("documents.json") as f:
    documents = json.load(f)


for (sha_hash, content) in documents.items():
    metadata = {key: value for key, value in content.items() if key != "text"}
    metadata["sha256"] = sha_hash
    document = {"text": content["text"], "metadata": metadata}
    requesting.append(InsertOne(document))
    
    if len(requesting) >= CHUNK_SIZE:
        collection.bulk_write(requesting)
        requesting = []
        
if requesting:
    collection.bulk_write(requesting)
    requesting = []