To start,
we'll just accumulate info in flat `pandas` dataframes.

In [14]:
import pandas as pd


document_df = pd.DataFrame(columns=["text", "source", "sha256"])

document_df

Unnamed: 0,text,source,sha256


The Q&A will be more useful the more precisely we slice and link the documents,
so we want to split a semantic "document", like a lecture or a video,
up into sub-documents first.

**Note**: we leave it up to the `langchain.TextSplitter` to split sub-documents into chunks smaller than a source at time of upsert into the vector database.

## Markdown Files

Most pages on the FSDL website
are originally written in Markdown,
which makes it easy to pull out relevant sub-documents.

### Lectures

We first define a `DataFrame` with basic metadata about where the lectures can be found -- on the website and as raw Markdown.

In [15]:
notes_md_url_base = "https://raw.githubusercontent.com/ramnathv/corise-r-for-ds/main/notes/"

In [16]:
notes_slugs = {
    1: "week-00/01-welcome-to-r-for-ds/01-welcome-to-r-for-ds.md",
    2: "week-00/02-how-to-prepare-for-this-course/02-how-to-prepare-for-this-course.md",
    3: "week-00/03-logistic-faqs/03-logistic-faqs.md",
    4: "week-01/01-doing-data-science/01-doing-data-science.md",
    5: "week-01/02-data-science-in-action/02-data-science-in-action.md",
    6: "week-01/03-importing-data/03-importing-data.md",
    7: "week-01/04-visualizing-data/04-visualizing-data.md",
    8: "week-01/05-transforming-data/05-transforming-data.md",
    9: "week-01/06-manipulating-data/06-manipulating-data.md",
    10: "week-02/01-aggregating-data/01-aggregating-data.md",
    11: "week-02/02-reshaping-data/02-reshaping-data.md",
    12: "week-02/03-combining-data/03-combining-data.md",
    13: "week-02/04-grammar-of-graphics/04-grammar-of-graphics.md",
    14: "week-02/05-data-science-in-action-again/05-data-science-in-action-again.md"
}

notes_df = pd.DataFrame.from_dict(notes_slugs, orient="index", columns=["url-slug"])
notes_df

Unnamed: 0,url-slug
1,week-00/01-welcome-to-r-for-ds/01-welcome-to-r...
2,week-00/02-how-to-prepare-for-this-course/02-h...
3,week-00/03-logistic-faqs/03-logistic-faqs.md
4,week-01/01-doing-data-science/01-doing-data-sc...
5,week-01/02-data-science-in-action/02-data-scie...
6,week-01/03-importing-data/03-importing-data.md
7,week-01/04-visualizing-data/04-visualizing-dat...
8,week-01/05-transforming-data/05-transforming-d...
9,week-01/06-manipulating-data/06-manipulating-d...
10,week-02/01-aggregating-data/01-aggregating-dat...


In [17]:
notes_df["raw-md-url"] = notes_df["url-slug"].apply(lambda s: f"{notes_md_url_base}/{s}".format(s))

We then bring in the markdown files from GitHub,
parse them to split out headings as our "sources",
and use `slugify` to create URLs for those heading sources.

In [26]:
from smart_open import open


def get_text_from(url):
    with open(url) as f:
        contents = f.read()
    return contents

notes_df["raw-text"] = notes_df["raw-md-url"].apply(lambda url: get_text_from(url))

In [27]:
notes_df["raw-text"]

1     \n### 👋 Hi!!\n\nHello, I’m Ramnath, and I’m ex...
2     \n### How to Prepare for This Course?\n\nWe cr...
3     \n### Are sessions recorded?\n\nWe encourage y...
4     \n## Doing Data Science\n\n### What is Data Sc...
5     \n## Data Science in Action\n\nThe best way to...
6     \n## Importing Data\n\nImporting data refers t...
7     \n## Visualizing Data\n\nData visualization is...
8     \n## Transforming Data\n\nVisualizing data is ...
9     \n## Manipulating Data\n\nRecall how data mani...
10    \n## Aggregating Data\n\n**Aggregating** data ...
11    \n## Reshaping Data\n\n**Reshaping** data is a...
12    \n## Combining Data\n\n**Combining** data spre...
13    \n## Grammar of Graphics\n\nThe **Grammar of G...
14    \n## Unisex Names\n\n<img src="https://fivethi...
Name: raw-text, dtype: object

In [30]:
notes_df["raw-text"] = notes_df["raw-text"].apply(lambda x: ' '.join(word for word in str(x).split() if not word.startswith('\n')))

In [31]:
notes_df["raw-text"]

1     ### 👋 Hi!! Hello, I’m Ramnath, and I’m excited...
2     ### How to Prepare for This Course? We created...
3     ### Are sessions recorded? We encourage you to...
4     ## Doing Data Science ### What is Data Science...
5     ## Data Science in Action The best way to get ...
6     ## Importing Data Importing data refers to the...
7     ## Visualizing Data Data visualization is a po...
8     ## Transforming Data Visualizing data is a lot...
9     ## Manipulating Data Recall how data manipulat...
10    ## Aggregating Data **Aggregating** data invol...
11    ## Reshaping Data **Reshaping** data is a fund...
12    ## Combining Data **Combining** data spread ac...
13    ## Grammar of Graphics The **Grammar of Graphi...
14    ## Unisex Names <img src="https://fivethirtyei...
Name: raw-text, dtype: object

In [32]:
notes_df

Unnamed: 0,url-slug,raw-md-url,raw-text
1,week-00/01-welcome-to-r-for-ds/01-welcome-to-r...,https://raw.githubusercontent.com/ramnathv/cor...,"### 👋 Hi!! Hello, I’m Ramnath, and I’m excited..."
2,week-00/02-how-to-prepare-for-this-course/02-h...,https://raw.githubusercontent.com/ramnathv/cor...,### How to Prepare for This Course? We created...
3,week-00/03-logistic-faqs/03-logistic-faqs.md,https://raw.githubusercontent.com/ramnathv/cor...,### Are sessions recorded? We encourage you to...
4,week-01/01-doing-data-science/01-doing-data-sc...,https://raw.githubusercontent.com/ramnathv/cor...,## Doing Data Science ### What is Data Science...
5,week-01/02-data-science-in-action/02-data-scie...,https://raw.githubusercontent.com/ramnathv/cor...,## Data Science in Action The best way to get ...
6,week-01/03-importing-data/03-importing-data.md,https://raw.githubusercontent.com/ramnathv/cor...,## Importing Data Importing data refers to the...
7,week-01/04-visualizing-data/04-visualizing-dat...,https://raw.githubusercontent.com/ramnathv/cor...,## Visualizing Data Data visualization is a po...
8,week-01/05-transforming-data/05-transforming-d...,https://raw.githubusercontent.com/ramnathv/cor...,## Transforming Data Visualizing data is a lot...
9,week-01/06-manipulating-data/06-manipulating-d...,https://raw.githubusercontent.com/ramnathv/cor...,## Manipulating Data Recall how data manipulat...
10,week-02/01-aggregating-data/01-aggregating-dat...,https://raw.githubusercontent.com/ramnathv/cor...,## Aggregating Data **Aggregating** data invol...


In [34]:
import hashlib

doc_ids = []
for _, row in notes_df.iterrows():
    m = hashlib.sha256()
    m.update(row["raw-text"].encode("utf-8"))
    doc_ids.append(m.hexdigest())
    
notes_df.index = doc_ids

In [35]:
notes_df

Unnamed: 0,url-slug,raw-md-url,raw-text
a434686a0ac1866c5af1f6f0ea43d6ffc625fc644b666808b38c327f6541438a,week-00/01-welcome-to-r-for-ds/01-welcome-to-r...,https://raw.githubusercontent.com/ramnathv/cor...,"### 👋 Hi!! Hello, I’m Ramnath, and I’m excited..."
385725cc2f629f0141cdd07e9f9429cc295e50127a8d3a48cf00299861a27632,week-00/02-how-to-prepare-for-this-course/02-h...,https://raw.githubusercontent.com/ramnathv/cor...,### How to Prepare for This Course? We created...
baceb34a1eef801d5cbc0142fe236a9963ef410aa68d66386f0b7c27840d0bbe,week-00/03-logistic-faqs/03-logistic-faqs.md,https://raw.githubusercontent.com/ramnathv/cor...,### Are sessions recorded? We encourage you to...
6c0055b418bc94ceffb2e41affd914d5c2f7b7ccd731bd793544f57feaf59656,week-01/01-doing-data-science/01-doing-data-sc...,https://raw.githubusercontent.com/ramnathv/cor...,## Doing Data Science ### What is Data Science...
1c24a88fbeee8bb4689caa9e846202e8fe45a92209b7c3a51a22bf2d5ccaa246,week-01/02-data-science-in-action/02-data-scie...,https://raw.githubusercontent.com/ramnathv/cor...,## Data Science in Action The best way to get ...
8a535455f2483f17f73c396e3272763f2c8e90d266d781b4aa4d938e5198b596,week-01/03-importing-data/03-importing-data.md,https://raw.githubusercontent.com/ramnathv/cor...,## Importing Data Importing data refers to the...
cad7e117b485864a2acbacd64fd1286f486e3fc1f87c520b4873c058f95dd9c5,week-01/04-visualizing-data/04-visualizing-dat...,https://raw.githubusercontent.com/ramnathv/cor...,## Visualizing Data Data visualization is a po...
208e0f93266461c6ea1e8fb10b264b16ba6b9b2b5bfc88d1c28c070f67aaeece,week-01/05-transforming-data/05-transforming-d...,https://raw.githubusercontent.com/ramnathv/cor...,## Transforming Data Visualizing data is a lot...
fe75fd8edf3c0a4e57f398f1a029a0b9cedca9ddd6cdb00ce44b14d7775bf475,week-01/06-manipulating-data/06-manipulating-d...,https://raw.githubusercontent.com/ramnathv/cor...,## Manipulating Data Recall how data manipulat...
5704a131445ffc871667a085fb8de8deb6fa9015babd6139071da88643bdf413,week-02/01-aggregating-data/01-aggregating-dat...,https://raw.githubusercontent.com/ramnathv/cor...,## Aggregating Data **Aggregating** data invol...


## Persist to Disk

As a first step to persisting our corpus,
let's save it to disk and reload it.

The data involved is relatively simple --
basically all strings --
so we don't need to `pickle` the `DataFrame`,
which comes with its own woes.

Instead, we just format it as `JSON` --
the web's favorite serialization format.

In [36]:
documents_json = notes_df.to_json(orient="index", index=True)

In [37]:
with open("documents.json", "w") as f:
    f.write(documents_json)

Before moving on,
let's check that we can in fact reload the data.

In [38]:
import json

with open("documents.json") as f:
    s = f.read()
    
key, document = list(json.loads(s).items())[0]

## Put into MongoDB

But a local filesystem isn't a good method for persistence.

We want these documents to be available via an API,
with the ability to scale reads and writes if needed.

So let's put them in a database.

We choose MongoDB simply for convenience --
we don't want to define a schema just yet,
since these tools are evolving rapidly,
and there are nice free hosting options.

> MongoDB is, in NoSQL terms, a "document database",
but the term document means something different
than it does in "Document Q&A".
In Mongoland, a "document" is just a blob of JSON.
We format our Q&A documents as JSON
and store them in Mongo,
so the distinction is not obvious here.

If you're running this yourself,
you'll need to create a hosted MongoDB instance
and add a database called `fsdl`
with a collection called `ask-fsdl`.

You can find instructions
[here](https://www.mongodb.com/basics/mongodb-atlas-tutorial).

You'll need the URL and password info
from that setup process to connect.

Add them to the `.env` file.

In [39]:
import json
import os

from dotenv import load_dotenv
import pymongo
from pymongo import InsertOne

load_dotenv()

mongodb_url = os.environ["MONGODB_URI"]
mongodb_password = os.environ["MONGODB_PASSWORD"]

CONNECTION_STRING = os.environ["MONGODB_URI"]

# connect to the database server
client = pymongo.MongoClient(CONNECTION_STRING)
# connect to the database
db = client.get_database("Cluster0")
# get a representation of the collection
collection = db.get_collection("Cluster0")

collection

Collection(Database(MongoClient(host=['ac-mmh8jah-shard-00-02.6fpuqjc.mongodb.net:27017', 'ac-mmh8jah-shard-00-01.6fpuqjc.mongodb.net:27017', 'ac-mmh8jah-shard-00-00.6fpuqjc.mongodb.net:27017'], document_class=dict, tz_aware=False, connect=True, retrywrites=True, w='majority', authsource='admin', replicaset='atlas-q5qzwz-shard-0', tls=True), 'Cluster0'), 'Cluster0')

Now that we're connected,
we're ready to upsert.

We loop over the documents -- loaded from disk --
and format them into a Python dictionary
that fits our `Document` pseudoschema.

With `pymongo`,
we can just insert that dictionary directly,
using `InsertOne`,
and use `bulk_write` to get batching.

In [41]:
CHUNK_SIZE = 250
requesting = []

with open("documents.json") as f:
    documents = json.load(f)


for (sha_hash, content) in documents.items():
    metadata = {key: value for key, value in content.items() if key != "text"}
    metadata["sha256"] = sha_hash
    document = {"text": content["raw-text"], "metadata": metadata}
    requesting.append(InsertOne(document))
    
    if len(requesting) >= CHUNK_SIZE:
        collection.bulk_write(requesting)
        requesting = []
        
if requesting:
    collection.bulk_write(requesting)
    requesting = []

ServerSelectionTimeoutError: ac-mmh8jah-shard-00-02.6fpuqjc.mongodb.net:27017: connection closed,ac-mmh8jah-shard-00-01.6fpuqjc.mongodb.net:27017: connection closed,ac-mmh8jah-shard-00-00.6fpuqjc.mongodb.net:27017: connection closed, Timeout: 30s, Topology Description: <TopologyDescription id: 64665d1d8e52252417b710e7, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('ac-mmh8jah-shard-00-00.6fpuqjc.mongodb.net', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('ac-mmh8jah-shard-00-00.6fpuqjc.mongodb.net:27017: connection closed')>, <ServerDescription ('ac-mmh8jah-shard-00-01.6fpuqjc.mongodb.net', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('ac-mmh8jah-shard-00-01.6fpuqjc.mongodb.net:27017: connection closed')>, <ServerDescription ('ac-mmh8jah-shard-00-02.6fpuqjc.mongodb.net', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('ac-mmh8jah-shard-00-02.6fpuqjc.mongodb.net:27017: connection closed')>]>