---

# Demo: Summarize and search through long reddit posts using dlt, Notion, and LanceDB

---


If you have the attention span to read those extra-long Reddit posts, you deserve respect. If you don't, then you deserve this demo.

By the end of this 100% free demo, you'll have something like this, without needing to be a Python pro (well, not the happiest example... 👀):

![image](https://storage.cloud.google.com/dlt-blog-images/demo_notebook_tuba.jpg)

### **So what exactly is this Colab for?**


**TL;DR:** You'll learn how to automatically load AI summarized content from a specific subreddit into Notion, making content management and review more efficient for creators.

![Overview](https://storage.googleapis.com/dlt-blog-images/notebook_tuba_demo_overview.png)


**The full scoop:**

- This notebook is your testament to the fact that YES, you can indeed automate the summary of those never-ending Reddit posts and park them neatly into Notion, all without spending a dime.
- Consider this a `one-stop-shop template to breeze through content from any subreddit` — because, let’s face it, nobody has the time to read that much anymore.
- If you fancy a bit of coding, customize your data source and tweak this setup to do anything else AI might handle — like:
    - Bulk loading comments for sentiment analysis.
    - Automating translations across any language.

### **The coding corner**

**1. Install and import necessary libraries**:

In [None]:
!pip install praw notion_client nltk dlt

# Standard library imports
import datetime
import os

# Related third party imports
import praw
import toml
from notion_client import Client
from transformers import pipeline
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# Data Load Tool
import dlt

Collecting praw
  Downloading praw-7.7.1-py3-none-any.whl (191 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.0/191.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting notion_client
  Downloading notion_client-2.2.1-py2.py3-none-any.whl (13 kB)
Collecting prawcore<3,>=2.1 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Collecting update-checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting httpx>=0.15.0 (from notion_client)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx>=0.15.0->notion_client)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx>=0.15.0->n

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


**2. Initialize the PRAW (Python Reddit API Wrapper) client and the summarizer using Facebook's BART model:**

In [None]:
from google.colab import userdata

userdata.get('REDDIT_CLIENT_ID')

reddit = praw.Reddit(
    client_id=userdata.get('REDDIT_CLIENT_ID'),
    client_secret=userdata.get('REDDIT_SECRET'),
    password=userdata.get('REDDIT_PASSWORD'),
    user_agent=userdata.get('REDDIT_USER_AGENT'),
    username=userdata.get('REDDIT_USERNAME')
)

summarizer = pipeline('summarization', model='facebook/bart-large-cnn')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

**3. Define helper functions:**

In [None]:
# This is just a helper function to convert a UNIX timestamp to an ISO 8601 formatted string with a UTC timezone indicator
def unix_to_iso8601(unix_timestamp):
    utc_datetime = datetime.datetime.fromtimestamp(unix_timestamp)
    return utc_datetime.isoformat() + 'Z'

# This is just a helper function to determine the maximum and minimum length requirements for the summaries
def dynamic_summary_length(text, max_ratio=0.4, min_ratio=0.2, min_length=30):
    text_length = len(text.split())
    max_length = max(min_length, int(text_length * max_ratio))
    min_length = max(min_length, int(text_length * min_ratio))
    return max_length, min_length

# This is just a helper function to summarize text using the BART model
def summarize_text(text):
    try:
        # Try summarizing the entire text first
        max_length, min_length = dynamic_summary_length(text)
        summary_object = summarizer(text, max_length=max_length, min_length=min_length, do_sample=False)
        return summary_object[0]['summary_text']
    except Exception as e:
        print(f"Summarization failed: {e}. Splitting the text.")

        # Split the text into sentences
        sentences = sent_tokenize(text)

        # Find the midpoint in terms of the number of sentences
        mid_point = len(sentences) // 2

        # Split the text into two halves at the midpoint
        first_half = " ".join(sentences[:mid_point])
        second_half = " ".join(sentences[mid_point:])

        # Summarize each half separately
        try:
            first_max_length, first_min_length = dynamic_summary_length(first_half)
            first_half_summary = summarizer(first_half, max_length=first_max_length, min_length=first_min_length, do_sample=False)[0]['summary_text']
        except Exception as sub_e:
            print(f"First half summarization failed: {sub_e}")
            first_half_summary = first_half

        try:
            second_max_length, second_min_length = dynamic_summary_length(second_half)
            second_half_summary = summarizer(second_half, max_length=second_max_length, min_length=second_min_length, do_sample=False)[0]['summary_text']
        except Exception as sub_e:
            print(f"Second half summarization failed: {sub_e}")
            second_half_summary = second_half

        # Combine the summaries of both halves
        combined_summary = first_half_summary + " " + second_half_summary
        return combined_summary.strip()



```
# This is formatted as code
```

**4. Define your custom `dlt` resource:**

In the function below, we are using `dlt.sources.incremental` to perform incremental loading. It is used to track a specific field in the data source, in this case, the `Created_utc` field, which represents the time when a post was created.

The initial_value parameter is set to "1970-01-01T00:00:00Z", which is the start of the Unix epoch time. This means that on the first run of the pipeline, it will load all posts since this time.

On subsequent runs, `dlt.sources.incremental` will keep track of the maximum Created_utc value that it has seen, and only load posts that have a Created_utc value greater than this. This is how it achieves incremental loading: by only loading new data that has been created since the last run.

Without using this functionality of `dlt`, you would have to manually keep track of the last `Created_utc` value that you have seen, and manually filter the posts to only include those that are newer. This would involve more complex code and potentially error-prone manual tracking.

In [None]:
# Define the `primary_key` and set the `write_disposition` to `merge` for incremental loading
@dlt.resource(primary_key='ID', write_disposition='merge')
def subreddit_posts(subreddit_name, updated_at=dlt.sources.incremental("Created_utc", initial_value="1970-01-01T00:00:00Z")):
    # Access the specified subreddit
    subreddit = reddit.subreddit(subreddit_name)

    # Retrieve the top 50 posts from the subreddit, adjust limit as necessary
    top_posts = subreddit.top(limit=10)

    for post in top_posts:
        # Check if the post has text, summarize it if it does
        if post.selftext:
            summary = summarize_text(post.selftext)
            print("Summarization successful!")
        else:
            summary = 'No Text'  # Handle posts without text

        # Convert the post's creation time from UNIX timestamp to ISO 8601 format
        created_time = unix_to_iso8601(post.created_utc)

        # Yield the post data in a structured format
        yield {
            'Title': post.title,
            'ID': post.id,
            'URL': post.url,
            'Summary': summary,
            'Created_utc': created_time
        }

**5. Define Notion as a custom `dlt` destination:**

While `dlt` supports a variety of regularly tested integrations, Notion is typically used as a data source and does not have built-in support as a destination within `dlt`. For guidance on using Notion as a source, refer to the [official documentation](https://dlthub.com/docs/dlt-ecosystem/verified-sources/notion). However, considering the wide variety of custom destinations available, configuring Notion as a custom destination provides a learning opportunity to effectively utilize `dlt`.

It's important to note that if you have configured a `dlt` resource with incremental loading, you must also define your destination as a `dlt` destination to ensure the incremental loading functions correctly.

In [None]:
# Define the destination function for creating entries in the Notion database
@dlt.destination(name='Notion')
def notion_create_post(items, table) -> None:
    # Initialize the Notion client with the authentication secret from secrets
    notion_client = Client(auth=userdata.get('NOTION_AUTHENTICATION'))
    # Retrieve the database ID from secrets
    notion_db_id = userdata.get('NOTION_DATABASE_ID')

    # Iterate over each item to create an entry in the Notion database
    for item in items:
        notion_client.pages.create(
            parent={'database_id': notion_db_id},
            properties={
                "Title": {"title": [{"text": {"content": item["Title"]}}]},
                "ID": {"rich_text": [{"text": {"content": item["ID"]}}]},
                "URL": {"url": item["URL"]},
                "Summary": {"rich_text": [{"text": {"content": item["Summary"]}}]},
                "Created_utc": {"rich_text": [{"text": {"content": str(item["Created_utc"])}}]},
            }
        )

**6. Create and run your `dlt` pipeline:**

Upon executing the code snippet below, your Notion database will be populated with basic information and summaries of subreddit posts. Utilizing incremental loading ensures that subsequent executions do not create duplicate entries.

To explore different content, simply change the `subreddit_name` argument in the `subreddit_posts` function your `dlt` pipeline.

In [None]:
# Create your dlt pipeline
notion_pipeline = dlt.pipeline(pipeline_name="reddit_notion_pipeline", destination=notion_create_post)

# Run your dlt pipeline
load_info = notion_pipeline.run(subreddit_posts(subreddit_name = "offmychest"))
print(load_info)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Summarization successful!
Summarization successful!
Summarization successful!
Summarization successful!
Summarization successful!
Summarization successful!
Summarization successful!
Summarization failed: index out of range in self. Splitting the text.
Summarization successful!
Summarization failed: index out of range in self. Splitting the text.
Summarization successful!
Summarization successful!
Pipeline reddit_notion_pipeline load step completed in 2.26 seconds
1 load package(s) were loaded to destination Notion and into dataset None
The Notion destination used <dlt.common.configuration.specs.base_configuration.CredentialsConfiguration object at 0x7a56f1d35270> location to store data
Load package 1721134597.1803734 is LOADED and contains no failed jobs


---
# **Good Things Come to Those who Finish Code Demos...**
---

Congrats on having a reasonably long attention span! 😆

In this part, you'll do some additional cool stuff with the same Reddit data using [LanceDB](https://lancedb.github.io/lancedb/).

### **What cool stuff?**

**TL;DR:** You'll basically have your own mini search engine for querying Subreddit post summaries.

![Overview](https://storage.googleapis.com/dlt-blog-images/notebook_tuba_demo_overview_lancedb_corrected.png)

**The full scoop:**

- If you've never had the chance to work with vector databases, this is your calling.
- Otherwise, this is a template to streamline your vector data pipelines with `dlt` and `LanceDB` - both open-source!
- If you're up for more advanced Machine Learning tasks, this is a great starting point where you don’t need to worry about the data loading part.

## **The coding corner**

**1. Install and import necessary libraries**:

In [None]:
!pip install "dlt[lancedb]" lancedb

import lancedb
from dlt.destinations.impl.lancedb.lancedb_adapter import lancedb_adapter

Collecting dlt[lancedb]
  Downloading dlt-0.5.1-py3-none-any.whl (712 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m712.3/712.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting lancedb
  Downloading lancedb-0.10.1-cp38-abi3-manylinux_2_28_x86_64.whl (21.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.0/21.0 MB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
Collecting gitpython>=3.1.29 (from dlt[lancedb])
  Downloading GitPython-3.1.43-py3-none-any.whl (207 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting giturlparse>=0.10.0 (from dlt[lancedb])
  Downloading giturlparse-0.12.0-py2.py3-none-any.whl (15 kB)
Collecting hexbytes>=0.2.2 (from dlt[lancedb])
  Downloading hexbytes-1.2.1-py3-none-any.whl (5.2 kB)
Collecting jsonpath-ng>=1.5.3 (from dlt[lancedb])
  Downloading jsonpath_ng-1.6.1-py3-none-any.whl (29 kB)
Collecting makefun>=1.15.0 

**2. Initialize `Notion` verified source**:

This command sets up a pipeline that extracts data from the Notion verified source and loads it into a LanceDB destination. You can check what it has loaded in `Files`.

In [None]:
!yes | dlt init notion lancedb

Looking up the init scripts in [1mhttps://github.com/dlt-hub/verified-sources.git[0m...
No files to update, exiting


**3. Import the `dlt.source` that fetches databases from Notion**:

Note that we're also defining a `dlt.transformer` function that allows you to manipulate data from a `dlt.resource`. The reason is to pass clean table data to the LanceDB adapter later, without any metadata that notion_databases yields.

In [None]:
from notion import notion_databases

# Retrieve all data from specified Notion databases
notion_data = notion_databases(database_ids = [{"id":  userdata.get('NOTION_DATABASE_ID')}], api_key = userdata.get('NOTION_AUTHENTICATION'))
# Since `notion_databases` is a dlt.source, we extract the table we have in Notion as a dlt.resource
# The `resources` attribute of a dlt.source object contains all the tables in the source.
# In this case, we're interested in the 'Reddit-summaries' table.
table_resource = notion_data.resources['Reddit-summaries']

# We only need the table data without the metadata, so we use a dlt.transformer which can process yield results from a dlt.resource
# The `dlt.transformer` decorator is used to define a function that transforms data from a dlt.resource.
# The `data_from` parameter specifies the dlt.resource that the transformer function will process.
@dlt.transformer(data_from=table_resource)
def get_only_properties(entries):
    # This function iterates over the entries in the 'Reddit-summaries' table.
    # For each entry, it extracts the 'ID', 'URL', 'Summary', 'Title', and 'Created_utc' properties.
    # It then yields a dictionary containing these properties.
    for entry in entries:
        id = entry['properties']['ID']['rich_text'][0]['plain_text']
        url = entry['properties']['URL']['url']
        summary = entry['properties']['Summary']['rich_text'][0]['plain_text']
        title = entry['properties']['Title']['title'][0]['plain_text']
        created_utc = entry['properties']['Created_utc']['rich_text'][0]['plain_text']
        yield {"Title": title, "ID": id, "URL": url, "Summary": summary, "Created_utc": created_utc}


**4. Create and run your `dlt` pipeline with `LanceDB` as destination**:

`LanceDB` has an integration with `dlt`. All you need to do is just to pass the data with the column you want to embed to the adapter and run the pipeline.

In [None]:
# Note that we're using open-source tools, so we don't provide any keys here
credentials = {
    "uri": "reddit_summaries.lancedb",
    "api_key": "",
    "embedding_model_provider_api_key": "",
}

# Create your dlt pipeline with LanceDB as destination
lancedb_pipeline = dlt.pipeline(
    pipeline_name='reddit_lancedb_pipeline',
    destination=dlt.destinations.lancedb(
        credentials=credentials,
        embedding_model_provider = "huggingface",
        embedding_model = "BAAI/bge-small-en-v1.5",
    ),
    dataset_name='reddit_top_posts',
)

load_info = lancedb_pipeline.run(lancedb_adapter(get_only_properties, embed=["Summary"]), write_disposition="replace")
print(load_info)

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

Pipeline reddit_lancedb_pipeline load step completed in 5.95 seconds
1 load package(s) were loaded to destination LanceDB and into dataset reddit_top_posts
The LanceDB destination used <dlt.destinations.impl.lancedb.configuration.LanceDBCredentials object at 0x78e4b4584670> location to store data
Load package 1721294703.4760716 is LOADED and contains no failed jobs


**3. Query your data:**

This script connects to a LanceDB database, retrieves data from a specific table, searches for a query within the table, and converts the search results to a pandas DataFrame.

In [None]:
# Connect to the database and retrieve the data
db = lancedb.connect("reddit_summaries.lancedb")
table = db.open_table("reddit_top_posts___get_only_properties")

query = "marriage"
result = table.search(query).limit(2)
df = result.to_pandas()

# Now you can print or manipulate the DataFrame
texts = df["summary"]

for text in texts:
    print(text, "\n")

I was 21 when my fiance asked me to marry him. We were only engaged for 6 months before the inncident. My middle oldest sister, lets call her Nicky, was a very cold person. She only ever opened up to my fiance as she said she saw him as a brother. She and I never saw eye to eye, I loved her dearly because she was my sister but didn't like her as a person. The night was going smoothly until Nicky spotted a guy across the room whom she claimed she wanted to "climb like a tree" She walked over to him and within a few minutes she was back and she had a sour expression on her face. She then told me the guy didn't want her number but he wanted mine instead. I don't remember what happened next as I blacked out and the next morning I woke up on a hard sofa, my head pounding. When I came to, I realised I was in Nicky's friends house and my phone was sitting on the glass table in front of me, but it was flat. I tried to explain that my phone went flat but he then went on screaming about how coul