Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSS Feed loader #942

Merged
merged 5 commits into from
Dec 7, 2023
Merged

Conversation

sidmohanty11
Copy link
Contributor

@sidmohanty11 sidmohanty11 commented Nov 13, 2023

Description

This PR adds support for loading RSS feeds directly from Embedchain.

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

  • Test Script (please provide)
from embedchain import Pipeline as App

app = App.from_config("config.yaml")

# app.add("https://news.ycombinator.com/rss", data_type="rss_feed")
# app.add("https://keepingupwith.ai/feed", data_type="substack")
app.add("https://www.lennysnewsletter.com/feed", data_type="substack")

response = app.query("Who is Brian Chesky?")

Output,

Inserting batches from 0 to 100 in chromadb
Inserting batches from 100 to 130 in chromadb
Successfully saved https://www.lennysnewsletter.com/feed (DataType.SUBSTACK). New chunks count: 130
Brian Chesky is the co-founder and CEO of Airbnb.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

loader = LangchainRSSFeedLoader(urls=[url])
data = loader.load()

for entry in data:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be really slow if the RSS feed is quite long. You might want to use multithreading to speed up the loader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it will update this

embedchain/utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@deven298 deven298 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sidmohanty11 Seems like substack loader is calling rss loader in its load function. Any particular reason why we need both? Am I missing something?

embedchain/loaders/rss_feed.py Show resolved Hide resolved
embedchain/loaders/rss_feed.py Outdated Show resolved Hide resolved
@sidmohanty11
Copy link
Contributor Author

@sidmohanty11 Seems like substack loader is calling rss loader in its load function. Any particular reason why we need both? Am I missing something?

You're 100% correct, as of now this is incomplete. We need to figure out a way to get all RSS feed posts for the substack URL. Then we can map it through from there. Else scrape the remaining posts and add it here

Copy link

codecov bot commented Nov 13, 2023

Codecov Report

Attention: 41 lines in your changes are missing coverage. Please review.

Comparison is base (4a5ed1d) 64.33% compared to head (f90a1c2) 63.75%.
Report is 2 commits behind head on main.

❗ Current head f90a1c2 differs from pull request most recent head 25c023f. Consider uploading reports for the commit 25c023f to get more accurate results

Files Patch % Lines
embedchain/loaders/rss_feed.py 0.00% 29 Missing ⚠️
embedchain/chunkers/rss_feed.py 0.00% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #942      +/-   ##
==========================================
- Coverage   64.33%   63.75%   -0.58%     
==========================================
  Files         118      120       +2     
  Lines        4405     4448      +43     
==========================================
+ Hits         2834     2836       +2     
- Misses       1571     1612      +41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sidmohanty11 sidmohanty11 changed the title RSS Feed loader with custom substack loader extension RSS Feed loader Dec 7, 2023
Copy link
Contributor

@deven298 deven298 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM



@register_deserializable
class RSSFeedChunker(BaseChunker):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use common chunker instead (I created one earlier). Keep the default chunk_size=2000

@deshraj deshraj merged commit d8897ce into mem0ai:main Dec 7, 2023
3 checks passed
deshraj pushed a commit that referenced this pull request Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants