Skip to content

Aggregating APIs and web scrapers to easily gather public web data and build your own data lakes.

License

Notifications You must be signed in to change notification settings

luc-pimentel/api_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

API Crawler

Twitter Follow Medium LinkedIn

The goal of this repo is to simplify the process of scraping the web and creating your own data lakes.

Fetch and store data from the web and use it to feed your own AI models, vectorstores, and databases.

Features

  • Data Retrieval: Retrieve data from different sources, such as comments, posts, videos, and channels, using the provided API methods.
  • Data Storage: Store retrieved raw data in JSON format. Use it later to build your own vectorstores, finetune your models, and etc.

Getting Started

  1. Clone the repository:
git clone https://github.com/luc-pimentel/api_crawler.git
  1. Install the required dependencies:
pip install -r requirements.txt
  1. Configure the API credentials by creating a .env file and adding the necessary credentials for the APIs you want to use. Alternatively, you can also pass the necessary credentials as globals variables using the os.environ method.

  2. Import the desired API module and start retrieving data:

from api_crawler import RedditAPI

reddit_api = RedditAPI()
posts = reddit_api.get_posts(subreddit='python', limit=10)

Contributing

Contributions are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.

About

Aggregating APIs and web scrapers to easily gather public web data and build your own data lakes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages