API Crawler

The goal of this repo is to simplify the process of scraping the web and creating your own data lakes.

Fetch and store data from the web and use it to feed your own AI models, vectorstores, and databases.

Features

Data Retrieval: Retrieve data from different sources, such as comments, posts, videos, and channels, using the provided API methods.
Data Storage: Store retrieved raw data in JSON format. Use it later to build your own vectorstores, finetune your models, and etc.

git clone https://github.com/luc-pimentel/api_crawler.git

pip install -r requirements.txt

Configure the API credentials by creating a .env file and adding the necessary credentials for the APIs you want to use. Alternatively, you can also pass the necessary credentials as globals variables using the os.environ method.
Import the desired API module and start retrieving data:

from api_crawler import RedditAPI

reddit_api = RedditAPI()
posts = reddit_api.get_posts(subreddit='python', limit=10)

Contributions are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
api_crawler		api_crawler
docs		docs
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt