The goal of this repo is to simplify the process of scraping the web and creating your own data lakes.
Fetch and store data from the web and use it to feed your own AI models, vectorstores, and databases.
- Data Retrieval: Retrieve data from different sources, such as comments, posts, videos, and channels, using the provided API methods.
- Data Storage: Store retrieved raw data in JSON format. Use it later to build your own vectorstores, finetune your models, and etc.
- Clone the repository:
git clone https://github.com/luc-pimentel/api_crawler.git
- Install the required dependencies:
pip install -r requirements.txt
-
Configure the API credentials by creating a
.env
file and adding the necessary credentials for the APIs you want to use. Alternatively, you can also pass the necessary credentials as globals variables using the os.environ method. -
Import the desired API module and start retrieving data:
from api_crawler import RedditAPI
reddit_api = RedditAPI()
posts = reddit_api.get_posts(subreddit='python', limit=10)
Contributions are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.