🥦, a web content crawling and sorting platform
- I want to
- Crawl content, such as images and texts, from "feeds" on the Internet, such as RSS, Twitter, some random webpage
- Archive those content into a centralized repository
- Process the content and attach extra attributes, such as extracting hash, width, height of an image, or translating a piece of text
- Manage the content repository using a dashboard, such as viewing images and duplicates, or viewing texts and changing their translation
- Expose the content repository to the world with certain attributes, such as "moderation is true"
- While I do not want to
- Re-implement crawling resiliency and failure observability for different use cases
- Specify different programming language object models for content in different use cases
- Re-implement common elements in a management dashboard for different use cases
This is a monolith web application that generalizes the crawling, processing, sorting and publishing of Internet content, while offer pluggability so that you customize it to fulfill individual use cases
TBD
Python 3.7
pipenv
Node.js
MongoDB
- Have an unauthenticated MongoDB running at
localhost:27017
- macOS
brew install mongodb brew services start mongodb
- Debian and Ubuntu: Follow this guide
- To verify, run the
mongo
in your terminal and you should be dropped to a MongoDB interactive shell
- Have an unauthenticated MongoDB running at
- Come up with a name for the instance. From now on we assume that name is
my_first_broccoli
./scripts/init_mongo.sh my_first_broccoli
This script will create a database named my_first_broccoli
with a user named my_first_broccoli
with the password my_first_broccoli
who has readWrites
role to the database
ADMIN_USERNAME # admin username used to authenticate API calls
ADMIN_PASSWORD # admin password used to authenticate API calls
JWT_SECRET_KEY # JWT secret key
MONGODB_CONNECTION_STRING # MongoDB connection string
MONGODB_DB # MongoDB database name
DEFAULT_API_HANDLER_MODULE # module of the default API handler
DEFAULT_API_HANDLER_CLASSNAME # class name of the default API handler
If you are running locally, you can copy .env.sample
as .env
and then edit .env
in server
You should also set additional environment variables for workers if the workers require
If you are running locally, you can copy .workers.env.sample
as .workers.env
and then edit .workers.env
in server
cd server
pipenv install
- Configure your shell environment to include
BPI_DEP_LINK
This is an environment variable that is needed insetup.py
to find a local version ofbroccoli-plugin-interface
, a Python package needed to develop broccoli plugins In your.zshrc
or.bashrc
, add this lineexport BPB_DEP_LINK=git+file:///ABS_PATH_TO_BPB#egg=broccoli_plugin_interface-0.1
ReplaceABS_PATH_TO_BPB
with the absolute path to thebroccoli-plugin-interface
directory in thebroccoli-platform
codebase For example, on my development environment withzsh
, the line looks like this in.zshrc
export BPI_DEP_LINK=git+file:///Users/username/Projects/broccoli-platform/broccoli-plugin-interface#egg=broccoli_plugin_interface-0.1
- Install the plugin
Assume the PyPI module name or Python module URL is
$PLUGIN
cd server
pipenv run pip install pip==18.1
pipenv run pip install -e $PLUGIN --process-dependency-links
$PLUGIN
might look something like /Users/username/Projects/some-project/some-plugin
- Uninstall the plugin
pipenv run pip freeze | grep -e
# figure about the Python module name of the plugin
pipenv run pip uninstall $THE_MODULE_NAME
FLASK_ENV=development pipenv run python app.py
pipenv run python -m unittest discover tests -v
If you are running locally, you can create and edit .env.development.local
in web
cd web
npm install
npm start
./scripts/reset_mongo.sh my_first_broccoli