End-to-end data pipeline for scraping moltbook.com, processing with Polars, and predicting user karma with H2O AutoML.
moltbook-karma/
config/ # Configuration (settings, selectors)
src/
scraper/ # Playwright-based web scraping
database/ # SQLite models and operations
processing/ # Polars silver/gold layers
models/ # H2O AutoML training
tests/ # Pytest unit tests
notebooks/ # Jupyter notebook deliverable
data/ # Pipeline outputs (created automatically)
app/ # CLI entry point
- Python 3.10+
- Dependencies in
pyproject.toml
# Clone or navigate to project directory
cd moltbook-karma
# Create virtual environment
python -m venv .venv
# Activate (Windows)
.venv\Scripts\activate
# Activate (Linux/Mac)
source .venv/bin/activate
# Install dependencies
pip install -e ".[dev]"
or
pip install .
# Install Playwright browsers
playwright install chromium# Scrape up to 100 users (default)
python -m app scrape --max-users 100
# Scrape with custom limits
python -m app scrape --max-users 50 --max-posts 200
# Force refresh cached pages
python -m app scrape --force
# Run with visible browser (debugging)
python -m app scrape --no-headless# Build silver (cleaned) and gold (features) layers
python -m app buildThis outputs Parquet files to:
data/silver/- Cleaned datadata/gold/- Feature-engineered data
# Train karma prediction model
python -m app train
# With custom settings
python -m app train --max-models 20 --max-time 600Model artifacts saved to data/models/.
# View pipeline status
python -m app statuspython -m app --help
python -m app scrape --help
python -m app build --help
python -m app train --help| Command | Description |
|---|---|
scrape |
Scrape moltbook.com for users, posts, comments |
build |
Build silver and gold data layers from database |
train |
Train H2O AutoML model for karma prediction |
status |
Show current pipeline status and counts |
| Option | Default | Description |
|---|---|---|
--max-users |
100 | Maximum users to scrape |
--max-posts |
500 | Maximum posts to scrape |
--max-comments |
1000 | Maximum comments to scrape |
--force |
False | Force refresh cached pages |
--headless/--no-headless |
True | Run browser in headless mode |
| Option | Default | Description |
|---|---|---|
--max-models |
10 | Maximum models for AutoML |
--max-time |
300 | Maximum training time (seconds) |
# Run all tests
python -m pytest tests/ -v
# Run with coverage
python -m pytest tests/ -v --cov=srcSettings can be customized via environment variables:
# Rate limiting
export MOLTBOOK_RATE_LIMIT_SECONDS=2.0
# Database path
export MOLTBOOK_DB_PATH=data/custom.db
# H2O settings
export MOLTBOOK_MAX_MODELS=20
export MOLTBOOK_MAX_RUNTIME_SECS=600Or create a .env file in the project root.
SQLite database with the following tables:
users- User profiles (id_user, name, karma, description, ...)posts- Posts (id_post, id_user, title, rating, ...)comments- Comments (id_comment, id_user, id_post, ...)sub_molt- Communities (id_submolt, name, description)user_submolt- User-community relationships
Run schema.sql to initialize:
sqlite3 data/moltbook.db < schema.sql| Path | Description |
|---|---|
data/moltbook.db |
SQLite database |
data/raw/ |
Cached HTML files |
data/silver/*.parquet |
Cleaned data |
data/gold/user_features.parquet |
Modeling features |
data/models/ |
H2O model artifacts |
data/models/predictions.parquet |
Predictions |
This scraper respects the target website:
- Rate limiting: 1 request/second (configurable)
- User-Agent identification
- HTML caching for incremental updates
- No concurrent requests by default
Please review moltbook.com's Terms of Service before scraping.
playwright install chromiumEnsure Java is installed:
java -versionStop any other processes using the database or delete data/moltbook.db to reset.
Academic use only. See assignment requirements.
