This project uses Browserbase and Stagehand to scrape Pitchfork album reviews, generate embeddings using OpenAI's text-embedding-3-small model, and store them in Qdrant for semantic search capabilities.
- Automated scraping of Pitchfork album reviews using Browserbase and Stagehand
- Generation of embeddings for full review text using OpenAI's text-embedding-3-small model
- Storage of reviews and embeddings in Qdrant for vector similarity search
- Unique UUID-based identification for each review
- Browser-like behavior to avoid detection
- Error handling and graceful failure recovery
- Node.js (v18 or higher)
- npm or yarn
- OpenAI API key
- Browserbase API key and Project ID
- Qdrant API key and URL
Create a .env file in the root directory with the following variables:
BROWSERBASE_API_KEY=your_browserbase_api_key
BROWSERBASE_PROJECT_ID=your_browserbase_project_id
OPENAI_API_KEY=your_openai_api_key
QDRANT_URL=your_qdrant_url
QDRANT_API_KEY=your_qdrant_api_key- Clone the repository:
git clone <repository-url>
cd pitchfork-reviews-vector-search- Install dependencies:
npm install- Start the development server:
npm run dev-
Open your browser and navigate to
http://localhost:3000 -
Click the "Run Stagehand" button to start the review collection process
The application will:
- Navigate to predefined Pitchfork album review URLs
- Extract the full review content
- Generate embeddings for the review text
- Store the reviews and embeddings in Qdrant
app/api/stagehand/main.ts- Main scraping and processing logicapp/lib/qdrant.ts- Qdrant client and collection managementapp/page.tsx- Frontend interfacestagehand.config.ts- Browserbase and Stagehand configuration
To add more reviews, update the ALBUM_REVIEW_URLS array in app/api/stagehand/main.ts:
const ALBUM_REVIEW_URLS = [
"https://pitchfork.com/reviews/albums/your-review-url-1/",
"https://pitchfork.com/reviews/albums/your-review-url-2/",
// Add more URLs as needed
];- Next.js - React framework
- Browserbase - Browser automation
- Stagehand - AI-powered browser automation
- OpenAI - Text embeddings
- Qdrant - Vector database
- TypeScript - Type safety
- Zod - Schema validation
MIT