JustTheFAQs is a Node.js-based application designed to generate structured FAQ pages from Wikipedia content. By leveraging the Wikipedia API and OpenAI's GPT-4 API, the program extracts key information, organizes it into concise and engaging question-and-answer pairs, and stores them in a database for dynamic access through a Next.js frontend. Additionally, it stores vector embeddings for semantic search in Pinecone, enabling powerful question-answer lookups.
- Wikipedia Integration: Fetches top-viewed Wikipedia pages and their metadata
- Content Processing: Extracts HTML content and images from Wikipedia articles
- FAQ Generation: Uses OpenAI's GPT-4 to create structured FAQs with questions, answers, and related links
- Dynamic Page Rendering: Serves FAQ content directly from a PostgreSQL database (accessed via Supabase) through Next.js
- Semantic Search (via Pinecone): Uses Pinecone as the vector database to store embeddings for advanced similarity-based searching
- Dynamic Pagination: Fetches additional Wikipedia articles when encountering previously processed pages
- Database Management: Tracks processed pages to avoid duplication, maintain cross-links, and streamline workflows
├── lib/
│ ├── LogDisplay.js # Used in the frontend display of logs.
│ └── db.js # Database helper functions; uses Supabase for Postgres access.
├── pages/
│ ├── api/
│ │ ├── scripts/
│ │ │ ├── fetchAndGenerate.js # Core script for fetching Wikipedia pages & generating FAQs.
│ │ ├── faqs.js # API route for returning FAQ data.
│ │ ├── search.js # API route for performing semantic searches (queries Pinecone).
│ ├── [slug].js # Dynamic route for individual FAQ pages.
│ ├── index.js # Entry point for the frontend (search page).
├── stream.js # Likely handles streaming of data/logs to the frontend.
...
-
Fetch Top Wikipedia Pages
- Uses the Wikimedia API to retrieve top-viewed Wikipedia articles.
-
Check Existing Data
- Each Wikipedia article is checked against the PostgreSQL database (via Supabase) to avoid duplicate processing.
-
Fetch and Process Content
- Extracts HTML content and images from Wikipedia articles using the Cheerio library.
- Truncates content to fit within OpenAI's token limit if necessary.
-
Generate FAQs
- Sends article content and images to OpenAI's GPT-4 API to create FAQs.
- Produces structured data containing:
- Questions/Answers
- Relevant subheaders
- Cross-references to other Wikipedia pages
- Media (image) links
-
Store in Database + Pinecone
- Saves all FAQ data (questions, answers, metadata) in a PostgreSQL database.
- Generates embeddings for each FAQ and saves them to Pinecone, enabling semantic similarity search.
-
Dynamic Page Serving
- Next.js frontend dynamically renders FAQ pages out of the PostgreSQL database.
- Real-time search & filtering is achieved by querying Pinecone for embeddings.
- Cross-links between pages are maintained to allow deeper exploration.
- Node.js (v16 or higher)
- PostgreSQL database
- Supabase (to access the PostgreSQL database)
- Pinecone account and API key (for vector embeddings)
- OpenAI API key
-
Clone the repository:
git clone https://github.com/your-repo/JustTheFAQs.git cd JustTheFAQs
-
Install dependencies:
npm install
-
Configure environment variables:
Create a.env
file with the following keys:NEXT_PUBLIC_SUPABASE_URL=your_supabase_url NEXT_PUBLIC_SUPABASE_ANON_KEY=your_supabase_anon_key OPENAI_API_KEY=your_openai_api_key PINECONE_API_KEY=your_pinecone_api_key PINECONE_ENVIRONMENT=your_pinecone_env
The OpenAI key and Pinecone key are crucial for generating embeddings and storing/retrieving vectors. The Supabase URL & key are used to connect to your PostgreSQL database.
Run the main script to fetch Wikipedia pages and generate FAQ content:
node pages/api/scripts/fetchAndGenerate.js
This fetches new Wikipedia pages, processes them in batches, and stores the result in both PostgreSQL (via Supabase) and Pinecone for embeddings.
Start the development server:
npm run dev
Open http://localhost:3000 to view the application in your browser.
Build and start the production server:
npm run build
npm start
Although embeddings are now stored in Pinecone, the application still uses PostgreSQL (accessed via Supabase) to store FAQ text data, cross-links, timestamps, etc. Below is the main schema for storing FAQ content.
Column | Data Type | Nullable | Default | Constraint |
---|---|---|---|---|
id | integer | NO | PRIMARY KEY | |
slug | text | NO | UNIQUE | |
created_at | timestamp without time zone | YES | now() | |
human_readable_name | text | YES |
This table represents distinct FAQ groups. Each slug corresponds to the Wikipedia title, and human_readable_name
is the more readable page name.
Below is the updated schema reflecting all current columns:
Column | Data Type | Nullable | Default | Description / Notes |
---|---|---|---|---|
id | integer | NO | Primary Key for each FAQ | |
url | text | NO | Full URL of the Wikipedia page | |
title | text | NO | Wikipedia page title | |
timestamp | timestamp without time zone | NO | CURRENT_TIMESTAMP | Time of row creation |
question | text | NO | FAQ question | |
answer | text | NO | FAQ answer | |
media_link | text | YES | For storing a single image/media link | |
human_readable_name | text | YES | Plain name for display | |
last_updated | timestamp without time zone | YES | Tracks updates | |
subheader | text | YES | Indicates FAQ section header | |
cross_link | text | YES | Comma-separated list of cross-links (related pages) | |
image_urls | text | YES | This field is not used anymore (CSV) | |
faq_file_id | integer | YES | Foreign Key referencing faq_files(id) |
|
pinecone_upsert_success | boolean | NO | false | Indicates whether the row was successfully upserted to Pinecone |
Note: Previously, the application stored embeddings in a
faq_embeddings
table. That table is no longer used because Pinecone now manages all vector data for semantic search.
Below are minimal JSON examples—one row per table—to illustrate the data format stored in each of the primary tables.
{
"id": 38406,
"url": "https://en.wikipedia.org/wiki/Cavalier",
"title": "Cavalier",
"timestamp": "2025-01-26 18:59:16.255024",
"question": "How has the Cavalier aesthetic influenced the arts, particularly in portraiture?",
"answer": "The Cavalier aesthetic significantly influenced art from the period, particularly through the works of artists like Sir Anthony van Dyck ... This artistic portrayal contributed to the legacy of the Cavalier image, cementing the connection between their military and social identity and the art of the time.",
"media_link": "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Sir_Anthony_Van_Dyck_-_Charles_I_%281600-49%29_-_Google_Art_Project.jpg/220px-Sir_Anthony_Van_Dyck_-_Charles_I_%281600-49%29_-_Google_Art_Project.jpg",
"human_readable_name": "Cavalier",
"last_updated": "2024-12-13 01:57:15",
"subheader": "Cavalier Style in Arts",
"cross_link": "Anthony_van_Dyck, Charles_I_in_Three_Positions",
"image_urls": null,
"faq_file_id": 3796,
"pinecone_upsert_success": true
}
Purpose: Stores individual FAQ items, including the question, answer, relevant links, and references to a parent FAQ file.
Key Fields:
question
/answer
: Main Q&A pair.cross_link
: Cross-referenced Wikipedia pages or related topics.faq_file_id
: Foreign key to thefaq_files
table.
{
"id": 3796,
"slug": "cavalier",
"created_at": "2025-01-26 18:59:12.655",
"human_readable_name": "Cavalier"
}
Purpose: Stores high-level FAQ group data (e.g., each entry typically corresponds to a Wikipedia article).
Key Fields:
slug
: Typically matches the Wikipedia article title (in slug form).human_readable_name
: Displays a user-friendly name (e.g., “Cavalier”).
(If you have a separate table that tracks pages to be processed or cross-linked, you might call it something like wiki_jobs
or pending_articles
.)
{
"id": 22331,
"title": "George Hamilton (actor)",
"human_readable_name": "George Hamilton (actor)",
"last_updated": null,
"status": "pending",
"priority": 1,
"created_at": "2025-01-23 01:11:42.341451+00",
"processed_at": null,
"error_message": null,
"attempts": 0,
"slug": "george-hamilton-actor-",
"source": "cross_link",
"url": "https://en.wikipedia.org/wiki/George_Hamilton_(actor)"
}
Purpose: Tracks tasks or articles queued for processing, with fields for status, priority, etc.
Key Fields:
status
: Could bepending
,completed
, orerror
.source
: Indicates why or how it was added (e.g., “cross_link”).attempts
: Number of retry attempts if there’s an error.
All embeddings are generated via OpenAI and stored in Pinecone. Queries to find the most relevant FAQs go through Pinecone, retrieving the best-matching entries by cosine similarity. The results are then cross-referenced with raw_faqs
data in PostgreSQL to serve the correct answers.
- Create a Pinecone index (e.g.,
faq-embeddings
) with a dimension matching your chosen embedding model (e.g., 1536). - Update
.env
with:PINECONE_API_KEY=your_pinecone_api_key PINECONE_ENVIRONMENT=your_pinecone_environment
- Ensure your Node.js code references the correct Pinecone index name and environment.
Contributions are welcome! Please submit a pull request or open an issue on the GitHub repository.
This project is licensed under the MIT License.
- Wikimedia API
- OpenAI
- Cheerio for HTML parsing
- Supabase for managing PostgreSQL data
- Pinecone for vector embedding search
- Next.js for the frontend framework
JustTheFAQs transforms dense Wikipedia articles into user-friendly FAQs enriched with semantic search capabilities. Explore your favorite topics with clarity and speed!