Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Knowledge Search

a graph-based knowledge search engine powered by Wikipedia

Connecting Articles in a Graph

The first link in the main body text identifies a hierarchical relationship between articles: banana links to fruit, piano to musical instruments, and so on.

The search engine constructs a directed graph connecting the 11 million English articles (6 million redirects) using the first link. For those curious about the network's topology, peek at the research inspiring this project.

Example: Piano

Parent Comparable Children
musical instruments Music box, Violin family, Glass harmonica Piano Music, Piano music, Grand Piano, Lily Maisky, William Merrigan Daly

Graph Implementation

  1. Download entire XML dump available here:
  2. Extract the first link in the main body text (
  3. Store graph in neo4j
    • index articles by title and add page views as a property of each article
      • uses bulk import, which also includes page views as an attribute for each node
      • query can filter resutls by page views to return the most relevant articles

note matching titles between the available hourly page view data and displayed title is imperfect (see

Fuzzy Title Matching

In addition to the graph, the first 2000 characters of the main body text are indexed for fuzzy title searching.

  • powered by Elasticsearch
  • indexing is distributed using Scala
    • note PySpark Databricks XML package and EsSpark (connector from Spark to Elasticsearch) are not compatible
    • build jar using Maven and run Scala (see pom.xml and index_wiki.scala)
  • Elasticsearch query weighs the title 2x more heavily realtive to introductory body text

Example: "paper" --> "Pulp (paper)"

lower-case "paper" is matched to the correct Wikipedia article title


To install dependencies:

pip install -r requirements.txt

For distributed computations, the program also requires Spark, Java > 7, and Scala.

Program expects configurations in which sets environment variables for database and spark nodes:

import os

os.environ["master_node_dns"] = ""
os.environ["elasticsearch_node_dns"] = ""
os.environ["neo4j_pass"] = "xxxx"
os.environ["neo4j_ip"] = ""


Flow based on search term:

  • match search term to the closest title (using Elasticsearch query above)
  • fetch a subset of the network (parent, comparable, and child articles)
  • filter the subgraph by page views to return only the most relevant subset

The front-end serves this result in a directed D3 graph.


a graph-based knowledge search engine powered by Wikipedia






No releases published


No packages published