kevin-bacon

The goal is to take any page on wikipedia and compute the number of links to the "Kevin Bacon" wikipedia entry (https://en.wikipedia.org/wiki/Kevin_Bacon).

Test Cases:

For example there is a one link separation between Footloose (https://en.wikipedia.org/wiki/Footloose_(1984_film)) and Kevin Bacon. There is a two link separation between Tom Cruise and Kevin Bacon (via https://en.wikipedia.org/wiki/A_Few_Good_Men). Also consider the links of separation between https://en.wikipedia.org/wiki/Philosophy or https://en.wikipedia.org/wiki/Taj_Mahal and Kevin Bacon etc.

Inputs/Outputs

Inputs to the program will be a wikipedia url and outputs should be an integer.

Uses a Breadths-First Search approach
The /separation route can be edited to use the file directory of pre-scraped data as graph or an in-memory graph
I also tried to "live" scrape data as BFS but the performance was so bad I didn't invest any more time in it. Same approach but different data source.
Future: use a graph database to store data for better performance

Optimization

As nodes are added to the graph, via scraping, store the depth at which they were scraped. This approach only works since Kevin_Bacon is always the target. Start scraping from Kevin_Bacon. The time complexity would be O(1).

Considerations

Populating Graph

Option 1: call (API)[https://www.mediawiki.org/wiki/API:Parsing_wikitext] and traverse the pages yourself (webscraper.js) and store in database
- Not very cost effective if application is hosted in cloud
- Amount of time to populate data could be significant since the network latency is unknown
Option 2: find db download with initial data to seed database
Must also consider how to update database - wikipedia pages are updated and new ones are added often
- Is there a way to know which pages are updated or created? Pub/Sub? Polling?

Challenges

Collecting the data
- Had to clean links for generic wiki pages (ex: help, portal) and remove special chars to store in directory

Adding new data sources (aka graph source) for web scraper to save to and compute to compute from

New data sources can be add by implementing the following interface

interface DataSource {
    computeDegreesOfSeparation:number,
    insertNode:void,
    addEdge:void
}

Helpful Links

https://en.wikipedia.org/w/api.php?format=json&prop=links&action=parse&page=Cleveland

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.vscode		.vscode
src		src
tests/e2e		tests/e2e
.gitignore		.gitignore
README.md		README.md
batch.js		batch.js
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kevin-bacon

Test Cases:

Inputs/Outputs

Table of Contents

Run API

Run Test Cases (E2E)

Run Scraper

Web Scraper

Algorithm to Calculate Degrees of Separation

Optimization

Considerations

Populating Graph

Challenges

Adding new data sources (aka graph source) for web scraper to save to and compute to compute from

Helpful Links

About

Releases

Packages

Languages

haydenwade/kevin-bacon

Folders and files

Latest commit

History

Repository files navigation

kevin-bacon

Test Cases:

Inputs/Outputs

Table of Contents

Run API

Run Test Cases (E2E)

Run Scraper

Web Scraper

Algorithm to Calculate Degrees of Separation

Optimization

Considerations

Populating Graph

Challenges

Adding new data sources (aka graph source) for web scraper to save to and compute to compute from

Helpful Links

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages