Skip to content

Commit

Permalink
Upgraded to Python 3
Browse files Browse the repository at this point in the history
  • Loading branch information
jwngr committed Jun 14, 2024
1 parent a28ad00 commit 6c874a8
Show file tree
Hide file tree
Showing 16 changed files with 12,859 additions and 7,439 deletions.
89 changes: 51 additions & 38 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,20 @@

Thank you for contributing to Six Degrees of Wikipedia!

## Local Setup
## Local setup

There are three main pieces you'll need to get set up running locally:

1. Mock SQLite database of Wikipedia links.
2. Backend Python Flask web server.
3. [Create React App](https://github.com/facebook/create-react-app)-based frontend website.
1. Mock SQLite database of Wikipedia links
2. Backend Python Flask web server
3. [Create React App](https://github.com/facebook/create-react-app)-based frontend website

There is some larger set up you'll need to run initially as well as some recurring set up every time
you want to run the service.

Note: The following instructions have only been tested on macOS.

### Initial Setup
### Initial setup

The first step is to clone the repo and move into the created directory:

Expand All @@ -24,40 +24,53 @@ $ git clone git@github.com:jwngr/sdow.git
$ cd sdow/
```

Several global dependencies are required to run the service. Since installation instructions vary
and are decently documented for each project, please refer to the links below on how to install them.
Several dependencies are required to run the service:
1. [`sqlite3`](https://docs.python.org/3/library/sqlite3.html) - Data storage
1. [`nvm`](https://github.com/nvm-sh/nvm) - Manage Node and `npm` versions
1. [`pyenv`](https://github.com/pyenv/pyenv) - Manage Python and `pip` versions
1. [`virtualenv`](https://virtualenv.pypa.io/) - Avoid polluting global environment

1. [Python](https://www.python.org/downloads/) - macOS comes with an older `2.x` version of Python,
but I recommend using [`pyenv`](https://github.com/pyenv/pyenv) to install the latest `2.x`
release.
1. [`pip`](https://pip.pypa.io/en/stable/installing/) - Most recent versions of Python ship with
`pip`
1. [`sqlite3`](https://docs.python.org/3/library/sqlite3.html) - Can be installed via `brew install sqlite3`.
1. [`virtualenv`](https://virtualenv.pypa.io/en/stable/installation/) - Helps avoid polluting your
global environment.
The simplest way to download these is via [Homebrew](https://github.com/pyenv/pyenv):

Once all the required global dependencies above are installed, run the following commands to get
everything set up:
```bash
## Install SQLite.
$ brew install sqlite

## Install nvm (Node + npm).
$ brew install nvm
$ nvm install node

## Install + configure pyenv (Python + pip).
$ brew install xz
$ brew install pyenv
# Also configure pyenv path using instructions in link above.
$ pyenv install 3

## Install + configure virtualenv.
$ python -m pip install --user virtualenv
# Also configure virtualenv path using instructions in link above.
```

Once the required global dependencies are installed, install the project dependencies and generate
a mock local database:

```bash
# Run from root of repo.
$ virtualenv env
$ source env/bin/activate
$ pip install -r requirements.txt
$ python scripts/create_mock_databases.py
$ cp sdow.sqlite sdow/
$ cd website/
$ npm install
$ cd ..
```

### Recurring Setup
### Recurring setup

Every time you want to run the service, you need to source your environment, start the backend Flask
app, and the frontend website. You can run the backend and frontend apps in different tabs.

To run the backend, open a new tab and run the following commands from the repo root:

```bash
# Run from root of repo.
$ source env/bin/activate
$ cd sdow/
$ export FLASK_APP=server.py FLASK_DEBUG=1
Expand All @@ -71,24 +84,24 @@ $ cd website/
$ npm start
```

The service should be running at http://localhost:3000.
The service can be found at http://localhost:3000.

## Repo Organization
## Repo organization

Here are some highlights of the directory structure and notable source files

- `.github/` - Contribution instructions as well as issue and pull request templates.
- `config/` - Configuration files for services like NGINX, Gunicorn, and Supervisord.
- `docs/` - Documentation.
- `.github/` - Contribution instructions as well as issue and pull request templates
- `config/` - Configuration files for services like NGINX, Gunicorn, and Supervisord
- `docs/` - Documentation
- `scripts/` - Scripts to do things like create a new version of the SDOW database, create a mock
- `sdow/` - The Python Flask web server.
- `server.py` - Main entry point which initializes the Flask web server.
- `database.py` - Defines a `Database` class which simplifies querying the SDOW SQLite database.
- `breadth_first_search.py` - The main search algorithm which finds the shortest path between pages.
- `helpers.py` - Miscellaneous helper functions and classes.
- `sketch/` - Sketch logo files.
- `sql/` - SQLite table schemas.
- `website/` - The frontend website, based on [Create React App](https://github.com/facebook/create-react-app).
- `.pylintrc` - Default configuration for `pylint`.
- `requirements.txt` - Requirements specification for installing project dependencies via `pip`.
- `setup.cfg` - Python PEP 8 autoformatting rules.
- `sdow/` - The Python Flask web server
- `server.py` - Main entry point which initializes the Flask web server
- `database.py` - Defines a `Database` class which simplifies querying the SDOW SQLite database
- `breadth_first_search.py` - The main search algorithm which finds the shortest path between pages
- `helpers.py` - Miscellaneous helper functions and classes
- `sketch/` - Sketch logo files
- `sql/` - SQLite table schemas
- `website/` - The frontend website, based on [Create React App](https://github.com/facebook/create-react-app)
- `.pylintrc` - Default configuration for `pylint`
- `requirements.txt` - Requirements specification for installing project dependencies via `pip`
- `setup.cfg` - Python PEP 8 autoformatting rules
14 changes: 7 additions & 7 deletions docs/data-source.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Data Source | Six Degrees of Wikipedia

## Table of Contents
## Table of contents

- [Data Source](#data-source)
- [Get the Data Yourself](#get-the-data-yourself)
Expand All @@ -9,7 +9,7 @@
- [Historical Search Results](#historical-search-results)
- [Database Creation Process](#database-creation-process)

## Data Source
## Data source

Data for this project comes from Wikimedia, which creates [gzipped SQL dumps of the English language
Wikipedia database](https://dumps.wikimedia.your.org/enwiki) twice monthly. The Six Degrees of
Expand All @@ -29,7 +29,7 @@ For performance reasons, files are downloaded from the
Six Degrees of Wikipedia only deals with actual Wikipedia pages, which in Wikipedia parlance means
pages which belong to [namespace](https://en.wikipedia.org/wiki/Wikipedia:Namespace) `0`.

## Get the Data Yourself
## Get the data yourself

Compressed versions of the Six Degrees of Wikipedia SQLite database (`sdow.sqlite.gz`) are available
for download from ["requester pays"](https://cloud.google.com/storage/docs/requester-pays) Google
Expand Down Expand Up @@ -95,7 +95,7 @@ $ pigz -d sdow.sqlite.gz
- `gs://sdow-prod/dumps/20231220/sdow.sqlite.gz` (4.3 GB)
</details>

## Database Schema
## Database schema

The Six Degrees of Wikipedia database is a single SQLite file containing the following three tables:

Expand All @@ -114,7 +114,7 @@ The Six Degrees of Wikipedia database is a single SQLite file containing the fol
1. `source_id` - The page ID of the source page, the page that redirects to another page.
2. `target_id` - The page ID of the target page, to which the redirect page redirects.

## Historical Search Results
## Historical search results

Historical search results are stored in a separate SQLite database (`searches.sqlite`) which
contains a single `searches` table with the following schema:
Expand All @@ -134,7 +134,7 @@ as well as to make it easy to update the `sdow.sqlite` database to a more recent
Historical search results are not available for public download, but they are not required to run
this project yourself.

## Database Creation Script
## Database creation script

A new build of the Six Degrees of Wikipedia database is created using the [database creation shell
script](../scripts/buildDatabase.sh):
Expand All @@ -150,7 +150,7 @@ by passing the date of the dump in the format `YYYYMMDD` as a command line argum
$ ./buildDatabase.sh <YYYYMMDD>
```

## Database Creation Process
## Database creation process

Generating the Six Degrees of Wikipedia database from a dump of Wikipedia takes approximately two
hours given the following instructions:
Expand Down
6 changes: 3 additions & 3 deletions docs/miscellaneous.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Miscellaneous | Six Degrees of Wikipedia

## Table of Contents
## Table of contents

* [Noteworthy Searches](#noteworthy-searches)
* [Edge Case Page Titles](#edge-case-page-titles)

## Noteworthy Searches
## Noteworthy searches

The following is a list of noteworthy searches:

Expand All @@ -19,7 +19,7 @@ The following is a list of noteworthy searches:
| [Lion Express → Phinney](https://www.sixdegreesofwikipedia.com/?source=Lion%20Express&target=Phinney) | 9 degrees of separation! |
| [2016 French Open → Brachmia melicephala](https://www.sixdegreesofwikipedia.com/?source=2016%20French%20Open&target=Brachmia%20melicephala) | Sparse graph of 6 degrees |

## Edge Case Page Titles
## Edge case page titles

The following is a collection of edge page titles, mainly used to ensure the project works given a
wide variety of inputs:
Expand Down
12 changes: 6 additions & 6 deletions docs/web-server-setup.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Web Server Setup | Six Degrees of Wikipedia
# Web erver Setup | Six Degrees of Wikipedia

## Table of Contents
## Table of contents

- [Initial Setup](#initial-setup)
- [Recurring Setup](#recurring-setup)
- [Updating Data Source](#updating-data-source)
- [Updating Server Code](#updating-server-code)

## Initial Setup
## Initial setup

1. Create a new [Google Compute Engine instance](https://console.cloud.google.com/compute/instances?project=sdow-prod)
from the `sdow-web-server` instance template, which is configured with the following specs:
Expand Down Expand Up @@ -208,7 +208,7 @@
$ sudo service stackdriver-agent start
```

## Recurring Setup
## Recurring setup

1. Activate the `virtualenv` environment:

Expand Down Expand Up @@ -243,7 +243,7 @@
`gunicorn` is written to `/tmp/gunicorn-stdout---supervisor-<HASH>.log`. Logs are also written to
Stackdriver Logging.

## Updating Data Source
## Updating data source

To update the web server to a more recent `sdow.sqlite` file with minimal downtime, run the
following commands after SSHing into the web server:
Expand All @@ -258,7 +258,7 @@ $ cd config/
$ supervisorctl restart gunicorn
```

## Updating Server Code
## Updating server code

To update the Python server code which powers the SDOW backend, run the following commands after
SSHing into the web server:
Expand Down
18 changes: 9 additions & 9 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
flask == 2.3.2
flask-compress == 1.4.0
flask-cors == 3.0.9
litecli == 1.2.0
google-cloud-logging == 1.14.0
flask == 3.0.3
flask-compress == 1.15.0
flask-cors == 4.0.1
litecli == 1.11.0
google-cloud-logging == 3.10.0
google-compute-engine == 2.8.13
gunicorn == 19.9.0
protobuf == 3.18.3
requests == 2.31.0
supervisor == 4.1.0
gunicorn == 22.0.0
protobuf == 4.25.3
requests == 2.32.3
supervisor == 4.2.5
2 changes: 0 additions & 2 deletions scripts/combine_grouped_links_files.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@
Output is written to stdout.
"""

from __future__ import print_function

import io
import sys
import gzip
Expand Down
2 changes: 0 additions & 2 deletions scripts/create_mock_databases.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
from __future__ import print_function

import os
import sqlite3
import subprocess
Expand Down
2 changes: 0 additions & 2 deletions scripts/generate_updated_wikipedia_facts.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@
Generates an updated Wikipedia facts JSON file.
"""

from __future__ import print_function

import os
import json
import sqlite3
Expand Down
2 changes: 0 additions & 2 deletions scripts/lookup_wikipedia_page_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@
Looks up Wikipedia page information via the official Wikipedia API given a list of page IDs.
"""

from __future__ import print_function

import requests

WIKIPEDIA_API_URL = 'https://en.wikipedia.org/w/api.php'
Expand Down
3 changes: 0 additions & 3 deletions scripts/prune_pages_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,9 @@
Output is written to stdout.
"""

from __future__ import print_function

import io
import sys
import gzip
from sets import Set

# Validate input arguments.
if len(sys.argv) < 3:
Expand Down
5 changes: 1 addition & 4 deletions scripts/replace_titles_and_redirects_in_links_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,9 @@
Output is written to stdout.
"""

from __future__ import print_function

import io
import sys
import gzip
from sets import Set

# Validate inputs
if len(sys.argv) < 4:
Expand All @@ -35,7 +32,7 @@
sys.exit()

# Create a set of all page IDs and a dictionary of page titles to their corresponding IDs.
ALL_PAGE_IDS = Set()
ALL_PAGE_IDS = set()
PAGE_TITLES_TO_IDS = {}
for line in io.BufferedReader(gzip.open(PAGES_FILE, 'r')):
[page_id, page_title, _] = line.rstrip('\n').split('\t')
Expand Down
5 changes: 1 addition & 4 deletions scripts/replace_titles_in_redirects_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,9 @@
Output is written to stdout.
"""

from __future__ import print_function

import io
import sys
import gzip
from sets import Set

# Validate input arguments.
if len(sys.argv) < 3:
Expand All @@ -29,7 +26,7 @@
sys.exit()

# Create a set of all page IDs and a dictionary of page titles to their corresponding IDs.
ALL_PAGE_IDS = Set()
ALL_PAGE_IDS = set()
PAGE_TITLES_TO_IDS = {}
for line in io.BufferedReader(gzip.open(PAGES_FILE, 'r')):
[page_id, page_title, _] = line.rstrip('\n').split('\t')
Expand Down
4 changes: 2 additions & 2 deletions sdow/database.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@

import os.path
import sqlite3
import helpers as helpers
from breadth_first_search import breadth_first_search
import sdow.helpers as helpers
from sdow.breadth_first_search import breadth_first_search


class Database(object):
Expand Down
Loading

0 comments on commit 6c874a8

Please sign in to comment.