Crawl movie & actor info from Wikipedia
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.idea
api
database
model
test
.gitignore
LICENSE
config.py
readme.md
requirements.txt
server.py
spider.py

readme.md

Crawler

A Wikipedia crawler that extract information from actor pages and movie pages and store in database

Install packages

It is recommended that you install all requirements in virtual environment.

To create a virtual environment, cd into the project root directory, and do

virtualenv venv

(The project is written in Python 3, and might not be compatible with Python 2. Make sure that you are creating an virtualenv using Python 3)

Then you can activate your virtual env by

source venv/bin/activate

Install packages via pip

pip install -r requirements.txt

Then you can begin exploring the project. When you are done, deactivate the environment by

deactivate

Initialize database

You need to initialize the database the first time you run the crawler. To do so, start a python console, and execute the following code

from database import init_db
init_db()

Then you can start running the spider by

python spider.py