GitHub - marinakiseleva/wiki-scraper: This web crawler uses Scrapy py to crawl Wikipedia. It prints the page title, total word count, and page category (using openpyxl) to an Excel workbook, in order to analyze the verbosity of articles by category.

marinakiseleva / wiki-scraper Public

Notifications You must be signed in to change notification settings
Fork 1
Star 2

This web crawler uses Scrapy py to crawl Wikipedia. It prints the page title, total word count, and page category (using openpyxl) to an Excel workbook, in order to analyze the verbosity of articles by category.

2 stars 1 fork Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
wikicrawler		wikicrawler
.gitignore		.gitignore
README		README
requirements.txt		requirements.txt

Repository files navigation

Run the following commands to get set up:

OPTIONAL ----------------------
You can download the required libraries in a virtual enviornment, so the libraries are not installed globally: 
$ virtualenv [whatever-name-you-choose]
$ cd whatever-name-you-choose
$ source bin/activate



RUNNING ----------------------
Run the required libraries.
$ pip install -r requirements.txt

Run the program:
$ scrapy crawl wikicrawler -a title="Black_hole" -a workbook="empty.xlsx"

If the workbook does not exist, the program will create it, otherwise it'll create a new sheet in the existing workbook. 
Ensure that title is a real Wikipedia page title, as is seen in the URL, for example:
https://en.wikipedia.org/wiki/Black_hole




DEBUGGING ---------------------
If you get a WebDriverException with the ChromeDriver, download the driver of the Chrome version you have already, from here: http://chromedriver.storage.googleapis.com/index.html, and place it into this directory.