Named-entity recognition project for Slovenian political data.
# Python 2.7.6
mkvirtualenv --no-site-packages politiki
workon politiki
pip install --upgrade -r requirements.txt
Manually scrape each portal or run './bin/small_crawl.sh' script
scrapy crawl delo -o data/urls/delo.csv -t csv -O --nolog
Combine URL lists into one huge list.
cat data/urls/*.csv | cut -d ',' -f1 | grep -v -e "url" | uniq -u > data/lists/big.txt
Use Aria2 to download everything for offline processing
aria2c --conf-path aria_config -i data/lists/big.txt