Named-entity recognition system for Slovenian political news.
Python Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
bot
.env
.gitignore
Experimenting.ipynb
README.md
aria_config
requirements.txt
scrapy.cfg
setup.py

README.md

Politiki NER

Named-entity recognition project for Slovenian political data.

Installation & development

# Python 2.7.6
mkvirtualenv --no-site-packages politiki
workon politiki
pip install --upgrade -r requirements.txt

Libaries and tools used

Preparing and scraping data

Manually scrape each portal or run './bin/small_crawl.sh' script

scrapy crawl delo -o data/urls/delo.csv -t csv -O --nolog

Combine URL lists into one huge list.

cat data/urls/*.csv | cut -d ',' -f1 | grep -v -e "url" | uniq -u > data/lists/big.txt

Use Aria2 to download everything for offline processing

aria2c --conf-path aria_config -i data/lists/big.txt

Author and credit