- Install the source code into a local repository.
git clone http://www.github.com/jiulee-v/crawling
- Install requirements
pip install requests beautifulsoup4 tqdm
- Run crawl.py and filter.py to get corpus(Wikipedia dump: only filter.py) in each directory
Military folder contains various scripts for Korean military data.
- Kook-bang Daily (News): 4B tokens
- Korea Ministry of Defence Official Blog (News & Magazine): 1B tokens(tistory; ~2016.7)
- Naver Encyclopedia:
- Arms and weapons: On progress
- World of arms: On progress
- Aircrafts: On progress
- Ships: On progress
- Guns: On progress
- Automobiles: On progress
- KODEF Aircrafts: On progress
- Wikipedia dump: On Progress