Skip to content

jinulee-v/crawling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

Crawling: Repository for various crawling scripts

How-to:

  1. Install the source code into a local repository.
git clone http://www.github.com/jiulee-v/crawling
  1. Install requirements
pip install requests beautifulsoup4 tqdm
  1. Run crawl.py and filter.py to get corpus(Wikipedia dump: only filter.py) in each directory

Military

Military folder contains various scripts for Korean military data.

  • Kook-bang Daily (News): 4B tokens
  • Korea Ministry of Defence Official Blog (News & Magazine): 1B tokens(tistory; ~2016.7)
  • Naver Encyclopedia:
    • Arms and weapons: On progress
    • World of arms: On progress
    • Aircrafts: On progress
    • Ships: On progress
    • Guns: On progress
    • Automobiles: On progress
    • KODEF Aircrafts: On progress
  • Wikipedia dump: On Progress

About

Various corpus crawling/filtering scripts.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published