Open source crawler for Persian websites. Crawled websites to now:
asriran/run_asriran.sh
You can change some paramters in this crawler. See
run_asriran.sh
.
Due to some problems in crawling, I splitted this job into two stages. First crawling all index pages and second use those pages for crawling.
wikipedia/run_wikipedia.sh
This crawler saves tasnim news pages based on category. This is appopriate for text classification task as data is relatively balanced across all categories. I selected equal amount of page per category.
We have a parameter Called
Number_of_pages
intasnim.py
which controls how many pages we should crawl in each category.
tasnim/run_tasnim.sh
Datasets are all available for download at Kaggle.
CSS selectors are mostly extracted via Copy Css Selector.