Newspaper Crawler is a project in HCMUS's Web data mining subject
- Crawler with Scrapy
- Natural Language Processing (NLP)
STT | MSSV | Họ tên |
---|---|---|
1 | 1760096 | Nguyễn Vũ Linh |
2 | 1760361 | Vũ Văn Lương |
3 | 1760438 | Nguyễn Hoàng Thức |
In the project has used:
- Scapy - A Fast and powerful Scraping and Web crawling.
- Newspaper3k - Article scraping & curation.
- Validators - Python Data Validation for Humans™.
You can clone this project in cmd with:
$ pip install git+https://github.com/nvlinh99/newspaper-crawler.git
After that you will see
crawler
nlp
List-newspaper.txt
And now you must install packages:
1. pip install newspaper3k
2. pip install validators
- Crawler:
Cmd require you input a url to crawl, you can use a url in List-newspaper.txt Data will be save in folder name: N06
$ cd crawler / $ cd KTW06 $ scrapy crawl threads
- NLP
Create a folder to contain results
$ cd .. (x2)
Cmd require Path input, you can put path to folder topic have been crawled before Ex:$ python N06_NLP.py
All results will save in folder output.$ pathIn = D:\...\crawler\KTW06\N06\nytimes\Business $ pathOut = D:\...\nlp\output