nvlinh99 / newspaper-crawler Public

Notifications You must be signed in to change notification settings
Fork 0
Star 3

A project in HCMUS's Web data mining subject

3 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
crawler		crawler
nlp		nlp
List-newspaper.txt		List-newspaper.txt
README.md		README.md

Repository files navigation

Newspaper Crawler

Newspaper Crawler is a project in HCMUS's Web data mining subject

Crawler with Scrapy
Natural Language Processing (NLP)

Thông tin nhóm

STT	MSSV	Họ tên
1	1760096	Nguyễn Vũ Linh
2	1760361	Vũ Văn Lương
3	1760438	Nguyễn Hoàng Thức

Packages

In the project has used:

Scapy - A Fast and powerful Scraping and Web crawling.
Newspaper3k - Article scraping & curation.
Validators - Python Data Validation for Humans™.

Installation

You can clone this project in cmd with:

$ pip install git+https://github.com/nvlinh99/newspaper-crawler.git

After that you will see

crawler
nlp
List-newspaper.txt

And now you must install packages:

1. pip install newspaper3k
2. pip install validators

How to use

Crawler:
```
$ cd crawler / $ cd KTW06
$ scrapy crawl threads
```
Cmd require you input a url to crawl, you can use a url in List-newspaper.txt Data will be save in folder name: N06
NLP
```
$ cd .. (x2)
```
Create a folder to contain results
```
$ python N06_NLP.py
```
Cmd require Path input, you can put path to folder topic have been crawled before Ex:
```
$ pathIn = D:\...\crawler\KTW06\N06\nytimes\Business
$ pathOut = D:\...\nlp\output
```
All results will save in folder output.

About

A project in HCMUS's Web data mining subject

python nlp crawler web-crawler python3 nlp-machine-learning newspaper-crawler

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%