Skip to content

nvlinh99/newspaper-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Newspaper Crawler

Newspaper Crawler is a project in HCMUS's Web data mining subject

  • Crawler with Scrapy
  • Natural Language Processing (NLP)

Thông tin nhóm

STT MSSV Họ tên
1 1760096 Nguyễn Vũ Linh
2 1760361 Vũ Văn Lương
3 1760438 Nguyễn Hoàng Thức

Packages

In the project has used:

  • Scapy - A Fast and powerful Scraping and Web crawling.
  • Newspaper3k - Article scraping & curation.
  • Validators - Python Data Validation for Humans™.

Installation

You can clone this project in cmd with:

$ pip install git+https://github.com/nvlinh99/newspaper-crawler.git

After that you will see

crawler
nlp
List-newspaper.txt

And now you must install packages:

1. pip install newspaper3k
2. pip install validators

How to use

  • Crawler:
    $ cd crawler / $ cd KTW06
    $ scrapy crawl threads
    Cmd require you input a url to crawl, you can use a url in List-newspaper.txt Data will be save in folder name: N06
  • NLP
    $ cd .. (x2)
    Create a folder to contain results
    $ python N06_NLP.py
    Cmd require Path input, you can put path to folder topic have been crawled before Ex:
    $ pathIn = D:\...\crawler\KTW06\N06\nytimes\Business
    $ pathOut = D:\...\nlp\output
    All results will save in folder output.

Releases

No releases published

Packages

No packages published

Languages