Skip to content

macunha1/newspaper3kli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Newspaper3kli

Newspaper3kli stands for the "kommand-line" interface over Newspaper3k.

A tiny layer on top of Newspaper3k with support for Unix-like executions and parallelism (using asyncio) to download bulks of articles faster.

Requirements

In addition to the requirements, make sure you have nltk's punkt package installed (via nlkt.download() in interactive Python) for Newspaper3k's article.nlp() to work properly.

Installation

# assuming your OS has pip3 as default
pip install newspaper3kli==0.1.0

Usage

Overview of available parameters

usage: newspaper3kli [-h] [-o OUTPUT] [-u] [--keep-html] [urls [urls ...]]

positional arguments:
  urls                  URL to download content from (single download)

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output path to store the results
  -u, --disable-verify-ssl
                        Flag to disable SSL certificate verification.
  --keep-html           Flag to save content with HTML..

Executing

Passing URLs from the terminal

newspaper3kli https://hello.world/article/2020 \
    https://hello.world/article/2019

Reading from a txt file

TXT is the simplest file format for reading with Newspaper3kli.

Assuming the txt file has the following content (line delimited URLs):

https://hello.world/article/2020
https://hello.world/article/2019
cat /path/to/this/file.txt | newspaper3kli

Reading from a CSV file

CSV parsing will depend in a tool like awk or cut to split the columns.

Content sample

url,tags,date
https://hello.world/article/2020,some|thing,2020-01-01T00:00:00
https://hello.world/article/2019,some|thing,2019-01-01T00:00:00

Processing

# note that $1 corresponds to the URLs column number, change to yours
cat /path/to/this/file.csv | awk -F, '{ print $1 }' | newspaper3kli

For any other character-delimited content, simple change from -F, (comma) to the desired format, e.g.: -F\t for TSV

Output path

When no path is specified through --output parameter, the default path is the output directory inside Newspaper3kli's installation directory.

Files are created according to Article's name, and are stored in pairs:

  • JSON for metadata;
  • HTML for content;

Credits

Thanks to dsynkov for the work at newspaper-bulk. The source of inspiration and some code for this project.

About

Newspaper3k tiny "kommand-line" interface (CLI) wrapper with parallelism support using AsyncIO supporting bulk page download.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages