GitHub - martinknz/fundus: A very simple news crawler with a funny name

A very simple news crawler in Python. Developed at Humboldt University of Berlin.

Quick Start | Tutorials | News Sources | Paper

Fundus is:

A static news crawler. Fundus lets you crawl online news articles with only a few lines of Python code! Be it from live websites or the CC-NEWS dataset.
An open-source Python package. Fundus is built on the idea of building something together. We welcome your contribution to help Fundus grow!

Quick Start

To install from pip, simply do:

pip install fundus

Fundus requires Python 3.8+.

Example 1: Crawl a bunch of English-language news articles

Let's use Fundus to crawl 2 articles from publishers based in the US.

from fundus import PublisherCollection, Crawler

# initialize the crawler for news publishers based in the US
crawler = Crawler(PublisherCollection.us)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)

That's already it!

If you run this code, it should print out something like this:

Fundus-Article:
- Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
- Text:  "Democrats jammed three of President Joe Biden's controversial court nominees
          through committee votes on Thursday thanks to a last-minute [...]"
- URL:    https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
- From:   FreeBeacon (2023-05-11 18:41)

Fundus-Article:
- Title: "Northwestern student government freezes College Republicans funding over [...]"
- Text:  "Student government at Northwestern University in Illinois "indefinitely" froze
          the funds of the university's chapter of College Republicans [...]"
- URL:    https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
- From:   FoxNews (2023-05-09 14:37)

This printout tells you that you successfully crawled two articles!

For each article, the printout details:

the "Title" of the article, i.e. its headline
the "Text", i.e. the main article body text
the "URL" from which it was crawled
the news source it is "From"

Example 2: Crawl a specific news source

Maybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only:

from fundus import PublisherCollection, Crawler

# initialize the crawler for The New Yorker
crawler = Crawler(PublisherCollection.us.TheNewYorker)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)

Example 3: Crawl articles from CC-NEWS

If you're not familiar with CC-NEWS, check out their paper.

from fundus import PublisherCollection, CCNewsCrawler

# initialize the crawler for news publishers based in the US
crawler = CCNewsCrawler(*PublisherCollection.us)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
  print(article)

Tutorials

We provide quick tutorials to get you started with the library:

If you wish to contribute check out these tutorials:

Currently Supported News Sources

You can find the publishers currently supported here.

Also: Adding a new publisher is easy - consider contributing to the project!

Evaluation benchmark

Check out our evaluation benchmark.

Scraper	Precision	Recall	F1-Score
Fundus	99.89_±0.57	96.75_±12.75	97.69_±9.75
Trafilatura	90.54_±18.86	93.23_±23.81	89.81_±23.69
BTE	81.09_±19.41	98.23_±8.61	87.14_±15.48
jusText	86.51_±18.92	90.23_±20.61	86.96_±19.76
news-please	92.26_±12.40	86.38_±27.59	85.81_±23.29
BoilerNet	84.73_±20.82	90.66_±21.05	85.77_±20.28
Boilerpipe	82.89_±20.65	82.11_±29.99	79.90_±25.86

Cite

Please cite the following paper when using Fundus or building upon our work:

@misc{dallabetta2024fundus,
      title={Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions}, 
      author={Max Dallabetta and Conrad Dobberstein and Adrian Breiding and Alan Akbik},
      year={2024},
      eprint={2403.15279},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact

Please email your questions or comments to Max Dallabetta

Contributing

Thanks for your interest in contributing! There are many ways to get involved; start with our contributor guidelines and then check these open issues for specific tasks.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1,475 Commits
.github		.github
docs		docs
resources/logo		resources/logo
scripts		scripts
src/fundus		src/fundus
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Start

Example 1: Crawl a bunch of English-language news articles

Example 2: Crawl a specific news source

Example 3: Crawl articles from CC-NEWS

Tutorials

Currently Supported News Sources

Evaluation benchmark

Cite

Contact

Contributing

License

About

Releases

Packages

Languages

License

martinknz/fundus

Folders and files

Latest commit

History

Repository files navigation

Quick Start

Example 1: Crawl a bunch of English-language news articles

Example 2: Crawl a specific news source

Example 3: Crawl articles from CC-NEWS

Tutorials

Currently Supported News Sources

Evaluation benchmark

Cite

Contact

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages