Media Cloud Metadata Extractor

🚧 under construction 🚧

This is a package to extract a domain, title, publication date, text, and language content from the URL or text of an online news story. The methods for each are extracted from the larger Media Cloud project, but also build on numerous 3rd party libraries. The metadata extracted includes:

the original URL of publication
a normalized URL useful for de-duplication
the canonical domain published on
the date of publication
the primary language used in the article text
the title of the article
a normalized title useful for de-duplication
the text content of the news article
the name of the library used to extract the article content

Installation

pip install mediacloud-metadata

Usage

If you pass in a URL, it will follow redirects and fetch the HTML for you.

from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path")

You can also pass in HTML you already have on hand. Note that in this case it is also useful to pass in the URL because that is used for some for some of the metadata extraction.

from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path",
                   html_text="<html><head><title>my webpage ... </html>")

Development

If you are interested in adding code to this module, first clone the GitHub repository.

Installing

flit install
pre-commit install

Testing

pytest

Distributing a New Version

Run pytest to make sure all the test pass
Update the version number in pyproject.toml
Make a brief note in the CHANGELOG.md about what changes
Commit the changes
Tag the commit with a semantic version number - v*.*.*
Push to repo to GitHub
Run flit build to create an install package
Run flit publish to upload it to PyPI

Test Cache

Test are run against fixtures by default. This can be changed with the use of '--use-cache=False' when running tests. When adding new tests, re-run 'scripts/get-test-web-content.py'

Contributors

Created as part of the Media Cloud Project. Contributes include:

Rahul Bhargava (Media Cloud, Northeastern University)
Paige Gulley (Media Cloud)
Phil Budne (Media Cloud)
Vangelis Banos (Internet Archive)

Name		Name	Last commit message	Last commit date
Latest commit History 262 Commits
.github		.github
mcmetadata		mcmetadata
scripts		scripts
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

mcmetadata

mcmetadata

scripts

scripts

.flake8

.flake8

.gitignore

.gitignore

.pre-commit-config.yaml

.pre-commit-config.yaml

CHANGELOG.md

CHANGELOG.md

LICENSE

LICENSE

README.md

README.md

conftest.py

conftest.py

pyproject.toml

pyproject.toml

Repository files navigation

Media Cloud Metadata Extractor

Installation

Usage

Development

Installing

Testing

Distributing a New Version

Test Cache

Contributors

About

Releases

Packages

Contributors 5

Languages

License

mediacloud/metadata-lib

Folders and files

Latest commit

History

Repository files navigation

Media Cloud Metadata Extractor

Installation

Usage

Development

Installing

Testing

Distributing a New Version

Test Cache

Contributors

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages