GitHub - MohammadRaziei/liburlparser: Fastest domain extractor library written in C++ with python binding.

Fastest domain extractor library written in C++ with python binding.

First and complete library for parsing url in C++ and Python and Command Line

About The Project

liburlparser is a powerful domain extractor library written in C++ with Python bindings. It provides efficient URL parsing capabilities for both C++ and Python, making it a valuable tool for projects that involve working with web addresses.

Features

Here are some key features of liburlparser:

Multiple Language Support:
- liburlparser can be used in multiple programming languages, including Python, C++, and Shell.
- It offers an intuitive interface that remains consistent across both C++ and Python.
Clean Code Design:
- The library provides two separate classes: Url and Host.
- This separation allows for cleaner and more organized code when dealing with URLs.
Public Suffix List Support:
- liburlparser supports known combinatorial suffixes (e.g., "ac.ir") using the public_suffix_list.
- It can also handle unknown suffixes (e.g., "comm" in "google.comm").
Automatic Public Suffix List Updates:
- Before each build and deployment, liburlparser updates the public_suffix_list automatically.
Host Properties:
- The Host class includes properties such as subdomain, domain, domain name, and suffix.
URL Properties:
- The Url class provides properties like protocol, userinfo, host (and all host properties), port, path, query parameters, and fragment.

Usage

Command Line

python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json

Python

you can use liburlparser so intutively

all of classes has help section

import liburlparser
help(liburlparser)
print(liburlparser.__version__)

from liburlparser import Url, Host
help(Url)
help(Host)

parse url and host

from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or 
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url 
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())

Also there is some helping api to get better performance for some small tasks

# if you need to extract the host of url as a string without any parsing 
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast

if you are fan of pydomainextractor, there is some interface similar to it

import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url

# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api

C++

there is some examples in examples folder

#include "urlparser.h"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");

you can see all methods in python we can use in c++ very easily

Installation

C++:

build steps:

git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install

Python and Command Line:

Be aware that it required python>=3.8

Installation

pip by pypi

pip install liburlparser

if you want to use psl.update to update the public suffix list, you must install the online version

pip install "liburlparser[online]"

Or

pip by git

pip install git+https://github.com/mohammadraziei/liburlparser

Or

manually

git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser

Performance

Extract From Host

Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)

Library	Function	Time
liburlparser	liburlparser.Host	1.12s
PyDomainExtractor	pydomainextractor.extract	1.50s
publicsuffix2	publicsuffix2.get_sld	9.92s
tldextract	__call__	29.23s
tld	tld.parse_tld	34.48s

Extract From URL

The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)

Library	Function	Time
liburlparser	liburlparser.Host.from_url	2.10s
PyDomainExtractor	pydomainextractor.extract_from_url	2.24s
publicsuffix2	publicsuffix2.get_sld	10.84s
tldextract	__call__	36.04s
tld	tld.parse_tld	57.87s

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
.github/workflows		.github/workflows
cmake/modules		cmake/modules
conda.recipe		conda.recipe
docs		docs
examples		examples
include		include
src		src
tests		tests
third_party		third_party
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

MohammadRaziei/liburlparser

Folders and files

Latest commit

History

Repository files navigation

Fastest domain extractor library written in C++ with python binding.

First and complete library for parsing url in C++ and Python and Command Line

About The Project

Features

Usage

Command Line

Python

C++

Installation

C++:

build steps:

Python and Command Line:

Installation

pip by pypi

pip by git

manually

Performance

Extract From Host

Extract From URL

License

Stats

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages