Skip to content

Commit

Permalink
Merge pull request #2 from investigativedata/develop
Browse files Browse the repository at this point in the history
v0.0.4
  • Loading branch information
simonwoerpel committed Jan 4, 2024
2 parents 00cff3f + 9993611 commit 5872bee
Show file tree
Hide file tree
Showing 40 changed files with 3,784 additions and 3,329 deletions.
2 changes: 1 addition & 1 deletion .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.0.3
current_version = 0.0.4
commit = True
tag = True
message = 🔖 Bump version: {current_version} → {new_version}
Expand Down
18 changes: 18 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
version: 2
updates:
- package-ecosystem: "pip"
directory: "/"
open-pull-requests-limit: 99
schedule:
interval: "daily"
target-branch: "develop"
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "daily"
target-branch: "develop"
- package-ecosystem: "docker"
directory: "/"
schedule:
interval: "weekly"
target-branch: "develop"
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
redis_data*
.pypirc
# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
8 changes: 4 additions & 4 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
# * Run "pre-commit install".
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
rev: v4.5.0
hooks:
- id: check-added-large-files
- id: check-case-conflict
Expand All @@ -31,13 +31,13 @@ repos:
- id: absolufy-imports

- repo: https://github.com/pycqa/isort
rev: 5.12.0
rev: 5.13.2
hooks:
- id: isort
args: ["--profile", "black"]

- repo: https://github.com/psf/black
rev: 23.9.1
rev: 23.12.1
hooks:
- id: black

Expand Down Expand Up @@ -68,7 +68,7 @@ repos:
- id: rst-inline-touching-normal

- repo: https://github.com/python-poetry/poetry
rev: 1.6.0
rev: 1.7.0
hooks:
- id: poetry-check
- id: poetry-lock
Expand Down
7 changes: 7 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"python.testing.pytestArgs": [
"tests"
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true
}
62 changes: 34 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,33 +2,44 @@

# juditha

A super-fast lookup service for canonical names based on redis and configurable upstream sources (currently [Aleph](https://docs.aleph.occrp.org/) and [Wikipedia](https://www.wikipedia.org/)).
A super-fast lookup service for canonical names based on redis and configurable fallback upstream sources (currently [Aleph](https://docs.aleph.occrp.org/) and [Wikipedia](https://www.wikipedia.org/)).

`juditha` wants to solve the noise/garbage problem occurring when working with [Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition). Given the availability of huge lists of *known names*, once could canonize `ner`-results against this service to check if they are known.
`juditha` wants to solve the noise/garbage problem occurring when working with [Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition). Given the availability of huge lists of *known names*, such as company registries or lists of persons of interest, one could canonize `ner`-results against this service to check if they are known.

The implementation uses a pre-populated redis cache which can fallback to other sources.

## quickstart

pip install juditha

## populate
### start local redis

docker run -p 6379:6379 redis

### populate

echo "Jane Doe\nAlice" | juditha load

### lookup

echo "Jane Doe\nAlice" | juditha import
juditha lookup "jane doe"
"Jane Doe"

## lookup
To match more fuzzy, reduce the threshold (default 0.97):

juditha lookup Jane
"jane"
juditha lookup "doe, jane" --threshold 0.5
"Jane Doe"

## data import

### from ftm entities

cat entities.ftm.json | juditha import --from-entities
cat entities.ftm.json | juditha load --from-entities

### from anywhere

juditha import -i s3://my_bucket/names.txt
juditha import -i https://data.ftm.store/eu_authorities/entities.ftm.json --from-entities
juditha load -i s3://my_bucket/names.txt
juditha load -i https://data.ftm.store/eu_authorities/entities.ftm.json --from-entities

### a complete dataset or catalog

Expand All @@ -42,8 +53,9 @@ Following the [`nomenklatura`](https://github.com/opensanctions/nomenklatura) sp
```python
from juditha import lookup

assert lookup("jane") == "Jane"
assert lookup("foo") is None
assert lookup("jane doe") == "Jane Doe"
assert lookup("doe, jane") is None
assert lookup("doe, jane", threshold=0.5) == "Jane Doe"
```

## run as api
Expand All @@ -52,12 +64,20 @@ assert lookup("foo") is None

### api calls

curl -I "http://localhost:8000/Berlin"
Just do head requests to check if a name is known:

curl -I "http://localhost:8000/jane%20doe"
HTTP/1.1 200 OK

curl -I "http://localhost:8000/Bayern"
curl -I "http://localhost:8000/John"
HTTP/1.1 404 Not Found

An actual request returns the canonized name:

curl "http://localhost:8000/doe,%20jane?threshold=0.5"
Jane Doe


## settings

set redis endpoint via environment variable:
Expand Down Expand Up @@ -102,20 +122,6 @@ j = Juditha("https://juditha.ftm.store")
assert j.lookup("HIMATIC EXPLOTACIONES SL") is not None
```

## fuzzy matching

Optionally, search fuzzy. Fuzzyness is controlled by `FUZZY_SCORE` as a threshold (~0.9x) and activated by `FUZZY=true`.

During import, this creates an additional reverted redis index based on value tokens and for lookups compares name candidates via `Levensthein`.

Fuzzy matching is controlled via the api get parameter `fuzzy=true`

curl -I "http://localhost:8000/Brlin"
HTTP/1.1 404 Not Found

curl -I "http://localhost:8000/Brlin?fuzzy=true"
HTTP/1.1 200 OK

## the name

**Juditha Dommer** was the daughter of a coppersmith and raised seven children, while her husband Johann Pachelbel wrote a *canon*.
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.0.3
0.0.4
17 changes: 12 additions & 5 deletions juditha/__init__.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,20 @@
from juditha.store import get_store, lookup
from juditha.clean import normalize
from juditha.settings import FUZZY_THRESHOLD
from juditha.store import classify, get_store, lookup


class Juditha:
def __init__(self, url: str) -> "Juditha":
self.store = get_store(juditha_url=url)

def lookup(self, value: str) -> str | None:
return self.store.lookup(value)
def lookup(
self, name: str, threshold: float | None = FUZZY_THRESHOLD
) -> str | None:
return self.store.lookup(name, threshold=threshold)

def classify(self, name: str) -> str | None:
return self.store.classify(name)

__version__ = "0.0.3"
__all__ = ["lookup"]

__version__ = "0.0.4"
__all__ = ["lookup", "classify", "normalize"]
28 changes: 19 additions & 9 deletions juditha/api.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from fastapi import FastAPI, Response
from fastapi.exceptions import HTTPException

from juditha import __version__, lookup, settings
from juditha import __version__, classify, lookup, settings

app = FastAPI(
debug=settings.DEBUG,
Expand All @@ -13,17 +13,27 @@
)


@app.get("/_classify/{q}")
async def api_classify(q: str) -> Response:
schema = classify(q)
if schema is None:
return Response("404", status_code=404)
return Response(schema)


@app.get("/{q}")
async def api_lookup(q: str, fuzzy: bool | None = False) -> str:
value = lookup(q, fuzzy=fuzzy)
if value is None:
raise HTTPException(404)
return Response(value)
async def api_lookup(
q: str, threshold: float | None = settings.FUZZY_THRESHOLD
) -> Response:
name = lookup(q, threshold=threshold)
if name is None:
return Response("404", status_code=404)
return Response(name)


@app.head("/{q}")
async def api_head(q: str, fuzzy: bool | None = False) -> None:
value = lookup(q, fuzzy=fuzzy)
if value is None:
async def api_head(q: str, threshold: float | None = settings.FUZZY_THRESHOLD) -> int:
name = lookup(q, threshold=threshold)
if name is None:
raise HTTPException(404)
return 200
Loading

0 comments on commit 5872bee

Please sign in to comment.