python-pandas-matching

Python code that calls Interzoid's Generative AI-enriched matching APIs to generate similarity keys for organization names (including international), which are then appended to pandas to create match reports.

This is an example of how AI-enhanced similarity keys generated from Interzoid's APIs are used to identify inconsistent yet matching corporate or organization name data, especially with international organization names. Since AI-models are used, this gives us results that go far beyond traditional string matching techniques, including international language characters and languages.

To see some examples of similarity keys with inconsistent data, see this blog entry

To achieve this kind of matching in the code example, we will use Interzoid's Company & Organization Matching API. This is a scalar API, meaning we will call it once for each row we analyze. Since it is a JSON API, it can be used almost anywhere, making it easy to implement in this example.

Functionally, the API will be sent the name of an entity, such as an organization or company name, from each row in a data frame. The API will analyze and process the name using specialized algorithms, knowledge bases, machine learning techniques, and an AI language model. It will respond with a generated similarity key, which is essentially a hashed canonical key encapsulating the many different variations the organization or company name could have within a dataset. This makes it easy to match up names despite differences in their actual electronic, data-described representation. Refer to the aforementioned blog entry to learn more about similarity keys.

Here is the API endpoint we will use to process row values for matching purposes in this example:

    url = 'https://api.interzoid.com/getcompanymatchadvanced'

There are a few Python libraries in this code example. If not already installed in your environment, please install the libraries as follows:

    $ pip install pandas
    $ pip install requests
    $ pip install tabulate

We could just sort the data frame by generated similarity key to get the matching organization names to line up next to each other. However, to make the results more readable and resembling something more like a report, we will add a space between the records of each matching set of similarity keys. Additionally, we will not show the entries where an organization or company name has no other data value that shares the same similarity key. This will ensure that we will only display rows that have matches, enabling us to clearly see the data redundancy that exists in our dataset.

Test data frame (created within the code):

data = {
    'org': ['ibm inc', 'Microsoft Corp.', 'go0gle llc','IBM','Google','Microsot', 'Amazon', 'microsfttt']
}

Example output:

    ibm inc          edplDLsBWcH9Sa7ZECaJx8KiEl5lvMWAa6ackCA4azs
    IBM              edplDLsBWcH9Sa7ZECaJx8KiEl5lvMWAa6ackCA4azs

    go0gle llc       pGWzK9MrYZzcyOrW5AkpnJYiOgI3qnO0EhwsuNh_dxk
    Google           pGWzK9MrYZzcyOrW5AkpnJYiOgI3qnO0EhwsuNh_dxk

    Microsoft Corp.  xUhcrilUNsRiCthe7rXkIupHiCbhhgyLrKNAcXruwoA
    Microsot         xUhcrilUNsRiCthe7rXkIupHiCbhhgyLrKNAcXruwoA
    microsfttt       xUhcrilUNsRiCthe7rXkIupHiCbhhgyLrKNAcXruwoA

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
org_match_report.py		org_match_report.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

python-pandas-matching

About

Uh oh!

Releases

Packages

Languages

interzoid/python-pandas-matching

Folders and files

Latest commit

History

Repository files navigation

python-pandas-matching

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages