This document explains the data preparation process for training our
matching model. The example data comes from a research project that
digitized historic records of German joint-stock companies [(Gram et
al.Â 2022)](https://dl.acm.org/doi/10.1145/3531533). The data contains
inconsistencies in spelling, primarily due to variations in abbreviation
conventions and OCR errors, across most variables. These challenges make
it a compelling real-world use case for entity matching.

The data consists of three files:

- *left.csv*
- *right.csv*
- *matches.csv*

## Loading the Data

Training the pipelines requires three datasets:

- `left` (observations from one source or period)
- `right` (observations from another source or period)
- `matches` (a dataframe where each row contains the unique IDs of matching entities from `left` and `right`)


In [None]:
import random
import pandas as pd

matches = pd.read_csv('matches.csv')
left = pd.read_csv('left.csv')
right = pd.read_csv('right.csv')

Preview of the matches data:


In [None]:
matches.head()

Preview of the left dataset:


In [None]:
left.head()

Preview of the right dataset:


In [None]:
right.head()

## Defining Features and Similarity Concepts

The `similarity_map` defines which similarity concepts (values) to apply to each feature pair (keys). Note that this example uses a minimal similarity map for simplicity rather than optimal performance.


In [None]:
from neer_match.similarity_map import SimilarityMap
from neer_match_utilities.custom_similarities import CustomSimilarities

CustomSimilarities() # Ensures Similarity concepts are always scaled between 0 and 1.

# Define similarity_map

similarity_map = {
    "company_name" : [
        "levenshtein",
        "jaro_winkler",
        "partial_token_sort_ratio",
    ],
    "city" : [
        "levenshtein",
    ],
    "industry" : [
        "levenshtein",
        "jaro_winkler",
        "notmissing",
    ],
}

smap = SimilarityMap(similarity_map)

## Harmonizing the data

### Left and Right

Next, data formatting can be harmonized using the `Prepare` class. This class offers flexible arguments for operations such as capitalizing strings, converting values to numeric types, and filling missing values. Additionally, a spaCy pipeline and custom stop words can be specified to remove noise from string variables (see [additional functionalities](additional_functionalities.md)). All operations are applied consistently to both the *left* and *right* DataFrames.


In [None]:
from neer_match_utilities.prepare import Prepare

# Initialize the Prepare object

prepare = Prepare(
    similarity_map=similarity_map, 
    df_left=left, 
    df_right=right, 
    id_left='company_id', 
    id_right='company_id',
)

# Get formatted and harmonized datasets

left, right = prepare.format(
    fill_numeric_na=False,
    to_numeric=['found_year'],
    fill_string_na=True, 
    capitalize=True,
    lower_case=False,
)

In [None]:
left.head()

## Re-Structuring the `Matches` dataframe

`neer-match` requires that the *matches* DataFrame be structured with
the indices from the left and right datasets instead of their unique
IDs. To convert your *matches* DataFrame into the required format, you
can run:


In [None]:
from neer_match_utilities.training import Training

training = Training(
    similarity_map=similarity_map, 
    df_left=left, 
    df_right=right, 
    id_left='company_id', 
    id_right='company_id',
)

matches = training.matches_reorder(
    matches, 
    matches_id_left='company_id_left', 
    matches_id_right='company_id_right'
)

matches.head()

## Splitting Data

Subsequently, we need to split the data into training and test sets,
each consisting of three DataFrames. The training ratio is given by
$\text{training_ratio} = 1 - (\text{test_ratio} + \text{validation_ratio})$.
Note that since validation is not implemented yet, you can set
$\text{validation_ratio} = 0$.


In [None]:
from neer_match_utilities.split import split_test_train

left_train, right_train, matches_train, left_validation, right_validation, matches_validation, left_test, right_test, matches_test = split_test_train(
    left = left,
    right = right,
    matches = matches,
    test_ratio = .5,
    validation_ratio = .0
)

## Training and Exporting the Model

For this tutorial, we use a simple Logit model. Other models (ANN, Probit, or GradientBoost) follow a similar syntax and are covered in [alternative models](alternative_models.md).


In [None]:
from neer_match_utilities.baseline_training import BaselineTrainingPipe
import pandas as pd
import os

training_pipeline = BaselineTrainingPipe(
    model_name='demonstration_model',
    similarity_map=smap,
    training_data=(left_train, right_train, matches_train),
    validation_data=(left_validation, right_validation, matches_validation),  # only needed if tune_threshold for GB
    testing_data=(left_test, right_test, matches_test),
    id_left_col="company_id",
    id_right_col="company_id",
    # matches_id_left="left",
    # matches_id_right="right",
    model_kind="logit", # "logit" | "probit" | "gb"
    mismatch_share_fit=1.0,
    # tune_threshold=False, # recommended for "gb"
    # tune_metric="mcc",
)

training_pipeline.execute()