## Workflow:

1. Read in data
2. Preprocess data (lowercase, special chars, extra spaces)
3. Inspect data:
   - missing values
   - duplicate values
   - tag distribution
   - sameness of entities 1 and 2 (size, token length)
4. Preprocessing data
5. Setup Baseline model
   - feature extraction
   - modelling
6. Training and evaluation
   - Split data into training, validation, and test sets.
   - Cross validation
   - ROC-AUC, precision, recall, and F1-score. (Focus on recall so QA flags as many true negatives as possible).


In [1]:
import pandas as pd
df = pd.read_csv('data/ds_challenge_alpas.csv')
df.columns = ['id', 'entity_1', 'entity_2', 'tag']
print(df.shape)
df.head()

(7042846, 4)


Unnamed: 0,id,entity_1,entity_2,tag
0,3137667,preciform A.B,Preciform AB,1
1,5515816,degener staplertechnik vertriebs-gmbh,Irshim,0
2,215797,Alltel South CaroliNA Inc,alltel south carolina INC.,1
3,1004621,cse Corporation,Cse Corp,1
4,1698689,Gruppo D Motors Srl,gruppo d motors Sociedad de Resposabilidad Lim...,1


✅ We are working on a binary classifcation task with already labeled data. 

✅ 7,042,846 examples

This number of samples is quite high (300MB) and may raise some problems when it comes to RAM and fitting times on a local setup. So we just take a slice of the data.

In [2]:
df = df.sample(frac=.01, random_state=123).reset_index(drop=True)
df.shape

(70428, 4)

In [3]:
import re

def preprocess_text(text):
    # Preprocessing function to clean the text - lower text, remove extra spaces and special chars
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df['entity_1'] = df['entity_1'].apply(preprocess_text)
df['entity_2'] = df['entity_2'].apply(preprocess_text)

# Check for missing values
missing_values = df.isnull().sum()
print(f"Missing values in each column:\n{missing_values}\n", )

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}\n")

# Analyze the distribution of the binary tag
tag_distribution = df['tag'].value_counts(normalize=True)
print("Distribution of binary tag:\n", tag_distribution)

# Look at basic statistics
df['entity_1_length'] = df['entity_1'].apply(len)
df['entity_2_length'] = df['entity_2'].apply(len)
df['entity_1_tokens'] = df['entity_1'].apply(lambda x: len(x.split()))
df['entity_2_tokens'] = df['entity_2'].apply(lambda x: len(x.split()))

# Display basic statistics
print("Basic statistics for entity lengths and token counts:")
print(df[['entity_1_length', 'entity_2_length', 'entity_1_tokens', 'entity_2_tokens']].describe())


Missing values in each column:
id          0
entity_1    0
entity_2    0
tag         0
dtype: int64

Number of duplicate rows: 0

Distribution of binary tag:
 tag
0    0.591654
1    0.408346
Name: proportion, dtype: float64
Basic statistics for entity lengths and token counts:
       entity_1_length  entity_2_length  entity_1_tokens  entity_2_tokens
count     70428.000000     70428.000000     70428.000000     70428.000000
mean         21.925669        21.868433         3.279378         3.275544
std           9.557876         9.542120         1.203934         1.206749
min           2.000000         2.000000         1.000000         1.000000
25%          16.000000        15.000000         3.000000         3.000000
50%          21.000000        21.000000         3.000000         3.000000
75%          27.000000        27.000000         4.000000         4.000000
max         131.000000       138.000000        17.000000        17.000000


✅ No missing data

✅ No duplicate data

⚠️ Class imbalance! 60% non matched entities, 40% matched

✅ Entities 1 and 2 seem similar in terms of descriptive statistics regarding length (10-30 chars), token size (2-5)


## Model Selection and Evaluation

The pipeline works like this:

1. Preprocessing: First, we clean the text by lowercasing and removing extra spaces or special characters — just basic stuff to make sure the model isn't confused by small inconsistencies.

2. TF-IDF + Cosine Similarity: We turn the company names into character-level n-gram vectors using a basic TF-IDF. These numerical vectors help capture typos, abbreviations, or word order differences while retaining the semantic meaning of the original entity. Then, we calculate cosine similarity between the two name vectors — a common way to measure how close two vectors are based on the angle between them.

3. Logistic Regression: Finally, the similarity score goes into a logistic regression model, which learns how to map that score to a binary prediction. The output is a probability score between 0 and 1, which we can use to make predictions.
 
This model setup is good because it's best to start simple when building something from scratch. This setup is fast, has explainable confidence intervals, and should perform well enough — if not, we can always upgrade to a more complex but robust method later.


In [4]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import FunctionTransformer

class CosineSimilarityTransformer(BaseEstimator, TransformerMixin):
    # Custom transformer to compute cosine similarity between entity_1 and entity_2
    # inherit from BaseEstimator, TransformerMixin to be later used with sklearn Pipeline structure
    def __init__(self, vectorizer=None):
        self.vectorizer = vectorizer or TfidfVectorizer(analyzer='char_wb', ngram_range=(2, 4))

    def fit(self, X, y=None):
        all_text = pd.concat([X['entity_1'], X['entity_2']])
        self.vectorizer.fit(all_text)
        return self

    def transform(self, X):
        tfidf_1 = self.vectorizer.transform(X['entity_1'])
        tfidf_2 = self.vectorizer.transform(X['entity_2'])
        cos_sim = cosine_similarity(tfidf_1, tfidf_2).diagonal()
        return cos_sim.reshape(-1, 1)

# Reserve 20% for test; from the remaining 80%, use 25% for validation (i.e. 60/20/20 split)
# use stratify to preserve class balance
train_val, test = train_test_split(df, test_size=0.2, random_state=42, stratify=df['tag'])
train, val = train_test_split(train_val, test_size=0.25, random_state=42, stratify=train_val['tag'])

# Define the pipeline: compute cosine similarity and then use simple logistic regression
pipeline = Pipeline([
    ('preprocessing', FunctionTransformer(lambda X: X.map(preprocess_text))),
    ('cosine_sim', CosineSimilarityTransformer()),
    ('clf', LogisticRegression(solver='saga', n_jobs=-1))
])

# fit model
X_train = train[['entity_1', 'entity_2']]
y_train = train['tag']
pipeline.fit(X_train, y_train)

# evaluate on the validation set
X_val = val[['entity_1', 'entity_2']]
y_val = val['tag']
y_val_pred_proba = pipeline.predict_proba(X_val)[:, 1]
y_val_pred = pipeline.predict(X_val)

# Calculate metrics
auc_val = roc_auc_score(y_val, y_val_pred_proba)
precision_val = precision_score(y_val, y_val_pred)
recall_val = recall_score(y_val, y_val_pred)
f1_val = f1_score(y_val, y_val_pred)

# Print evaluation metrics
print("Explanation: ROC-AUC measures how well the model distinguishes between classes. A score of 1 indicates perfect classification.")
print("Validation ROC-AUC: {:.6f}\n".format(auc_val))

print("Explanation: Precision measures the proportion of positive predictions that are correct. Of all the predicted positives, how many were actually positive?")
print("Validation Precision: {:.6f}\n".format(precision_val))

print("Explanation: Recall measures the proportion of actual positives that are correctly identified. Of all the actual positives, how many did we correctly identify?")
print("Validation Recall: {:.6f}\n".format(recall_val))

print("Explanation: The F1-Score is the mean of precision and recall. It is a good balance metric, especially when dealing with imbalanced classes.")
print("Validation F1-Score: {:.6f}\n".format(f1_val))

# Cross-validation on training set using StratifiedKFold (5 folds)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='roc_auc')
print("Cross-validation ROC-AUC scores:", cv_scores)
print("Mean CV ROC-AUC: {:.6f}".format(np.mean(cv_scores)))

# Peek at the predictions
predictions = X_val.copy()
predictions["tags"] = y_val_pred_proba
predictions


Explanation: ROC-AUC measures how well the model distinguishes between classes. A score of 1 indicates perfect classification.
Validation ROC-AUC: 0.999991

Explanation: Precision measures the proportion of positive predictions that are correct. Of all the predicted positives, how many were actually positive?
Validation Precision: 0.999302

Explanation: Recall measures the proportion of actual positives that are correctly identified. Of all the actual positives, how many did we correctly identify?
Validation Recall: 0.995828

Explanation: The F1-Score is the mean of precision and recall. It is a good balance metric, especially when dealing with imbalanced classes.
Validation F1-Score: 0.997562

Cross-validation ROC-AUC scores: [0.99999316 0.99998737 0.99999838 0.99998551 0.99999751]
Mean CV ROC-AUC: 0.999992


Unnamed: 0,entity_1,entity_2,tags
12701,moai electronics,moai electronics corp,0.999874
69312,unimatec chemicals europe,shanghai twinstars luggage,0.001052
28313,knifecenter,dong hyung tem,0.001057
7831,baader schaltanlagen u leiterplatten gmbh co kg,baader schaltanlagen u leiterplatten gmbh co kg,0.999964
67660,rotor doo zemun,akapolco international,0.001049
...,...,...,...
57669,energiopts spol s,energiopts spol s ro,0.999925
10650,hudikhus ab,hudikhus ab,0.999964
40337,derwent shipping logistics ltd,derwent shipping logistics ltd,0.999964
15528,motex modetextilservice logistik und management,kuhlman electric corp,0.001317


## Model Evaluation Metrics Interpretation

### ROC-AUC
- **Validation ROC-AUC: `0.999991`**
  - This score indicates that the model has strong ability to distinguish between the positive and negative classes. A ROC-AUC score close to 1 suggests that the model is highly effective in classifying the entities correctly, with very few false positives and false negatives.

### Precision
- **Validation Precision: `0.999302`**
  - The precision score indicates that 99.93% of the predictions made by the model as positive are indeed correct. Good in applications where false positives can lead to significant issues.

### Recall
- **Validation Recall: `0.995828`**
  - The recall score shows that the model correctly identifies 99.58% of the actual positive cases. In scenarios where identifying all true matches is critical, this score is quite good.

### F1-Score
- **Validation F1-Score: `0.997562`**
  - The F1-Score, which balances precision and recall, is 99.76%. This indicates that the model maintains a good balance between precision and recall, making it a reliable choice for this task. The high F1-Score suggests that the model performs well even in the presence of class imbalance.

### Cross-Validation
- **Cross-validation ROC-AUC scores: `[0.99999316, 0.99998737, 0.99999838, 0.99998551, 0.99999751]`**
  - The cross-validation scores are consistently high, indicating that the model's performance is stable across different subsets of the data.
 
Its important to note however, that we trained on a subset of the data due to resource limitations (running out of RAM locally). We don't necessarily appear to be overfitting yet, but training on a heavier duty machine with more data would be the logical next step before building for production.