### **Cell 1: Environment Setup and Dependency Fixes**

When I started running the notebook, I encountered a binary incompatibility error because the latest version of Numpy (2.0+) conflicts with the current Transformers library. To fix this, I force-installed a legacy version of Numpy (<2.0). I also installed `dadmatools` for Persian text normalization and `sentence-transformers` for the vector embedding tasks.

In [None]:
!pip uninstall -y transformers sentence-transformers numpy
!pip install -U "sentence-transformers>=3.0.0" transformers "numpy<2.0" dadmatools python-Levenshtein pandas

### **Cell 2: Importing Libraries and GPU Check**

Here, I am importing the necessary libraries. I'm using `pandas` for data handling and `torch` to manage the computation device. I included a check to ensure the code detects the GPU (CUDA); otherwise, generating embeddings for 10,000 rows would take too long on the CPU.

In [16]:
import pandas as pd
import numpy as np
import torch
import Levenshtein
from dadmatools.normalizer import Normalizer
from sentence_transformers import SentenceTransformer

# checking if cuda is available to speed up the embedding process
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Running on: {device}")

Running on: cuda


### **Cell 3: Building the Root-Word Dictionary**

To handle Rule 5 (Singular/Plural and Root variations), I cannot rely solely on the model. I used the provided `dataset_sample.csv` to build a lookup dictionary. I iterate through the file, and wherever the validation column confirms a relationship (e.g., "valid" or "duplicate"), I map the derivative word to its root (Lemma). This allows me to normalize words like "تاجر" to "تجارت" before processing.

In [17]:
# loading the dataset that contains root and derivative pairs
try:
    lemma_df = pd.read_csv('/kaggle/input/mydatasets/dataset_sample.csv')
    
    lemma_dict = {}
    # these markers in the csv indicate that the words are related
    valid_markers = ['بله', 'تکراری', 'بله.', 'بله ']

    for index, row in lemma_df.iterrows():
        status = str(row['valid']).strip()
        
        # if the relationship is valid, add it to the dictionary
        if status in valid_markers:
            root = str(row['lem']).strip()
            word = str(row['derivitive']).strip()
            lemma_dict[word] = root
            
    print(f"Dictionary built. Contains {len(lemma_dict)} root mappings.")
    
except FileNotFoundError:
    print("Error: dataset_sample.csv not found. please check the path.")
    lemma_dict = {}

Dictionary built. Contains 820 root mappings.


### **Cell 4: The Preprocessing Pipeline**

I defined a `preprocess` function that acts as the backbone of this system. It performs four steps:
1. Normalizes Persian characters (e.g., unifying 'ک' and 'ی') using `dadmatools`.
2. Tokenizes the string.
3. Removes "Stop Words" (generic legal terms like "شرکت", "موسسه", "مهندسی") because they create noise in similarity searches.
4. Lemmatizes the tokens using the dictionary created in the previous step.
This ensures that "Mihan Food Company" and "Mihan Foods" result in the exact same processed string.

In [19]:
# initializing the dadmatools normalizer with standard settings
normalizer = Normalizer(
    full_cleaning=False,
    unify_chars=True,
    refine_punc_spacing=True,
    remove_extra_space=True
)

# list of generic words that don't add semantic value to the name
# removing these helps avoid false positives
STOPWORDS = {
    "شرکت", "موسسه", "سهامی", "خاص", "عام", "مسئولیت", "محدود",
    "تعاونی", "گروه", "صنایع", "مجتمع", "کارخانجات", "تولیدی", "بازرگانی",
    "خدماتی", "مهندسی", "پخش", "توزیع", "و", "در", "به", "های", "ای"
}

def preprocess(text):
    if not isinstance(text, str):
        return None
        
    # step 1: normalize characters
    text = normalizer.normalize(text)
    
    # step 2: split and remove stopwords
    tokens = text.split()
    clean_tokens = [t for t in tokens if t not in STOPWORDS]
    
    if not clean_tokens:
        return None

    # step 3: convert words to their roots using the dictionary
    lemma_tokens = [lemma_dict.get(t, t) for t in clean_tokens]
    
    # returning a dict with different formats for different checks
    return {
        "original": text,
        "clean_tokens": set(clean_tokens),        # set is faster for subset checks
        "clean_string": " ".join(clean_tokens),   # string for levensthein distance
        "lemma_string": " ".join(lemma_tokens)    # lemmatized string for embedding
    }

print("Preprocessing pipeline ready.")

Preprocessing pipeline ready.


### **Cell 5: Indexing the Database**

In this step, I loaded the 10,000 registered company names. I applied the preprocessing pipeline to all of them. Then, I used the `intfloat/multilingual-e5-large` model to convert the lemmatized names into vectors. I chose E5 over ParsBERT because E5 is optimized for retrieval tasks. I added the "passage: " prefix to the text because this specific model requires it to distinguish between the documents being indexed and the query.


In [20]:
# loading the main data
df = pd.read_csv('/kaggle/input/mydatasets/data_sample.csv')

# preprocessing all 10k rows
print("Preprocessing 10,000 rows...")
processed_data = df['name'].apply(preprocess).tolist()

# filtering out any rows that became empty after stopword removal
valid_indices = [i for i, x in enumerate(processed_data) if x is not None]
df = df.iloc[valid_indices].reset_index(drop=True)
processed_list = [processed_data[i] for i in valid_indices]

# saving the processed forms back to the dataframe
df['clean_tokens'] = [x['clean_tokens'] for x in processed_list]
df['clean_string'] = [x['clean_string'] for x in processed_list]
df['lemma_string'] = [x['lemma_string'] for x in processed_list]

# generating embeddings
print("Generating embeddings using GPU...")
model = SentenceTransformer('intfloat/multilingual-e5-large', device=device)

# e5 models expect 'passage: ' prefix for the documents in the index
docs_to_encode = ["passage: " + x for x in df['lemma_string']]

# encoding in batches to manage memory
db_vectors = model.encode(
    docs_to_encode,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True 
)

print(f"Database indexed. Matrix shape: {db_vectors.shape}")

Preprocessing 10,000 rows...
Generating embeddings using GPU...


Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Database indexed. Matrix shape: (10000, 1024)


### **Cell 6: The Validation Logic (Rules 1-6)**

This is the core logic function. It takes a new name, preprocesses it, and converts it to a vector query (prefixed with "query: "). I calculate the cosine similarity against the database and select the top 15 candidates.
I then iterate through these candidates applying the rules hierarchically specified in the task:
1.  **Python Checks (Rules 1-3):** I use strict string comparison and set operations to catch exact matches, subsets, and permutations. These are 100% accurate.
2.  **Levenshtein Distance (Rule 4):** I check for typos only if the vector similarity is reasonably high (> 0.7) to save computation.
3.  **Semantic Similarity (Rule 5 & 6):** I rely on the embedding score. If the score exceeds 0.88, I flag it as a semantic conflict.

In [22]:
def check_name_rules(new_name):
    """
    Checks the 6 rules defined in the project requirements.
    """
    # preprocessing the input
    input_data = preprocess(new_name)
    if not input_data:
        return "Invalid input (empty or only stopwords)"
        
    input_clean_str = input_data['clean_string']
    input_tokens = input_data['clean_tokens']
    input_lemma_str = input_data['lemma_string']
    
    print(f"Checking: {new_name}")
    print(f"Processed: {input_clean_str}")
    print(f"Roots: {input_lemma_str}")
    
    # vector search
    # e5 models expect 'query: ' prefix for the search query
    query_vec = model.encode(["query: " + input_lemma_str], normalize_embeddings=True)
    
    # calculating cosine similarity
    scores = np.dot(db_vectors, query_vec.T).flatten()
    
    # getting top 15 candidates
    top_indices = np.argsort(scores)[::-1][:15]
    
    for idx in top_indices:
        score = scores[idx]
        candidate_row = df.iloc[idx]
        cand_name = candidate_row['name']
        cand_clean_str = candidate_row['clean_string']
        cand_tokens = candidate_row['clean_tokens']
        
        # rule 1: exact match check
        if input_clean_str == cand_clean_str:
            return f"REJECTED | Conflict: {cand_name} | Reason: Rule 1 (Exact Match)"

        # rule 2: subset check (input inside existing or existing inside input)
        if input_tokens.issubset(cand_tokens):
            return f"REJECTED | Conflict: {cand_name} | Reason: Rule 2 (Input is a subset)"
        if cand_tokens.issubset(input_tokens):
            return f"REJECTED | Conflict: {cand_name} | Reason: Rule 2 (Existing name is a subset)"

        # rule 3: permutation check (same words, different order)
        if input_tokens == cand_tokens:
             return f"REJECTED | Conflict: {cand_name} | Reason: Rule 3 (Word Permutation)"

        # rule 4: typos / levenshtein distance
        # only checking this if the vector score is high enough to suggest similarity
        if score > 0.7:
            edit_dist = Levenshtein.distance(input_clean_str, cand_clean_str)
            # stricter threshold for short names
            threshold = 1 if len(input_clean_str) < 5 else 2
            if edit_dist <= threshold:
                return f"REJECTED | Conflict: {cand_name} | Reason: Rule 4 (Typo/Spelling too similar)"

        # rule 5 & 6: semantic / roots
        # using the embedding score to catch semantic issues
        if score >= 0.88:
            return f"REJECTED | Conflict: {cand_name} | Reason: Rule 5/6 (High Semantic Similarity: {score:.3f})"

    return "ACCEPTED | Name appears valid."

print("Validation logic ready.")

Validation logic ready.


### **Cell 7: Testing**

Finally, I ran the system against some test cases to verify accuracy.

In [27]:
test_cases = [
    "عمران خطوط سورن",
    "كالا پخش ايلام",
    "غذایی صنایع میهن",
    "صنایع غزایی میهن",
    "صنعت غذا میهن",
    "شرکت فولاد مبارکه اصفهان",
    "شعبه اصفهان شرکت مهر افزون خرم آباد",
]

print("--- RUNNING TESTS ---")
for name in test_cases:
    result = check_name_rules(name)
    print(result)
    print("-" * 50)

--- RUNNING TESTS ---
Checking: عمران خطوط سورن
Processed: عمران خطوط سورن
Roots: عمران خطوط سورن
REJECTED | Conflict: عمران خطوط سورن | Reason: Rule 1 (Exact Match)
--------------------------------------------------
Checking: كالا پخش ايلام
Processed: کالا ایلام
Roots: کالا ایلام
REJECTED | Conflict: كالا پخش عصر ايلام | Reason: Rule 2 (Input is a subset)
--------------------------------------------------
Checking: غذایی صنایع میهن
Processed: غذایی میهن
Roots: غذایی میهن
ACCEPTED | Name appears valid.
--------------------------------------------------
Checking: صنایع غزایی میهن
Processed: غزایی میهن
Roots: غزایی میهن
ACCEPTED | Name appears valid.
--------------------------------------------------
Checking: صنعت غذا میهن
Processed: صنعت غذا میهن
Roots: صنعت غذا میهن
ACCEPTED | Name appears valid.
--------------------------------------------------
Checking: شرکت فولاد مبارکه اصفهان
Processed: فولاد مبارکه اصفهان
Roots: فولاد مبارکه اصفهان
ACCEPTED | Name appears valid.
----------------

## Idea to improve accuracy:
### After obtaining the closest names through embeddings, instead of relying on the cosine distance and the obtained score, we can input the candidate names and the user's desired name into an LLM configured with prompt engineering, and ask it to reject the user's request if it sees a similar name.