# Lead Matching 📑

🚀 Tasks:
1. Preprocessing (15%): Outline your plan, clean and prepare the data, addressing name variations (transliteration, Unicode, etc.) and missing values. Provide code examples.
2. Name Matching (40%): Implement a name matching strategy (fuzzy matching, phonetic matching, etc.) considering language differences. Explain your approach and provide code. Discuss precision/recall trade-offs.
3. Location Matching (20%): Use latitude/longitude to improve matching (distance calculation, radius-based). Provide code.
4. Combined Matching & Evaluation (25%): Combine your name matching and location matching strategies into a single matching system. 
    - a. How will you weight the importance of name and location similarity?
    - b. Describe evaluation metrics (precision, recall, F1-score).
    - c. Discuss limitations.

📑 Contents:
- [Import Dataset](#import_dataset)
- [Preprocessing](#preprocessing)
- [Name Matching](#name_matching)
- [Location Matching](#location_matching)
- [Combined Matching & Evaluation](#combined_matching_and_evaluation)

<a id="import_dataset"></a>
# Import Dataset

In [1]:
import pandas as pd
import re
from sklearn.metrics import classification_report

In [2]:
filepath = 'data/Dataset for data scientist assessment - Lead.csv'
df = pd.read_csv(filepath)
# rename
df.rename(columns={'Restaurant name': 'restaurant_name'}, inplace=True)
df.head()

Unnamed: 0,restaurant_name,lat,long,group
0,Sushi Hiro (ซูชิฮิโระ) Eight Thonglor ชั้น1,13.730788,100.581716,1
1,Sushi Hiro พรอมานาด,13.826534,100.676388,2
2,ซูชิฮิโระ หองหล่อ,13.7309,100.58472,1
3,หองหล่อ Sushi Hiro ในตึก Eight,13.730888,100.582816,1
4,Hiro Sushi ทองหล่อ,13.730751,100.582316,1


<a id="preprocessing"></a>
# Preprocessing

- [Missing Values](##missing_values)
- [Data Types](##data_types)
- [Data Cleaning and Preparation](##data-cleaning-and-preparation)

<a id="missing_values"></a>
## Missing Values

Note: I create a new column 'group' to store the group of the restaurant as label data.

In [3]:
print("> Missing values: ")
print(df.isnull().sum())

> Missing values: 
restaurant_name    0
lat                0
long               0
group              0
dtype: int64


<a id="data_types"></a>
## Data Types

Ensure that data is in the appropriate type
- restaurant_name: string
- lat: float
- long: float
- group: int

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   restaurant_name  30 non-null     object 
 1   lat              30 non-null     float64
 2   long             30 non-null     float64
 3   group            30 non-null     int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 1.1+ KB


<a id="data_cleaning_and_preparation"></a>
## Data Cleaning and Preparation

In [5]:
from pythainlp.tokenize import word_tokenize
from pythainlp.transliterate import romanize

Restaurant Name

In [6]:
def normalize_text(text: str):
    """
    Normalize text by removing special characters and lowercasing
    """
    if isinstance(text, str):
        text = text.encode('utf-8', errors='ignore').decode('utf-8')
        text = re.sub(r'[^\w\s\u0E00-\u0E7F]', ' ', text)  # Keep Thai Unicode range
        text = re.sub(r'\s+', ' ', text)
        return text.strip().lower()
    return ''

def split_thai_english(text: str):
    """
    Split Thai and English text components
    """
    thai_pattern = r'[\u0E00-\u0E7F]+'
    eng_pattern = r'[a-zA-Z]+'
    
    thai_parts = re.findall(thai_pattern, text)
    eng_parts = re.findall(eng_pattern, text)
    
    thai_text = ' '.join(thai_parts).strip()
    eng_text = ' '.join(eng_parts).strip()
    
    return thai_text, eng_text

def segment_thai(text: str):
    """
    Segment Thai text into words
    """
    return word_tokenize(text, engine="newmm")

def romanize_thai_text(text: str):
    """
    Romanize Thai text
    """
    if not text:
        return ''
    text_parts = text.split()
    # transliterate each word and join with space
    return ' '.join([romanize(word, engine='royin') for word in text_parts])


def expand_variations(text):
    variations = [text]
    
    # Split into Thai and English parts
    thai_part, eng_part = split_thai_english(text)
    
    # Create mapping for common transliterated words
    transliteration_map = {
        'sushi': ['ซูชิ'],
        'hiro': ['ฮิโระ'],
        'honmono': ['ฮอนโมโน', 'ฮนโมโน'],
        'khao': ['ข้าว'],
        'coffee': ['กาแฟ', 'คอฟฟี่'],
        # 'nana': ['นานา', 'นานะ'],
        'nana': ['นานา'],
        'toku': ['โทกุ', 'โทคุ'],
        'roaster': ['โรสเตอร์'],
        'ari': ['อารี', 'อารีย์'],
        'so': ['โซ'],
        'i': ['ไอ', 'อิ'],
    }
    
    # Check for English parts that might be Thai transliterations
    for eng_word in eng_part.split():
        if eng_word in transliteration_map:
            for thai_var in transliteration_map[eng_word]:
                new_var = text.replace(eng_word, thai_var)
                variations.append(new_var)
    
    return variations

# combine all cleaning and preparation steps
def preprocess_restaurant_name(df: pd.DataFrame, name_column: str):
    """
    Preprocess restaurant names by normalizing, splitting Thai/English components,
    segmenting Thai text, romanizing Thai text, and generating variations for matching
    """
    # Basic cleaning
    df['clean_name'] = df[name_column].apply(normalize_text)
    
    # Split Thai/English components
    df['thai_component'], df['english_component'] = zip(*df['clean_name'].apply(split_thai_english))
    
    # Word segmentation for Thai
    df['segmented_thai'] = df['thai_component'].apply(segment_thai)
    
    # Romanize Thai for comparison
    df['thai_romanized'] = df['thai_component'].apply(romanize_thai_text)
    
    # Generate variations for matching
    df['name_variations'] = df['clean_name'].apply(expand_variations)
    
    return df

In [7]:
df_tmp = df.copy()

In [8]:
# 1. Normalize restaurant name
df_tmp['restaurant_name'] = df_tmp['restaurant_name'].apply(normalize_text)
df_tmp['restaurant_name'].head()

0    sushi hiro ซูชิฮิโระ eight thonglor ชั้น1
1                          sushi hiro พรอมานาด
2                            ซูชิฮิโระ หองหล่อ
3               หองหล่อ sushi hiro ในตึก eight
4                           hiro sushi ทองหล่อ
Name: restaurant_name, dtype: object

In [9]:
# 2. Split Thai and English text
df_tmp['thai_component'], df_tmp['eng_component'] = zip(*df_tmp['restaurant_name'].map(split_thai_english))
df_tmp[['restaurant_name', 'thai_component', 'eng_component']].head()

Unnamed: 0,restaurant_name,thai_component,eng_component
0,sushi hiro ซูชิฮิโระ eight thonglor ชั้น1,ซูชิฮิโระ ชั้น,sushi hiro eight thonglor
1,sushi hiro พรอมานาด,พรอมานาด,sushi hiro
2,ซูชิฮิโระ หองหล่อ,ซูชิฮิโระ หองหล่อ,
3,หองหล่อ sushi hiro ในตึก eight,หองหล่อ ในตึก,sushi hiro eight
4,hiro sushi ทองหล่อ,ทองหล่อ,hiro sushi


In [10]:
# 3. Segment Thai text
df_tmp['thai_segmented'] = df_tmp['thai_component'].apply(segment_thai)
df_tmp['thai_segmented'].head()

0         [ซูชิ, ฮิ, โระ,  , ชั้น]
1                 [พร, อ, มา, นาด]
2    [ซูชิ, ฮิ, โระ,  , หอง, หล่อ]
3          [หอง, หล่อ,  , ใน, ตึก]
4                      [ทอง, หล่อ]
Name: thai_segmented, dtype: object

In [11]:
# Romanize Thai for comparison
df_tmp['thai_romanized'] = df_tmp['thai_component'].apply(romanize_thai_text)
df_tmp['thai_romanized'].head()

0     sutihiro chan
1         phonmanat
2    sutihiro onglo
3     onglo naituek
4           thonglo
Name: thai_romanized, dtype: object

In [12]:
# use the combined preprocessing function
df_preprocessed = preprocess_restaurant_name(df, 'restaurant_name')
df_preprocessed.head()

Unnamed: 0,restaurant_name,lat,long,group,clean_name,thai_component,english_component,segmented_thai,thai_romanized,name_variations
0,Sushi Hiro (ซูชิฮิโระ) Eight Thonglor ชั้น1,13.730788,100.581716,1,sushi hiro ซูชิฮิโระ eight thonglor ชั้น1,ซูชิฮิโระ ชั้น,sushi hiro eight thonglor,"[ซูชิ, ฮิ, โระ, , ชั้น]",sutihiro chan,"[sushi hiro ซูชิฮิโระ eight thonglor ชั้น1, ซู..."
1,Sushi Hiro พรอมานาด,13.826534,100.676388,2,sushi hiro พรอมานาด,พรอมานาด,sushi hiro,"[พร, อ, มา, นาด]",phonmanat,"[sushi hiro พรอมานาด, ซูชิ hiro พรอมานาด, sush..."
2,ซูชิฮิโระ หองหล่อ,13.7309,100.58472,1,ซูชิฮิโระ หองหล่อ,ซูชิฮิโระ หองหล่อ,,"[ซูชิ, ฮิ, โระ, , หอง, หล่อ]",sutihiro onglo,[ซูชิฮิโระ หองหล่อ]
3,หองหล่อ Sushi Hiro ในตึก Eight,13.730888,100.582816,1,หองหล่อ sushi hiro ในตึก eight,หองหล่อ ในตึก,sushi hiro eight,"[หอง, หล่อ, , ใน, ตึก]",onglo naituek,"[หองหล่อ sushi hiro ในตึก eight, หองหล่อ ซูชิ ..."
4,Hiro Sushi ทองหล่อ,13.730751,100.582316,1,hiro sushi ทองหล่อ,ทองหล่อ,hiro sushi,"[ทอง, หล่อ]",thonglo,"[hiro sushi ทองหล่อ, ฮิโระ sushi ทองหล่อ, hiro..."


Lat, Long

In [13]:
lat_min, lat_max = -90, 90
long_min, long_max = -180, 180

# Check for valid latitude and longitude values
def validate_lat_long(lat, long):
    return (lat_min <= lat <= lat_max) and (long_min <= long <= long_max)

df_preprocessed['valid_location'] = df_preprocessed.apply(lambda x: validate_lat_long(x['lat'], x['long']), axis=1)
df_preprocessed['valid_location'].value_counts()

valid_location
True    30
Name: count, dtype: int64

<a id="name_matching"></a>
# Name Matching

In [14]:
from jellyfish import jaro_winkler_similarity

In [None]:
def group_similar_restaurant(df, name_column, threshold=0.7):
    """
    Group similar restaurant names in a DataFrame
    
    Parameters:
    df: DataFrame containing restaurant names
    name_column: Column name containing the restaurant names
    threshold: Similarity threshold (0-1) for grouping
    
    Returns:
    DataFrame with original data plus name_cluster column
    """
    # Clean and normalize text
    def clean_text(text):
        if not isinstance(text, str):
            return ""
        text = text.lower().strip()
        text = re.sub(r'\s+', ' ', text)
        return text
    
    # Split Thai and English components
    def split_thai_english(text):
        thai_chars = re.findall(r'[\u0E00-\u0E7F]+', text)
        eng_chars = re.findall(r'[a-z0-9]+', text)
        
        thai_text = ''.join(thai_chars)
        eng_text = ''.join(eng_chars)
        
        return thai_text, eng_text
    
    # Calculate similarity between two restaurant names
    def calculate_similarity(name1, name2):
        """
        There are 4 main cases to consider:
        1. Direct comparison of the full name
        2. Comparison of Thai and English components separately
        3. Comparison of romanized Thai components
        4. Comparison of variations of the names
        """
        # Clean both names
        name1 = clean_text(name1)
        name2 = clean_text(name2)
        
        # Direct comparison
        direct_sim = jaro_winkler_similarity(name1, name2)
        
        # Split and compare components
        thai1, eng1 = split_thai_english(name1)
        thai2, eng2 = split_thai_english(name2)

        # variations of the names
        variations1 = expand_variations(name1)
        variations2 = expand_variations(name2)
        
        component_sims = []
        
        # Compare Thai components if both exist
        if thai1 and thai2:
            thai_sim = jaro_winkler_similarity(thai1, thai2)
            component_sims.append(thai_sim)
            
            # Also compare romanized versions
            rom1 = romanize(thai1)
            rom2 = romanize(thai2)
            if rom1 and rom2:
                rom_sim = jaro_winkler_similarity(rom1, rom2)
                component_sims.append(rom_sim)
        
        # Compare variations
        var_sims = []
        for var1 in variations1:
            for var2 in variations2:
                var_sim = jaro_winkler_similarity(var1, var2)
                var_sims.append(var_sim)
        if var_sims:
            component_sims.append(max(var_sims))
        
        # Compare English components if both exist
        if eng1 and eng2:
            eng_sim = jaro_winkler_similarity(eng1, eng2)
            component_sims.append(eng_sim)

        
        # If we have component similarities, use the average
        if component_sims:
            component_sim = sum(component_sims) / len(component_sims)
            # Take the better of direct or component similarity
            return max(direct_sim, component_sim)
        else:
            return direct_sim
    
    # Create a copy of the input dataframe
    result_df = df.copy()
    
    # Get restaurant names
    restaurant_names = df[name_column].tolist()
    n = len(restaurant_names)
    
    # Start with each restaurant in its own group
    groups = [i for i in range(n)]
    
    # Compare each pair of restaurants
    for i in range(n):
        for j in range(i+1, n):
            # If they're already in the same group, skip
            if groups[i] == groups[j]:
                continue
            
            # Calculate similarity
            sim = calculate_similarity(restaurant_names[i], restaurant_names[j])
            
            # If similar enough, merge their groups
            if sim >= threshold:
                old_group = groups[j]
                new_group = groups[i]
                
                # Update all members of the old group
                for k in range(n):
                    if groups[k] == old_group:
                        groups[k] = new_group
    
    # Add group assignments to the result dataframe
    result_df['name_cluster'] = groups

    # renumber the name_cluster to start from 1
    name_cluster_map = {name_cluster: i+1 for i, name_cluster in enumerate(result_df['name_cluster'].unique())}
    result_df['name_cluster'] = result_df['name_cluster'].map(name_cluster_map)
    return result_df

In [16]:
name_matching_df = group_similar_restaurant(df, 'restaurant_name', threshold=0.7)
name_matching_df[['restaurant_name', 'name_cluster']].head()

Unnamed: 0,restaurant_name,name_cluster
0,Sushi Hiro (ซูชิฮิโระ) Eight Thonglor ชั้น1,1
1,Sushi Hiro พรอมานาด,2
2,ซูชิฮิโระ หองหล่อ,1
3,หองหล่อ Sushi Hiro ในตึก Eight,3
4,Hiro Sushi ทองหล่อ,3


In [17]:
# merge the grouped names back to the original dataframe
df_grouped = df.merge(name_matching_df[['restaurant_name', 'name_cluster']], on='restaurant_name', how='left')
df_grouped[['restaurant_name', 'group', 'name_cluster']]

# # renumber the name_cluster to start from 1
# name_cluster_map = {name_cluster: i+1 for i, name_cluster in enumerate(df_grouped['name_cluster'].unique())}
# df_grouped['name_cluster'] = df_grouped['name_cluster'].map(name_cluster_map)

Unnamed: 0,restaurant_name,group,name_cluster
0,Sushi Hiro (ซูชิฮิโระ) Eight Thonglor ชั้น1,1,1
1,Sushi Hiro พรอมานาด,2,2
2,ซูชิฮิโระ หองหล่อ,1,1
3,หองหล่อ Sushi Hiro ในตึก Eight,1,3
4,Hiro Sushi ทองหล่อ,1,3
5,Honmono Sushi (ฮอนโมโน ซูชิ) ทองหล่อ,3,1
6,ร้าน ซูชิ ฮอนโมโน แถว ทองหล่อ,3,1
7,Honmono Sushi สาขา เซ็นทรัล บางนา,4,1
8,ร้าน Sushi ฮอนโมโน บางนา,4,1
9,Honmono ซูชิ ซอย ทองหล่อ 23,3,1


In [18]:
df_grouped[['restaurant_name', 'group', 'name_cluster']]

Unnamed: 0,restaurant_name,group,name_cluster
0,Sushi Hiro (ซูชิฮิโระ) Eight Thonglor ชั้น1,1,1
1,Sushi Hiro พรอมานาด,2,2
2,ซูชิฮิโระ หองหล่อ,1,1
3,หองหล่อ Sushi Hiro ในตึก Eight,1,3
4,Hiro Sushi ทองหล่อ,1,3
5,Honmono Sushi (ฮอนโมโน ซูชิ) ทองหล่อ,3,1
6,ร้าน ซูชิ ฮอนโมโน แถว ทองหล่อ,3,1
7,Honmono Sushi สาขา เซ็นทรัล บางนา,4,1
8,ร้าน Sushi ฮอนโมโน บางนา,4,1
9,Honmono ซูชิ ซอย ทองหล่อ 23,3,1


In [19]:
print(classification_report(df_grouped['group'], df_grouped['name_cluster'], zero_division=0))

              precision    recall  f1-score   support

           1       0.29      0.50      0.36         4
           2       1.00      1.00      1.00         1
           3       0.00      0.00      0.00         3
           4       0.00      0.00      0.00         2
           5       0.00      0.00      0.00         3
           6       0.00      0.00      0.00         1
           7       0.00      0.00      0.00         1
           8       0.00      0.00      0.00         2
           9       0.00      0.00      0.00         1
          10       0.00      0.00      0.00         1
          11       0.00      0.00      0.00         5
          12       0.00      0.00      0.00         1
          13       0.00      0.00      0.00         5

    accuracy                           0.10        30
   macro avg       0.10      0.12      0.10        30
weighted avg       0.07      0.10      0.08        30



📒 The result from name matching is poor performance since they group only by name so if the same name but different location the results of grouping of name matching are mostly the same.

<a id="location_matching"></a>
# Location Matching

In [20]:
from plotly import express as px
from sklearn.cluster import DBSCAN
import numpy as np

In [21]:

fig = px.scatter_geo(df, 
                     lat='lat', 
                     lon='long', 
                     scope = 'asia',
                     color='restaurant_name', 
                     hover_name='restaurant_name', 
                     projection='mercator',  # Changed projection
                     center={"lat": 13.7563, "lon": 100.5018},
                     )
fig.update_geos(
        fitbounds="locations",  # This is crucial - fits the view to your data points
        visible=False,
        showcoastlines=True,
        coastlinecolor="RebeccaPurple",
        showland=True,
)
fig.update_layout(
                title='Restaurants in Bangkok',
                showlegend = True,
                mapbox = dict(
                    zoom=10
                )
        )
fig.show()

In [22]:
def cluster_restaurants_dbscan(df, max_distance_km=0.2):
    """
    Group restaurants using DBSCAN clustering based on geographic distance
    
    Parameters:
    df: DataFrame with latitude and longitude columns
    max_distance_km: Maximum distance in kilometers for restaurants to be considered in the same group
    
    Returns:
    DataFrame with an additional 'cluster' column indicating the group
    """
    # Convert lat/lon to radians
    coords = df[['lat', 'long']].values
    
    # Calculate epsilon parameter in radians (approximate conversion from km)
    kms_per_radian = 6371.0  # Earth's radius in km
    epsilon = max_distance_km / kms_per_radian
    
    # Run DBSCAN clustering
    db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine')
    cluster_labels = db.fit_predict(np.radians(coords))
    
    # Add cluster labels to the dataframe
    df_result = df.copy()
    df_result['cluster'] = cluster_labels
    
    # Count number of clusters
    n_clusters = len(set(cluster_labels))
    print(f'Number of clusters: {n_clusters}')

    # shift cluster value with 1 
    df_result['cluster'] = df_result['cluster'] + 1
    
    return df_result

In [23]:
# Run the clustering methods
print("DBSCAN Clustering:")
df_dbscan = cluster_restaurants_dbscan(df, max_distance_km=0.3)
print(df_dbscan[['restaurant_name', 'cluster']])

DBSCAN Clustering:
Number of clusters: 11
                                     restaurant_name  cluster
0        Sushi Hiro (ซูชิฮิโระ) Eight Thonglor ชั้น1        1
1                                Sushi Hiro พรอมานาด        2
2                                  ซูชิฮิโระ หองหล่อ        1
3                     หองหล่อ Sushi Hiro ในตึก Eight        1
4                                 Hiro Sushi ทองหล่อ        1
5               Honmono Sushi (ฮอนโมโน ซูชิ) ทองหล่อ        3
6                      ร้าน ซูชิ ฮอนโมโน แถว ทองหล่อ        3
7                 Honmono Sushi สาขา เซ็นทรัล บางนา         4
8                           ร้าน Sushi ฮอนโมโน บางนา        4
9                        Honmono ซูชิ ซอย ทองหล่อ 23        3
10                             Khao-Sō-i ซอยคอนแวนต์        5
11                    ร้าน Khao-Sō-i แถว ตึก CP สีลม        5
12                            Khao Sō-i Siam Paragon        6
13                        ร้าน ข้าว โซอิ ซอยคอนแวนต์        5
14             ร้าน ข้าว โซอ

In [24]:
# group with original group
df_grouped = df_grouped.merge(df_dbscan[['restaurant_name', 'cluster']], on='restaurant_name', how='left')
df_grouped[['restaurant_name', 'group', 'name_cluster', 'cluster']]

Unnamed: 0,restaurant_name,group,name_cluster,cluster
0,Sushi Hiro (ซูชิฮิโระ) Eight Thonglor ชั้น1,1,1,1
1,Sushi Hiro พรอมานาด,2,2,2
2,ซูชิฮิโระ หองหล่อ,1,1,1
3,หองหล่อ Sushi Hiro ในตึก Eight,1,3,1
4,Hiro Sushi ทองหล่อ,1,3,1
5,Honmono Sushi (ฮอนโมโน ซูชิ) ทองหล่อ,3,1,3
6,ร้าน ซูชิ ฮอนโมโน แถว ทองหล่อ,3,1,3
7,Honmono Sushi สาขา เซ็นทรัล บางนา,4,1,4
8,ร้าน Sushi ฮอนโมโน บางนา,4,1,4
9,Honmono ซูชิ ซอย ทองหล่อ 23,3,1,3


In [25]:
print(classification_report(df_grouped['group'], df_grouped['cluster'], zero_division=0))

              precision    recall  f1-score   support

           1       1.00      1.00      1.00         4
           2       1.00      1.00      1.00         1
           3       1.00      1.00      1.00         3
           4       1.00      1.00      1.00         2
           5       1.00      1.00      1.00         3
           6       0.17      1.00      0.29         1
           7       1.00      1.00      1.00         1
           8       0.67      1.00      0.80         2
           9       1.00      1.00      1.00         1
          10       0.00      0.00      0.00         1
          11       0.00      0.00      0.00         5
          12       0.00      0.00      0.00         1
          13       0.00      0.00      0.00         5

    accuracy                           0.60        30
   macro avg       0.60      0.69      0.62        30
weighted avg       0.55      0.60      0.56        30



This location matching is achieved 60% accuracy across all 30 data with radius distance 300 meters.

<a id="combined_matching_and_evaluation"></a>
# 4. Combined Matching & Evaluation

In [26]:
# integrate name_matching and location matching to one function with weight assign
def integrate_matching(df, name_threshold=0.7, distance_threshold_km=0.2, name_weight=0.5):
    """
    Integrate name and location matching to group restaurants
    
    Parameters:
    df: DataFrame with restaurant data
    name_threshold: Similarity threshold for name matching
    distance_threshold_km: Maximum distance in kilometers for location matching
    name_weight: Weight for name matching in combined score
    
    Returns:
    DataFrame with 'group' column indicating the group
    """
    # Group similar restaurant names
    name_matching_df = group_similar_restaurant(df, 'restaurant_name', threshold=name_threshold)
    
    # Cluster restaurants based on location
    location_matching_df = cluster_restaurants_dbscan(df, max_distance_km=distance_threshold_km)
    
    # Merge the two matching results
    df_grouped = df.merge(name_matching_df[['restaurant_name', 'name_cluster']], on='restaurant_name', how='left')
    df_grouped = df_grouped.merge(location_matching_df[['restaurant_name', 'cluster']], on='restaurant_name', how='left')
    
    # Combine the two matching results
    df_grouped['combined_score'] = name_weight * df_grouped['name_cluster'] + (1 - name_weight) * df_grouped['cluster']
    
    return df_grouped

In [31]:
# Run the integrated matching
df_integrated = integrate_matching(df, name_threshold=0.7, distance_threshold_km=0.2, name_weight=0.2)
df_integrated['combined_score_rounded'] = df_integrated['combined_score'].astype(int)
df_integrated[['restaurant_name', 'group', 'name_cluster', 'cluster', 'combined_score', 'combined_score_rounded']]

Number of clusters: 13


Unnamed: 0,restaurant_name,group,name_cluster,cluster,combined_score,combined_score_rounded
0,Sushi Hiro (ซูชิฮิโระ) Eight Thonglor ชั้น1,1,1,1,1.0,1
1,Sushi Hiro พรอมานาด,2,2,2,2.0,2
2,ซูชิฮิโระ หองหล่อ,1,1,3,2.6,2
3,หองหล่อ Sushi Hiro ในตึก Eight,1,3,1,1.4,1
4,Hiro Sushi ทองหล่อ,1,3,1,1.4,1
5,Honmono Sushi (ฮอนโมโน ซูชิ) ทองหล่อ,3,1,4,3.4,3
6,ร้าน ซูชิ ฮอนโมโน แถว ทองหล่อ,3,1,4,3.4,3
7,Honmono Sushi สาขา เซ็นทรัล บางนา,4,1,5,4.2,4
8,ร้าน Sushi ฮอนโมโน บางนา,4,1,5,4.2,4
9,Honmono ซูชิ ซอย ทองหล่อ 23,3,1,4,3.4,3


In [32]:
# evaluate the integrated matching
print(classification_report(df_integrated['group'], df_integrated['combined_score_rounded'], zero_division=0))

              precision    recall  f1-score   support

           1       1.00      0.75      0.86         4
           2       0.50      1.00      0.67         1
           3       1.00      1.00      1.00         3
           4       1.00      1.00      1.00         2
           5       1.00      1.00      1.00         3
           6       1.00      1.00      1.00         1
           7       0.17      1.00      0.29         1
           8       1.00      0.50      0.67         2
           9       0.33      1.00      0.50         1
          10       0.00      0.00      0.00         1
          11       0.83      1.00      0.91         5
          12       0.00      0.00      0.00         1
          13       0.00      0.00      0.00         5

    accuracy                           0.70        30
   macro avg       0.60      0.71      0.61        30
weighted avg       0.67      0.70      0.66        30



- a. I combined the name and location matching results using a weighted average of the two scores. The name matching score was multiplied by a weight (0.2) and the location matching score was multiplied by (1 - 0.2) = 0.8. The combined score was then rounded to the nearest integer to assign the final group. I decide to weighted location matching more than name matching from it evaluation result.
- b. the evaluation result 
    - Accuracy improve to 70%
    - Class 3, 4, 5 and 6 have excellent performance 
    - Class 12 and 13 are completly failiure I think for class 13 comes from transliteration_map that same english but different in thai language
- c. limitation
    - for name matching can perform only in thai/eng language
    - for name matching result relies on Thai romanization (pythainlp)
    - the predefined variations may not cover all possible ways
    - the function doesn't understand meaning - it can't recognize that "Orange" and "ส้ม" are translations of the same name.
    - not optimize for computation for large dataset.