# Beneficiary Matching Solution

### Objective
- Develop a system to match potential beneficiaries using limited details of the deceased, such as name and address.
- The output should be a ranked list of probable beneficiaries with associated probability scores.

### Initial Steps
1. **Input Data**:
   - Details provided: Name, Full Address, City, State, and ZIP Code of the deceased.
2. **Output Data**:
   - A ranked list of probable beneficiaries with probability scores.

### 1. Preprocessing
- **Standardize Data**: Convert names and addresses to a uniform format (e.g., lowercase, remove special characters).
- **Tokenize and Encode**: Use text embeddings (like BERT, Sentence-BERT) for names and addresses to capture their meanings and nuances.
- **Choose Similarity Metrics**: Cosine similarity or Jaccard index to compare names and addresses between deceased and potential beneficiaries.

### 2. Feature Engineering
- **Name Similarity**: Use Levenshtein Distance, Jaro-Winkler, or BERT-based similarity measures.
- **Address Similarity**: Handle address matching using text embeddings or by comparing individual components (street, city, state).
- **Location Proximity**: Consider using geolocation (lat/long) if available, to refine matches based on physical proximity.

### 3. Matching Process
- **Model Type**: Start with an unsupervised approach focused on scoring and ranking based on similarities.
- **Composite Scoring**: Combine individual similarity scores into a composite score to rank potential matches. Adjust weighting to balance name and address importance.

### Implementation Strategy
- Use Sentence-BERT or a similar model to handle semantic search and similarity scoring.
- Clustering could be useful for grouping similar entries and reducing search space, especially for larger datasets.


In [None]:
# Install necessary libraries
# !pip install pandas numpy scikit-learn sentence-transformers

import pandas as pd
from sentence_transformers import SentenceTransformer, util

# Load the dataset
df = pd.read_csv('expandedbenificiary_data.csv')

# Model for semantic similarity
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Deceased details (example input)
deceased_details = {
    'Name': 'John Doe',
    'Full Address': '123 Elm St, Downtown, Springfield, IL, 62701'
}

# Encode the deceased's name and address
deceased_name_embedding = model.encode(deceased_details['Name'])
deceased_address_embedding = model.encode(deceased_details['Full Address'])

# Function to calculate similarity score
def calculate_similarity(row):
    # Encode beneficiary's name and address
    beneficiary_name_embedding = model.encode(row['Name'])
    beneficiary_address_embedding = model.encode(row['Full Address'])

    # Similarity calculations
    name_similarity = util.pytorch_cos_sim(deceased_name_embedding, beneficiary_name_embedding).item()
    address_similarity = util.pytorch_cos_sim(deceased_address_embedding, beneficiary_address_embedding).item()

    # Composite score
    composite_score = 0.5 * name_similarity + 0.5 * address_similarity
    return composite_score

# Apply scoring
df['Probability Score'] = df.apply(calculate_similarity, axis=1)

# Sort and select top matches
sorted_df = df.sort_values(by='Probability Score', ascending=False)
top_matches = sorted_df.head(10)

# Output top probable beneficiaries
print(top_matches[['Name', 'Full Address', 'Probability Score']])


### Considerations
- **Adjust Weights**: Experiment with different weights for the composite score to balance name and address relevance.
- **Expand Features**: Consider adding other non-sensitive features (e.g., age, partial identifiers) if available.
- **Feedback Loop**: Use feedback from actual matches to refine and adjust the model over time for better accuracy.


In [1]:
import os 
from dotenv import load_dotenv

os.chdir("..")
load_dotenv()

True

In [6]:
import pandas as pd
file_name = "research/benificiary_data.csv"

benificiary_data = pd.read_csv(file_name)

In [7]:
benificiary_data

Unnamed: 0,Name,Address,Pin Code,Birth Date,Age
0,Peter Hill,"3848 107th St, Antioch, VT 79559",79559,1968-07-20,56
1,Mark Hill,"9306 37th St, Garden Grove, UT 77711",77711,1988-11-08,36
2,Jennifer Williams,"2293 197th St, Temecula, RI 72774",72774,1969-07-25,55
3,Jennifer Wilson,"4635 144th St, Pembroke Pines, CA 47939",47939,1941-02-03,83
4,James Allen,"2091 38th St, Palm Bay, FL 27049",27049,1955-10-27,69
...,...,...,...,...,...
9995,Jason Martin,"7678 180th St, Pomona, WI 93819",93819,1961-01-26,63
9996,Thomas Rodriguez,"349 158th St, West Palm Beach, NH 13869",13869,1975-10-11,49
9997,Eric Reed,"7709 151st St, St. Louis, HI 21130",21130,1961-09-14,63
9998,Charles Murphy,"1088 186th St, Jurupa Valley, AR 34692",34692,2004-03-22,20


In [8]:
import os
import pandas as pd
from pandasai import Agent

from pandasai.llm import OpenAI
from pandasai import SmartDataframe


# os.environ["PANDASAI_API_KEY"] = "YOUR_API_KEY"
llm = OpenAI()

data = SmartDataframe(file_name,
                       config={"llm": llm})


In [9]:

query = "Return the top five data"
data.chat(query)

{'type': 'dataframe', 'value':                 Name                                  Address  Pin Code  \
0         Peter Hill         3848 107th St, Antioch, VT 79559     79559   
1          Mark Hill     9306 37th St, Garden Grove, UT 77711     77711   
2  Jennifer Williams        2293 197th St, Temecula, RI 72774     72774   
3    Jennifer Wilson  4635 144th St, Pembroke Pines, CA 47939     47939   
4        James Allen         2091 38th St, Palm Bay, FL 27049     27049   

   Birth Date  Age  
0  1968-07-20   56  
1  1988-11-08   36  
2  1969-07-25   55  
3  1941-02-03   83  
4  1955-10-27   69  }


Unnamed: 0,Name,Address,Pin Code,Birth Date,Age
0,Peter Hill,"3848 107th St, Antioch, VT 79559",79559,1968-07-20,56
1,Mark Hill,"9306 37th St, Garden Grove, UT 77711",77711,1988-11-08,36
2,Jennifer Williams,"2293 197th St, Temecula, RI 72774",72774,1969-07-25,55
3,Jennifer Wilson,"4635 144th St, Pembroke Pines, CA 47939",47939,1941-02-03,83
4,James Allen,"2091 38th St, Palm Bay, FL 27049",27049,1955-10-27,69


In [10]:
data.chat("Extract the city name from the first record")

{'type': 'string', 'value': 'The city name extracted from the first record is Antioch.'}


'The city name extracted from the first record is Antioch.'

In [11]:
query = "How many unique cities are there"
data.chat(query)

{'type': 'number', 'value': 3730}


3730

In [12]:
query = "How many unique surnames are there"
data.chat(query)

76

In [13]:
query = "Does address has duplicate values?"
data.chat(query)

{'type': 'string', 'value': 'There are duplicate addresses.'}


'There are duplicate addresses.'

In [19]:
import pandas as pd
from fuzzywuzzy import fuzz
from datetime import datetime


class BeneficiaryMatcher:
    def __init__(self, data):
        """
        Initialize the BeneficiaryMatcher with a dataset.
        
        Parameters:
        data (pd.DataFrame): DataFrame containing the dataset with columns 'Name', 'Address', 'Pin Code', 'Birth Date', 'Age'
        """
        self.data = data
        self.clean_data()
    
    def clean_data(self):
        """Clean the dataset by standardizing the text fields and handling missing data."""
        self.data['Name'] = self.data['Name'].str.lower().str.strip()
        self.data['Address'] = self.data['Address'].str.lower().str.strip()
        self.data['Pin Code'] = self.data['Pin Code'].astype(str).str.strip()
        self.data['Birth Date'] = pd.to_datetime(self.data['Birth Date'], errors='coerce')
        self.data['Age'] = self.data['Age'].fillna(self.data['Birth Date'].apply(lambda x: self.calculate_age(x) if pd.notnull(x) else None))
        self.data.dropna(subset=['Name', 'Address', 'Pin Code'], inplace=True)

    def calculate_age(self, birth_date):
        """Calculate age from birth date."""
        today = datetime.today()
        return today.year - birth_date.year - ((today.month, today.day) < (birth_date.month, birth_date.day))

    def pin_code_similarity(self, pin1, pin2):
        """
        Calculate similarity between two pin codes using a combination of numeric distance and string similarity.
        
        Parameters:
        pin1, pin2 (str): Pin codes to compare
        
        Returns:
        float: A similarity score between 0 and 100.
        """
        try:
            # Numeric similarity based on distance (if pin codes are numerically close)
            distance = abs(int(pin1) - int(pin2))
            # Normalize distance to a score between 0 and 100
            numeric_similarity = max(0, 100 - (distance / 100) * 10)  # Adjust scaling factor as needed
            
            # String similarity for cases where pin codes might have similar patterns
            string_similarity = fuzz.ratio(pin1, pin2)
            
            # Combine numeric and string similarities (weighting can be adjusted)
            combined_similarity = (0.6 * numeric_similarity) + (0.4 * string_similarity)
            return combined_similarity
        except ValueError:
            # If pin codes can't be compared numerically, fallback to string similarity
            return fuzz.ratio(pin1, pin2)

    def find_potential_beneficiaries(self, target, threshold=70):
        """
        Find potential beneficiaries matching the target person based on name, address, pin code, and age.
        
        Parameters:
        target (dict): A dictionary containing the target's details: 'Name', 'Address', 'Pin Code', 'Age'
        threshold (int): The minimum score threshold for fuzzy matching (default: 70)
        
        Returns:
        pd.DataFrame: A DataFrame with potential matches sorted by probability score.
        """
        target_name = target['Name'].lower().strip()
        target_address = target['Address'].lower().strip()
        target_pin = str(target['Pin Code']).strip()
        target_age = target.get('Age', None)
        
        self.data['Name Score'] = self.data['Name'].apply(lambda x: fuzz.ratio(x, target_name))
        self.data['Address Score'] = self.data['Address'].apply(lambda x: fuzz.partial_ratio(x, target_address))
        self.data['Pin Code Score'] = self.data['Pin Code'].apply(lambda x: self.pin_code_similarity(x, target_pin))
        
        # Calculate Age Difference Score if age is available
        if target_age:
            self.data['Age Diff'] = self.data['Age'].apply(lambda x: abs(x - target_age) if pd.notnull(x) else None)
            self.data['Age Score'] = self.data['Age Diff'].apply(lambda x: max(0, 100 - x) if x is not None else 0)
        else:
            self.data['Age Score'] = 0
        
        # Combine scores into a single probability score
        self.data['Probability Score'] = (
            0.4 * self.data['Name Score'] +
            0.3 * self.data['Address Score'] +
            0.2 * self.data['Pin Code Score'] +
            0.1 * self.data['Age Score']
        )
        
        # Filter potential matches based on a probability threshold
        potential_matches = self.data[self.data['Probability Score'] >= threshold]
        potential_matches = potential_matches.sort_values(by='Probability Score', ascending=False)
        
        return potential_matches[['Name', 'Address', 'Pin Code', 'Birth Date', 'Age', 'Probability Score']]
    

#


The age score in the `BeneficiaryMatcher` class is designed to compare the target person's age with the ages in the dataset and assign a similarity score based on the difference. Here's how the age score calculation works:

### Age Score Calculation

1. **Age Difference Calculation:**
   - First, the difference between the target person's age and the age of each individual in the dataset is calculated.
   - This is done using the expression: `abs(x - target_age)`, where `x` is the age of the individual in the dataset, and `target_age` is the age of the target person.

2. **Converting Age Difference to a Score:**
   - The age difference is converted into a similarity score using the formula: `max(0, 100 - x)`.
   - If the age difference is small, the age score will be higher (closer to 100).
   - If the age difference is large, the age score will be lower.
   - A difference of `0` years will give a score of `100`, indicating a perfect match.
   - A difference of `10` years will give a score of `90`, and so on, decrementing by the difference value.
   - If the difference is very high, the score can drop to `0` (or close to `0`).

3. **Handling Missing Ages:**
   - If an individual's age is missing (or cannot be calculated from their birth date), the score is set to `0`.
   - This handling assumes that missing age data implies no confidence in the match based on age.

### Example

If the target age is `44`, and we compare it with an individual's age of:

- **Age 44:** `abs(44 - 44) = 0`, Age Score = `100`
- **Age 40:** `abs(44 - 40) = 4`, Age Score = `96`
- **Age 50:** `abs(44 - 50) = 6`, Age Score = `94`
- **Age 60:** `abs(44 - 60) = 16`, Age Score = `84`
- **Age 70:** `abs(44 - 70) = 26`, Age Score = `74`



### Summary

- **High Age Score:** Indicates the ages are very close, suggesting a higher likelihood of a match.
- **Low Age Score:** Indicates a significant difference in ages, suggesting a lower likelihood of a match.
- This approach allows the model to weigh the closeness of ages as part of the overall probability score, making the matching process more accurate when age is a relevant factor.

In [20]:
matcher = BeneficiaryMatcher(benificiary_data)
target_person = {
    'Name': 'John Thompson',
    'Address': '9745 178nd St, Wichita, IN 37955',
    'Pin Code': 37959,
    'Age': 44
}

# Find potential beneficiaries
potential_beneficiaries = matcher.find_potential_beneficiaries(target_person, threshold=70)
print(potential_beneficiaries)

                Name                           Address Pin Code Birth Date  \
28  rebecca thompson  9605 172nd st, wichita, in 37955    37955 2000-02-11   

    Age  Probability Score  
28   24             78.452  


The threshold in the `find_potential_beneficiaries` method represents the minimum probability score required for a record to be considered a potential match. In the provided implementation, the default threshold is set to **70**.

### How the Threshold Works

- **Threshold Value:** A score of 70 means that only potential matches with a probability score of 70 or higher will be included in the final results.
- **Score Calculation:** The probability score is a weighted combination of individual matching scores (name, address, pin code, and age). Each component contributes to the overall score, and the combined score reflects the overall similarity to the target person.
- **Adjusting the Threshold:**
  - **Lower Threshold:** Setting the threshold lower (e.g., 60) will include more potential matches, including those that may be less similar.
  - **Higher Threshold:** Setting the threshold higher (e.g., 80) will result in fewer matches but with higher confidence in their similarity.

### Adjusting the Threshold Based on Your Needs

We can adjust the threshold parameter when calling `find_potential_beneficiaries` based on the desired sensitivity and specificity of your matching:

```python
# Example with a custom threshold
potential_beneficiaries = matcher.find_potential_beneficiaries(target_person, threshold=75)  # Custom threshold of 75
print(potential_beneficiaries)
```

Choosing an appropriate threshold depends on the balance we want between the number of matches and the confidence in those matches. A threshold of 70 is typically a moderate starting point, which can be adjusted based on the observed results and feedback from testing.

------

# Beneficiary Matcher Scoring 

Let's discuss how the scoring is calculated for Name, Address, Pin Code, and Probability in the `BeneficiaryMatcher` class.

## 1. Name Score
- **Method Used:** Fuzzy String Matching using `fuzz.ratio` from the `fuzzywuzzy` library.
- **Calculation:**
  - Compares the target person's name with each name in the dataset.
  - Returns a similarity score between 0 and 100.
  - Higher scores indicate closer matches.
  
  ```python
  self.data['Name Score'] = self.data['Name'].apply(lambda x: fuzz.ratio(x, target_name))
  ```

## 2. Address Score
- **Method Used:** Partial Fuzzy String Matching using `fuzz.partial_ratio`.
- **Calculation:**
  - Compares the target person's address with each address in the dataset.
  - Uses partial matching to handle minor differences (e.g., abbreviations).
  - Returns a score between 0 and 100.
  
  ```python
  self.data['Address Score'] = self.data['Address'].apply(lambda x: fuzz.partial_ratio(x, target_address))
  ```

## 3. Pin Code Score
- **Method Used:** Combination of Numeric Distance and Fuzzy String Matching.
- **Calculation:**
  - Measures the numeric distance between pin codes and converts it into a similarity score.
  - Uses `fuzz.ratio` for string-based similarity.
  - Combines numeric and string scores, weighted as needed.

  ```python
  self.data['Pin Code Score'] = self.data['Pin Code'].apply(lambda x: self.pin_code_similarity(x, target_pin))
  ```

## 4. Age Score
- **Method Used:** Age difference converted to a similarity score.
- **Calculation:**
  - Computes the difference between the target age and each age in the dataset.
  - Converts the difference into a score (0-100), with closer ages receiving higher scores.
  
  ```python
  self.data['Age Score'] = self.data['Age Diff'].apply(lambda x: max(0, 100 - x) if x is not None else 0)
  ```

## 5. Probability Score
- **Method Used:** Weighted Combination of Name, Address, Pin Code, and Age Scores.
- **Calculation:**
  - Combines the individual scores into a single Probability Score using the formula:
  
  ```python
  self.data['Probability Score'] = (
      0.4 * self.data['Name Score'] +
      0.3 * self.data['Address Score'] +
      0.2 * self.data['Pin Code Score'] +
      0.1 * self.data['Age Score']
  )
  ```

- **Threshold:** Potential matches are filtered based on a probability score threshold (default: 70).

```

This scoring system is designed to provide a comprehensive measure of similarity between the target person and potential beneficiaries, incorporating multiple relevant data points.
```