# Beneficiary Matching Solution

### Objective
- Develop a system to match potential beneficiaries using limited details of the deceased, such as name and address.
- The output should be a ranked list of probable beneficiaries with associated probability scores.

### Initial Steps
1. **Input Data**:
   - Details provided: Name, Full Address, City, State, and ZIP Code of the deceased.
2. **Output Data**:
   - A ranked list of probable beneficiaries with probability scores.

### 1. Preprocessing
- **Standardize Data**: Convert names and addresses to a uniform format (e.g., lowercase, remove special characters).
- **Tokenize and Encode**: Use text embeddings (like BERT, Sentence-BERT) for names and addresses to capture their meanings and nuances.
- **Choose Similarity Metrics**: Cosine similarity or Jaccard index to compare names and addresses between deceased and potential beneficiaries.

### 2. Feature Engineering
- **Name Similarity**: Use Levenshtein Distance, Jaro-Winkler, or BERT-based similarity measures.
- **Address Similarity**: Handle address matching using text embeddings or by comparing individual components (street, city, state).
- **Location Proximity**: Consider using geolocation (lat/long) if available, to refine matches based on physical proximity.

### 3. Matching Process
- **Model Type**: Start with an unsupervised approach focused on scoring and ranking based on similarities.
- **Composite Scoring**: Combine individual similarity scores into a composite score to rank potential matches. Adjust weighting to balance name and address importance.

### Implementation Strategy
- Use Sentence-BERT or a similar model to handle semantic search and similarity scoring.
- Clustering could be useful for grouping similar entries and reducing search space, especially for larger datasets.


In [None]:
# Install necessary libraries
# !pip install pandas numpy scikit-learn sentence-transformers

import pandas as pd
from sentence_transformers import SentenceTransformer, util

# Load the dataset
df = pd.read_csv('expandedbenificiary_data.csv')

# Model for semantic similarity
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Deceased details (example input)
deceased_details = {
    'Name': 'John Doe',
    'Full Address': '123 Elm St, Downtown, Springfield, IL, 62701'
}

# Encode the deceased's name and address
deceased_name_embedding = model.encode(deceased_details['Name'])
deceased_address_embedding = model.encode(deceased_details['Full Address'])

# Function to calculate similarity score
def calculate_similarity(row):
    # Encode beneficiary's name and address
    beneficiary_name_embedding = model.encode(row['Name'])
    beneficiary_address_embedding = model.encode(row['Full Address'])

    # Similarity calculations
    name_similarity = util.pytorch_cos_sim(deceased_name_embedding, beneficiary_name_embedding).item()
    address_similarity = util.pytorch_cos_sim(deceased_address_embedding, beneficiary_address_embedding).item()

    # Composite score
    composite_score = 0.5 * name_similarity + 0.5 * address_similarity
    return composite_score

# Apply scoring
df['Probability Score'] = df.apply(calculate_similarity, axis=1)

# Sort and select top matches
sorted_df = df.sort_values(by='Probability Score', ascending=False)
top_matches = sorted_df.head(10)

# Output top probable beneficiaries
print(top_matches[['Name', 'Full Address', 'Probability Score']])


### Considerations
- **Adjust Weights**: Experiment with different weights for the composite score to balance name and address relevance.
- **Expand Features**: Consider adding other non-sensitive features (e.g., age, partial identifiers) if available.
- **Feedback Loop**: Use feedback from actual matches to refine and adjust the model over time for better accuracy.


In [1]:
import os 
from dotenv import load_dotenv

os.chdir("..")
load_dotenv()

True

In [3]:
import pandas as pd

benificiary_data = pd.read_csv("benificiary_data.csv")

In [5]:
import os
import pandas as pd
from pandasai import Agent

from pandasai.llm import OpenAI
from pandasai import SmartDataframe


# os.environ["PANDASAI_API_KEY"] = "YOUR_API_KEY"
llm = OpenAI()

data = SmartDataframe("benificiary_data.csv",
                       config={"llm": llm})


In [6]:

query = "Return the top five data"
data.chat(query)

{'type': 'dataframe', 'value':                Name                                            Address  \
0   Joshua Campbell  2632 Henry Roads Apt. 790, South Lauren, LA 75990   
1   Sandra Thompson          058 Richardson Plaza, Larrystad, GU 88136   
2  Melanie Gonzalez      497 Jones Landing, Port Sheilamouth, MP 39117   
3  Samantha Edwards  59865 Benjamin Spur Suite 119, Singhtown, RI 7...   
4    Nathan Johnson            206 Molina Ferry, West Angela, IL 23011   

   Pin Code  Birth Date  Age  
0     75990  1941-10-04   83  
1     88136  1976-05-01   48  
2     39117  1963-10-07   61  
3     72042  1999-04-16   25  
4     23011  1988-06-09   36  }


Unnamed: 0,Name,Address,Pin Code,Birth Date,Age
0,Joshua Campbell,"2632 Henry Roads Apt. 790, South Lauren, LA 75990",75990,1941-10-04,83
1,Sandra Thompson,"058 Richardson Plaza, Larrystad, GU 88136",88136,1976-05-01,48
2,Melanie Gonzalez,"497 Jones Landing, Port Sheilamouth, MP 39117",39117,1963-10-07,61
3,Samantha Edwards,"59865 Benjamin Spur Suite 119, Singhtown, RI 7...",72042,1999-04-16,25
4,Nathan Johnson,"206 Molina Ferry, West Angela, IL 23011",23011,1988-06-09,36


In [7]:
data.chat("Extract the city name from the first record")

{'type': 'string', 'value': 'The city name extracted from the first record is South Lauren.'}


'The city name extracted from the first record is South Lauren.'

In [8]:
query = "How many unique cities are there"
data.chat(query)

{'type': 'number', 'value': 199}


199

In [9]:
query = "How many unique surnames are there"
data.chat(query)

{'type': 'number', 'value': 76}


76

In [10]:
query = "Does address has duplicate values?"
data.chat(query)

'Does address have duplicate values? Yes'