FRAUD DETECTION EDA - FRISS

In the following notebook I will create the vector databaset that will be used as a RAG system for the LLM from detection pipeline.

I amn going to use sentence transformers to embeed the training instances and will create a FAISS db, which I will pickle and later use in the pipeline.

In [1]:
!pip install sentence-transformers faiss-cpu



Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.10.0-cp310-cp310-manylinux_2_28_x86_64.whl (30.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m58.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.10.0


In [5]:
import pandas as pd
import numpy as np
import faiss
import pickle
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt

In [6]:
claims = pd.read_csv("training_set.csv")

claims.head()


Unnamed: 0.1,Unnamed: 0,sys_dataspecification_version,sys_claimid,claim_amount_claimed_total,claim_causetype,claim_date_occurred,claim_date_reported,claim_location_urban_area,object_make,object_year_construction,policy_fleet_flag,policy_profitability,report_delay_days,car_age_at_claim,label
0,0,4.5,MTR-338957796-02,2433.0,Collision,2012-10-22,2012-11-27,1,VOLKSWAGEN,2008.0,0,Low,36,4.0,0
1,1,4.5,MTR-434911509-02,3791.0,Collision,2014-06-12,2014-06-18,1,CITROEN,2003.0,0,Very low,6,11.0,0
2,2,4.5,MTR-615568027-02,452.0,Collision,2013-05-06,2013-09-23,1,RENAULT,2001.0,0,Low,140,12.0,0
3,3,4.5,MTR-917387010-02,555.0,Collision,2017-11-12,2017-12-06,1,RENAULT,2017.0,0,High,24,0.0,0
4,4,4.5,MTR-281513737-02,382.0,Collision,2015-10-21,2015-12-02,1,BMW,2011.0,0,Very high,42,4.0,0


Convert the instances into natural language which works better for embedding creation, rather having tabular data encoded.

In [7]:
def claim_to_text(row):
    """
    Convert a claim row to a natural language sentence.
    Excludes the label to prevent bias in retrieval.
    """
    # Customize the template so the text is in the best fromat to be embeeded
    return (
        f"A {row['object_make']} car manufactured in {row['object_year_construction']} had a "
        f"{row['claim_causetype']} claim of {row['claim_amount_claimed_total']} EUR, "
        f"which occurred on {row['claim_date_occurred']} and was reported on {row['claim_date_reported']}. "
        f"Policy profitability is {row['policy_profitability']}, and fleet flag is {row['policy_fleet_flag']}. "
        f"The car age at claim was {row['car_age_at_claim']} years, with a report delay of {row['report_delay_days']} days."
    )

# Apply the function to create a new text column for embedding.
claims["text"] = claims.apply(claim_to_text, axis=1)
texts = claims["text"].tolist()

# Display an example to verify the output
print("Sample generated text:")
print(texts[0])


Sample generated text:
A VOLKSWAGEN car manufactured in 2008.0 had a Collision claim of 2433.0 EUR, which occurred on 2012-10-22 and was reported on 2012-11-27. Policy profitability is Low, and fleet flag is 0. The car age at claim was 4.0 years, with a report delay of 36 days.


Create the embeddings


In [8]:
# Initialize the embedding model
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

# Generate embeddings for each claim
embeddings = embed_model.encode(texts, show_progress_bar=True)

print("Embeddings shape:", embeddings.shape)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/2500 [00:00<?, ?it/s]

Embeddings shape: (80000, 384)


Create the Faiss index & databaset


In [9]:
# Determine the dimensionality for the index
dimension = embeddings.shape[1]

# Create a FAISS index (using L2)
index = faiss.IndexFlatL2(dimension)

# Add embeddings to the index
index.add(np.array(embeddings))

print(f"Number of embedded vectors in index: {index.ntotal}")


Number of embedded vectors in index: 80000


Pickle and save db & dictionary with index for usage in the pipeline

In [10]:
# Save the FAISS index to a file
faiss.write_index(index, "claim_index.faiss")
print("FAISS index saved as claim_index.faiss")

# Create metadata dictionary
metadata = claims[["sys_claimid", "text", "label"]].to_dict(orient="records")

# Pickle the metadata dictionary
with open("claim_metadata.pkl", "wb") as f:
    pickle.dump(metadata, f)
print("Metadata saved as claim_metadata.pkl")


FAISS index saved as claim_index.faiss
Metadata saved as claim_metadata.pkl


Test the functionality

In [11]:
def retrieve_similar_claims(query_text, k=5):
    """
    Given a query claim in natural language, retrieve the top-k similar claims.
    """
    query_embedding = embed_model.encode([query_text])
    distances, indices = index.search(np.array(query_embedding), k)

    # Retrieve corresponding metadata; ensure the order matches with FAISS index
    similar_claims = [metadata[idx] for idx in indices[0]]
    return similar_claims, distances[0]

# Test retrieval with a sample query:
sample_query = texts[0]  # using an example claim from our dataset
results, distances = retrieve_similar_claims(sample_query)
print("Retrieved similar claims:")
for res, d in zip(results, distances):
    print(f"Distance: {d:.4f}, Claim Text: {res['text']}")


Retrieved similar claims:
Distance: 0.0000, Claim Text: A VOLKSWAGEN car manufactured in 2008.0 had a Collision claim of 2433.0 EUR, which occurred on 2012-10-22 and was reported on 2012-11-27. Policy profitability is Low, and fleet flag is 0. The car age at claim was 4.0 years, with a report delay of 36 days.
Distance: 0.0077, Claim Text: A VOLKSWAGEN car manufactured in 2008.0 had a Collision claim of 143.0 EUR, which occurred on 2012-11-05 and was reported on 2012-11-29. Policy profitability is Very high, and fleet flag is 0. The car age at claim was 4.0 years, with a report delay of 24 days.
Distance: 0.0080, Claim Text: A VOLKSWAGEN car manufactured in 2008.0 had a Collision claim of 147.0 EUR, which occurred on 2012-07-01 and was reported on 2012-07-19. Policy profitability is High, and fleet flag is 0. The car age at claim was 4.0 years, with a report delay of 18 days.
Distance: 0.0081, Claim Text: A VOLKSWAGEN car manufactured in 2008.0 had a Collision claim of 2496.0 EUR, whic