# Adding Anomalies to Synthetic Identity Islands

## Similarity Function

The `similar` function calculates the similarity between two strings using `SequenceMatcher`.

## Finding Similar Identities

The `find_similar_identity` function searches for identities within an island that are similar to a target identity based on a specified attribute (`name`, `date_of_birth`, or `nationality`) and a similarity threshold.

## Explanation of Adding Anomalies
Anomalies are introduced to simulate discrepancies and errors that can occur in real-world data. These anomalies help in testing the robustness of systems designed to detect and resolve identity-related issues.

**Types of Anomalies and Their Logic:**

1. **Duplicate Identity:**
   - A duplicate identity with minor changes (such as a slightly altered name) is created and linked back to the original identity using an `IDENTITY_EQUIVALENCE` edge.
   - This simulates cases where a person might have multiple identities with small differences.

2. **Inconsistent Reference:**
   - A reference document (e.g., passport, national ID) that does not logically match any identity in the island is added.
   - This can occur due to data entry errors or fraudulent documents, and it is linked to a random identity in the island using a `CITED_BY` edge.

3. **Mislinked Identity:**
   - An identity from a different island that has a similar attribute (e.g., name) is linked using an `IDENTITY_EQUIVALENCE` edge.
   - This represents cases where two different individuals are mistakenly linked due to similarities in their attributes.

4. **Incorrect Event:**
   - An event (e.g., biometric verification) that does not logically connect to the identities within the island is introduced.
   - Such events might be incorrectly recorded or linked to the wrong person, simulating data integrity issues.

In [1]:
import random
import uuid
import pickle
import networkx as nx
from difflib import SequenceMatcher
from faker import Faker

In [2]:
fake = Faker()

In [3]:
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

In [4]:
def find_similar_identity(island, target_identity, attribute, threshold=0.6):
    similar_identity = None
    highest_similarity = 0

    for identity in island:
        if attribute == 'name':
            name_similarity = similar(target_identity['name'], identity['name'])
            if name_similarity >= threshold:
                if target_identity['date_of_birth'] == identity['date_of_birth'] or target_identity['nationality'] == identity['nationality']:
                    if name_similarity > highest_similarity:
                        highest_similarity = name_similarity
                        similar_identity = identity['id']
        elif attribute == 'date_of_birth':
            if target_identity['date_of_birth'] == identity['date_of_birth']:
                if target_identity['nationality'] == identity['nationality']:
                    similar_identity = identity['id']
                    break  # Perfect match on date_of_birth and nationality
        elif attribute == 'nationality':
            if target_identity['nationality'] == identity['nationality']:
                if target_identity['date_of_birth'] == identity['date_of_birth']:
                    similar_identity = identity['id']
                    break  # Perfect match on nationality and date_of_birth

    return similar_identity

In [5]:
def add_anomalies(G, identity_islands, anomaly_percentage):
    total_islands = len(identity_islands)
    num_anomalous_islands = int(total_islands * (anomaly_percentage / 100))

    for _ in range(num_anomalous_islands):
        # Randomly select an island to introduce anomalies
        island = random.choice(identity_islands)
        
        # Choose a random identity from the selected island
        source_identity_id = random.choice(island)
        source_identity = G.nodes[source_identity_id]
        
        # Choose a type of anomaly to introduce
        anomaly_type = random.choice(['duplicate_identity', 'inconsistent_reference', 'mislinked_identity', 'incorrect_event'])
        
        if anomaly_type == 'duplicate_identity':
            # Create a duplicate identity with minor changes
            base_identity = G.nodes[source_identity_id]
            duplicate_identity = base_identity.copy()
            duplicate_identity['id'] = str(uuid.uuid4())
            duplicate_identity['name'] = fake.first_name() + " " + base_identity['name'].split()[1]
            G.add_node(duplicate_identity['id'], **duplicate_identity)
            G.add_edge(source_identity_id, duplicate_identity['id'], type='IDENTITY_EQUIVALENCE')
            island.append(duplicate_identity['id'])

        elif anomaly_type == 'inconsistent_reference':
            # Add a reference that doesn't match any identity correctly
            inconsistent_reference = {
                'id': str(uuid.uuid4()),
                'type': 'Reference',
                'doc_type': random.choice(['PASSPORT', 'NATURALISATION', 'VISA_1', 'VISA_2', 'NATIONAL_IDENTITY_CARD']),
                'doc_number': fake.ssn()
            }
            G.add_node(inconsistent_reference['id'], **inconsistent_reference)
            target_identity_id = random.choice(island)
            G.add_edge(target_identity_id, inconsistent_reference['id'], type='CITED_BY')

        elif anomaly_type == 'mislinked_identity':
            # Choose 'name' attribute as the base for similarity
            attribute = 'name'
            
            # Find a similar identity in a different island
            unrelated_identity_island = random.choice([i for i in identity_islands if i != island])
            similar_identity_id = find_similar_identity([G.nodes[id] for id in unrelated_identity_island], source_identity, attribute)
            
            if similar_identity_id:
                G.add_edge(source_identity_id, similar_identity_id, type='IDENTITY_EQUIVALENCE')
            else:
                unrelated_identity = random.choice([id for i in identity_islands for id in i if i != island])
                G.add_edge(source_identity_id, unrelated_identity, type='IDENTITY_EQUIVALENCE')

        elif anomaly_type == 'incorrect_event':
            # Add event nodes that don't logically connect to the identities
            incorrect_event = {
                'id': str(uuid.uuid4()),
                'type': 'Event',
                'event_type': 'BIOMETRIC_VERIFICATION',
                'event_date': fake.date()
            }
            G.add_node(incorrect_event['id'], **incorrect_event)
            target_identity_id = random.choice(island)
            G.add_edge(target_identity_id, incorrect_event['id'], type='IMMIGRATION_STATUS_LINKED')

## Load Graph

In [6]:
with open('../data/synthetic_data/synthetic_identity_islands.gpickle', 'rb') as f:
    G = pickle.load(f)

print("Synthetic data read complete.")

# Check the number of nodes and edges
num_nodes = G.number_of_nodes()
num_edges = G.number_of_edges()
print(f"Number of nodes in the graph: {num_nodes}")
print(f"Number of edges in the graph: {num_edges}")

Synthetic data read complete.
Number of nodes in the graph: 113
Number of edges in the graph: 153


In [7]:
with open('../data/synthetic_data/identity_islands.pkl', 'rb') as f:
    identity_islands = pickle.load(f)

print("Identity Island data read complete.")
print(len(identity_islands))

Identity Island data read complete.
10


## Add Anomalies

In [8]:
anomaly_percentage = 100  # Introduce anomalies in 50% of the islands
add_anomalies(G, identity_islands, anomaly_percentage)

## Save Graph with Anomalies

In [9]:
# Check the number of nodes and edges
print(f"Number of nodes in the graph: {G.number_of_nodes()}")
print(f"Number of edges in the graph: {G.number_of_edges()}")

# Save the graph to a file using pickle
with open('../data/anomalies_data/synthetic_identity_islands.gpickle', 'wb') as f:
    pickle.dump(G, f, pickle.HIGHEST_PROTOCOL)

print("Graph have been saved.")

Number of nodes in the graph: 121
Number of edges in the graph: 163
Graph have been saved.


## Print out nodes and their attributes

In [10]:
# Debug: Print out nodes and their attributes

node_id = ""
for node, attrs in G.nodes(data=True):
    node_id = node
    print(f"Node ID: {node}")
    print("Attributes:")
    for attr_key, attr_value in attrs.items():
        print(f"{attr_key}: {attr_value}")
    print()
    break

Node ID: 3f9fa0ea-d962-40f5-89cd-8253c94db91c
Attributes:
id: 3f9fa0ea-d962-40f5-89cd-8253c94db91c
type: Identity
name: Katherine Glass
age: 35
date_of_birth: 11/04/1989
nationality: KGZ



## Print out Edges and their attributes

In [11]:
for edge in G.edges(data=True):
    print(f"Edge from {edge[0]} to {edge[1]}")
    print("Attributes:")
    for attr_key, attr_value in edge[2].items():
        print(f"{attr_key}: {attr_value}")
    break

Edge from 3f9fa0ea-d962-40f5-89cd-8253c94db91c to 9dcbecaa-f0b1-415f-9843-9c218bbc4466
Attributes:
type: IDENTITY_EQUIVALENCE


In [12]:
from pprint import pprint

neighbors = list(G.neighbors(node_id))
pprint(f"Node {node_id} has {len(neighbors)} neighbors: {neighbors}")

('Node 3f9fa0ea-d962-40f5-89cd-8253c94db91c has 9 neighbors: '
 "[UUID('9dcbecaa-f0b1-415f-9843-9c218bbc4466'), "
 "UUID('8da377c7-4b90-46a1-88d2-5912a565765d'), "
 "UUID('c950fc8a-09a5-4fe5-89f5-b3ace8fe2727'), "
 "UUID('ca798061-f5ea-4040-90d8-5a0d465db622'), "
 "UUID('ac433407-a8ee-489e-afb4-e4aa9876c240'), "
 "UUID('3f9fa0ea-d962-40f5-89cd-8253c94db91c'), "
 "UUID('e725982e-1efa-4873-a773-fa9693cdd6f9'), "
 "'0e641f32-3b8c-4fe6-bd3c-409df4caac56', "
 "'57cf1688-28c5-47d3-ac03-70a80c5c69e7']")


## Display Identity Island with Anomalies

In [13]:
import networkx as nx
import pandas as pd

def extract_identity_islands(G):
    identity_islands = []
    for subgraph in nx.connected_components(G.to_undirected()):
        identities = [n for n in subgraph if G.nodes[n]['type'] == 'Identity']
        if len(identities) > 1:
            identity_islands.append(identities)
    return identity_islands

def print_identity_islands(G):
    identity_islands = extract_identity_islands(G)
    for i, island in enumerate(identity_islands, start=1):
        print(f"Identity Island {i}:")
        for identity in island:
            node_data = G.nodes[identity]
            print(f"  ID: {identity}, Name: {node_data['name']}, DOB: {node_data['date_of_birth']}, Nationality: {node_data['nationality']}")
        print()

# Extract and print identity islands
print_identity_islands(G)

Identity Island 1:
  ID: 57cf1688-28c5-47d3-ac03-70a80c5c69e7, Name: Victoria Glass, DOB: 11/04/1989, Nationality: KGZ
  ID: c950fc8a-09a5-4fe5-89f5-b3ace8fe2727, Name: Katherine Hughes Glass, DOB: 11/04/1989, Nationality: KGZ
  ID: 9dcbecaa-f0b1-415f-9843-9c218bbc4466, Name: Katherine Larry Hughes Glass, DOB: 11/04/1989, Nationality: KGZ
  ID: 8da377c7-4b90-46a1-88d2-5912a565765d, Name: Katherine Larry Glass, DOB: 11/04/1989, Nationality: KGZ
  ID: 3f9fa0ea-d962-40f5-89cd-8253c94db91c, Name: Katherine Glass, DOB: 11/04/1989, Nationality: KGZ
  ID: ca798061-f5ea-4040-90d8-5a0d465db622, Name: Katherine Hughes Glass, DOB: 11/04/1989, Nationality: KGZ
  ID: 0e641f32-3b8c-4fe6-bd3c-409df4caac56, Name: Kimberly Glass, DOB: 11/04/1989, Nationality: KGZ

Identity Island 2:
  ID: 7a267725-130f-42c3-b438-530b7ef726c5, Name: Barbara Baker, DOB: 11/08/2011, Nationality: UZB
  ID: 39e58dd6-cb83-43bb-8864-1a487dfca258, Name: Barbara April Evans Baker, DOB: 11/08/2011, Nationality: UZB
  ID: 56c4758