# Data Deduplication using Clustering
**Objective**: Learn and implement data deduplication techniques.

**Task**: DBSCAN for Data Deduplication

**Steps**:
1. Data Set: Download a dataset containing duplicate entries for event registrations.
2. DBSCAN Clustering: Apply the DBSCAN algorithm to cluster similar registrations.
3. Identify Duplicates: Detect duplicates based on density of the clusters.
4. Refinement: Validate clusters and remove any erroneous duplicates.

In [None]:
# write your code from here
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_distances

# Step 1: Simulate event registration dataset with duplicates
data = {
    'name': [
        'Alice Johnson', 'Alicia Johnson', 'Bob Smith', 'Robert Smith',
        'Charlie Lee', 'Charlie L.', 'David Yu', 'David Y.'
    ],
    'email': [
        'alicej@example.com', 'alicej@example.com', 'bobsmith@example.com',
        'roberts@example.com', 'clee@example.com', 'clee@example.com',
        'davidyu@example.com', 'davidyu@example.com'
    ],
    'event': ['TechCon'] * 8
}
df = pd.DataFrame(data)

# Step 2: Vectorize text fields (name + email) for similarity
combined_text = df['name'] + ' ' + df['email']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(combined_text)

# Step 3: Apply DBSCAN on cosine distance
distance_matrix = cosine_distances(X)
db = DBSCAN(eps=0.4, min_samples=1, metric='precomputed')
labels = db.fit_predict(distance_matrix)

df['cluster'] = labels

# Step 4: Identify duplicates by non-unique cluster labels
dedup_df = df.drop_duplicates(subset='cluster', keep='first').reset_index(drop=True)

print("Original Data with Cluster Labels:")
print(df[['name', 'email', 'cluster']])
print("\nDeduplicated Event Registrations:")
print(dedup_df[['name', 'email']])
