# Deduplication

In this notebook, I will cluster images using their extracted features. Then, I will find images that are duplicated. By duplicate, I mean images that have very similar embeddings, Hopefully, they will be labeled in the same cluster. Finally, I can keep only one representative of similar items in each cluster and remove the others.

Outline of this notebook:
- [Step 1](#step1): Clustering images based on their features
- [Step 2](#step2): Finding (and removing) duplicate images

<a id='step1'></a>
## Step 1: Clustering images based on their features

In this step, I will use DBSCAN algorithm to cluster the images. Setting appropriate DBSCAN hyper-parameters plays a crucial role for our task. I set `min_pts=1` because we may have images that have no duplicates. By varying `epsilon` parameter we can change the similarity threshold for assigning images to the same cluster. You can find more about determining DBSCAN parameters at my [blog post](http://www.sefidian.com/2020/12/18/how-to-determine-epsilon-and-minpts-parameters-of-dbscan-clustering/).

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN


%load_ext autoreload
%autoreload 2

Let's load the extracted features from the previous notebook.

In [2]:
from configs import epsilon

features = pd.read_pickle("features.pkl")

In [3]:
features

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,246,247,248,249,250,251,252,253,254,255
images/photo_2022-06-16_17-29-00.jpg,0.199145,0.246505,0.0028,0.323195,0.234513,0.09212,-0.434587,0.268099,-0.224246,-0.208385,...,-0.248163,0.146117,0.035693,-0.022854,-0.256254,0.216303,-0.33438,0.647641,-0.04866,0.621903
images/photo_2022-06-16_17-29-01 (2) (3rd copy).jpg,-0.297781,0.430951,0.131061,0.092694,0.336686,-0.097079,-0.509302,-0.057603,-0.46872,-0.120154,...,0.184711,0.129619,-0.344921,0.355836,-0.158022,0.177382,-0.185563,0.337501,-0.328419,0.472292
images/photo_2022-06-16_17-29-01 (2) (another copy).jpg,-0.297781,0.430951,0.131061,0.092694,0.336686,-0.097079,-0.509302,-0.057603,-0.46872,-0.120154,...,0.184711,0.129619,-0.344921,0.355836,-0.158022,0.177382,-0.185563,0.337501,-0.328419,0.472292
images/photo_2022-06-16_17-29-01 (2) (copy).jpg,-0.297781,0.430951,0.131061,0.092694,0.336686,-0.097079,-0.509302,-0.057603,-0.46872,-0.120154,...,0.184711,0.129619,-0.344921,0.355836,-0.158022,0.177382,-0.185563,0.337501,-0.328419,0.472292
images/photo_2022-06-16_17-29-01 (2).jpg,-0.297781,0.430951,0.131061,0.092694,0.336686,-0.097079,-0.509302,-0.057603,-0.46872,-0.120154,...,0.184711,0.129619,-0.344921,0.355836,-0.158022,0.177382,-0.185563,0.337501,-0.328419,0.472292
images/photo_2022-06-16_17-29-02 (another copy).jpg,0.162328,0.067346,0.174673,0.103443,0.247469,0.037177,-0.44427,-0.152575,-0.413282,-0.03342,...,0.146916,0.006428,0.054943,0.115833,-0.103238,-0.022021,-0.004661,0.364359,-0.11729,0.090883
images/photo_2022-06-16_17-29-02 (copy).jpg,0.162328,0.067346,0.174673,0.103443,0.247469,0.037177,-0.44427,-0.152575,-0.413282,-0.03342,...,0.146916,0.006428,0.054943,0.115833,-0.103238,-0.022021,-0.004661,0.364359,-0.11729,0.090883
images/photo_2022-06-16_17-29-02.jpg,0.162328,0.067346,0.174673,0.103443,0.247469,0.037177,-0.44427,-0.152575,-0.413282,-0.03342,...,0.146916,0.006428,0.054943,0.115833,-0.103238,-0.022021,-0.004661,0.364359,-0.11729,0.090883


In [4]:
features.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,246,247,248,249,250,251,252,253,254,255
count,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,...,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0
mean,-0.063125,0.271543,0.131383,0.125538,0.290458,-0.023083,-0.475576,-0.052505,-0.417371,-0.098658,...,0.116429,0.085485,-0.147395,0.218499,-0.149757,0.107471,-0.136327,0.38634,-0.214276,0.347965
std,0.251148,0.180222,0.05626,0.080043,0.049602,0.081123,0.036194,0.137806,0.082717,0.061709,...,0.1485,0.065702,0.211258,0.153678,0.050859,0.108034,0.120077,0.106415,0.124075,0.21881
min,-0.297781,0.067346,0.0028,0.092694,0.234513,-0.097079,-0.509302,-0.152575,-0.46872,-0.208385,...,-0.248163,0.006428,-0.344921,-0.022854,-0.256254,-0.022021,-0.33438,0.337501,-0.328419,0.090883
25%,-0.297781,0.067346,0.131061,0.092694,0.247469,-0.097079,-0.509302,-0.152575,-0.46872,-0.120154,...,0.146916,0.006428,-0.344921,0.115833,-0.158022,-0.022021,-0.185563,0.337501,-0.328419,0.090883
50%,-0.067727,0.338728,0.131061,0.098069,0.292078,-0.029951,-0.476786,-0.057603,-0.441001,-0.120154,...,0.165814,0.129619,-0.154614,0.235835,-0.158022,0.177382,-0.185563,0.35093,-0.222855,0.472292
75%,0.162328,0.430951,0.174673,0.103443,0.336686,0.037177,-0.44427,-0.057603,-0.413282,-0.03342,...,0.184711,0.129619,0.054943,0.355836,-0.103238,0.177382,-0.004661,0.364359,-0.11729,0.472292
max,0.199145,0.430951,0.174673,0.323195,0.336686,0.09212,-0.434587,0.268099,-0.224246,-0.03342,...,0.184711,0.146117,0.054943,0.355836,-0.103238,0.216303,-0.004661,0.647641,-0.04866,0.621903


In [5]:
model = DBSCAN(min_samples=1, eps=epsilon)

clusters = pd.DataFrame(
    {"path": features.index, "label": model.fit_predict(features)}
).sort_values(["label", "path"], ascending=False)

In [6]:
clusters.head(30)

Unnamed: 0,path,label
7,images/photo_2022-06-16_17-29-02.jpg,2
6,images/photo_2022-06-16_17-29-02 (copy).jpg,2
5,images/photo_2022-06-16_17-29-02 (another copy...,2
4,images/photo_2022-06-16_17-29-01 (2).jpg,1
3,images/photo_2022-06-16_17-29-01 (2) (copy).jpg,1
2,images/photo_2022-06-16_17-29-01 (2) (another ...,1
1,images/photo_2022-06-16_17-29-01 (2) (3rd copy...,1
0,images/photo_2022-06-16_17-29-00.jpg,0


<a id='step2'></a>
## Step 2: Finding (and removing) duplicate images

In this step, I will group the images in the same cluster and keep only one image of each cluster as its representative.

In [22]:
# select one of elements from each cluster to keep
list_to_keep = set(clusters.groupby("label")["path"].first())
all_files = set(clusters["path"])

In [23]:
files_to_remove = all_files - list_to_keep

### print duplicates

In [35]:
for group, paths in clusters[clusters["label"].duplicated(keep=False)].groupby("label")["path"]:
    print(f'Duplicate items for "{paths.iloc[0]}" are:\n {paths.iloc[1:].values}\n')

Duplicate items for "images/photo_2022-06-16_17-29-01 (2).jpg" are:
 ['images/photo_2022-06-16_17-29-01 (2) (copy).jpg'
 'images/photo_2022-06-16_17-29-01 (2) (another copy).jpg'
 'images/photo_2022-06-16_17-29-01 (2) (3rd copy).jpg']

Duplicate items for "images/photo_2022-06-16_17-29-02.jpg" are:
 ['images/photo_2022-06-16_17-29-02 (copy).jpg'
 'images/photo_2022-06-16_17-29-02 (another copy).jpg']



In [36]:
files_to_remove

{'images/photo_2022-06-16_17-29-01 (2) (3rd copy).jpg',
 'images/photo_2022-06-16_17-29-01 (2) (another copy).jpg',
 'images/photo_2022-06-16_17-29-01 (2) (copy).jpg',
 'images/photo_2022-06-16_17-29-02 (another copy).jpg',
 'images/photo_2022-06-16_17-29-02 (copy).jpg'}

## Remove duplicate files

### Warning 

***Running the cell below will remove the detected duplicate images from disk. If you are not sure to do this, you can check and then delete the detected images manually one by one.***

In [37]:
for file_path in files_to_remove:
    try:
        os.remove(file_path)
    except OSError as e:
        print("Cannot delete file '%s': %s" % (file_path, e.strerror))