# Deduplication

In this notebook, I will cluster images using their extracted features. Then, I will find images that are duplicated. By duplicate, I mean images that have very similar embeddings, Hopefully, they will be labeled in the same cluster. Finally, I can keep only one representative of similar items in each cluster and remove the others.

Outline of this notebook:
- [Step 1](#step1): Clustering images based on their features
- [Step 2](#step2): Finding (and removing) duplicate images

<a id='step1'></a>
## Step 1: Clustering images based on their features

In this step, I will use DBSCAN algorithm to cluster the images. Setting appropriate DBSCAN hyper-parameters plays a crucial role for our task. I set `min_pts=1` because we may have images that have no duplicates. By varying `epsilon` parameter we can change the similarity threshold for assigning images to the same cluster. You can find more about determining DBSCAN parameters at my [blog post](http://www.sefidian.com/2020/12/18/how-to-determine-epsilon-and-minpts-parameters-of-dbscan-clustering/).

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN


%load_ext autoreload
%autoreload 2

Let's load the extracted features from the previous notebook.

In [2]:
from configs import epsilon

features = pd.read_pickle("features.pkl")

In [3]:
features

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,246,247,248,249,250,251,252,253,254,255
images/photo_2022-06-16_16-17-08.jpg,-0.393243,0.198433,0.177144,-0.004066,0.344836,-0.483615,0.244499,0.148384,0.137952,0.082348,...,0.496214,-0.222817,0.378033,0.097635,-0.513407,0.190662,-0.115193,-0.195566,0.063622,-0.480281
images/photo_2022-06-16_16-17-14.jpg,0.004710,0.269924,0.010112,-0.131621,0.334753,-0.148701,0.161734,-0.274407,-0.220948,-0.086322,...,0.386467,-0.165707,-0.004429,0.173446,-0.224162,-0.013267,-0.213428,0.148017,0.085448,-0.193394
images/photo_2022-06-16_16-17-15.jpg,-0.313095,0.130897,0.195526,0.018630,0.031401,-0.157848,0.301113,0.065235,-0.169378,0.378660,...,0.178280,-0.449108,0.275259,0.301783,-0.384012,0.059103,0.200918,0.000642,-0.188533,-0.258484
images/photo_2022-06-16_16-17-17.jpg,-0.555533,-0.086299,-0.092255,-0.408534,0.373758,-0.169927,0.297859,-0.068833,0.411412,-0.093120,...,0.124647,-0.326179,-0.025985,0.412165,-0.297770,0.088128,0.183624,0.081721,-0.097750,-0.137062
images/photo_2022-06-16_16-17-18.jpg,-0.441659,0.007555,0.137791,-0.170745,0.065476,-0.287096,-0.110629,0.049418,-0.349220,0.046992,...,0.589542,-0.313997,-0.182642,0.479272,-0.553294,-0.096794,0.210342,0.171631,0.068233,-0.208699
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
images/subdir_1/sample_326.png,0.217887,0.240366,-0.339801,-0.008033,0.438812,-0.318372,0.228650,0.006852,-0.085336,0.165379,...,0.171965,0.229646,0.048396,0.090013,-0.553021,0.019513,-0.116264,0.023887,0.068613,-0.373949
images/subdir_2/sample_366.png,0.085278,0.017423,0.421910,-0.042469,0.297975,-0.112743,0.042038,0.004188,0.351063,-0.043017,...,0.468410,-0.336908,-0.131539,-0.069323,-0.074641,-0.168053,-0.408853,-0.249817,0.241861,-0.090310
images/subdir_2/sample_440.png,-0.283753,-0.029625,-0.198545,-0.230542,0.425968,-0.335625,0.165714,-0.080252,0.342686,0.200655,...,0.395328,0.109614,-0.239445,0.186434,-0.272617,0.249436,-0.272713,0.066301,0.101124,-0.370444
images/subdir_2/sample_457.png,-0.011154,-0.069685,-0.317279,0.128421,-0.100237,-0.492984,0.097012,-0.125322,0.361557,0.186905,...,0.383356,0.106593,-0.136422,0.141173,-0.350344,0.107056,-0.339924,0.353845,0.490037,-0.362301


In [4]:
features.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,246,247,248,249,250,251,252,253,254,255
count,246.0,246.0,246.0,246.0,246.0,246.0,246.0,246.0,246.0,246.0,...,246.0,246.0,246.0,246.0,246.0,246.0,246.0,246.0,246.0,246.0
mean,-0.337594,0.160944,0.084181,-0.12906,0.313357,-0.349555,0.142716,0.040532,0.066051,0.043025,...,0.33013,-0.219597,-0.027843,0.126664,-0.414825,0.02579,-0.02258,0.024512,0.007696,-0.270302
std,0.200427,0.146173,0.145802,0.149254,0.187077,0.160723,0.15042,0.18382,0.168424,0.171474,...,0.164959,0.208178,0.161159,0.16718,0.214853,0.18234,0.209003,0.137387,0.179505,0.156312
min,-0.8611,-0.205654,-0.339801,-0.530347,-0.146947,-0.718545,-0.262396,-0.630216,-0.486979,-0.371726,...,-0.042836,-0.737152,-0.497204,-0.325051,-0.969378,-0.37432,-0.611308,-0.412451,-0.593626,-0.707144
25%,-0.462013,0.055869,-0.012706,-0.229918,0.189653,-0.467356,0.037677,-0.068294,-0.064873,-0.085232,...,0.205283,-0.370783,-0.130301,0.028054,-0.550519,-0.090504,-0.171306,-0.06513,-0.102354,-0.367187
50%,-0.333473,0.158679,0.088926,-0.129525,0.304624,-0.361874,0.141983,0.040236,0.061341,0.039455,...,0.334965,-0.228564,-0.046157,0.137309,-0.411866,0.020596,-0.012926,0.027217,-0.001193,-0.275493
75%,-0.221529,0.254387,0.188082,-0.034011,0.439723,-0.228578,0.257321,0.15233,0.190924,0.176052,...,0.422656,-0.090937,0.084064,0.242236,-0.277365,0.159299,0.126544,0.108168,0.127191,-0.160624
max,0.217887,0.592896,0.440081,0.295554,0.848179,0.080913,0.574959,0.521046,0.575232,0.514609,...,0.873931,0.28577,0.378033,0.580993,0.207809,0.675945,0.496674,0.353845,0.490037,0.179364


In [5]:
model = DBSCAN(min_samples=1, eps=epsilon)

clusters = pd.DataFrame(
    {"path": features.index, "label": model.fit_predict(features)}
).sort_values(["label", "path"], ascending=False)

In [6]:
clusters.head(30)

Unnamed: 0,path,label
245,images/subdir_2/sample_498.png,186
244,images/subdir_2/sample_457.png,185
243,images/subdir_2/sample_440.png,184
242,images/subdir_2/sample_366.png,183
241,images/subdir_1/sample_326.png,182
240,images/subdir_1/sample_296.png,181
239,images/subdir_1/sample_202.png,180
238,images/photo_2022-06-16_17-29-02.jpg,179
237,images/photo_2022-06-16_17-29-01.jpg,178
236,images/photo_2022-06-16_17-29-01 (2).jpg,177


<a id='step2'></a>
## Step 2: Finding (and removing) duplicate images

In this step, I will group the images in the same cluster and keep only one image of each cluster as its representative.

In [7]:
# select one of elements from each cluster to keep
list_to_keep = set(clusters.groupby("label")["path"].first())
all_files = set(clusters["path"])

In [8]:
files_to_remove = all_files - list_to_keep

In [9]:
files_to_remove

{'images/photo_2022-06-16_16-17-50.jpg',
 'images/photo_2022-06-16_16-17-51.jpg',
 'images/photo_2022-06-16_16-17-52.jpg',
 'images/photo_2022-06-16_16-17-53 (2).jpg',
 'images/photo_2022-06-16_16-17-53.jpg',
 'images/photo_2022-06-16_16-17-54.jpg',
 'images/photo_2022-06-16_16-17-55.jpg',
 'images/photo_2022-06-16_16-17-56.jpg',
 'images/photo_2022-06-16_16-17-57 (2).jpg',
 'images/photo_2022-06-16_16-17-57.jpg',
 'images/photo_2022-06-16_16-17-59.jpg',
 'images/photo_2022-06-16_16-18-00.jpg',
 'images/photo_2022-06-16_16-18-01 (2).jpg',
 'images/photo_2022-06-16_16-18-01.jpg',
 'images/photo_2022-06-16_16-18-02.jpg',
 'images/photo_2022-06-16_16-18-05.jpg',
 'images/photo_2022-06-16_16-18-11.jpg',
 'images/photo_2022-06-16_16-18-12.jpg',
 'images/photo_2022-06-16_16-18-14.jpg',
 'images/photo_2022-06-16_16-18-16.jpg',
 'images/photo_2022-06-16_16-18-18.jpg',
 'images/photo_2022-06-16_16-18-19.jpg',
 'images/photo_2022-06-16_16-18-20.jpg',
 'images/photo_2022-06-16_16-18-22.jpg',
 'im

## Remove duplicate files

### Warning 

***Running the cell below will remove the detected duplicate images from disk. If you are not sure to do this, you can check and then delete the detected images manually one by one.***

In [10]:
for file_path in files_to_remove:
    try:
        os.remove(file_path)
    except OSError as e:
        print("Cannot delete file '%s': %s" % (file_path, e.strerror))