# 聚合相似图片

QQ 缓存的群表情实在是太多了...而且图片的命名简直就是乱码，每次手工整理都很费劲，虽然顺便收表情包很开心xd

最近在重新学 Python ，就想用来实践一下。之前数字内容安全还是什么课有用 Python 写过感知哈希（pHash），按照流程一步步实现来的，当时只验证了 pHash 可以抵抗轻微的图片修改，属于鲁棒哈希。

于是就想着应该可以用来检测重复图片，找了个支持图像哈希的模块 [ImageHash](https://pypi.org/project/ImageHash/) ，其中还有其他感知哈希（dHash、aHash、wHash），全部都拿来尝试了一下。

根据哈希值差值，将图片进行分组，实现起来不难，但是有两个缺点：
- 哈希算法的选择
- 差值的选择

简直就是“凭感觉”的选择，效果确实有，可以聚合一些相似图片，但是也有完全不相干的图片被聚合的情况...

之后打算拿 [Image Deduplicator](https://idealo.github.io/imagededup/) 试一试，由于 notebook 运行在 Python3.8.5 的虚拟环境中，不支持 imagededup，所以就先实现一个简单的方法。

In [1]:
from PIL import Image
import os
import shutil
import imagehash
import itertools

In [2]:
# read image path into list
def read_files(path):
    images = []
    for file in os.listdir(path):
        filename = file.lower()
        if filename.endswith(".jpg") or filename.endswith("jpeg") or \
            filename.endswith(".png") or filename.endswith("gif"):
            filepath = path + filename
            images.append(filepath)
    return images

In [3]:
# open image and calculate hash 
def cal_hash(files, convert_str=True, sort=True, hashfunc=imagehash.dhash):
    unhandled = []
    hashes = []
    for f in files:
        try:
            image = Image.open(f)
            h = hashfunc(image)
            if convert_str:
                h = str(h)    # convert to hex string
            hashes.append((str(h), f))
        except Exception as e:
            print('Problem:', e, 'with', f)
            unhandled.append(f)
            continue
    
    if sort:    # sort by hex value
        hashes.sort(key=lambda x:x[0])
    
    return hashes, unhandled

In [4]:
# divide images into groups with little hash difference
def divide_groups(hashes, threshold=1):
    # divide into groups by diff
    tmp = []
    groups = []
    for i,v in enumerate(hashes):
        if i+1 > len(hashes)-1:
            break
        
        h1, f1 = v
        h2, f2 = hashes[i+1]
        diff = imagehash.hex_to_hash(h1) - imagehash.hex_to_hash(h2)
        if abs(diff) < threshold:
            tmp.append(v)
        else:
            if tmp:
                tmp.append(v)
                groups.append(tmp)
                tmp = []
    
    # not in group
    flatten = list(itertools.chain(*groups))
    res_hashes = [(h,f) for h,f in hashes if (h,f) not in flatten]
    
    print("Origin:", len(hashes), "Total:", len(flatten)+len(res_hashes),
         "Groups:", len(groups), "Grouped:", len(flatten), "Remain:", len(res_hashes))
    
    return groups, res_hashes

In [5]:
# move and rename by groups
def move_files(groups, store_path, filenum=0):
    for lst in groups:
        size = 0
        for h,f in lst:
            tmp = os.path.getsize(f)
            if tmp > size:
                saved = f
                size = tmp
        
        # lagest file without "_" in filename
        ext = "." + saved.split(".")[-1]
        shutil.move(saved, store_path+str(filenum).zfill(5)+ext)
        
        # other duplicate files, with "_N" in filename
        i = 1
        for h,f in lst:
            if f == saved:
                continue
            ext = "." + f.split(".")[-1]
            shutil.move(f, store_path+str(filenum).zfill(5)+"_"+str(i)+ext)
            i += 1
        
        # add filenum
        filenum += 1
    
    # return for next function call
    return filenum

In [6]:
# my_qq --> QQ 号
path = "D:/Program Data/QQ/my_qq/Image/Group/"
store = "D:/Program Data/QQ/my_qq/Image/Group/sorted/"

In [None]:
# read image path
files = read_files(path)

# calculate hash
hashes, unhandled = cal_hash(files,hashfunc=imagehash.dhash)

In [None]:
# error opening image
unhandled

In [9]:
# total images
len(hashes)

4726

In [10]:
newhashes = hashes.copy()  # group on copy
threshold = 20     # compare threshold
filenum = 0        # filename
for i in range(9, threshold, 3):
    groups, newhashes = divide_groups(newhashes, i)
    filenum = move_files(groups, store, filenum)

Origin: 4726 Total: 4726 Groups: 216 Grouped: 507 Remain: 4219
Origin: 4219 Total: 4219 Groups: 26 Grouped: 60 Remain: 4159
Origin: 4159 Total: 4159 Groups: 47 Grouped: 97 Remain: 4062
Origin: 4062 Total: 4062 Groups: 129 Grouped: 263 Remain: 3799


聚合之后的效果看起来还不错

![](img/dhash.png)

但是随着阈值的增加，不相干的图片也被分在一个组里了... 所以只是有点帮助的代码而已(´･ᴗ･`)

![](img/errgroup.png)

In [11]:
# rename remaining files
def rename_files(hashes, path, filenum):
    for h,f in hashes:
        ext = "." + f.split(".")[-1]
        shutil.move(f, path+str(filenum).zfill(5)+ext)
        filenum += 1        

In [12]:
rename_files(newhashes, store, filenum)