### Jaccard similarity
function `jaccard_similarity(set1, set2)` corresponds to jaccard distance:

$$
sim_J=\frac{|A{\cap}B|}{|A{\cup}B|}
$$

### Weighted Jaccard similarity
while `weighted_similarity(set1, set2)` corresponds to weighted jaccard distance:

$$
sim_w=\frac{\sum_i^Nw_i(a_i{\land}b_i)}{\sum_i^Nw_i(a_i{\lor}b_i)}
$$

This code compute similarity of two images(image1 and image2) based on their tags(tags1 and tags2). 

In [9]:
# coding: utf-8
"""
This code aims to get pair of images and compute pair similarity.
Each rows of photos.txt contain each photos information like:

photos/1946439170-pink-rebecca-minkoff-bag-black_400.jpg

where "photos/" and "_400.jpg" are useless.
"""
import csv

# number of pairs
PAIRS = 10000

# read file whose row contains photoID, photoTags
f_info = open("D:\\Downloads\\photos.txt",'r',encoding="utf-8")
photoList = f_info.readlines()

# keep row length as constant
SIZE = len(photoList)-1
print("Rows of the file: "+str(SIZE))

# read possible tags and their weights
f_tags = open('C:\\Users\\Hondoh\\Desktop\\tags_count.txt','r',encoding="utf-8")
tagList = f_tags.readlines()

weights = {"":0}

for i in tagList:
    i = i.strip("\n")
    s = i.split(",")
    weights[s[0]] = s[1]
    
weights.pop("")
#print(weights[""])

def photoInfo(idNum):
    #read idNum th row
    row = photoList[idNum]
    
    #delete useless terms
    row = row.strip("photos/")
    row = row.strip("\n")
    row = row.strip("_400.jpg")
    info = row.split("-")
    
    if len(info)==2:
        return info[0], []
    else:
        return info[0], info[1:]
            

def jaccard_similarity(list1, list2):
    similarity = len(set(list1) & set(list2))/len(set(list1) | set(list2))
    return similarity

def weighted_similarity(list1, list2):
    # set of intersection
    inter_set = set(list1) & set(list2)
    #set of union
    union_set = set(list1) | set(list2)
    
    inter_score = 0
    union_score = 0
    
    for i in inter_set:
        #print(i.dtype())
        #print(weights[i].dtype())
        inter_score += int(weights.get(i))
    
    for i in union_set:
        union_score += int(weights.get(i))
        
    return (inter_score/union_score)

Rows of the file: 89501


In [10]:
import random

# compute each kinds of similarity
for i in range(PAIRS):
    tags1 = []
    tags2 = []
    
    id1 = 0
    id2 = 0
   
    # get image id and tags
    while len(tags1)==0:
        id1, tags1 = photoInfo(random.randint(0, SIZE))
        if len(set(tags1) & set(weights)) != len(set(tags1)):
            tags1 = []
        
    while len(tags2)==0 & (id1!=id2):
        id2, tags2 = photoInfo(random.randint(0, SIZE))
        if len(set(tags2) & set(weights)) != len(set(tags2)):
            tags2 = []
        
    if "" in tags1:
        tags1.remove("")
    if "" in tags2:
        tags2.remove("")

    # compute jaccard distance
    similarity = jaccard_similarity(tags1, tags2)
    similarity2 = weighted_similarity(tags1, tags2) 
    
    if (similarity>0.3):
        print(str(id1)+","+str(id2)+":\t"+str(round(similarity,2))+","+str(round(similarity2,2))+"\t tags:"+str((set(tags1) & set(tags2))))


5315682246,6674145735:	0.36,0.81	 tags:{'dress', 'brown', 'shoes', 'black', 'blue'}
5557891890,9887602133:	0.31,0.59	 tags:{'dark', 'brown', 'vintage', 'dress'}
9231628961,3698352693:	0.31,0.37	 tags:{'dress', 'tights', 'm', 'h', 'gray'}
3422979567,9304681999:	0.33,0.65	 tags:{'navy', 'leather', 'zara', 'black'}
3798499911,8231892321:	0.31,0.85	 tags:{'h', 'm', 'boots', 'black'}
10807544871,3192771628:	0.31,0.21	 tags:{'zara', 'gum', 'bubble', 'clutch', 'asos'}
2875438209,6410910128:	0.36,0.74	 tags:{'boots', 'black', 'm', 'jacket', 'h'}
2739958360,9525039067:	1.0,1.0	 tags:{'brim', 'creepers', 'hat', 'oasapcom', 'wide', 'shoes', 'black', 'choiescom'}
10265496524,7332148337:	0.31,0.75	 tags:{'vintage', 'shorts', 'black', 'shoes', 'blue'}
1749547802,9218132:	0.33,0.72	 tags:{'shoes', 'dress', 'black'}
11024395964,9919098527:	0.31,0.28	 tags:{'ba', 'white', 'shirt', 'shoes'}
2522199180,1999862523:	0.31,0.64	 tags:{'black', 'sneakers', 'bag', 'heather', 'gray'}
6428640187,9370291366:	0.33