# Section 1 : Concept-based image indexing
- Here I have used the imageUrlStr of the product as a basis to identify if the product is a duplicate or not.
- Algorithm: Hashing
- Time complexity for finding duplicates in N images: O(N) (Disregarding hashing overhead)

In [1]:
import pandas as pd


In [2]:
df = pd.read_csv('data/datafile.csv', dtype=str) 
df.shape[0]

346902

In [3]:
df.head(3)

Unnamed: 0,productId,title,description,imageUrlStr,mrp,sellingPrice,specialPrice,productUrl,categories,productBrand,...,Unnamed: 43,Unnamed: 44,Unnamed: 45,Unnamed: 46,Unnamed: 47,Unnamed: 48,Unnamed: 49,Unnamed: 50,Unnamed: 51,Unnamed: 52
0,TOPE9ABBZU3HZRHN,Citrine Casual Short Sleeve Printed Women's Pi...,This beautiful printed modal top from Citrine ...,http://img.fkcdn.com/image/top/r/h/n/1-1-wwtpw...,1099,329,329,http://dl.flipkart.com/dl/citrine-casual-short...,"Apparels>Women>Western Wear>Shirts, Tops & Tun...",Citrine,...,,,,,,,,,,
1,TOPE9ABBBTJYDSQE,Citrine Casual Short Sleeve Printed Women's Pi...,This beautiful printed modal top from Citrine ...,http://img.fkcdn.com/image/top/r/h/n/1-1-wwtpw...,1099,329,329,http://dl.flipkart.com/dl/citrine-casual-short...,"Apparels>Women>Western Wear>Shirts, Tops & Tun...",Citrine,...,,,,,,,,,,
2,TOPE9AZZSMSZFYAM,Leelan Casual Short Sleeve Solid Women's Black...,,http://img.fkcdn.com/image/top/y/a/m/1-1-10009...,524,262,262,http://dl.flipkart.com/dl/leelan-casual-short-...,"Apparels>Women>Western Wear>Shirts, Tops & Tun...",Leelan,...,,,,,,,,,,


### Removing entries with duplicate productId. This is because productId is the primary key.

In [4]:
df = df.drop_duplicates(subset='productId', keep="last")
df = df.reset_index(drop=True)
df.shape[0]

346813

### Iterate over dataset. Append as duplicate if url already in hash table. Add new entry otherwise.

In [5]:
distinct = {}
for i in range(df.shape[0]):
    url = df.imageUrlStr[i].split(';')[1]
    if url in distinct:
        distinct[url].append(i)
    else:
        distinct[url] = [i]

In [6]:
len(distinct)

87816

In [7]:
dupes = {}
for key, value in distinct.items():
    if len(value) > 1:
        dupes[df.productId[value[0]]] = [df.productId[x] for x in value[1:]]

In [8]:
len(dupes)

75447

### Dumping output to a JSON file. 

In [9]:
import json
json.dump(dupes, indent=4, fp=open("out.json","w"))

# Section 2: Content based image similarity 
- I have created a module called image_util which has functions to check content similarity.
- Algoritm: Scale-Invariant Feature Transform
- Other methods like CNN driven Content-based Image Retrieval (CBIR) can also be used

In [10]:
# import requests
# import os

# name_map = {}

# def download_images():
#     '''A script to download images (If required) to check for duplication.
#     '''
#     # Create a dictionary that maps url with image name
#     i = 0
#     for key, values in distinct.items():
#         i += 1
#         name_map[str(i) + '.jpeg'] = key

#     # Download
#     saved = os.listdir('data/images/')
#     i = 0
#     for key, value in name_map.items():
#         i += 1
#         print(i)

#         url = value
#         if key not in saved:
#             page = requests.get(url)

#             f_name = 'data/images/' + key
#             with open(f_name, 'wb') as f:
#                 f.write(page.content)

### Driver function. Takes two 'products' as input and checks if they are duplicate
- The function first checks if productID is same
- It then check if imageUrlStr is same
- This is followed by content similarity check using `are_similar` method of image_util
    - `are_similar` method takes two images as input along with a threshold value (default = 60)
    - It uses Scale-Invariant Feature Transform for calculating similar points.
    - If the number of similar points > threshold, then image is considered to be similar

In [11]:
import image_util
def areDuplicates(df1, df2):
    # Have same product Id
    if df1.productId.iloc[0] == df2.productId.iloc[0]:
        print('Same Product Id')
        return True
    
    url1 = df1.imageUrlStr.iloc[0].split(';')[1]
    url2 = df2.imageUrlStr.iloc[0].split(';')[1]
    # Have same image url
    if url1 == url2:
        print('Same Image Url')
        return True

    # Check image content similarity
    else:
        image1 = image_util.url_to_image(url1)
        image2 = image_util.url_to_image(url2)
        print('Images Downloaded. Checking Similarity')     
        return image_util.are_similar(image1, image2, threshold=60)
        

In [12]:
df1 = df.iloc[[22208]]
df2 = df.iloc[[22210]]
areDuplicates(df1, df2)

Images Downloaded. Checking Similarity
Number of good points:  2
The images are different


False

In [13]:
df1 = df.iloc[[22208]]
df2 = df.iloc[[22216]]
areDuplicates(df1, df2)

Same Image Url


True