# <center><font color='#333333'>Hack the Rack 2018</font></center>
## <center><font color='#808080'>Challenge 2: Image Tag Processing</font></center> 
### <center><font color='#3b5998'>Created by cyda - Yeung Wong & Carrie Lo</font></center>

--------------------------------------------------------------------------------------
![logo](https://4.bp.blogspot.com/-LAXjdvVCYCU/WxeQFKQ-1wI/AAAAAAAAACs/o8IJ1eLLAEwQYv2Az7EqQi9jODTqRx7wACK4BGAYYCw/s1000/tight%2Bbanner_with_description.png)

--------------------------------------------------------------------------------------
Please acknowledge <b>team cyda - Yeung Wong and Carrie Lo</b> when using the code

<b><font color='#3b5998'>If you find this script is helpful, please feel free to endorse us through Linkedin!</font></b>

<b>Linkedin:</b>

Yeung Wong - https://www.linkedin.com/in/yeungwong/

Carrie Lo - https://www.linkedin.com/in/carrielsc/

--------------------------------------------------------------------------------------

#### Challenge Description

<u>Humanizing image search for inspiration</u>

This project is intentionally made to process the text data of the tagging of over 25,000 images provided by Li and Fung so as to make a search engine of their products more easily and effectively. Therefore, in order to facilitate their daily works, we drill down our challenge into two main focuses

- Part 1: Clean the input dataset from two different APIs and create a Image_Tag master dataset
- Part 2: Leverage pretrained neural network to enhance the customer search experience

Example of the Image_Tag master dataset
    | pic_id | tags     |
    ---------------------
    | pic001 | "dress"  |
    | pic001 | "pink"   |
    | pic001 | "summer" |
    | pic002 | "hat"    |
    | ...    |  ...     |

--------------------------------------------------------------------------------------

#### Reminder

Please make sure you check the below checkpoints before running the script.

1. This script is Part 2 of the challenge. For details in Part 1 which is worked in the R environment, you may check thought our github - cydalytics.
2. Make sure the file path is correct and the files are following the hirarchy. (Please refer <u>1.2 define the path</u> for detail)

--------------------------------------------------------------------------------------

#### Data we use

Input:
- Pretrained data - GoogleNews-vectors-negative300-SLIM.bin
- Processed data - json_atr_tag_dataset.csv
- Raw data - images file (Optional: just for interactive data exploration)

Output:
- Processed data - synonyms_list (for pbi).csv (Optional: for further data visualization analysis)
> If you are interested in knowing how these synonyms can be further analysed and presented in the data visualization tools
  such as PowerBI, please visit https://cydalytics.blogspot.com/

--------------------------------------------------------------------------------------

# 1 Preliminary

## 1.1 import libraries and set global parameter

In [1]:
# General
import os
import re
import numpy as np
import pandas as pd
from scipy.misc import imread
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

# Data Visualization
import imageio
import matplotlib.pyplot as plt
from ipywidgets import interact
from IPython.display import YouTubeVideo

# Text Mining
import nltk
import nltk.data
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Neural Networking Model
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.INFO)
from sklearn.metrics import accuracy_score
import gensim
from gensim.models import KeyedVectors

# Parameter - To stop potential randomness
seed = 494 # Li & Fung Limited Stock Index
rng = np.random.RandomState(seed)



## 1.2 define the path

In [2]:
root_dir = os.path.abspath('')
pretrained_dir = os.path.join(root_dir, 'Data\Pretrained & External Data')
processed_dir = os.path.join(root_dir, 'Data\Result & Processed Data')
# image_dir is optional and we will not include images in order to reduce the running time
# image_dir = os.path.join(root_dir, 'Data\Raw Data\Images')
image_dir = os.path.join('G:/My Drive/WHY/Data Science/Projects/20180622 Hack The Rack/Dataset/Challenge 2 - Image Tagging/Images')

# 2 Import Data

## 2.1 import processed dataset

In [3]:
photo_keyword_df = pd.read_csv(os.path.join(processed_dir, 'json_atr_tag_dataset.csv'))
photo_keyword_df.columns = ['pid', 'pkeyword']
photo_keyword_df.head(10)

Unnamed: 0,pid,pkeyword
0,941234bleach1_1_20160414065445545.jpg,solid
1,941234bleach1_1_20160414065445545.jpg,blue
2,941234bleach1_1_20160414065445545.jpg,female
3,941234bleach1_1_20160414065445545.jpg,long sleeve
4,941234bleach1_1_20160414065445545.jpg,no
5,941234bleach1_1_20160414065445545.jpg,adult
6,140172472_3_20161214075635169.jpg,solid
7,140172472_3_20161214075635169.jpg,gray
8,140172472_3_20161214075635169.jpg,sleeveless
9,538037checkpattern_1_20160525091219305.jpg,geometric print


## 2.2 import images (Optional)

It is highly not suggested to import the images as it takes many time

In the reality, the app should be directly connected with the cloud service which mapped the images with the image ID

But in order to show that it works, we will use 1000 pictures as demostration

In [4]:
# for demostration
image_id = list(set(photo_keyword_df['pid']))[0:1000]
len(image_id)

1000

In [5]:
# for full set of images
'''
image_id = np.asarray(os.listdir(image_dir))
image_id

for j in reversed(range(len(image_id))):
    if image_id[j][0] == ".":
        hidden_index = image_id.tolist().index(image_id[j])
        image_id = np.delete(image_id, hidden_index)

if 'desktop.ini' in image_id:
    desktop_ini_index = image_id.tolist().index('desktop.ini')
    image_id = np.delete(image_id, desktop_ini_index)
'''

'\nimage_id = np.asarray(os.listdir(image_dir))\nimage_id\n\nfor j in reversed(range(len(image_id))):\n    if image_id[j][0] == ".":\n        hidden_index = image_id.tolist().index(image_id[j])\n        image_id = np.delete(image_id, hidden_index)\n\nif \'desktop.ini\' in image_id:\n    desktop_ini_index = image_id.tolist().index(\'desktop.ini\')\n    image_id = np.delete(image_id, desktop_ini_index)\n'

In [6]:
image_pic = []

for k in range(len(image_id)):
    img_id_temp =image_id[k]
for image_id_prototype in image_id:
    img_id_temp =image_id_prototype
    filepath = os.path.join(image_dir, img_id_temp)
    image_pic.append(imageio.imread(filepath))

In [7]:
image_key_pointer_df = pd.DataFrame(image_id)
image_key_pointer_df.columns = ['pid']

image_key_pointer_df = image_key_pointer_df.reset_index()
del image_key_pointer_df['index']

image_key_pointer_df['image_pointer'] = range(len(image_pic))

image_key_pointer_df.head()

Unnamed: 0,pid,image_pointer
0,m_1249592_1_20170919181229000.jpg,0
1,140170842_1_20161216081959358.jpg,1
2,p_1440664_1_20170925095647000.jpg,2
3,542792sailorwhitestp_1_20161209022043798.jpg,3
4,h_831855_1_20170827164527000.jpg,4


## 2.3 create the master_dataset

In [8]:
# For the one who runs the image part
master_dataset = photo_keyword_df.merge(image_key_pointer_df, on='pid')

# For the one who runs without the image part
'''
master_dataset = photo_keyword_df
'''

'\nmaster_dataset = photo_keyword_df\n'

# 3 Image Exploration (Optional) (for those who imports the images)

## 3.1 explore the images

In [9]:
def browse_images(image_id):
    n = len(image_id)
    def view_image(i):
        plt.imshow(image_pic[i], cmap=plt.cm.gray_r, interpolation='nearest')
        plt.xlabel(image_id[i], fontsize = 12)
        plt.show()
    interact(view_image, i=(0,n-1))
    
browse_images(image_id)

interactive(children=(IntSlider(value=499, description='i', max=999), Output()), _dom_classes=('widget-interac…

## 3.2 explore the tags

In [10]:
def browse_master_dataset(image_id):
    n = len(image_id)
    def view_master_dataset(i):
        if len(master_dataset[(master_dataset.pid == image_id[i])]) == 0:
            print ('No Related Keywords')
        if len(master_dataset[(master_dataset.pid == image_id[i])]) != 0:
            print(master_dataset[(master_dataset.pid == image_id[i])][['pkeyword']].to_string(index=False, header = False))
    interact(view_master_dataset, i=(0,n-1))
    
browse_master_dataset(image_id)

interactive(children=(IntSlider(value=499, description='i', max=999), Output()), _dom_classes=('widget-interac…

## 3.3 explore the images and tags

In [11]:
def browse_images2(image_id):
    n = len(image_id)
    def view_image2(i):
        plt.imshow(image_pic[i], cmap=plt.cm.gray_r, interpolation='nearest')
        plt.xlabel(image_id[i], fontsize = 12)
        plt.show()
        
        if len(master_dataset[(master_dataset.pid == image_id[i])]) == 0:
            print ('No Related Keywords')
        if len(master_dataset[(master_dataset.pid == image_id[i])]) != 0:
            print(master_dataset[(master_dataset.pid == image_id[i])][['pkeyword']].to_string(index=False, header = False)) 
    interact(view_image2, i=(0,n-1))

In [12]:
browse_images2(image_id)

interactive(children=(IntSlider(value=499, description='i', max=999), Output()), _dom_classes=('widget-interac…

# 4 Search Function

## 4.1 text preprocessing function

In [13]:
def text_preprocessing(_text, method='lemm'):
    
    # Tokenize and keep only english chars
    words = nltk.wordpunct_tokenize(re.sub('[^a-zA-Z]', ' ', _text))
    # Change to lower case
    words = [x.lower() for x in words]
    
    # keep words length > 1
    words = [x for x in words if len(x)>1]
    
    # Lemmatizing or stemming
    if method == 'lemm':
        wnl = WordNetLemmatizer()
        words = [wnl.lemmatize(w) for w in words]
    elif method == 'stem':
        port = PorterStemmer()
        words = [port.stem(w) for w in words]

    return ' '.join(words)

## 4.2 keyword search

In [14]:
def keyword_search(keyword_input):
    keyword_input = text_preprocessing(keyword_input, 'lemm')
    if len(master_dataset[(master_dataset.pkeyword == keyword_input)]) == 0:
        print ('No Related Keywords')
        
    if len(master_dataset[(master_dataset.pkeyword == keyword_input)]) != 0:
        temp_dataset = master_dataset[(master_dataset.pkeyword == keyword_input)][['pid','image_pointer']]
        n = len(temp_dataset)
        def display_result(i):
            plt.imshow(image_pic[i], cmap=plt.cm.gray_r, interpolation='nearest')
            plt.xlabel(image_id[i], fontsize = 12)
            plt.show()        
        interact(display_result, i=np.asarray(temp_dataset['image_pointer']))

In [15]:
keyword_search('female')

interactive(children=(Dropdown(description='i', options=(31, 852, 348, 835, 280, 695, 3, 586, 8, 152, 247, 524…

## 4.3 combined keyword search

In [16]:
def combine_keyword_search(keyword_input1, keyword_input2):
    keyword_input1 = text_preprocessing(keyword_input1, 'lemm')
    keyword_input2 = text_preprocessing(keyword_input2, 'lemm')
    
    if (len(master_dataset[(master_dataset.pkeyword == keyword_input1)]) == 0)or(len(master_dataset[(master_dataset.pkeyword == keyword_input2)]) == 0):
        print ('No Related Keywords')
        
    if (len(master_dataset[(master_dataset.pkeyword == keyword_input1)]) != 0)and(len(master_dataset[(master_dataset.pkeyword == keyword_input2)]) != 0):
        temp_dataset = master_dataset[(master_dataset.pkeyword == keyword_input1)]
        temp_pid = list(temp_dataset['pid'])
        temp_dataset2 = master_dataset[master_dataset.pid.isin(temp_pid)]
        temp_dataset3 = temp_dataset2[(temp_dataset2.pkeyword == keyword_input2)][['pid','image_pointer']]
        n = len(temp_dataset3)
        def display_result(i):
            plt.imshow(image_pic[i], cmap=plt.cm.gray_r, interpolation='nearest')
            plt.xlabel(image_id[i], fontsize = 12)
            plt.show()        
        interact(display_result, i=np.asarray(temp_dataset['image_pointer']))

In [17]:
combine_keyword_search('male', 'round')

interactive(children=(Dropdown(description='i', options=(398, 419, 397, 935, 321, 229, 622, 976, 279, 163, 214…

## 4.4 relevant keyword search

In [18]:
def relevant_search_keyword(keyword_input):
    keyword_input1 = text_preprocessing(keyword_input, 'lemm')
    
    if len(master_dataset[(master_dataset.pkeyword == keyword_input)]) == 0:
        print ('No Related Keywords')
        
    if len(master_dataset[(master_dataset.pkeyword == keyword_input)]) != 0:
        temp_dataset = master_dataset[(master_dataset.pkeyword == keyword_input)]
        temp_pid = list(temp_dataset['pid'])
        temp_dataset2 = master_dataset[master_dataset.pid.isin(temp_pid)]
    
    a = temp_dataset2

    my_tab = pd.crosstab(index=temp_dataset2['pkeyword'], columns="count")
    my_tab = my_tab.sort_values('count', ascending=False)
    max = my_tab['count'][0]
    my_tab['count2'] = my_tab['count'] / max
    
    my_tab = my_tab[my_tab.count2 < 0.8]
    
    my_tab = my_tab[1:10]
    
    print(list(my_tab.index))

In [19]:
relevant_search_keyword('v neck')

['no', 'solid', 'adult', 'black', 'long sleeve', 'short half', 'sleeveless', 'floral print', 'casual']


# 5 Synonyms Tagging

This part is optional since it is not related to the challenge we are facing

But it is useful for doing further analysis on showing how data can be leveraged and presented in the data visualization tool

If you are interested in this field, feel free to visit https://cydalytics.blogspot.com/

## 5.1 import pretrained neural network model

In [20]:
pretrained_word_embedding = os.path.join(pretrained_dir, 'GoogleNews-vectors-negative300-SLIM.bin')
model = KeyedVectors.load_word2vec_format(pretrained_word_embedding, binary=True)

## 5.2 synonyms analysis

In [21]:
model.wv.most_similar(['hoodie'],topn=10)

  if __name__ == '__main__':


[('hoody', 0.7738105058670044),
 ('sweatshirt', 0.7398683428764343),
 ('hoodies', 0.7226213216781616),
 ('beanie', 0.6617897748947144),
 ('Hoodie', 0.6560907959938049),
 ('jacket', 0.6535500288009644),
 ('balaclava', 0.6209901571273804),
 ('bandana', 0.612551748752594),
 ('bandanna', 0.5987221002578735),
 ('shirt', 0.5979430675506592)]

In [22]:
model.wv.most_similar(['tshirt'],topn=10)

  if __name__ == '__main__':


[('Tshirt', 0.6293981075286865),
 ('shirt', 0.5718910694122314),
 ('shirts', 0.5639886856079102),
 ('onesie', 0.5497466325759888),
 ('snuggie', 0.5451359748840332),
 ('hoodie', 0.543143630027771),
 ('gbr', 0.5220522880554199),
 ('woot', 0.515326738357544),
 ('sweatshirt', 0.5147799849510193),
 ('underoos', 0.5078185796737671)]

In [23]:
model.wv.most_similar(['handbag'],topn=10)

  if __name__ == '__main__':


[('handbags', 0.7093594670295715),
 ('Handbag', 0.6531553268432617),
 ('satchel', 0.6487715244293213),
 ('wristlet', 0.6127949357032776),
 ('wallet', 0.5805193781852722),
 ('purse', 0.5770999193191528),
 ('holdall', 0.5732836723327637),
 ('briefcase', 0.5614685416221619),
 ('bag', 0.5601797103881836),
 ('necklace', 0.559670627117157)]

## 5.3 outlier detection

In [24]:
model.doesnt_match("Dress Sunglasses Hat shoelace".split())

'shoelace'

In [25]:
model.doesnt_match("Microsoft Waterbottle Water Agua".split())

'Microsoft'

## 5.4 for further analysis on data visualization (Optional)

In [26]:
'''
labels = ['synonyms', 'percentage']
result = pd.DataFrame()

for i in photo_keyword_df['pkeyword']:
    try:
        temp = model.wv.most_similar([i],topn=10)
        temp_df = pd.DataFrame.from_records(temp, columns=labels)
        temp_df['Keyword'] = i
        result = result.append(temp_df, ignore_index=True)
    except:
        pass

result.head()

result.to_csv(os.path.join(processed_dir, 'synonyms_list (for pbi).csv'))
'''

"\nlabels = ['synonyms', 'percentage']\nresult = pd.DataFrame()\n\nfor i in photo_keyword_df['pkeyword']:\n    try:\n        temp = model.wv.most_similar([i],topn=10)\n        temp_df = pd.DataFrame.from_records(temp, columns=labels)\n        temp_df['Keyword'] = i\n        result = result.append(temp_df, ignore_index=True)\n    except:\n        pass\n\nresult.head()\n\nresult.to_csv(os.path.join(processed_dir, 'synonyms_list (for pbi).csv'))\n"

-----------------------------------------
# <center><font color='#FF0000'>~ This</font> <font color='#FF7F00'>is</font> <font color='#FFFF00'>the</font> <font color='#00FF00'>end</font> <font color='#00FFFF'>of</font> <font color='#0000FF'>the</font> <font color='#8B00FF'>script ~</font></center>

-----------------------------------------
<b><font color='#3b5998'>If you appreciate our hard work, please endorse us through linkedin!</font></b>

<b>Linkedin:</b>

Yeung Wong - https://www.linkedin.com/in/yeungwong/

Carrie Lo - https://www.linkedin.com/in/carrielsc/