# Sample Notebook

**This is a sample showing how to use following five classes**\
\
Data_Preprocessing\
URL_Matching\
Face_Name_Matching\
Model_Ensembling\
Image_Caption.py\
\
\
Requirements:
Python 3.7\
\
Libraries:
tensorflow\
pytorch\
py-googletrans\
NLTK\
gensim\
cv2\
icrawler\
DeepFace\
scipy\
matplotlib\
\
What you can do with this class:

#### Data Preprocessing
- combine the first two batches of files for training usage
- the third batch used for validation
- crawl training, valation, and test images from given URLs 
- extract features: image id, image url, article_title
- tranlsate article title into English using Google Translate API (https://github.com/ssut/py-googletrans)

#### URL Matching based Method
- Feature Extraction: extract article url and image url from provided file
- remove manually defined stop words from urls
- URL tokenization
- URL comparison: a pair of image url and article url is considered to be matched if they contains more than one common tokens.
- sort the potential matched image list by the number of same tokens
- Evaluate performance of this URL Matching based Method using MR100 on both training dataset and validation dataset

#### Image Captioning based Model
- acquired image caption from the pre-trained image captioning model (https://github.com/ruotianluo/ImageCaptioning.pytorch)
- caculate the wmd between the each pair of image caption and article title
- sort the potential matched image list by the wmd
- Evaluate performance of this Image Captioning based Method using MR100 on both training dataset and validation dataset

#### Face Matching
##### Step 1:  Create a specific training dataset for face-name matching
- A pair of article and image is used for this model training if the pair satisfies following two conditions: 1. the title of article include person's name, 2. the image is a human face image
- Extract person's name from article title
- Remove the image from this specific traning dataset if face can't be detected using multiple face detection frameworks
- Build connections between the extracted name and the corresponding human face image
- If there is no connected image for the extracted name in the training dataset, we crawl five face image using the extracted name as keyword from website.

##### Step 2: Face Name Matching
- Extract the person's names from testing article titles
- Find the corresponding face images from the training dataset which created in step 1
- Encode the face images into feature vectors
- Compare the corresponding face images with each test face image by caculating the cosine distance between two feature vectors
- Two face images are regared as matched if the cosine distance between two vectors is smaller or equal to 0.4 
- Sort the potential matcheing image list by both cosine distance and total matches

##### Step 3:Evaluation

##### Step 4: Prediction


In [61]:
def write_wmd_sim(wmd_sim_file, sim_result):
    f = open(wmd_sim_file, "a")
    for key, v in sim_result.items():
        for item in v:
            result=key+"\t"+os.path.basename(item[0])+"\t"+str(item[1])+"\n"
            f.write(result)
    f.close()

In [74]:
from datetime import datetime
def img_cap_similarity(caption_file, title_file, wmd_sim_file, title_eng_idx):
    caption_dict=get_caption (caption_file)
    ar_id_title=get_ar_id_title (title_file, title_eng_idx)
    sim_result=cal_sim(ar_id_title, caption_dict)
    write_wmd_sim(wmd_sim_file, sim_result)
    lines_caption_sim = []
    with open(wmd_sim_file) as f:
        lines_caption_sim = f.readlines()
    ar_cap_sim_dic={}
    for line in lines_caption_sim:
        segs=line.strip().split("\t")
        if segs[0] not in ar_cap_sim_dic:
            ar_cap_sim_dic[segs[0]]={}
        ar_cap_sim_dic[segs[0]][segs[1]]= -float(segs[2])
    sort_final_result=sort_dict(ar_cap_sim_dic)
    return sort_final_result

In [75]:
def sort_dict(input):
    result={}
    for k, v in input.items():
        sort_v=dict(sorted(v.items(), key=lambda item: item[1], reverse=True))
        result[k]=sort_v
    return result

In [76]:
def eval_cap_sim (sort_final_result):
    count=0
    for key, value in sort_final_result.items():
        first_tuple_elements=[]
        for a_tuple in value:
            first_tuple_elements.append(a_tuple)
        if key in first_tuple_elements[0:100]:
            count+=1
    return count/len(sort_final_result)

In [None]:
train_result = img_cap_similarity(r"processed_data\data\train_image_caption_result.txt", \
                                  r"processed_data\data\train_title_eng.tsv",\
                                  r"result/tr_caption_sim_wmd.tsv", 3)

In [None]:
sort_final_result = img_cap_similarity(r"processed_data\data\eval_image_caption_result.txt", \
                                 "processed_data\data\eval_title_eng.tsv",\
                                "result/eval_caption_sim_wmd.tsv", 3)

In [87]:
print(eval_cap_sim (sort_final_result))

0.042312526183493925


In [None]:
test_result = img_cap_similarity(r"processed_data\data\test_image_caption_result.txt", \
                                 "processed_data\data\test_title_eng.tsv",\
                                "result/test_caption_sim_wmd.tsv", 2)

## Face Matching
### Step 1:  Create a specific training dataset for face-name matching
- A pair of article and image is used for this model training if the pair satisfies following two conditions: 1. the title of article include person's name, 2. the image is a human face image
- Extract person's name from article title
- Remove the image from this specific traning dataset if face can't be detected using multiple face detection frameworks
- Build connections between the extracted name and the corresponding human face image
- If there is no connected image for the extracted name in the training dataset, we crawl five face image using the extracted name as keyword from website.
### Step 2: Face Name Matching
- Extract the person's names from testing article titles
- Find the corresponding face images from the training dataset which created in step 1
- Encode the face images into feature vectors
- Compare the corresponding face images with each test face image by caculating the cosine distance between two feature vectors
- Two face images are regared as matched if the cosine distance between two vectors is smaller or equal to 0.4 
- Sort the potential matcheing image list by both cosine distance and total matches

### Step 1:  Create a specific training dataset for face-name matching

#### Name Extraction

In [None]:
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
import os
import shutil
from shutil import copyfile
from deepface import DeepFace
import cv2
import time
import matplotlib.pyplot as plt

In [None]:
import os
import icrawler
from icrawler.builtin import GoogleImageCrawler
import shutil
from shutil import copyfile
import cv2
from deepface import DeepFace
import os
import warnings
from deepface import DeepFace
import time
from datetime import datetime
import os
from deepface import DeepFace

In [None]:
java_path = r"C:\Users\yuxia\Documents\java-se-8u41-ri\bin\java.exe"
os.environ['JAVAHOME'] = java_path
st = StanfordNERTagger(r'C:\Users\yuxia\Downloads\stanford-ner-4.2.0\stanford-ner-2020-11-17\classifiers\english.all.3class.distsim.crf.ser.gz',
                           r'C:\Users\yuxia\Downloads\stanford-ner-4.2.0\stanford-ner-2020-11-17\stanford-ner.jar',
                           encoding='utf-8')

In [None]:
def concat_name(classified_text):
    i=0
    name_list=[]
    while i < len(classified_text)-1:
        if classified_text[i][1] == 'PERSON':
            name = classified_text[i][0]
            if classified_text[i+1][1]=='PERSON':
                name+=" "+classified_text[i+1][0]
                i+=1
            name_list.append(name)
        i+=1
    if i == len(classified_text)-1 and classified_text[i][1] == 'PERSON':
        name_list.append(classified_text[i][0])
    return name_list


def add_title_name(tr_file, output_file):
    a_file = open(tr_file, encoding="utf8")
    next(a_file)
    cnt=0
    header="img_id"+"\t"+"title"+"\t"+"title_eng"+"\t"+"title_names"
    with open(output_file, 'a',encoding="utf-8") as the_file:
        for line in a_file:
            title_eng=line.split("\t")[2]
            tokenized_text = word_tokenize(title_eng)
            classified_text = st.tag(tokenized_text)
            names=concat_name(classified_text)
            if len(names)>0:
                names_str = ','.join(names)
                print(names_str)
                new_line=line.strip("\n")+"\t"+names_str+"\n"
                cnt+=1
            else:
                new_line=line.strip("\n")+"\t "+"\n"
            the_file.write(new_line)

In [None]:
tr_title_eng_file = r'processed_data\data\train_title_eng.tsv'
eval_title_eng_file = r'processed_data\data\eval_title_eng.tsv'
test_title_eng_file = r'processed_data\data\test_title_eng.tsv'
tr_title_eng_name_file = r'processed_data\data\train_title_eng_name.tsv'
eval_title_eng_name_file = r'processed_data\data\eval_title_eng_name.tsv'
test_title_eng_name_file = r'processed_data\data\test_title_eng_name.tsv'
add_title_name(tr_title_eng_file, output_file)
add_title_name(eval_title_eng_file, output_file)
add_title_name(test_title_eng_file, output_file)

In [None]:
filenames=[r"processed_data\data\train_title_eng_name.tsv", r"processed_data\data\eval_title_eng_name.tsv"]
output_file=r"processed_data\data\train_eval_title_eng_name.tsv"
combine_files(filenames, output_file, False)

#### Processing the face image in the training/validation dataset
- group images by the extracted names in the corresponding article
- Created name indexing image dictionaries to fit unicode convention in OpenCV
Note: OpenCV only accepts ASCII characters for image paths when reading and writing images

In [None]:
def create_name_folder(name_file, train_face_folder, train_img_folder):
    a_file = open(name_file, encoding="utf8")
    next(a_file)
    for line in a_file:
        line = line.strip("\n")
        img_name = line.split("\t")[0]
        names = line.split("\t")[4].rstrip()
        if len(names) > 0:
            path = os.path.join(train_face_folder, names.split(",")[0])
            if not os.path.exists(path):
                os.makedirs(path)
            if os.path.exists(os.path.join(train_img_folder, img_name)):
                copyfile(os.path.join(train_img_folder, img_name), os.path.join(path, img_name))

def create_mapped_folder(d, mapped_folder):
    if not os.path.isdir(mapped_folder):
        os.mkdir(mapped_folder)
    sub_directories = [o for o in os.listdir(d) if os.path.isdir(os.path.join(d, o))]
    idx_name = {}
    name_idx = {}
    idx = 1
    for sub_dir in sub_directories:
        idx_name[sub_dir] = 'face_' + str(idx)
        name_idx['face_' + str(idx)] = sub_dir
        idx += 1
    sub_full_paths = [os.path.join(d, o) for o in os.listdir(d) if os.path.isdir(os.path.join(d, o))]
    for sub_dir in sub_full_paths:
         mapped_file_folder(sub_dir, mapped_folder, idx_name)
    return idx_name, name_idx

def mapped_file_folder(src, dest, idx_name):
    src_files = os.listdir(src)
    for file_name in src_files:
        full_file_name = os.path.join(src, file_name)
        dest_folder = os.path.join(dest, idx_name[os.path.basename(src)])
        if not os.path.isdir(dest_folder):
            os.mkdir(dest_folder)
        if os.path.isfile(full_file_name):
            shutil.copy(full_file_name, dest_folder)

In [None]:
create_name_folder(r'processed_data\data\train_eval_title_eng_name.tsv', r"processed_data\img\train_eval_faces", r"img\train")
create_name_folder(r'processed_data\data\train_eval_title_eng_name.tsv', r"processed_data\img\train_eval_faces", r"img\eval")

In [None]:
create_name_folder(r'processed_data\data\train_title_eng_name.tsv', r"processed_data\img\train_faces", r"img\train")

In [None]:
tr_eval_idx_name, tr_eval_name_idx = create_mapped_folder(r"processed_data\img\train_eval_faces", r"processed_data\img\train_eval_mapped_face")

In [None]:
tr_idx_name, tr_name_idx = create_mapped_folder(r"processed_data\img\train_faces", r"processed_data\img\train_mapped_face")

- Remove non-face images from this specific training set

In [None]:
def detect_face_cv( file):
    # Load the cascade
    face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
    # Read the input image
    img = cv2.imread(file)
    # Convert into grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    width, height = gray.shape
    # Detect faces
    faces = face_cascade.detectMultiScale(gray, 1.1, 4)
    # Draw rectangle around the faces
    for (x, y, w, h) in faces:
        if w != width or height != h:
            return True
    return False

def deep_detect_backend(file):
    backends = ['opencv', 'ssd', 'dlib', 'mtcnn', 'retinaface']
    c = 0
    for backend in backends:
        try:
            detected_face = DeepFace.detectFace(file, detector_backend=backend)
        except:
            c += 1
    if c == len(backends):
        return False
    else:
        return True

def deep_detect(file):
    models = ["VGG-Face", "Facenet", "Facenet512", "OpenFace", "DeepFace", "DeepID", "ArcFace", "Dlib"]
    c = 0
    for model in models:
        try:
            detected_face = DeepFace.detectFace(file, model_name=model)
        except:
            c += 1
    if c == len(models):
        return False
    else:
        return True

In [None]:
def remove_no_face_img(path):
    face_img=[]
    sub_directories = [os.path.join(path, d) for d in os.listdir(path) if os.path.isdir(os.path.join(path, d))]
    cnt=0
    for sub_dir in sub_directories:
        print(os.path.basename(sub_dir))
        files = [os.path.join(sub_dir, f) for f in os.listdir(sub_dir) if f.endswith('.jpg')]
        for file in files:
            if deep_detect(file) or deep_detect_backend(file) or detect_face_cv(file):
                face_img.append(os.path.basename(file))
            else:
                os.remove(file)
    return face_img

In [None]:
train_eval_face_img_list=remove_no_face_img(r"processed_data\img\train_eval_mapped_face")
train_face_img_list=remove_no_face_img(r"processed_data\img\train_mapped_face")

In [None]:
def remove_empty_folder(path):
    sub_directories = [os.path.join(path, d) for d in os.listdir(path) if os.path.isdir(os.path.join(path, d))]
    for sub_dir in sub_directories:
            files = [os.path.join(sub_dir, f) for f in os.listdir(sub_dir) if f.endswith('.jpg')]
            if len(files)==0:
                shutil.rmtree(sub_dir)

In [None]:
remove_empty_folder(r"processed_data\img\train_eval_mapped_face")
remove_empty_folder(r"processed_data\img\train_mapped_face")

In [None]:
def find_mapping(d):
    sub_directories = [o for o in os.listdir(d) if os.path.isdir(os.path.join(d, o))]
    idx_name = {}
    name_idx = {}
    idx = 1
    for sub_dir in sub_directories:
        idx_name[sub_dir] = 'face_' + str(idx)
        name_idx['face_' + str(idx)] = sub_dir
        idx += 1
    return idx_name, name_idx

In [None]:
tr_eval_idx_name, tr_eval_name_idx = find_mapping(r"processed_data\img\train_eval_faces")
tr_idx_name, tr_name_idx = find_mapping(r"processed_data\img\train_faces")

In [None]:
print(len(tr_eval_idx_name))
print(len(tr_idx_name))

- Find the list of news articles with title including person's name

In [None]:
def get_ar_name_list (article_file, title_eng_idx):
    articles_names=open(article_file, 'r', encoding="utf-8")
    next(articles_names)
    lines = [line.strip() for line in articles_names]
    result=[]
    for i in range(len(lines)):
        orig_line=lines[i]
        segs = orig_line.split("\t")
        if len(segs) > title_eng_idx and len(segs[len(segs)-1].strip())>0 and segs[len(segs)-1].strip()!='NA':
            result.append((segs[0], segs[title_eng_idx].split(",")[0]))
    return result

In [None]:
train_ar_name_list=get_ar_name_list(r"processed_data\data\train_title_eng_name.tsv", 4)
eval_ar_name_list=get_ar_name_list(r"processed_data\data\eval_title_eng_name.tsv", 4)
test_ar_name_list=get_ar_name_list(r"processed_data\data\test_title_eng_name.tsv", 3)

In [None]:
print(len(train_ar_name_list))
print(len(eval_ar_name_list))
print(len(test_ar_name_list))

#### Face Image Crawling
If there is no connected image for the extracted name in the training dataset, we crawl five face image using the extracted name as keyword from website.

In [None]:
def craw_missing_images(ar_name_list, idx_name, train_mapped_face_path, crawl_face_path):
    if not os.path.isdir(crawl_face_path):
        os.mkdir(crawl_face_path)
    for ar_name in ar_name_list:
        if ar_name[1] in idx_name and ar_name[1] in idx_name and os.path.exists(os.path.join(train_mapped_face_path, idx_name[ar_name[1]])):
                print("found")
        else:
             
            if not os.path.isdir(os.path.join(crawl_face_path, ar_name[1])):
                os.mkdir(os.path.join(crawl_face_path, ar_name[1]))
            google_crawler = GoogleImageCrawler(feeder_threads=1,parser_threads=2,downloader_threads=4,storage={'root_dir': os.path.join(crawl_face_path, ar_name[1])})
            filters = dict(date=((2019, 1, 1), (2021, 7, 30)))
            google_crawler.crawl(keyword=ar_name[1], filters=filters, max_num=5, file_idx_offset=0)

In [None]:
craw_missing_images(train_ar_name_list, {}, "",  r"processed_data\img\crawl_face")

In [None]:
craw_missing_images(eval_ar_name_list, tr_idx_name, r"processed_data\img\train_mapped_face", r"processed_data\img\crawl_train_face")
craw_missing_images(test_ar_name_list, tr_eval_idx_name, r"processed_data\img\train_eval_mapped_face", r"processed_data\img\crawl_train_eval_face")

In [None]:
crawl_tr_eval_idx_name, crawl_tr_eval_name_idx=create_mapped_folder(r"processed_data\img\crawl_train_eval_face", \
                                                                    r"processed_data\img\crawl_train_eval_face_mapped")

In [None]:
crawl_tr_idx_name, crawl_tr_name_idx=create_mapped_folder(r"processed_data\img\crawl_train_face", \
                                                          r"processed_data\img\crawl_train_face_mapped")

In [None]:
crawl_idx_name, crawl_name_idx=create_mapped_folder(r"processed_data\img\crawl_face", \
                                                          r"processed_data\img\crawl_face_mapped")

- Remove non-face images from crawled images

In [None]:
def remove_no_face_img_crawl(path):
    face_img=[]
    sub_directories = [os.path.join(path, d) for d in os.listdir(path) if os.path.isdir(os.path.join(path, d))]
    cnt=0
    for sub_dir in sub_directories:
        print(os.path.basename(sub_dir))
        files = [os.path.join(sub_dir, f) for f in os.listdir(sub_dir) if f.endswith('.jpg')]
        for file in files:
            if os.path.isfile(file) and (deep_detect(file) or deep_detect_backend(file) or detect_face_cv(file)):
                face_img.append(os.path.basename(file))
            elif os.path.isdir (sub_dir):
                shutil.rmtree(sub_dir)
    return face_img

In [None]:
remove_no_face_img_crawl(r"processed_data\img\crawl_train_eval_face_mapped")

In [None]:
remove_no_face_img_crawl(r"processed_data\img\crawl_train_face_mapped")

In [None]:
remove_no_face_img_crawl(r"processed_data\img\crawl_face_mapped")

In [None]:
crawl_tr_eval_idx_name, crawl_tr_eval_name_idx = find_mapping(r"processed_data\img\crawl_train_eval_face")
crawl_tr_idx_name, crawl_tr_name_idx = find_mapping(r"processed_data\img\crawl_train_face")

### Face Name Matching

#### Image Candidate Selection
Only face images are selected to match with the images in training set created step 1

In [None]:
def select_face_image(src_path, dst_path):
    if not os.path.isdir(dst_path):
                os.mkdir(dst_path)
    files = [os.path.join(src_path, f) for f in os.listdir(src_path) if f.endswith('.jpg')]
    for file in files:
        if deep_detect(file) or deep_detect_backend(file) or detect_face_cv(file):
            shutil.copy(file, dst_path)

In [None]:
select_face_image(r'img\train', r'processed_data\img\train_face_candidate')
select_face_image(r'img\eval', r'processed_data\img\eval_face_candidate')
select_face_image(r'img\test', r'processed_data\img\test_face_candidate')

### caculate image similarity 

In [None]:
import time
from datetime import datetime
def get_face_similarity(face_img_candidate_dir, train_mapped_img_dir, ar_name_list, idx_name):
    cnt = 0
    record = 0
    img_files = [f for f in os.listdir(face_img_candidate_dir) if f.endswith('.jpg')]
    print(len(img_files))
    ar_img_files = {}
    for ar_name in ar_name_list:
        if ar_name[1].strip() != 'NA' and ar_name[1] in idx_name:
            img_db_path=""
            if os.path.exists(os.path.join(train_mapped_img_dir, idx_name[ar_name[1]])):
                img_db_path=os.path.join(train_mapped_img_dir, idx_name[ar_name[1]])
            if len(img_db_path)>0:
                df_results = []
                t = time.process_time()
                count = 0
                for img_file in img_files:
                    img_path = os.path.join(face_img_candidate_dir, img_file)
                    df = DeepFace.find(img_path=img_path, db_path=img_db_path,
                                       model_name='Facenet', enforce_detection=False)
                    if len(df) > 0:
                        df_results.append((img_path, df['Facenet_cosine'].mean(), len(df)))
                    else:
                        df_results.append((img_path, "NA", 0))
                    count += 1
                ar_img_files[ar_name[0]] = df_results
                cnt += 1
                elapsed_time = time.process_time() - t

                print(str(datetime.now()))
                print("in ", elapsed_time, "seconds complete", cnt, " name completed", " compared with ", count,
                      "images")
        record += 1
        print("processing ", record, " files")
    return ar_img_files

### Model Evaluation

In [None]:
def write_face_matching_similarity(output_file, ar_img_files):
    f = open(output_file, "a")
    for key, v in ar_img_files.items():
        for item in v:
            result = key + "\t" + os.path.basename(item[0]) + "\t" + str(item[1]) + "\t" + str(item[2]) + "\n"
            f.write(result)
    f.close()

In [1]:
def sort_dictionary(input_dict):
    result={}
    for k, v in input_dict.items():
        sort_v=dict(sorted(v.items(), key=lambda item: item[1], reverse=True))
        result[k]=sort_v
    return result

def cal_face_matching_similarity(input_file):
    image_train_sim = []
    with open(input_file) as f:
        image_train_sim = f.readlines()
    ar_train_sim_dic={}
    for line in image_train_sim:
        segs=line.strip().split("\t")
        if segs[0] not in ar_train_sim_dic:
            ar_train_sim_dic[segs[0]]=[]
        ar_train_sim_dic[segs[0]].append((segs[1], segs[2], segs[3]))
    ar_train_sim_dic_cal={}
    for k, v in ar_train_sim_dic.items():
        if k not in ar_train_sim_dic_cal:
            ar_train_sim_dic_cal[k]={}
        for item in v:
            if int(item[2])==0:
                sim=0
            else:
                sim=(1-float(item[1]))*int(item[2])
            ar_train_sim_dic_cal[k][item[0]]= sim
    return sort_dictionary(ar_train_sim_dic_cal)

In [None]:
eval_ar_img_files = get_face_similarity(r'processed_data\img\eval_face_candidate',\
                                   r'processed_data\img\train_mapped_face', eval_ar_name_list, tr_idx_name)

In [None]:
write_face_matching_similarity(r"result/eval_face_similarity.tsv",eval_ar_img_files)

In [None]:
eval_ar_img_files_crawl = get_face_similarity(r'processed_data\img\eval_face_candidate',\
                                   r'processed_data\img\crawl_train_face_mapped', eval_ar_name_list, crawl_tr_idx_name)

In [None]:
write_face_matching_similarity(r"result/eval_face_similarity_crawl.tsv",eval_ar_img_files_crawl)

In [2]:
eval_face_matching = cal_face_matching_similarity(r"result/eval_face_similarity.tsv")
eval_face_matching_crawl = cal_face_matching_similarity(r"result/eval_face_similarity_crawl.tsv")

In [4]:
def cal_MR(eval_face_matching):
    count=0
    for key, value in eval_face_matching.items():
        first_tuple_elements=[]
        for a_tuple in value:
            first_tuple_elements.append(a_tuple)
        if key in first_tuple_elements[0:100]:
            count+=1
    return count

In [5]:
print(cal_MR(eval_face_matching)/len(eval_face_matching))
print(cal_MR(eval_face_matching)/2385)

0.21052631578947367
0.0067085953878406705


In [6]:
print(cal_MR(eval_face_matching_crawl)/len(eval_face_matching_crawl))
print(cal_MR(eval_face_matching_crawl)/2385)

0.10404624277456648
0.007547169811320755


### Prediction

In [None]:
test_ar_img_files = get_face_similarity(r'processed_data\img\test_face_candidate',
                                   r'processed_data\img\train_eval_mapped_face', 
                                        test_ar_name_list, 
                                        tr_eval_idx_name)

In [None]:
write_face_matching_similarity(r"result/test_train_img_similarity.txt",eval_ar_img_files)

In [None]:
test_ar_img_files_crawl = get_face_similarity(r'processed_data\img\test_face_candidate',
                                   r'processed_data\img\crawl_train_eval_face_mapped', 
                                              test_ar_name_list, 
                                              tr_eval_crawl_tr_idx_name)

In [None]:
write_face_matching_similarity(r"result/test_crawl_img_similarity.txt",eval_ar_img_files)

### Model Ensembling

convert a file into a dictionary respresenting results from image captioning based model

In [141]:
def cal_caption_result (caption_sim_wmd_file):
    lines_caption_sim = []
    with open(caption_sim_wmd_file) as f:
        lines_caption_sim = f.readlines()
    ar_cap_sim_dic={}
    for line in lines_caption_sim:
        segs=line.strip().split("\t")
        if segs[0] not in ar_cap_sim_dic:
            ar_cap_sim_dic[segs[0]]={}
        ar_cap_sim_dic[segs[0]][os.path.splitext(segs[1])[0]]= 1-float(segs[2])
    return ar_cap_sim_dic

In [142]:
ar_cap_sim_dic=cal_caption_result(r"result\test_caption_sim_wmd.txt")

In [143]:
print(len(ar_cap_sim_dic))

1915


convert a file into a dictionary respresenting results from face name matching based model

In [144]:
def face_matching_result (face_matching_file):
    image_train_sim = []
    with open(face_matching_file) as f:
        image_train_sim = f.readlines()
    ar_train_sim_dic={}
    for line in image_train_sim:
        segs=line.strip().split("\t")
        if segs[0] not in ar_train_sim_dic:
            ar_train_sim_dic[segs[0]]=[]
        ar_train_sim_dic[segs[0]].append((os.path.splitext(segs[1])[0], segs[2], segs[3]))
    ar_train_sim_dic_cal={}
    
    for k, v in ar_train_sim_dic.items():
        if k not in ar_train_sim_dic_cal:
            ar_train_sim_dic_cal[k]={}
        for item in v:
            if int(item[2])==0:
                sim=0
            else:
                sim=(1-float(item[1]))*int(item[2])
            ar_train_sim_dic_cal[k][item[0]]= sim
    return ar_train_sim_dic_cal

In [145]:
ar_train_sim_dic_cal=face_matching_result(r"result\test_train_img_similarity.txt")

In [146]:
ar_crawl_sim_dic_cal=face_matching_result(r"result\test_crawl_img_similarity.txt")

In [147]:
print(len(ar_crawl_sim_dic_cal))

191


normalize value of a given dictionary (Min-max normalization)

In [148]:
def norm_dict(a_dict):
    result={}
    amin, amax = min(a_dict.values()), max(a_dict.values())
    for k, v in a_dict.items():
        if amax-amin==0:
            result[k]=0
        else:
            result[k] = (v-amin) / (amax-amin)
    return result

normalize value of a given dictionary 
the dictionary respresents results from face name matching based model

In [149]:
def norm_sim(a_dict):
    result={}
    print(len(a_dict))
    for k, v in a_dict.items():
        result[k]=norm_dict(v)
    return result

sort the value in the given dictionary

In [150]:
def sort_dict(a_dict):
    normalized_dict=norm_sim(a_dict)
    result = {}
    for k, v in normalized_dict.items():
        sort_v = dict(sorted(v.items(), key=lambda item: item[1], reverse=True))
        result[k] = sort_v
    return result

sort dictionary respresenting results from face name matching based model
sort dictionary respresenting results from image captioning matching based model

In [151]:
cap_dict=sort_dict(ar_cap_sim_dic)
train_dict=sort_dict(ar_train_sim_dic_cal)
crawl_dict=sort_dict(ar_crawl_sim_dic_cal)

1915
128
191


merge results from image captioning based model and results from face matching based model

In [152]:
def merge_cap_face(cap_dict, train_dict, crawl_dict, weight_cap, weight_img):
    result={}
    for k, v in cap_dict.items():
        img_id=os.path.splitext(k)[0]
        result[img_id]=v
        if k in train_dict:
            for k_tr, v_tr in train_dict[k].items():
                result[img_id][k_tr]=v_tr*weight_img
        if k in crawl_dict:
            for k_cr, v_cr in crawl_dict[k].items():
                result[img_id][k_cr]=v_cr*weight_img
    return result

In [153]:
cap_face_result=merge_cap_face(cap_dict, train_dict, crawl_dict, 0.5, 0.5)

In [155]:
sorted_cap_face_result=sort_dict(cap_face_result)

1915


truncate image candidates into top 100 list for each article

In [156]:
def truncate_result(a_dict):
    for key, value in a_dict.items():
        l = [*value]
        l_100 = l[0:100]
        a_dict[key]=l_100

In [157]:
truncate_result(sorted_cap_face_result)

In [158]:
len(sorted_cap_face_result)

1915

acquire result from url matching based method

In [159]:
img_id_name_dict = extract_img_url_token("../data/MediaEvalNewsImagesBatch04images.tsv",
                                              TEST_I_ID_IDX,
                                              TEST_IMG_URL_IDX)
article_id_name_dict = extract_article_token("../data/MediaEvalNewsImagesBatch04articles.tsv", 
                                             TEST_A_ID_IDX,
                                             TEST_TITLE_IDX)
url_result = match_url(article_id_name_dict, img_id_name_dict)

matching url
1772
1772


merge result from url matching based model into result

In [160]:
def merge_url(final_cap_face_result, url_result):
    final_result={}
    for s_id, s_value in final_cap_face_result.items():
        if s_id in result:
            diff_elements = [x for x in url_result[s_id] if x not in s_value ]
            common_elements= [x for x in s_value if x in url_result[s_id] ]
            tail_elements=s_value[len(s_value)-len(diff_elements):]
            common_ele_in_tail=[x for x in common_elements if x in tail_elements]
            if len(common_ele_in_tail)>0:
                new_value=diff_elements+common_ele_in_tail+s_value[:len(s_value)-len(diff_elements)-len(common_ele_in_tail)]
            else:
                new_value=diff_elements+s_value[:len(s_value)-len(diff_elements)]
            final_result[s_id]= new_value

        else:
            final_result[s_id]=s_value
    return final_result

In [161]:
final_result=merge_url(sorted_cap_face_result, url_result)

save final result into a file

In [162]:
def save_final_result(output_file, final_result):
    with open(output_file, "w") as the_file:
        header="particleID"
        for i in range(100):
            header+="\t"+"iid"+str(i+1)
        the_file.write(header+"\n")
        for key, value in final_result.items():
            the_file.write(key+'\t'+ "\t".join(value)+"\n")

In [163]:
save_final_result(r"result\final_result.tsv", final_result)

Result Exploring

In [165]:
df=pd.read_csv(r"result\final_result.tsv", delimiter="\t")

In [166]:
df

Unnamed: 0,particleID,iid1,iid2,iid3,iid4,iid5,iid6,iid7,iid8,iid9,...,iid91,iid92,iid93,iid94,iid95,iid96,iid97,iid98,iid99,iid100
0,1000265260,134746,134710,135977,134622,134853,136039,136139,136193,134762,...,134315,134322,134510,135693,134997,135109,135698,134633,136319,134726
1,1001935289,135908,135628,134775,136331,134390,136306,134381,134409,134909,...,136045,136458,136137,134266,134868,135007,135435,136062,136172,136179
2,1002375244,136277,134791,136639,134624,134416,135315,135770,136169,136530,...,135374,135406,135453,135626,135924,135952,136016,136106,136168,136189
3,1002735288,136819,136242,134962,134606,136361,134390,134123,134231,134332,...,136561,134322,136431,134502,134709,136243,134149,134162,134733,134788
4,1002835245,135405,136751,134806,135863,136083,135390,136320,136390,134853,...,135534,135709,134206,136624,136931,136639,136288,134634,134627,135606
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1910,1999075246,135716,134131,135146,135009,135268,135742,135805,136327,134737,...,137028,135043,134556,134699,134729,134770,134923,135112,135368,135506
1911,1999165241,134482,135136,135987,136178,136355,136646,136889,136763,134782,...,136791,135627,136258,134220,134356,136583,136668,134291,136412,136326
1912,1999345240,136293,135876,135329,136288,136356,135714,134205,134395,135003,...,136679,136669,134933,136116,135566,136096,135390,135844,136285,134557
1913,1999355239,134193,136200,135183,135391,135705,135853,135432,135722,135813,...,136327,134336,134523,136558,136988,134946,135912,136476,134380,134397
