# TEM501 course project

In this document, I will give instrustions to perform following tasks related to the course project.

- Load data from data files
- Split development data into dev-train/dev-test data
- Preprocess and extract features from meta-data of images
- Select top features from the training data
- Build a simple baseline for the course project

## Loading data

We will load meta-data and labels of images in the development set. In this simple version, I only use `description`, `title` and `user_tags` in meta-data of images.

In [1]:
# You may need to change the path to the data
devset_meta_data = "./data/course_project/devset/devset_images_metadata.json"
devset_gt_data = "./data/course_project/devset/devset_images_gt.csv"
path_to_output_csv = "./data/course_project/devset/devset_images_bow.csv"

We define a class `ImageMetaData` to store info of each image.

In [2]:
class ImageMetaData():
    
    def __init__(self, img_id, description, title, user_tags):
        self.img_id = img_id
        self.description = description if description is not None else ""
        self.title = title if title is not None else ""
        self.user_tags = user_tags if user_tags is not None else []
    
    def __repr__(self):
        return str(self.__dict__)

The following functions will load meta-data and labels of images.

In [3]:
import json
import pandas as pd

def load_meta_data(file_path):
    with open(file_path) as f:
        json_object = json.load(f)
        images = json_object["images"]
    data = []
    for img in images:
        img_meta = ImageMetaData(img["image_id"], img["description"], img["title"], img["user_tags"])
        data.append(img_meta)
    return data

def load_label(file_path):
    """Load gold-standard labels
    
    Return
    -------
    img2label: Map from image id to its label
    """
    df = pd.read_csv(file_path, header=None)
    img2label = { str(img_id):int(lb) for img_id,lb in zip(df[0], df[1]) }
    return img2label

Let's try to load meta data and labels in the developement set.

In [4]:
devset_data = load_meta_data(devset_meta_data)
print("Number of images: {}".format(len(devset_data)))

# See some first images
print(devset_data[:3])

Number of images: 4224
[{'img_id': '3595468464', 'description': '', 'title': 'Sukhumvit Soi 4, Nana Tai, Bangkok', 'user_tags': ['7d', 'asia', 'bangkok', 'capital', 'city', 'dynax', 'earth asia', 'food', 'foodstall', 'konica', 'maxxum', 'minolta', 'nana', 'people', 'road', 'soi', 'street', 'sukhumvit', 'thailand', 'urban']}, {'img_id': '5090153632', 'description': 'The Arno crosses Florence, where it passes below the Ponte Vecchio and the Santa Trìnita bridge (built by Bartolomeo Ammanati, but inspired by Michelangelo). The river flooded this city regularly in historical times, the last occasion being the famous flood of 1966, with 4,500 m³/s after a rain of 437.2 mm in Badia Agnano and 190 millimetres in Florence, in only 24 hours.', 'title': 'Florence, Italy', 'user_tags': ['arno', 'duomo', 'florence', 'italy', 'ponte vecchio', 'tuscany']}, {'img_id': '5893636276', 'description': 'Straight maile road', 'title': '4 BATTI 4 RASTA', 'user_tags': ['jamshedpur']}]


In [5]:
img2label = load_label(devset_gt_data)
print(img2label["3595468464"], img2label["5090153632"])

0 0


## Data spliting

For system development, we always want to split the devset into dev-train and dev-test. We will do that by using `scikit-learn`.

In [6]:
from sklearn.model_selection import train_test_split

img_labels = [img2label[img.img_id] for img in devset_data]

dev_train_data, dev_test_data, dev_train_labels, dev_test_labels = train_test_split(devset_data, img_labels,
                                                                                  test_size=0.25, random_state=42
                                                                                  )
print(dev_train_data[0])

{'img_id': '2268582059', 'description': '', 'title': 'P1070591', 'user_tags': ['burningmax', 'national', 'park', 'road', 'roadtrip', 'states', 'trash', 'trip', 'united', 'usa', 'utah', 'west', 'zion']}


## Preprocessing and feature extraction

In this section, we will perform data preprocessing and feature extraction. We just perform basic text cleaning and use BoW features from description, title, and user tags of images. **More advanced preprocessing and feature extraction is left for students**.

In [7]:
import re
from nltk.tokenize import word_tokenize

def clean_text(text):
    """Clean text

    Parameters
    -----------
    text: A String

    Return
    -----------
    text_: A String
        The text after being cleaned
    """
    ANY_URL_REGEX = r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))"""
    WEB_URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
    
    text_ = text.strip()
    
    regx_list  = [
        # Remove text between <a href=>...</a>
        re.compile(r"<a href=\".+?\".*?>.*?</a>"),
    ]
    for rgx in regx_list:
        text_ = re.sub(rgx, " ", text_)
    # Remove HTML links
    text_ = re.sub(WEB_URL_REGEX, " ", text_)
    text_ = re.sub(ANY_URL_REGEX, " ", text_)
    return text_

    
def extract_bow_features(img):
    """
    Parameters
    -----------
    img: An ImageMetaData object
    
    Return
    -----------
    tokenized string that includes BoW in title, description, and user_tags
    """
    tokens = []
    for text in [ img.title, img.description ]:
        if text == "":
            continue
        text = clean_text(text)
        tokens += word_tokenize(text)
        
    for tag in img.user_tags:
        tag = clean_text(tag)
        if tag == "":
            continue
        tag = "_".join(tag.split())
        tokens.append(tag)

    return " ".join(tokens)

We test the `extract_bow_features` on some images.

In [8]:
print(extract_bow_features(dev_train_data[0]))
print(extract_bow_features(dev_train_data[1]))


P1070591 burningmax national park road roadtrip states trash trip united usa utah west zion
P9290020 canoe flooding hue katsana mekong minska motorbike river travel typhoon veitnam water wet


## Top feature selection

Now, we will use scikit-learn (See [http://scikit-learn.org/stable/modules/feature_selection.html](http://scikit-learn.org/stable/modules/feature_selection.html) and [http://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html](http://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html)) to select some good features.

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import numpy as np

vectorizer = CountVectorizer(
                             binary=True,   # Use binary features
                             stop_words="english"
                            ) 
dev_train_bow = [extract_bow_features(img) for img in dev_train_data]
X_train = vectorizer.fit_transform(dev_train_bow)
fs = SelectKBest(chi2, k=100)
X_new = fs.fit_transform(X_train, dev_train_labels)
feature_names = vectorizer.get_feature_names()
sorted_features = [feature_names[i] for i in np.argsort(fs.scores_)[::-1]]
print(sorted_features[:20])

['flood', 'flooded', 'building', 'flooding', 'water', 'rain', 'road', 'floods', 'storm', 'city', 'construction', 'architecture', 'site', 'bridge', 'old', 'people', 'new', 'irene', 'weather', 'built']


## Simple baseline system

Now we will build a simple baseline system. The system uses weigted sum of top features as the score of an image.

In [18]:
def baseline_score(img, weights):
    score = 0.0
    bow = extract_bow_features(img)
    for w in bow:
        if w in weights:
            score += weightes[w]
    return score

Now we use the `baseline_score` to sort the images in the `dev_test`.

In [19]:
top_features = sorted_features[:15]
weights = {}
wei = 1
step = 0.05
for w in top_features:
    weights[w] = wei
    wei -= step
print(weights)
    
sorted_images = sorted(dev_test_data, key=lambda x: baseline_score(x, weights), reverse=True)
sorted_image_ids = [img.img_id for img in sorted_images]
sorted_image_labels = [img2label[img_id] for img_id in sorted_image_ids]

{'flood': 1, 'flooded': 0.95, 'building': 0.8999999999999999, 'flooding': 0.8499999999999999, 'water': 0.7999999999999998, 'rain': 0.7499999999999998, 'road': 0.6999999999999997, 'floods': 0.6499999999999997, 'storm': 0.5999999999999996, 'city': 0.5499999999999996, 'construction': 0.4999999999999996, 'architecture': 0.4499999999999996, 'site': 0.39999999999999963, 'bridge': 0.34999999999999964, 'old': 0.29999999999999966}


Now we will use the following function to calculate the average precision at `k`.

In [20]:
import numpy

def average_precision_at_k(k, doc_labels):
    """Average Precision at k
    """
    k = min(k, len(doc_labels))
    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(doc_labels[:k]):
        if p == 1:
            num_hits += 1
            score += num_hits / (i+1.0)
    if num_hits == 0:
        return 0.0
    return score/num_hits

def report_result(ranked_gt_labels):
    cutoff = 480
    cutoff_vals = [50, 100, 250, 480]

    avg_prec_at_k = 100*average_precision_at_k(cutoff, ranked_gt_labels)
    print("AP@{} = {} %".format(cutoff, avg_prec_at_k))

    scores = []
    for k in cutoff_vals:
        avg_prec = 100.0 * average_precision_at_k(k, ranked_gt_labels)
        print('AP@%d = %f' % (k, avg_prec))
        scores.append(avg_prec)
    avg = numpy.mean(scores)
    print("Mean AP@ [{}] = {}".format(", ".join([str(x) for x in cutoff_vals]), avg))

In [21]:
report_result(sorted_image_labels)

AP@480 = 34.00006388442876 %
AP@50 = 36.727798
AP@100 = 33.951794
AP@250 = 33.986380
AP@480 = 34.000064
Mean AP@ [50, 100, 250, 480] = 34.66650922088554


## Applications of the baseline for the test set

Now we will use the baseline system for test set data.

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import numpy as np

vectorizer = CountVectorizer(
                             binary=True,   # Use binary features
                             stop_words="english"
                            ) 
train_bow = [extract_bow_features(img) for img in devset_data]
X_train = vectorizer.fit_transform(train_bow)
fs = SelectKBest(chi2, k=2)
fs.fit(X_train, img_labels)
feature_names = vectorizer.get_feature_names()
sorted_features = [feature_names[i] for i in np.argsort(fs.scores_)[::-1]]
top_features = sorted_features[:15]

weights = {}
wei = 1
step = 0.05
for w in top_features:
    weights[w] = wei
    wei -= step
print(weights)

path_to_test_metadata = "./data/course_project/testset/testset_images_metadata.json"
path_to_test_label = "./data/course_project/testset/testset_images_gt.csv"
testset_data = load_meta_data(path_to_test_metadata)
test_img2label = load_label(path_to_test_label)

sorted_images = sorted(testset_data, key=lambda x: baseline_score(x, weights), reverse=True)
sorted_image_ids = [img.img_id for img in sorted_images]
sorted_image_labels = [test_img2label[img_id] for img_id in sorted_image_ids]

{'flood': 1, 'building': 0.95, 'flooded': 0.8999999999999999, 'flooding': 0.8499999999999999, 'water': 0.7999999999999998, 'rain': 0.7499999999999998, 'road': 0.6999999999999997, 'floods': 0.6499999999999997, 'architecture': 0.5999999999999996, 'city': 0.5499999999999996, 'storm': 0.4999999999999996, 'construction': 0.4499999999999996, 'site': 0.39999999999999963, 'bridge': 0.34999999999999964, 'old': 0.29999999999999966}


Now we report the result of the baseline system on the test set data.

In [23]:
report_result(sorted_image_labels)

AP@480 = 35.792911181924495 %
AP@50 = 51.742616
AP@100 = 42.246260
AP@250 = 37.323844
AP@480 = 35.792911
Mean AP@ [50, 100, 250, 480] = 41.77640771764484


### Random baseline

Now let's try a random baseline which sort the image ids in the testset randomly.

In [24]:
import random
random.seed(1337)

image_ids = list( test_img2label.keys() )
random.shuffle(image_ids)
image_labels = [test_img2label[img_id] for img_id in image_ids]
report_result(image_labels)

AP@480 = 36.29854208586498 %
AP@50 = 41.706019
AP@100 = 39.961639
AP@250 = 36.702047
AP@480 = 36.298542
Mean AP@ [50, 100, 250, 480] = 38.6670617885098
