Small interactive script to preprocess the inputs:
1) Load the image information at the output of a Faster R-CNN 2) Retrieve the image features 3) Encode the texts using a pretrained Bert Tokenizer 4) Save everything in pickle files

The image information is obtained from the output of the fc6 of a pretrained Faster R-CNN with a ResNeXt-152 backbone. They can be downloaded at https://dl.fbaipublicfiles.com/mmf/data/datasets/hateful_memes/defaults/features/features.tar.gz. They are downloaded in the lmdb format. In the corresponding directory, the image information is stored as keys-values, where the keys correspond to the id numbers of the images and the values are dictionnaries containing the following information: "feature_path", "features" (N=100, 2048), "image_height", "image_width", "num_boxes" (N,), "objects"(N,), "cls_prob" (N, 1601), "bbox" (100, 4).

We only keep the image features. Then, we append the features to the Pandas DataFrames obtained from the .json data files (train, dev and test data).

Then, we use a pretrained BERT Tokenizer to encode the texts. The text encodings are also appended to the Pandas DataFrames. They correspond to dictionnaries with the following information: "input_tokens" (max_seq_length=128,), "input_ids" (max_seq_length=128,), "segment_ids" (max_seq_length=128,), "input_mask" (max_seq_length=128,)

Finally, we save the resulting DataFrames in pickle format.

In [1]:
import os
import pickle
import pandas as pd 
import numpy as np  

In [2]:
from processors.bert_processor import bert_encoder
from processors.img_processor import img_features_loader

In [3]:
lmdb_dir = '/Users/guillaumevalette/Desktop/HMDataset/detectron.lmdb/'

train_path = '/Users/guillaumevalette/Desktop/HM_challenge/data/train.jsonl'
dev_path = '/Users/guillaumevalette/Desktop/HM_challenge/data/dev.jsonl'
test_path = '/Users/guillaumevalette/Desktop/HM_challenge/data/test.jsonl'

train_data = '/Users/guillaumevalette/Desktop/HM_challenge/data/VisualBert/train_data'
dev_data = '/Users/guillaumevalette/Desktop/HM_challenge/data/VisualBert/dev_data'
test_data = '/Users/guillaumevalette/Desktop/HM_challenge/data/VisualBert/test_data'

In [4]:
file_paths = [train_path, dev_path, test_path]
file_datas = [train_data, dev_data, test_data]

for file_path, file_data in zip(file_paths, file_datas):
    # create dataframe from .json data file
    df = pd.read_json(file_path, lines=True)

    df.rename(columns={'img': 'img_name'}, inplace=True)

    # append "img_features" column 
    df['img_features'] = df['id'].map(lambda img_id: img_features_loader(lmdb_dir, img_id))
    
    # append "text_encoding" column
    df['text_encoding'] = df['text'].map(lambda seq: bert_encoder(seq))
    
    # save to pickle format
    df.to_pickle(file_data)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/guillaumevalette/Desktop/HM_challenge/data/VisualBert/train_data'

In [5]:
df.to_pickle(file_data)

### Check if the above works

In [6]:
train_df = pd.read_pickle(train_data)

In [7]:
train_df.head()

Unnamed: 0,id,img_name,label,text,img_features,text_encoding
0,42953,img/42953.png,0,its their character not their color that matters,"[[0.0, 0.0, 0.0, 0.0, 9.599549, 2.1708376, 13....","{'input_tokens': ['its', 'their', 'character',..."
1,23058,img/23058.png,0,don't be afraid to love again everyone is not ...,"[[0.0, 0.0, 0.0, 0.0, 12.636864, 0.0, 7.682766...","{'input_tokens': ['don', ''', 't', 'be', 'afra..."
2,13894,img/13894.png,0,putting bows on your pet,"[[0.0, 0.0, 0.0, 0.83242446, 5.372245, 2.75794...","{'input_tokens': ['putting', 'bows', 'on', 'yo..."
3,37408,img/37408.png,0,i love everything and everybody! except for sq...,"[[0.0, 0.0, 0.0, 0.0, 11.302303, 0.0, 0.0, 0.0...","{'input_tokens': ['i', 'love', 'everything', '..."
4,82403,img/82403.png,0,"everybody loves chocolate chip cookies, even h...","[[1.1141618, 0.0, 0.7985689, 0.0, 17.840979, 0...","{'input_tokens': ['everybody', 'loves', 'choco..."


In [9]:
train_df["img_features"][42]

array([[ 0.        ,  1.9127843 ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  1.5968409 ,  0.        , ...,  0.        ,
         0.        ,  0.42826056],
       ...,
       [10.182184  ,  0.        ,  0.        , ...,  0.        ,
         7.8023977 ,  0.        ],
       [ 0.09641957,  0.        ,  4.907559  , ...,  0.        ,
         6.407262  ,  0.        ],
       [ 0.        ,  0.        , 13.499453  , ...,  0.        ,
         2.0245812 ,  0.        ]], dtype=float32)

In [10]:
train_df["img_features"][42].shape

(100, 2048)

In [12]:
train_df["text"][42]

'that chiken is so black no one is going to eat it'

In [13]:
train_df["text_encoding"][42]

{'input_tokens': ['that',
  'chi',
  '##ken',
  'is',
  'so',
  'black',
  'no',
  'one',
  'is',
  'going',
  'to',
  'eat',
  'it'],
 'input_ids': array([ 101, 2008, 9610, 7520, 2003, 2061, 2304, 2053, 2028, 2003, 2183,
        2000, 4521, 2009,  102,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0, 

In [14]:
import torch

In [17]:
for item in train_df["img_features"]:
    if item.shape != (100, 2048):
        print('problem')
        break

In [19]:
ids = np.random.randint(low=0, high=8500, size=3000)

In [22]:
import lmdb

In [24]:
for id in ids:
    img_id = train_df["id"][id]
    id_str = str(img_id)
    if len(id_str) < 5:
        id_str = '0' + id_str
    env_db = lmdb.open(path=lmdb_dir, subdir=True, readonly=True, readahead=False)
    txn = env_db.begin()
    value = txn.get(id_str.encode()) 
    img_info = pickle.loads(value)
    img_feat = img_info["features"]
    env_db.close()

    if not np.array_equal(img_feat, train_df.loc[train_df["id"] == img_id].iloc[0]["img_features"]):
        print('problem')
        break


In [25]:
from transformers import BertTokenizer

In [26]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [28]:
tokenizer.decode(train_df["text_encoding"][42]["input_ids"])

'[CLS] that chiken is so black no one is going to eat it [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'