<h3>Próximos passos</h3>
<ul>
    <div style="list-style: none; padding-inline-start: 0px !important; margin-block-start: 0em !important;">
        <strike>✓ Pegar os conjuntos de frames e aplicar a distância de cosseno com a sentença para a geração da matriz de co-atenção</strike><br/>
        <strike>✓ Gerar as representações de co-atenção</strike><br/>
        <strike>✓ Gerar a similaridade entre o segmento de vídeo e sentença com atenção</strike><br/>
        <strike>✓ Construir uma função que processe os dados on the fly</strike><br/>
        <strike>✓ Construir um batch com mais de um exemplo (necessário)</strike><br/>
        <strike>✓ Gerar os top-K hard exemplos negativos</strike><br/>
        <strike>✓ Construir a função margin-based ranking loss</strike><br/>
    </div>
    <li>Elaborar a média móvel dos pares de video-sentença</li>
    <li>Construir as funções de métricas</li>
    <li>Passar o conjunto de dados de forma automática ao pipeline</li>
    <li>Rodar uma primeira versão do treinamento</li>
</ul>

In [57]:
!python utils/teste.py

python: can't open file 'utils/teste.py': [Errno 2] No such file or directory


In [1]:
from tensorflow.keras.layers import Conv3D, MaxPool3D, Flatten, Dense, Layer, Bidirectional
from tensorflow.keras.layers import Dropout, Input, BatchNormalization, Embedding, GRU
from tensorflow.keras import models
from tensorflow.keras import Model
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.data import Dataset
from tensorflow.keras import initializers
from tensorflow import convert_to_tensor, split, expand_dims
import tensorflow as tf
# from utils.frames import load_transform_video, mapped_load_transform_video

import os
import json
import numpy as np
import pandas as pd
import dask.dataframe as dd

import random
import re
import tempfile
import ssl
import cv2
from utils import embed

import imageio
from IPython import display

In [2]:
with open('data/train_data.json', 'r') as f:
    train_data = pd.Series(json.load(f)).apply(pd.Series)[:200]
    train_data = dd.from_pandas(train_data, npartitions=200)

In [3]:
def to_gif(images):
    converted_images = np.clip(images * 255, 0, 255).astype(np.uint8)
    imageio.mimsave('./animation.gif', converted_images, fps=30)
    return embed.embed_file('./animation.gif')

In [4]:
def load_video(path, max_frames=900, min_frames=900, resize=(112, 112)):
    cap = cv2.VideoCapture(path)
    frames = np.empty(resize+tuple([3]))[None]
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
#             frame = tf.numpy_function(crop_center_square, [frame], [tf.float32])

            y, x = frame.shape[0:2]
            min_dim = min(y, x)
            start_x = (x // 2) - (min_dim // 2)
            start_y = (y // 2) - (min_dim // 2)
            frame = frame[start_y:start_y+min_dim,start_x:start_x+min_dim]
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]][None]
            frames = np.vstack([frames, frame])

            if len(frames) == max_frames:
                break
            
        while len(frames) < min_frames or (len(frames) > min_frames and len(frames) < max_frames):
            frames = np.vstack([frames, np.zeros(resize + tuple([3]))[None]])
        
    except:
        print(frame)
        
    finally:
        cap.release()
    
    frames = np.where((frames == -np.inf) | (frames == np.inf) | (pd.isna(frames)), 0.0, frames)
#     frames[(frames == -np.inf) | (frames == np.inf) | (pd.isna(frames))] = 0.0
    return (frames / 255.0).astype(np.float32)

In [5]:
def get_embedding_matrix(descriptions, embedding_dim, n_tokens):
    
    vectorizer = TextVectorization(max_tokens=n_tokens, output_sequence_length=25)
    text_ds = Dataset.from_tensor_slices(descriptions).batch(16)
    vectorizer.adapt(text_ds)

    path_to_glove_file = "utils/vocabulary/glove.6B.{}d.txt".format(embedding_dim)

    embeddings_index = {}
    with open(path_to_glove_file, 'rb') as f:
        data = [line.split(maxsplit=1) for line in f]
    pretrained_embeddings = pd.DataFrame.from_records(data, columns=['word', 'coefs'])
    pretrained_embeddings['word'] = pretrained_embeddings['word'].apply(lambda x: x.decode())
    pretrained_embeddings['coefs'] = pretrained_embeddings['coefs'].apply(lambda x: np.fromstring(x, "f", sep=" "))
    pretrained_embeddings = pretrained_embeddings.set_index('word', drop=True).iloc[:, 0]

    voc = vectorizer.get_vocabulary()
    num_tokens = len(voc) + 2
    word_index = dict(zip(voc, range(len(voc))))
    hits = 0
    misses = 0

    embedding_matrix = np.zeros((num_tokens, embedding_dim))
    for word, i in word_index.items():
        try:
            embedding_matrix[i] = pretrained_embeddings[word]
            hits += 1
        except KeyError:
            misses += 1
            pass
    return vectorizer, embedding_matrix

In [6]:
train_data['video'] = 'datasets/DiDeMo/' + train_data['video'] + '.mp4'
paths = tf.data.Dataset.from_tensor_slices(train_data['video'])

In [7]:
class VideoDataset(tf.data.Dataset):
    def _generator(paths, n_frames, size):
        for path in paths:
            yield load_video(path.decode(), n_frames, n_frames, resize=tuple(size))

    def __new__(cls, paths, n_frames, size=(112,112)):
        return tf.data.Dataset.from_generator(
            cls._generator,
            output_types=tf.dtypes.float32,
            output_shapes=[n_frames] + list(size) + [3],
            args=(paths, n_frames, size)
        )

In [8]:
videos = VideoDataset(train_data['video'].compute().values, 900)

In [9]:
class VideoLayer(Layer):
    def __init__(self):
        super(VideoLayer, self).__init__()
        self.conv3d_1 = Conv3D(32, kernel_size=7, activation='relu')
        self.conv3d_2 = Conv3D(32, kernel_size=5, activation='relu')
        self.conv3d_3 = Conv3D(32, kernel_size=3, activation='relu')
        
        self.maxpool3d_1 = MaxPool3D(pool_size=3)
        self.maxpool3d_2 = MaxPool3D(pool_size=3)
        self.maxpool3d_3 = MaxPool3D(pool_size=3)
        
        self.bn_1 = BatchNormalization()
        self.flatten_1 = Flatten()
        
        self.dense_1 = Dense(units=128, activation='relu')
        
    def call(self, inputs):
        x = self.conv3d_1(inputs)
        x = self.maxpool3d_1(x)
        x = self.conv3d_2(x)
        x = self.maxpool3d_2(x)
        x = self.conv3d_3(x)
        x = self.maxpool3d_3(x)
        x = self.bn_1(x)
        x = self.flatten_1(x)
        return self.dense_1(x)

In [10]:
class SentenceLayer(Layer):
    def __init__(self, n_tokens, embedding_dim, embedding_matrix):
        super(SentenceLayer, self).__init__()
        self.embedding_1 = Embedding(
            n_tokens,
            embedding_dim,
            embeddings_initializer=initializers.Constant(embedding_matrix),
            trainable=False
        )
        
        self.bigru_1 = Bidirectional(GRU(64, return_sequences=True))
        
    def call(self, inputs):
        x = self.embedding_1(inputs)
        return self.bigru_1(x)

In [11]:
class MomentVideo(Model):
    def __init__(self, frame_rate, n_tokens, embedding_dim, embedding_matrix=None):
        super(MomentVideo, self).__init__()
        self.video_1 = VideoLayer()
        self.sentence_1 = SentenceLayer(n_tokens, embedding_dim, embedding_matrix)
        self.frame_rate = frame_rate
    
    
    def cosine_similarity(self, tensor1, tensor2):
        num = tf.linalg.matmul(expand_dims(tensor1, 1), expand_dims(tensor2, 1), transpose_a=True)
        den = tf.norm(tensor1)*tf.norm(tensor2)
        return num/(den+1e-15)
    
    
    def matrix_cosine_similarity(self, tensor1, tensor2):
        matrix = []
        for i in range(tensor1.shape[0]):
            row = []
            for j in range(tensor2.shape[0]):
                row.append(self.cosine_similarity(tensor1[i, :], tensor2[j, :]))
            matrix.append(row)
        return tf.stack(matrix)[:,:,0,0]
    
    
    def similarity_between_repr_and_attend(self, tensor1, tensor2):
        score = 0
        for k in range(tensor1.shape[0]):
            score += self.cosine_similarity(tensor1[k, :], tensor2[k, :])
        return score / tensor1.shape[0]
    
    
    def get_scores(self, video_repr_tensor, sentence_repr_tensor):
        coattention_matrix = self.matrix_cosine_similarity(video_repr_tensor, sentence_repr_tensor)
        
        normalized_sentence = tf.nn.softmax(coattention_matrix, axis=1)
        normalized_video = tf.nn.softmax(coattention_matrix, axis=0)
            
        matrix = []
        for i in range(video_repr_tensor.shape[0]):
            row_sum = np.zeros((128))
            for j in range(sentence_repr_tensor.shape[0]):
                row_sum += normalized_sentence[i][j] * sentence_repr_tensor[j, :]
            matrix.append(row_sum)
            
        sentence_attention = tf.stack(matrix)
    
        matrix = []
        for j in range(sentence_repr_tensor.shape[0]):
            row_sum = np.zeros((128))
            for i in range(video_repr_tensor.shape[0]):
                row_sum += normalized_video[i][j] * video_repr_tensor[i, :]
            matrix.append(row_sum)
            
        video_attention = tf.stack(matrix)
        
        score_video = self.similarity_between_repr_and_attend(video_repr_tensor, sentence_attention)
        score_sentence = self.similarity_between_repr_and_attend(sentence_repr_tensor, video_attention)
        
        return score_video, score_sentence
    
    
    def call(self, videos, sentences):
        '''
            Parameters:
                videos (batch) - the raw videos of the dataset
                sentences (batch) - the sentences of the dataset
            
            Return:
                matrix_score - matrix that represents the scores of the each video wrt each sentence
        '''

        
        #extract the features
        video_repr = []
        for last_frame in range(videos.shape[1] // self.frame_rate):
            video_repr.append(self.video_1(videos[:, last_frame*self.frame_rate:(last_frame+1)*self.frame_rate]))
        videos_repr = tf.stack(video_repr, axis=1)
        sentences_repr = self.sentence_1(sentences)
        
        n_batch = videos.shape[0]
        
        scores_video = np.empty([n_batch, n_batch])
        scores_sentence = np.empty([n_batch, n_batch])
        
        for i in range(n_batch):
            for j in range(n_batch):
                scores_video[i][j], scores_sentence[j][i] = self.get_scores(videos_repr[i], sentences_repr[j])
                
        return scores_video, scores_sentence
    
    
#     def train_step(self, data):
#         # Unpack the data. Its structure depends on your model and
#         # on what you pass to `fit()`.
#         videos, sentences, y = data
        
#         video_repr = []
#         with tf.GradientTape() as tape:
#             y_pred = self(x, training=True)  # Forward pass
#             # Compute the loss value
#             # (the loss function is configured in `compile()`)
#             loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)

#         # Compute gradients
#         trainable_vars = self.trainable_variables
#         gradients = tape.gradient(loss, trainable_vars)
#         # Update weights
#         self.optimizer.apply_gradients(zip(gradients, trainable_vars))
#         # Update metrics (includes the metric that tracks the loss)
#         self.compiled_metrics.update_state(y, y_pred)
#         # Return a dict mapping metric names to current value
#         return {m.name: m.result() for m in self.metrics}

In [12]:
def margin_based_ranking_loss(scores_video, scores_sentence, margin=1, top_k=2):
    
#     score_true_video = tf.linalg.tensor_diag_part(scores_video)
    shape_negative = list(scores_video.shape)
    shape_negative[-1] = shape_negative[-1] - 1

    score_true_videos = scores_video.diagonal()
    scores_negatives_videos = np.sort(scores_video[scores_video != scores_video.T].reshape(shape_negative), axis=1)[:, ::-1][:, :top_k]
    
    video_loss = 1 - np.tile(score_true_videos.reshape(-1, 1), scores_negatives_videos.shape[1]) + scores_negatives_videos
    video_loss = np.where(video_loss < 0, 0, video_loss)
    video_loss = np.sum(video_loss, axis=1)
    
    
    shape_negative = list(scores_sentence.shape)
    shape_negative[-1] = shape_negative[-1] - 1

    score_true_sentences = scores_sentence.diagonal()
    scores_negatives_sentences = np.sort(scores_sentence[scores_sentence != scores_sentence.T].reshape(shape_negative), axis=1)[:, ::-1][:, :top_k]

    sentence_loss = 1 - np.tile(score_true_sentences.reshape(-1, 1), scores_negatives_sentences.shape[1]) + scores_negatives_sentences
    sentence_loss = np.where(sentence_loss < 0, 0, sentence_loss)
    sentence_loss = np.sum(sentence_loss, axis=1)
    
    return video_loss + sentence_loss

In [12]:
embedding_dim = 50
vectorizer, embedding_matrix = get_embedding_matrix(train_data['description'], embedding_dim, 20000)
iter_videos = iter(videos.batch(4))
descriptions = vectorizer(train_data['description'].compute())

In [13]:
batch_video = iter_videos.get_next()

In [26]:
with tf.device('cpu:0'):
    moment = MomentVideo(75, len(embedding_matrix), embedding_dim, embedding_matrix)
    scores_video, scores_sentence = moment(batch_video, descriptions[:4])

In [31]:
scores_video

array([[0.08091357, 0.11735822, 0.12237162, 0.13858634],
       [0.08166637, 0.12108622, 0.12046162, 0.14008217],
       [0.07299844, 0.10532001, 0.10973985, 0.11165836],
       [0.08955354, 0.12799676, 0.12595092, 0.13003561]])

In [84]:
scores_sentence

array([[0.04544296, 0.05315233, 0.05486396, 0.06009117],
       [0.07006237, 0.07351372, 0.08248698, 0.08085805],
       [0.06270567, 0.06977195, 0.07429998, 0.07614164],
       [0.05711501, 0.06609467, 0.05781157, 0.06316324]])


# Exploratory Analysis over Video Description

In [97]:
df = pd.Series(train_data).apply(pd.Series)

Unnamed: 0,num_segments,description,dl_link,times,video,annotation_id
0,6,a brown rat goes into someone's hand then onto...,https://www.flickr.com/video_download.gne?id=2...,"[[2, 2], [2, 2], [2, 2], [2, 2]]",54322086@N00_2408598493_274c77d26a.avi,2
1,6,an orange kitten is sitting then gets up and w...,https://www.flickr.com/video_download.gne?id=2...,"[[5, 5], [5, 5], [5, 5], [5, 5]]",99051133@N00_2502628368_d14bd317de.mov,3
2,6,a person walks outside and then back in,https://www.flickr.com/video_download.gne?id=5...,"[[3, 3], [3, 3], [3, 3], [3, 4]]",67801451@N00_5358663022_243bd90fbc.mov,4
3,6,the guards spin around 180 degrees,https://www.flickr.com/video_download.gne?id=4...,"[[4, 4], [4, 5], [3, 4], [3, 4]]",64379474@N00_4479342537_7b5a3d3f1d.avi,5
4,6,the plane flies off the roof,https://www.flickr.com/video_download.gne?id=9...,"[[5, 5], [4, 5], [4, 5], [4, 5]]",63122283@N06_9978694646_e72011157f.mov,7
...,...,...,...,...,...,...
33000,6,man holds his hand up to feed the birds for th...,https://www.flickr.com/video_download.gne?id=3...,"[[1, 1], [1, 1], [1, 1], [1, 1]]",65977087@N00_3751407080_6bda50e6df.mov,64632
33001,6,someone walks across bottom screen,https://www.flickr.com/video_download.gne?id=7...,"[[0, 1], [0, 0], [2, 2], [0, 0]]",89333651@N00_7564648598_776b587a55.mp4,64637
33002,6,first time we can see the musicians,https://www.flickr.com/video_download.gne?id=4...,"[[2, 2], [2, 2], [2, 2], [2, 2]]",77118917@N00_4606529468_6ba24fe0c9.avi,64638
33003,6,the pitcher's mound comes into view,https://www.flickr.com/video_download.gne?id=4...,"[[3, 3], [3, 3], [3, 3], [3, 3]]",72426516@N00_4868053620_46b99ec39f.mov,64641


In [128]:
pd.set_option('display.max_rows', 150)
df['description'].str.len().quantile(0.999)

127.99600000000646

Isso significa que em 99.9% dos casos, as descrições não passaram de 128 caractereces

In [127]:
df['description'].str.split().str.len().quantile(0.999)

25.0

Isso significa que em 99.9% dos casos, as descrições não passaram de 25 palavras

In [115]:
words = df['description'].str.split()
word_counts = pd.value_counts(words.apply(pd.Series).stack())

In [116]:
word_counts

the          24341
a            10974
in            8262
to            5668
man           4637
             ...  
Green,           1
snatched         1
filmer.          1
harnesses        1
master.          1
Length: 9260, dtype: int64

Claramente, as palavras mais citadas são os artigos como: 'the', 'a', 'in', 'to'

In [30]:
tf.norm([3.0, 4.0])

<tf.Tensor: shape=(), dtype=float32, numpy=5.0>