## Background:

Following on this github issue (https://github.com/livepeer/research/issues/17), particularly on the Video Quality Checking topic:
> There are various easily checkable properties that can be extracted from the video itself such as the codec, the resolution, timestamps, and perhaps certain other bitstream features. However those properties alone do not ensure an important aspect of verification: that the transcoded content itself is a reasonable match for the original source given a good-faith effort at transcoding.

> What is a "reasonable match" and what is a "good-faith effort at transcoding"? Some problems with the video may include:
   -  Watermarking or other manipulation of the source content
   -  Uncalled for resolution changes mid-stream
   -  Excessive frame dropping
   -  Low quality encoder or inappropriate encoding settings

> What criteria should we be checking addition to video quality?
   -  Codec and container itself
   -  Timestamps
   -  Any metadata?


The aim of this notebook is to give an answer to the following set of questions raised in the context of finding an objective metric to assess the quality of stream videos:

-  How to use these per-frame scores for video
-  How to incorporate these scores into a pass/fail classifier?
-  What does each contribute towards the classifier?
-  How are these affected by variations in input and output?
-  How is verification affected if either metric is removed from the equation?
-  Can we extrapolate the behavior of these metrics across unknown inputs? What are the boundaries?
-  What is the performance/computational impact of incorporating these metrics into the classifier? Can this be run online within our sub-2s latency budget?



## Methodology:

We have created a dataset with 140 videos collected from the YT8M dataset using our tool from previous research (https://github.com/epiclabs-io/YT8M).

These 140 are basically those with a set of at least four renditions codified in H264 (240, 360, 480, 720 and 1080).
In order to depart from the null hypothesis that video is transcoded correctly, this seems like a very good assumption as it is in YouTube's own interest to be so.

In order to extend the list of possible metrics, we have researched a few candidates based on the perceptual hashing of the original and the codified videos: cosine distance, euclidean distance and Hamming distance.


# CODE

### Import necessary libraries and initialize

In [None]:
import random
import pandas as pd
import plotly.plotly as py
import plotly.offline as offline
import plotly.graph_objs as go
from plotly import tools
import numpy as np
import scipy.stats as stats

from matplotlib.mlab import PCA

from sklearn.cluster import KMeans

pd.options.display.max_rows = 999
pd.options.display.max_columns = 999

offline.init_notebook_mode()

### Retrieve data, group it and normalize it

In [None]:
# Retrieve data from repo
metrics_df = pd.read_csv('../output/metrics.csv')
metrics_df = metrics_df.drop(['Unnamed: 0'], axis=1)

display(metrics_df.head(5))
#metrics_df = metrics_df.drop(metrics_df[metrics_df['temporal_difference-cosine']>0.10].index, axis=0)
metrics_df['title'] = metrics_df['level_0']

attack_series = []
for _, row in metrics_df.iterrows():
    attack_series.append(row['level_1'].split('/')[-2])
    
metrics_df['attack'] = attack_series
metrics_df['temporal_canny-cross-correlation'] = metrics_df['temporal_canny-cross-correlation'].str.replace(']','')
metrics_df['temporal_canny-cross-correlation'] = metrics_df['temporal_canny-cross-correlation'].str.replace('[','')
metrics_df['temporal_canny-cross-correlation'] = pd.to_numeric(metrics_df['temporal_canny-cross-correlation'])

metrics_df['temporal_dct-cross-correlation'] = metrics_df['temporal_dct-cross-correlation'].str.replace(']','')
metrics_df['temporal_dct-cross-correlation'] = metrics_df['temporal_dct-cross-correlation'].str.replace('[','')
metrics_df['temporal_dct-cross-correlation'] = pd.to_numeric(metrics_df['temporal_dct-cross-correlation'])

metrics_df['temporal_histogram_distance-cross-correlation'] = metrics_df['temporal_histogram_distance-cross-correlation'].str.replace(']','')
metrics_df['temporal_histogram_distance-cross-correlation'] = metrics_df['temporal_histogram_distance-cross-correlation'].str.replace('[','')
metrics_df['temporal_histogram_distance-cross-correlation'] = pd.to_numeric(metrics_df['temporal_histogram_distance-cross-correlation'])

metrics_df['temporal_cross_correlation-cross-correlation'] = metrics_df['temporal_cross_correlation-cross-correlation'].str.replace(']','')
metrics_df['temporal_cross_correlation-cross-correlation'] = metrics_df['temporal_cross_correlation-cross-correlation'].str.replace('[','')
metrics_df['temporal_cross_correlation-cross-correlation'] = pd.to_numeric(metrics_df['temporal_cross_correlation-cross-correlation'])

display(metrics_df.head(25))
display(metrics_df.describe())
# Filter out those rows we might not be interested about by dropping them from the dataframe
# metrics_df = metrics_df.drop(metrics_df[metrics_df['temporal_canny-cross-correlation'] > 100].index, axis=0)
# metrics_df = metrics_df.drop(metrics_df[metrics_df['temporal_canny-euclidean'] > 20].index, axis=0)
metrics_df = metrics_df.drop(metrics_df[metrics_df['attack'].str.contains('rotate')].index, axis=0)
metrics_df = metrics_df.drop(metrics_df[metrics_df['attack'].str.contains('flip')].index,axis=0)
metrics_df = metrics_df.drop(metrics_df[metrics_df['attack'].str.contains('black')].index,axis=0)
metrics_df = metrics_df.drop(metrics_df[metrics_df['attack'].str.contains('vignette')].index,axis=0)

# Group values for each title
metrics = ['temporal_canny-mean',
           'temporal_canny-cross-correlation',
           'temporal_canny-euclidean',
           'temporal_canny-std',
           'temporal_dct-mean', 
           'temporal_dct-cross-correlation', 
           'temporal_dct-euclidean', 
           'temporal_dct-std', 
           'temporal_histogram_distance-mean',
           'temporal_histogram_distance-cross-correlation',
           'temporal_histogram_distance-euclidean',
           'temporal_histogram_distance-std',
           'temporal_cross_correlation-mean',
           'temporal_cross_correlation-cross-correlation',
           'temporal_cross_correlation-euclidean',
           'temporal_cross_correlation-std',
           'vmaf']

grouped_df = metrics_df.groupby(['level_0'] + ['level_1'] + metrics + ['attack'], as_index=False).count()
grouped_df = grouped_df.sort_values(by=['attack'])

In [None]:
grouped_df.head(500)

In [None]:
def colors(n):
    ret = []
    r = int(random.random() * 256)
    g = int(random.random() * 256)
    b = int(random.random() * 256)
    step = 256 / n
    for i in range(n):
        r += step
        g += step
        b += step
        r = int(r) % 256
        g = int(g) % 256
        b = int(b) % 256
        ret.append((r,g,b)) 
    return ret

In [None]:
# Function to create a plot from columns of a pandas DataFrame
def plot_metric(metric, dataframe, column='attack', contours=True, type='linear'):
    x_data = dataframe[column]
    y_data = dataframe[metric]
    data = []
    assets_list = dataframe['level_0'].unique()
    attacks_list = dataframe['attack'].unique()
    

    count = 0
    for asset in assets_list:
        color_list = colors(len(assets_list))
        trace1 = go.Histogram2dContour(x=x_data[dataframe['level_0']==asset],
                                       y=y_data[dataframe['level_0']==asset],
                                       name=metric
                                       )

        [r,g,b] = color_list[count]
        trace2 = go.Scatter(x=x_data[dataframe['level_0']==asset],
                            y=y_data[dataframe['level_0']==asset],
                            mode='markers',
                            marker=dict(color=['rgba({}, {}, {}, .9)'.format(r,g,b) for attack in attacks_list],
                                        size=5,
                                        opacity=1),
                            text=asset,
                            name=asset,
                            hoverinfo='text')
        if contours:
            data.append(trace1)
            data.append(trace2)
        else:
            data.append(trace2)
        count += 1
        
    layout = {
        "height":700, 
            "width":800,
            "title": "Video metrics disparity: {}".format(metric), 
            "xaxis": {"title": "Rendition", "type": "category", "automargin": True, "tickangle":60 }, 
            "yaxis": {"title": "Metric value", "type": type},
            "hovermode":"closest",
            "margin": {"b":100},
            "showlegend": False
            }
    
    
    fig = go.Figure(data=data, layout=layout)
    offline.iplot(fig)

In [None]:
# Iterate through each metric and obtain the respective charts
for metric in metrics:
    plot_metric(metric, grouped_df, contours=False, type='linear')

In [None]:
# Plot charts with logarithmic scale

plot_metric('temporal_dct-mean', grouped_df, contours=False, type='log')

# DATA EXPLORATION

## Spearman's correlation coefficient
Seek potential correlations between metrics. The intensity of this correlations is given by the Spearman’s correlation coefficient displayed in the table below. Without entering into details, let’s explain that Spearman’s correlation coefficient gives the same information as that of Pearson’s, but calculated on ranks instead of actual data values. This allows for identification of both positive (blue) and negative (red) correlations, where +1 means total positive correlation (when one feature grows, so does the other) and -1 means total negative correlation (when one feature grows, the other decreases).

In [None]:
import matplotlib.pyplot as plt

data = grouped_df[metrics]
corr = data.corr('spearman')
corr.style.background_gradient().set_precision(2)

In [None]:
metrics_df[metrics].head()

## Scatter matrix between metrics
Matrix below depicts a pairs plot of our newly generated dataset. It builds on two elementary plots: scatter plots of one metric against each other and histograms of themselves in the diagonal. We can see that all distances (euclidean, cosine and Hamming) are linearly related, meaning basically that they could be used almost interchangeably. On the other hand, SSIM and PSNR are also somehow correlated in a logarithmic / exponential manner, but inversely with regards to the hash distance metrics. In a world apart, the more sophisticated MS-SSIM and VMAF present some degree of connection between them, and display a similar pattern as SSIM in their lower bound with regard to the hash distance metrics.

In [None]:
from pandas.plotting import scatter_matrix

scatter = scatter_matrix(metrics_df[metrics], alpha=0.2, figsize=(16,16), diagonal='kde')

# DISCUSSION

## Cosine distance

This distance is measured over the hash created by reducing a grayscale version (luminance space) of each frame (original and codified) to a 16x16 pixel image, then obtaining a 15 bit hash and computing their cosine distance.

## Euclidean distance

This distance is measured over the hash created by reducing a grayscale version (luminance space) of each frame (original and codified) to a 16x16 pixel image, then obtaining a 15 bit hash and computing their euclidean distance.

## Hamming distance
This distance is measured over the hash created by reducing a grayscale version (luminance space) of each frame (original and codified) to a 16x16 pixel image, then obtaining a 15 bit hash and computing their euclidean distance.

## PSNR
PSNR is computed over the grayscale of both reference and codified using the psnr function of the ffmpeg framework. See the Tools notebook and its associated scripts:
* https://github.com/livepeer/verification-classifier/blob/master/notebooks/Tools.ipynb
* https://github.com/livepeer/verification-classifier/blob/master/scripts/shell/evaluate-psnr-ssim.sh

## SSIM
SSIM is obtained using ssim function of the ffmpeg framework. See the Tools notebook and its associated scripts
* https://github.com/livepeer/verification-classifier/blob/master/notebooks/Tools.ipynb
* https://github.com/livepeer/verification-classifier/blob/master/scripts/shell/evaluate-psnr-ssim.sh

## MS-SSIM
This metric is obtained using ms-ssim function of the libav framework. See the Tools notebook and its associated scripts: 
* https://github.com/livepeer/verification-classifier/blob/master/notebooks/Tools.ipynb
* https://github.com/livepeer/verification-classifier/blob/master/scripts/shell/evaluate-ms-ssim.sh


## VMAF
This metric is obtained using the libvmaf command from Netflix. See the Tools notebook and its associated scripts: 
* https://github.com/livepeer/verification-classifier/blob/master/notebooks/Tools.ipynb
* https://github.com/livepeer/verification-classifier/blob/master/scripts/shell/evaluate-vmaf.sh


## Temporal difference Euclidean
This metric is obtained as the Euclidean distance between two vectors obtained from the original and its rendition videos, respectively, in a No Reference manner. First, the time series of the original is computed by subtracting the pixels of successive frames form their next. The same is applied on each rendition. The distances between the created pairs of vectors are then obtained. See implementation in the video_asset_processor.py module:
* https://github.com/livepeer/verification-classifier/blob/master/scripts/video_asset_processor.py


# CONCLUSIONS

As it is explained here: https://link.springer.com/article/10.1007/s11042-017-4831-6,
the metrics above can be categorized as objective Full Reference metrics for Video Quality Assessment.
We would reccomend to use other type of metrics, as described in the above reference, and use an unsupervised approach given the complexity of the problem at hand.

#### How to use PSNR / SSIM (per-frame scores) for video (sequence of frames)?

Good characterization of time series is achieved by wisely using statistical techniques (average, histograms, wavelets, etc.) that can summarize the properties of a sequence. In this case we have extracted the mean value for all metrics.
However, the main issue will not be just the extrapolation of a per frame metric (PSNR, MS-SSIM, MSE, entropy, etc.) to a time series sequence of frames.
The most complicated part comes when we need to define what is a "good" configuration of those time series.  

#### How to incorporate these scores into a pass/fail classifier? What does each contribute towards the classifier?

As pointed out above, by simple measurement of different instantaneous metrics there is no way one can do this. We could potentially define acceptable thresholds for that ratio for given renditions and bitrates / encoding parameters, but they would remain arbitrary.

#### How are these affected by variations in input and output?

In the figures above, we can appreciate that different sequences of frames (140 assets extracted from YouTube) with different configurations give slightly different results, although SSIM seems to be the one with the least dispersion.

#### How is verification affected if either metric is removed from the equation? Can we extrapolate the behavior of these metrics across unknown inputs? What are the boundaries?

As it is shown in the charts, the mean values for each metric are sensitive to different kinds of encoding and video characteristics. This leads one to think that (as it is explained in literature) different metrics are sensitive to different kinds of inputs. This renders them unsuitable for actual training of a supervised machine learning algorithm, given the amplitude of the space of possible ground truths.


#### What is the performance/computational impact of incorporating these metrics into the classifier? Can this be run online within our sub-2s latency budget?


Figure below shows the (more or less) constant elapsed time required to compute both PSNR and SSIM between the original frame and the frames generated at 500Kbps and 250Kbps.

This gives an overhead (in our implementation, using skimage's compare_ssim) of about 0.172s per frame (=4s for 24 frames) for Big Buck Bunny.

Obviously, there is no particular need to catch every single frame, and more efficient implementations could be found.
