<a href="https://colab.research.google.com/github/mexca/mexca-workshop/blob/main/notebooks/20240214_mexca_workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Example: Emotion Feature Extraction With Mexca

In this notebook, we will use the US presidential debate between Clinton and Trump in 2016 as an example to capture and compare emotion expressions with MEXCA. The video can be found on [YouTube](https://www.youtube.com/watch?v=DBhrSdjePkk), but we will use a file that is hosted by a third party.

The video contains three persons (Hillary Clinton, Donald Trump, and a moderator), but we will use a part of the video in this example where only Clinton and Trump are present. Most frames of the video contain a least one face and speech. Because it contains only a limited number of faces and speakers and the faces are mostly shown in close-ups, the video is a good example to demonstrate MEXCA.

## Preparation

It is recommended to run MEXCA on a GPU, so we need to make sure that a GPU is available. In Google Colab, we can do this by chaning the runtime type (under `Runtime` select `Change runtime type` and `T4 GPU`).

Before we begin, we must install the full version of the mexca Python package. This can be done with `pip install mexca[all]`. The `[all]` appendix indicates that all components of the MEXCA pipeline should be installed. The installation can take a few minutes to finish. (Note that `!` is an IPython magic command to run a line as a shell command).

In [None]:
!pip install mexca[all]

# Fixes a bug with Colab and triton package
!pip install --no-deps "triton==2.0.0"

To check if the installation was successful, we can try to access the version of the installed mexca package.

In [None]:
import mexca

mexca.__version__ # Should return `1.0.1`

We can check if a GPU is available using `torch.cuda.is_available()`.

In [None]:
import torch

torch.cuda.is_available() # Should return `True` if GPU is available

Let us now import the required packages for the remainder of the notebook.

In [None]:
import logging
import os

import matplotlib.pyplot as plt
import numpy as np
import polars as pl
import yaml

from base64 import b64encode
from google.colab import userdata
from urllib.request import urlopen

from mexca.audio import SpeakerIdentifier, VoiceExtractor
from mexca.pipeline import Pipeline
from mexca.text import AudioTranscriber, SentimentExtractor
from mexca.video import FaceExtractor

We also need to download the example video file from the third party URL. First, we define a function to download a file from an URL.

In [None]:
def download_example(url, filename):
    # Check if filename exists
    if not os.path.exists(filename):
        video = urlopen(url)

        with open(filename, 'wb') as file:
            file.write(video.read())

Then, we specify the URL, a name for the video file, and use the download function.

In [None]:
video_url = 'https://books.psychstat.org/rdata/data/debate.mp4'
filename = 'debate.mp4'

download_example(video_url, filename)

 We can run `os.path.exists()` to check if the video was successfully downloaded.

In [None]:
os.path.exists(filename) # Should return `True`

## Building the Pipeline

Next, we build the pipeline by combining different components for video, audio, and text processing. We first define the number of persons shown in the part of the video that we will analyze (Clinton and Trump). Setting the number of faces and speakers correctly is important as emotion expression features might be attributed to the wrong person otherwise.

In [None]:
num_clusters = 2

Then, we set the device on which the pipeline components should be run. If a GPU is available, we will use it. Otherwise, we will run the components on the CPU.

In [None]:
device = (
    torch.device(type="cuda")
    if torch.cuda.is_available()
    else torch.device(type="cpu")
)

To detect and extract features from faces shown in the video, we create the `FaceExtractor` component. We set `num_faces=num_clusters` so that detected faces will be assigned to two clusters based on their encoded representations (embeddings). We also add that the component should run on our specified device.

In [None]:
face_extractor = FaceExtractor(
    num_faces=num_clusters,
    device=device
)

For the audio processing, we create two components: The `SpeakerIdentifier` detects speech segmenets in the audio signal and assigns them to speaker clusters. As with the faces, we assume that the video has two speakers, so we set `num_speakers=num_clusters`.

*Note*: mexca builds on pretrained models from the pyannote.audio package. Since release 2.1.1, downloading the pretrained models requires the user to accept two user agreements on Hugging Face hub and generate an authentication token. Therefore, to run the mexca pipeline, please accept the user agreements on [here](https://huggingface.co/pyannote-speaker-diarization-3.1) and [here](https://huggingface.co/pyannote/segmentation-3.0). Then, generate an authentication token [here](https://huggingface.co/settings/tokens). Use this token as the value for `use_auth_token` (instead of `HF_TOKEN`).

In [None]:
try:
    HF_TOKEN = userdata.get('HF_TOKEN')
except:
    raise Exception("Please generate your own access token for Hugging Face Hub")

In [None]:
speaker_identifier = SpeakerIdentifier(
    num_speakers=num_clusters,
    device=device,
    use_auth_token=HF_TOKEN
)

The `VoiceExtractor` computes vocal emotion expression features from the audio stream of the video. The configuration of the extracted voice feature set can be changed by setting `config=mexca.data.VoiceFeaturesConfig()`, but we will keep the default configuration for this example.

In [None]:
voice_extractor = VoiceExtractor()

To extract the sentiment from the spoken text, we create two text processing components. First, we transcribe the audio signal to text using the `AudioTranscriber` class. The component automatically detects the spoken language of each speech segment. The transcribed text is split into single sentences. The transcription is done using a Whisper model which comes in different sizes. Larger sized models make in most cases more accurate transcriptions but take longer to run. We set the size of the model with `whisper_model="medium"` to use a medium sized model.

In [None]:
audio_transcriber = AudioTranscriber(
    whisper_model="medium",
    device=device
)

Second, the `SentimentExtractor` predicts a positive, negative, and neutral sentiment score for each sentence.

In [None]:
sentiment_extractor = SentimentExtractor(device=device)

Now, we combine the five components into a `Pipeline` instance, which will run them after each other and integrate the results.

In [None]:
pipeline = Pipeline(
    face_extractor=face_extractor,
    speaker_identifier=speaker_identifier,
    voice_extractor=voice_extractor,
    audio_transcriber=audio_transcriber,
    sentiment_extractor=sentiment_extractor
)

To track the progress of the pipeline, we create a logger to print messages at the `INFO` level.

In [None]:
logging.basicConfig(level="INFO")

To run the pipeline, we call the `apply()` method. The video has a frame rate of 25 and to speed up the processing, we choose to process 10 video frames at a time (`frame_batch_size=10`) and to only process every 5th frame (`skip_frames=5`), assuming that emotion expressions do not change substantially faster than 200ms. For this example, we also indicate to only process the first 30 seconds using `process_subclip=(0, 30)`.

**Note**: The first time you run the pipeline pre-trained models will be automatically downloaded which can take a few minutes.

In [None]:
output = pipeline.apply(
    filename,
    frame_batch_size=10,
    skip_frames=5,
    process_subclip=(0, 30)
)

To simplify further processing, we store the `features` attribute from the output of the pipeline which contains the integrated features as a `polars.lazyframe.frame.LazyFrame` in a separate variable.

In [None]:
output_df = output.features

We can get a quick glimpse of the output by calling polars' `describe` method. Note that some columns contain lists as row elements (e.g., `face_box`).

In [None]:
output_df.collect().describe()

## Analyzing Facial Expressions

We start analyzing the output by comparing facial action unit activations between Clinton and Trump.

In [None]:
def stderr(x):
    """Calculate the standard error of the mean
    """
    return np.std(x)/np.sqrt(len(x))

In [None]:
clinton_id = 1
trump_id = 0

In [None]:
n_au = 27 # Nr of action units

# Expand action unit lists into separate rows
au_df = output_df.select(
    [
        pl.lit([list(range(27))]).alias("face_au"),
        pl.col("face_aus").alias("value").list.take(list(range(27))),
        pl.col("face_label")
    ]
).filter(pl.col("value").is_not_null()).explode(["face_au", "value"])

# Compute mean and standard error for each action unit
au_stats = (
    au_df
    .groupby('face_label', "face_au")
    .agg(
        pl.mean("value").alias("avg"),
        (pl.std("value")/pl.count().sqrt()).alias("ste")
    )
    .sort("face_au", "face_label")
)

In [None]:
# Reference ids of the action units
au_ref = [1, 2, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22,
          23, 24, 25, 26, 27, 32, 38, 39]

aus = np.arange(n_au)

# Create bar plot with error bars
width = 0.35

fig, ax = plt.subplots()

clinton_au_df = au_stats.filter(
    pl.col("face_label") == str(clinton_id)
).collect()
trump_au_df = au_stats.filter(
    pl.col("face_label") == str(trump_id)
).collect()

ax.bar(aus-width/2, clinton_au_df.select(pl.col("avg")).to_series(), width,
       yerr=1.96*clinton_au_df.select(pl.col("ste")).to_series(),
       capsize=4, ecolor='darkgray', label='Clinton', color='seagreen')

ax.bar(aus+width/2, trump_au_df.select(pl.col("avg")).to_series(), width,
       yerr=1.96*trump_au_df.select(pl.col("ste")).to_series(),
       capsize=4, ecolor='darkgray', label='Trump', color='indianred')

ax.set_xlabel('Action unit')
ax.set_xticks(aus, au_ref)
ax.set_ylabel('Mean activation (95% CI)')
ax.legend()

fig.tight_layout()

plt.show()


The FaceExtractor compoenent extracts the activations of 41 action units. Here we only select the first 27 which are bilateral units (the following 14 units correspond to left and right unilateral activations). The bar plot shows substantial differences between Clinton and Trump in the mean activations of units associated with joy (6 and 12; Trump higher than Clinton). There are also differences in units related to sadness (1 and 4; Clinton higher than Trump). Moreover, Clinton shows higher mean activations related to fear (1, 4, 5, and 20) and for anger-related units (4, 5, 23). Note that these results must interpreted with care, as we are not comparing the activations against a reference data base or baseline.

## Analyzing the Voice

Besides facial emotion expressions, mexca also allows us to analyze vocal expressions. By default, it extracts the voice pitch measured as the fundamental frequency F0 from speakers in the video which indicates emphasis and is related to emotional arousal. Similar to the action units, we can compare voice pitch between Clinton and Trump. For an overview of all voice features, see the [documentation](https://mexca.readthedocs.io/en/latest/output.html).

In [None]:
# Get speaker IDs, pitch and time colulmns
segment_speaker, pitch, time = output_df.select(
    pl.col("segment_speaker_label", "pitch_f0_hz", "time")
).collect().to_numpy().T

# Set non-speaker frames to NaN to avoid lines connecting separate speech segments
clinton_time = time.copy()
clinton_time[segment_speaker != str(clinton_id)] = np.nan
clinton_pitch = pitch.copy()
clinton_pitch[segment_speaker != str(clinton_id)] = np.nan

trump_time = time.copy()
trump_time[segment_speaker != str(trump_id)] = np.nan
trump_pitch = pitch.copy()
trump_pitch[segment_speaker != str(trump_id)] = np.nan

In [None]:
# Create line plot
fig, ax = plt.subplots()

ax.plot(clinton_time, clinton_pitch, label='Clinton', color='seagreen')
ax.plot(clinton_time, [0] * clinton_time.shape[0], color = 'seagreen')
ax.axhline(np.nanmean(clinton_pitch), ls='--', color='seagreen')

ax.plot(trump_time, trump_pitch, label='Trump', color='indianred')
ax.plot(trump_time, [-5] * trump_time.shape[0], color = 'indianred')
ax.axhline(np.nanmean(trump_pitch), ls='--', color='indianred')

ax.set_xlabel('Time (in s)')
ax.set_xticks(np.arange(35, step=5.0))
ax.set_ylabel('Pitch (F0 in Hz)')
ax.legend()

fig.tight_layout()

plt.show()


The figure shows the voice pitch of Clinton and Trump over time and displays the mean pitch (dashed line). It shows that the baseline pitch of Trump's voice is higher on average than Clinton's.

## Analyzing the Text

Next to facial expressions and voice features, mexca can also extract the sentiment from the spoken text. Again, we can compare the positive, negative, and neutral sentiment in the speech content between Clinton and Trump.

In [None]:
# Extract text sentiment
sent_pos, sent_neg, sent_neu = output_df.select(
    pl.col("span_sent_pos", "span_sent_neg", "span_sent_neu")
).collect().to_numpy().T

In [None]:
# Create line plot
fig, (ax1, ax2, ax3) = plt.subplots(3, 1)

ax1.plot(clinton_time, sent_pos, label='Clinton', color='seagreen')
ax1.plot(trump_time, sent_pos, label='Trump', color='indianred')
ax2.plot(clinton_time, sent_neg, label='Clinton', color='seagreen')
ax2.plot(trump_time, sent_neg, label='Trump', color='indianred')
ax3.plot(clinton_time, sent_neu, label='Clinton', color='seagreen')
ax3.plot(trump_time, sent_neu, label='Trump', color='indianred')

ax1.set_title('Positive')
ax2.set_title('Negative')
ax3.set_title('Neutral')
ax3.set_xlabel('Time (in s)')
for ax in (ax1, ax2, ax3):
    ax.set_xticks(np.arange(35, step=5.0))
    ax.set_yticks(np.arange(1.2, step=0.2))
ax2.set_ylabel('Sentiment score')
ax2.legend()

fig.tight_layout()

plt.show()

We can see that Clinton uses relatively neutral sentiment. Trump, in contrast, has a strongly positive peak in his turn when he talks about "the finest deal you've ever seen" and a negative peak at "all of a sudden you were against it".


In [None]:
# Print transcription at peak time window
with pl.Config(fmt_str_lengths=100):
    print((output_df
        .filter(
            pl.col("time").is_between(22, 27) &
            pl.col("span_text").is_not_null()
        )
        .select(pl.col("time", "span_text"))
        .unique(subset="span_text")
        .sort("time")
    ).collect())

## Summary

In this example, we build a custom pipeline using the mexca package to extract emotion expressions from a video. We ran the pipeline on an excerpt from the US presidential debate 2016 between Clinton and Trump. We analyzed differences in facial action unit activations, voice pitch, and speech text sentiment between the two candidates.

## References

Lüken, M., Moodley, K., Viviani, E., Pipal, C., & Schumacher, G. (2024, January 18). MEXCA - A simple and robust pipeline for capturing emotion expressions in faces, vocalization, and speech. https://doi.org/10.31234/osf.io/56svb