# Mexca Demo

**Requirements**: mexca base package, Docker

This is a short demo to illustrate how mexca can be used to extract emotion expression features from a video file. As an example, we will use a video of the presidential debate betweeen Clinton and Trump in 2016. The video can be found on [YouTube](https://www.youtube.com/watch?v=DBhrSdjePkk), but we will use a file that is hosted by a third party.

In [1]:
import logging
import os
import yaml
import pandas as pd
from huggingface_hub import notebook_login
from IPython.display import Video
from urllib.request import urlopen
from mexca.container import AudioTranscriberContainer, FaceExtractorContainer, SentimentExtractorContainer, SpeakerIdentifierContainer, VoiceExtractorContainer
from mexca.pipeline import Pipeline


Before we can apply mexca, we need to download the example video file.

In [2]:
def download_example(url, filename):
    # Check if filename exists
    if not os.path.exists(filename):
        video = urlopen(url)

        with open(filename, 'wb') as file:
            file.write(video.read())

In [3]:
example_url = 'https://books.psychstat.org/rdata/data/debate.mp4'
filename = 'debate.mp4'

download_example(example_url, filename)

Video(filename)

*Note*: mexca builds on pretrained models from the pyannote.audio package. Since release 2.1.1, downloading the pretrained models requires the user to accept two user agreements on Hugging Face hub and generate an authentication token. Therefore, to run the mexca pipeline, you must accept the user agreements on [here](https://huggingface.co/pyannote/speaker-diarization) and [here](https://huggingface.co/pyannote/segmentation). Then, generate an authentication token [here](https://huggingface.co/settings/tokens). Use this token to login when running `notebook_login()`. You only need to login when running mexca for the first time.

In [4]:
# Only required the first time to store the token
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

To track the progress of the pipeline, we create a logger from the `logging.yml` file in this directory.

In [None]:
with open('logging.yml', 'r', encoding='utf-8') as file:
    config = yaml.safe_load(file)
    logging.config.dictConfig(config)

Now that we have the example video, we can create a mexca pipeline object from containerized components. We specify that mexca should detect two faces and speakers (Clinton and Trump).

**Note**: The first time you run the pipeline with containerzied components, the containers will be automatically downloaded which can take some time.

In [5]:
pipeline = Pipeline(
    face_extractor=FaceExtractorContainer(num_faces=2),
    speaker_identifier=SpeakerIdentifierContainer(
        num_speakers=2
    ),
    voice_extractor=VoiceExtractorContainer(),
    audio_transcriber=AudioTranscriberContainer(),
    sentiment_extractor=SentimentExtractorContainer()
)

Next, we can apply the mexca pipeline object to the example video file. It can take a long time to process video files. Thus, we will only process 10 seconds of the video by setting the `process_subclip` argument (seconds 19 to 29). We also specify that 5 video frames should be processed at the same time (in a batch), and only every 5th frame should be processed.

In [6]:
output = pipeline.apply(filename, frame_batch_size=5, skip_frames=5, process_subclip=(19, 29))

2023-01-26 09:40:12,116 - INFO - Starting MEXCA pipeline
2023-01-26 09:40:12,607 - INFO - Wrote audio file to debate.wav
2023-01-26 09:40:12,616 - INFO - Processing video frames
  0%|          | 0/11 [00:00<?, ?it/s]
  9%|▉         | 1/11 [00:13<02:16, 13.62s/it]
 18%|█▊        | 2/11 [00:16<01:05,  7.28s/it]
 27%|██▋       | 3/11 [00:18<00:40,  5.07s/it]
 36%|███▋      | 4/11 [00:21<00:28,  4.00s/it]
 45%|████▌     | 5/11 [00:23<00:20,  3.38s/it]
 55%|█████▍    | 6/11 [00:25<00:14,  2.92s/it]
 64%|██████▎   | 7/11 [00:28<00:11,  2.77s/it]
 73%|███████▎  | 8/11 [00:30<00:07,  2.62s/it]
 82%|████████▏ | 9/11 [00:32<00:04,  2.39s/it]
 91%|█████████ | 10/11 [00:34<00:02,  2.27s/it]
100%|██████████| 11/11 [00:34<00:00,  1.77s/it]
100%|██████████| 11/11 [00:34<00:00,  3.17s/it]

2023-01-26 09:40:59,297 - INFO - Identifying speakers
torchvision is not available - cannot save figures

2023-01-26 09:41:40,336 - INFO - Transcribing speech segments to text
  0%|          | 0/3 [00:00<?, ?it/s]
 

The pipeline returns a `Multimodal` object that contains the extracted emotion expression features in the `feature` attribute.

In [10]:
output.features

Unnamed: 0,frame,time,face_box,face_prob,face_landmarks,face_aus,face_label,face_confidence,segment_start,segment_end,segment_speaker_label,span_start,span_end,span_text,span_sent_pos,span_sent_neg,span_sent_neu,pitch_f0
0,0,0.0,"[153.57325744628906, 52.99969482421875, 188.68...",0.999950,"[[157.13745515746177, 78.60379230039689], [158...","[0.42119452357292175, 0.45601001381874084, 0.4...",0.0,0.815524,,,,,,,,,,
1,0,0.0,"[343.40350341796875, 241.61273193359375, 364.0...",0.719327,"[[344.8119846885391, 251.87311820705798], [345...","[0.4797929525375366, 0.5523781776428223, 0.353...",0.0,0.417723,,,,,,,,,,
2,5,0.2,"[152.47720336914062, 51.66067886352539, 188.33...",0.999502,"[[157.78725514356384, 76.64474904225453], [159...","[0.3446630835533142, 0.42248767614364624, 0.75...",0.0,0.808896,,,,,,,,,,257.139719
3,5,0.2,"[343.40350341796875, 241.61273193359375, 364.0...",0.719327,"[[344.8119846885391, 251.87311820705798], [345...","[0.4797929525375366, 0.5523781776428223, 0.353...",0.0,0.417723,,,,,,,,,,257.139719
4,10,0.4,"[154.5249481201172, 52.934913635253906, 189.00...",0.999963,"[[158.84604518055403, 76.65568378986245], [160...","[0.3668370842933655, 0.4343304932117462, 0.619...",0.0,0.856980,,,,,,,,,,248.686417
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,230,9.2,"[300.0128173828125, 78.5645751953125, 387.7807...",0.999516,"[[300.24064894851637, 121.72136614803566], [30...","[0.26662319898605347, 0.4757242798805237, 0.34...",1.0,0.924672,0.5,9.46,0,,,,,,,214.665366
59,235,9.4,"[300.8193664550781, 63.438255310058594, 390.40...",0.998520,"[[302.5129394743491, 113.25603919910617], [303...","[0.2896226644515991, 0.4626683294773102, 0.300...",1.0,0.939227,0.5,9.46,0,,,,,,,220.147673
60,240,9.6,"[296.2780456542969, 62.41368103027344, 389.360...",0.999893,"[[301.5640075994887, 115.95888258115154], [302...","[0.3490285575389862, 0.4520280361175537, 0.347...",1.0,0.924410,,,,,,,,,,194.231820
61,245,9.8,"[294.7552490234375, 62.13750457763672, 389.665...",0.999851,"[[300.4995698369823, 116.58932477986455], [300...","[0.27883127331733704, 0.4050939679145813, 0.39...",1.0,0.878531,,,,,,,,,,176.077137


The column names of the data frame tell us about the features that our pipeline extracted. We can see multiple columns with the `face_` prefix that contain facial expression features and information about the detected faces. Columns with the `segment_` prefix contain information about the speech segments (note that this can be unreliable for video segments this short). Currently, mexca only extracts the voice pitch `pitch_f0` from the audio signal. The prefix `span_` indicates columns with information about sentences of the trascribed spoken text. For further information about the output and features, see the [documentation](https://mexca.readthedocs.io/en/latest/index.html).