# Mexca Demo

This is a short demo to illustrate how mexca can be used to extract emotion expression features from a video file. As an example, we will use a video of the presidential debate betweeen Clinton and Trump in 2016. The video can be found on [YouTube](https://www.youtube.com/watch?v=DBhrSdjePkk), but we will use a file that is hosted by a third party.

In [1]:
from huggingface_hub import notebook_login
from mexca.core.pipeline import Pipeline
from urllib.request import urlopen
import os
import pandas as pd
from IPython.display import Video

Before we can apply mexca, we need to download the example video file.

In [2]:
def download_example(url, filename):
    # Check if filename exists
    if not os.path.exists(filename):
        video = urlopen(url)

        with open(filename, 'wb') as file:
            file.write(video.read())

In [3]:
example_url = 'https://books.psychstat.org/rdata/data/debate.mp4'
filename = 'debate.mp4'

download_example(example_url, filename)

Video(filename)

*Note*: mexca builds on pretrained models from the pyannote.audio package. Since release 2.1.1, downloading the pretrained models requires the user to accept two user agreements on Hugging Face hub and generate an authentication token. Therefore, to run the mexca pipeline, please accept the user agreements on [here](https://huggingface.co/pyannote/speaker-diarization) and [here](https://huggingface.co/pyannote/segmentation). Then, generate an authentication token [here](https://huggingface.co/settings/tokens). Use this token to login when running `notebook_login()`. You only need to login when running mexca for the first time.

In [4]:
# Only required the first time to store the token
notebook_login()

08/30/2022 13:24:05 - INFO - huggingsound.speech_recognition.model - Loading model...


Now that we have the example video, we can create a mexca pipeline object using the default constructor method. This method creates a complete, standard pipeline including all modalities (video, audio, text) with default settings. For the audio transcription, we need to specify the language the pipeline will be transcribing (here English).

In [None]:
pipeline = Pipeline().from_default(language='english')

Next, we can apply the mexca pipeline object to the example video file. It can take a long time to process video files. Thus, we will only process 10 seconds of the video by setting the `process_subclip` argument (seconds 19 to 29).

In [5]:
output = pipeline.apply(filename, process_subclip=(19, 29))

Analyzing video ...


100%|██████████| 250/250 [01:20<00:00,  3.12it/s]


Video done
Analyzing audio ...


100%|██████████| 1/1 [00:00<?, ?it/s]


Audio done
Analyzing text ...


100%|██████████| 1/1 [00:04<00:00,  4.29s/it]

Text done





The pipeline returns a `Multimodal` object that contains the extracted emotion expression features in the `feature` attribute. We can convert the features into a `pandas.DataFrame` for further inspection and processing.

In [6]:
output_df = pd.DataFrame(output.features)
output_df

Unnamed: 0,frame,time,face_box,face_prob,face_landmarks,face_aus,face_id,pitchF0,segment_id,segment_start,segment_end,track,speaker_id,text_token_id,text_token,text_token_start,text_token_end,match_id
0,0,0.00,"[152.7034, 54.79718, 188.72562, 101.63676]",0.995797,"[157.67497766017914, 77.6062490940094]","[0.19813386, 0.9820602, 0.01565872, 0.7402648,...",33.0,,0.0,0.0,0.0,,,0.0,,0.0,0.00,0.0
1,1,0.04,"[150.80806, 54.61269, 187.75795, 98.58902]",0.951096,"[158.0879489183426, 77.3046144247055]","[0.19556102, 0.9880665, 0.0186732, 0.6875217, ...",33.0,235.303811,0.0,0.0,0.0,,,0.0,,0.0,0.00,0.0
2,2,0.08,"[153.00014, 53.994972, 188.90388, 100.47665]",0.948597,"[157.4544359445572, 76.8644654750824]","[0.17714082, 0.9813759, 0.019113649, 0.6613569...",33.0,228.937996,0.0,0.0,0.0,,,0.0,,0.0,0.00,0.0
3,3,0.12,"[152.07495, 55.0316, 187.75964, 100.87048]",0.970770,"[158.34282779693604, 76.28121328353882]","[0.18197653, 0.9805567, 0.020266142, 0.6350254...",33.0,,0.0,0.0,0.0,,,0.0,called,0.1,0.42,0.0
4,4,0.16,,,,,,241.922074,0.0,0.0,0.0,,,0.0,called,0.1,0.42,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
247,245,9.80,"[295.1616, 68.45729, 384.9053, 185.97256]",0.986520,"[301.49172043800354, 117.4190034866333]","[0.14955567, 0.937534, 0.0047857133, 0.8661386...",11.0,176.077137,0.0,0.0,0.0,,,0.0,,0.0,0.00,0.0
248,246,9.84,"[292.8204, 69.97427, 383.19717, 185.33023]",0.985304,"[299.9294970780611, 118.42326319217682]","[0.14492835, 0.9356139, 0.004929804, 0.8497235...",26.0,174.976738,0.0,0.0,0.0,,,0.0,,0.0,0.00,1.0
249,247,9.88,"[291.48737, 69.62635, 382.71396, 185.08218]",0.989932,"[299.5945331156254, 117.64745843410492]","[0.14012621, 0.9424231, 0.006228745, 0.8394418...",26.0,182.762635,0.0,0.0,0.0,,,0.0,,0.0,0.00,1.0
250,248,9.92,"[289.82156, 67.945526, 380.91974, 182.10425]",0.992164,"[296.6519346833229, 115.27929627895355]","[0.16058972, 0.94528455, 0.006031608, 0.788579...",26.0,198.402287,0.0,0.0,0.0,,,0.0,,0.0,0.00,1.0


The column names of the data frame tell us about the features that our pipeline extracted. We can see multiple columns with the `face_` prefix that contain facial expression features and information about the detected faces. Currently, mexca only extracts the voice pitch `pitchF0` from the audio signal. Columns with the `segment_` prefix contain information about the speech segments (note that this is unreliable for video segments this short). The prefix `text_` indicates column with information about the trascribed spoken text. The last column `match_id` matches the ids of the detected faces to the detected speakers by overlapping time in the video. For further information about the output and features, see the [documentation](https://mexca.readthedocs.io/en/latest/index.html).

# Coming Soon

- Feature visualization tools
- Tutorial on advanced usage of mexca
- Tutorial on extending mexca with custom features and components