The mexca package contains five components that can be used to build the MEXCA pipeline.
This component takes a video file as input as and applies four steps:
- Detection: Faces displayed in the video frames are detected using a pretrained MTCNN model from facenet-pytorch [1].
- Encoding: Faces are extracted from the frames and encoded into an embedding space using InceptionResnetV1 from facenet-pytorch.
- Identification: IDs are assigned to faces by clustering the embeddings using spectral clustering (k-means).
- Extraction: Facial features (landmarks, action units) are extracted from the faces using pyfeat [2]. Available models are PFLD, MobileFaceNet, and MobileNet for landmark extraction and svm, and xgb for action unit extraction.
Note
The two available AU extraction models give different output: svm returns binary unit activations, whereas xgb returns continuous activations (from a tree ensemble).
This component takes an audio file as input and applies three steps using the speaker diarization pipeline from pyannote.audio [3]:
- Segmentation: Speech segments are detected using pyannote/segmentation (this step includes voice activity detection).
- Encoding: Speaker embeddings are computed for each speech segment using ECAPA-TDNN from speechbrain [4].
- Identification: IDs are assigned to speech segments based on clustering with a Gaussian hidden Markov model.
This component takes the audio file as input and extracts voice features using praat-parselmouth [5]. Currently, only the fundamental frequency (F0) can be extracted.
This component takes the audio file and speech segments information as input. It transcribes the speech segments to text using a pretrained Whisper model. The resulting transcriptions are aligned with the speaker segments. The transcriptions are split into sentences using a regular expression.
This component takes the transcribed text sentences as input and predicts sentiment scores (positive, negative, neutral) for each sentence using a pretrained multilingual RoBERTa model [6].
[1] | Barbieri, F., Camacho-Collados, J., Neves, L., & Espinosa-Anke, L.. (2020). TweetEval: Unified benchmark and comparative evaluation for tweet classification. arxiv. https://doi.org/10.48550/arxiv.2010.12421 |
[2] | Bredin, H., & Laurent, A. (2021). End-to-end speaker segmentation for overlap-aware resegmentation. arXiv. https://doi.org/10.48550/arXiv.2104.04045 |
[3] | Cheong, J. H., Xie, T., Byrne, S., & Chang, L. J. (2021). Py-feat: Python facial expression analysis toolbox. arXiv. https://doi.org/10.48550/arXiv.2104.03509 |
[4] | Jadoul, Y., Thompson, B., & de Boer, B. (2018). Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics, 71, 1-15. https://doi.org/10.1016/j.wocn.2018.07.001 |
[5] | Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. https://cdn.openai.com/papers/whisper.pdf |
[6] | Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., … Bengio, Y. (2021). SpeechBrain: A general-purpose speech toolkit. arXiv. https://doi.org/10.48550/arXiv.2106.04624 |
[7] | Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. arXiv. https://doi.org/10.48550/arXiv.1503.03832 |