The ability to detect a change in the input is an essential aspect of perception. In speech communication, we use this ability to identify “talker changes” when listening to conversational speech (such as, audio podcasts). In this paper, we propose to improve our understanding about how fast listeners detect a change in talker, and the acoustic features tracked to identify a voice by designing a novel experimental paradigm. A listening experiment is designed in which listeners indicate the moment of perceived talker change in multi-talker speech utterances. We examine talker change detection performance by probing the human reaction time (RT). A random forest regression is used to model the relationship between RTs and acoustic features. The findings suggest that: (i) RT is less than a second, (ii) RT can be predicted from the difference in acoustic features of segment before and after change, and (iii) there a exists a significant dependence of RT on MFCC-D1 (delta MFCCs) features between segments of speech before and after the change instant. Further, a comparison with a machine system designed for the same task of TCD using speaker diarization principles showed a poor performance relative to the humans.
The repository contains the data and codes used in the study.
To be presented in ICASSP 2019, at Brighton, UK. https://2019.ieeeicassp.org/
See you there!
The Journal of the Acoustical Society of America 145, 131 (2019) https://asa.scitation.org/doi/10.1121/1.5084044?af=R
Neeraj Kumar Sharma, Shobhana Ganesh, Sriram Ganapathy, Lori L. Holt
Contributors associated with the Carnegie Mellon Univeristy, Pittsburgh and the Indian Institute of Science, Bangalore.
The manuscript is shared here for personal use only. Any other use requires prior permission of the authors.