Skip to content

nazmulkazi/dataset_automated_medical_transcription

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

Motivation

We generated this dataset to train a machine learning model for automatically generating psychiatric case notes from doctor-patient conversations. Since, we didn't have access to real doctor-patient conversations, we used transcripts from two different sources to generate audio recordings of enacted conversations between a doctor and a patient. We employed eight students who worked in pairs to generate these recordings. Six of the transcripts that we used to produce this recordings were hand-written by Cheryl Bristow and rest of the transcripts were adapted from Alexander Street which were generated from real doctor-patient conversations. Our study requires recording the doctor and the patient(s) in seperate channels which is the primary reason behind generating our own audio recordings of the conversations.

We used Google Cloud Speech-To-Text API to transcribe the enacted recordings. These newly generated transcripts are auto-generated entirely using AI powered automatic speech recognition whereas the source transcripts are either hand-written or fine-tuned by human transcribers (transcripts from Alexander Street).

We provided the generated transcripts back to the students and asked them to write case notes. The students worked independently using a software that we developed earlier for this purpose. The students had past experience of writing case notes and we let the students write case notes as they practiced without any training or instructions from us.

Authors

  • Kazi, Nazmul
  • Kuntz, Matt
  • Kanewala, Upulee
  • Kahanda, Indika

Contributors

  • Bristow, Cheryl
  • Arzubi, Eric

Description of the dataset

Case note categories:

Index Category Abbr.
0 Client Details CD
1 Chief Complaint CC
2 History of Present Illness HPI
3 Past Psychiatric History PPH
4 History of Substance Use HSU
5 Social History SH
6 Family History FH
7 Review of Systems RS

Directories:

Directory Description
transcripts/source Source transcripts that are used to generate the audio recordings.
recordings Audio recordings of the enacted doctor-patient conversations.
transcripts/transcribed Transcripts generated from the audio recordings using Google Cloud Speech-To-Text API.
casenotes Casenotes written by the students, i.e. annotators.

Transcript file structure:

[
	{
		"speaker"  : 1,
		"dialogue" : ["sentence 1", "sentence 2", ...]
	},
	...
]

Definitions

Term Definition
speaker Speaker of the current dialogue turn.
dialogue Sentence(s) spoken by the speaker in current dialogue turn.

Case note file structure:

[
	{
		"categoryId" : "0",
		"sourceId"   : "0",
		"formalText" : "formal text"
	},
	...
]

Definitions

Term Definition
categoryId Index of the case note category, e.g. 5 = Social History, to which this sentence is used. This property is zero-indexed.
sourceId Index of the source sentence in the transcript. This property is zero-indexed.
formalText Modified version of the sentence as it is used in the case note.

NOTE: If a sentence is used in multiple casenote categories, a record will appear for each use. "sourceId":"n" refers to the sentence whose index is n in the whole transcript whereas multiple sentences can belong to the same dialogue turn. In the following transcript, "sourceId":"3" refers to sentence_d:

[
	{
		"speaker"  : 1,
		"dialogue" : ["sentence_a", "sentence_b"]
	},
	{
		"speaker"  : 2,
		"dialogue" : ["sentence_c", "sentence_d",  "sentence_e"]
	}
]

dataset.pickle

This is a pickle file (protocol version 4) containing all the transcribed transcripts and the casenotes for easy and quick access to the data using python.

Alternative repository for audio recordings

The audio recordings are also available in Oxiago Int. website.

Funding

This project is funded by CATalyst Gap fund, Fall 2019.

Transcripts adapted from Alexander Street

D0420-S2-T01 D0420-S3-T02 D0420-S3-T03 D0420-S4-T01 D0420-S4-T02 D0421-S1-T01 D0421-S1-T02 D0421-S1-T03 D0421-S1-T04 D0421-S1-T05 D0421-S2-T01 D0421-S2-T02 D0421-S3-T01 D0421-S3-T02 D0421-S3-T03 D0421-S3-T04 D0421-S3-T05 D0422-S1-T01 D0422-S1-T02 D0422-S1-T03 D0422-S1-T04 D0422-S2-T01 D0422-S2-T02 D0422-S3-T01 D0422-S3-T02 D0422-S3-T03 D0422-S3-T04 D0422-S3-T05 D0422-S3-T06 D0422-S4-T01 D0422-S4-T02 D0422-S4-T03 D0422-S4-T04 D0422-S4-T05 D0423-S1-T01 D0423-S1-T02 D0423-S1-T03 D0423-S2-T01 D0423-S2-T02 D0423-S2-T03 D0424-S1-T01 D0424-S1-T02 D0424-S1-T03 D0424-S2-T01 D0424-S2-T02 D0424-S2-T03 D0424-S2-T04 D0424-S2-T05 D0424-S2-T06 D0424-S3-T01 D0424-S3-T02 D0424-S3-T03 D0424-S3-T04 D0425-S1-T01 D0425-S1-T02 D0425-S1-T03 D0425-S2-T01 D0425-S2-T02 D0425-S2-T03 D0425-S2-T04 D0425-S3-T01 D0425-S3-T02 D0425-S3-T03 D0425-S3-T04 D0425-S3-T05

Disclaimer

This dataset is provided "As Is" without warranty of any kind. In no event shall the authors or copyright holders be liable for any claim, damages, or other liability. The source transcripts may not be enacted word-to-word as they appear in the transcript. Similarly, we used automatic speech recognition to transcribe the recordings and the transcribed transcripts may not match exactly as the words appear in the audio recordings.

DOI

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

About

Dataset for training machine learning model for automatically generating psychiatric case notes from doctor-patient conversations.

Resources

License

Stars

Watchers

Forks