Motivation

We generated this dataset to train a machine learning model for automatically generating psychiatric case notes from doctor-patient conversations. Since, we didn't have access to real doctor-patient conversations, we used transcripts from two different sources to generate audio recordings of enacted conversations between a doctor and a patient. We employed eight students who worked in pairs to generate these recordings. Six of the transcripts that we used to produce this recordings were hand-written by Cheryl Bristow and rest of the transcripts were adapted from Alexander Street which were generated from real doctor-patient conversations. Our study requires recording the doctor and the patient(s) in seperate channels which is the primary reason behind generating our own audio recordings of the conversations.

We used Google Cloud Speech-To-Text API to transcribe the enacted recordings. These newly generated transcripts are auto-generated entirely using AI powered automatic speech recognition whereas the source transcripts are either hand-written or fine-tuned by human transcribers (transcripts from Alexander Street).

We provided the generated transcripts back to the students and asked them to write case notes. The students worked independently using a software that we developed earlier for this purpose. The students had past experience of writing case notes and we let the students write case notes as they practiced without any training or instructions from us.

Authors

Kazi, Nazmul
Kuntz, Matt
Kanewala, Upulee
Kahanda, Indika

Contributors

Bristow, Cheryl
Arzubi, Eric

Description of the dataset

Case note categories:

Index	Category	Abbr.
0	Client Details	CD
1	Chief Complaint	CC
2	History of Present Illness	HPI
3	Past Psychiatric History	PPH
4	History of Substance Use	HSU
5	Social History	SH
6	Family History	FH
7	Review of Systems	RS

Directories:

Directory	Description
`transcripts/source`	Source transcripts that are used to generate the audio recordings.
`recordings`	Audio recordings of the enacted doctor-patient conversations.
`transcripts/transcribed`	Transcripts generated from the audio recordings using Google Cloud Speech-To-Text API.
`casenotes`	Casenotes written by the students, i.e. annotators.

Transcript file structure:

[
	{
		"speaker"  : 1,
		"dialogue" : ["sentence 1", "sentence 2", ...]
	},
	...
]

Definitions

Term	Definition
`speaker`	Speaker of the current dialogue turn.
`dialogue`	Sentence(s) spoken by the speaker in current dialogue turn.

Case note file structure:

[
	{
		"categoryId" : "0",
		"sourceId"   : "0",
		"formalText" : "formal text"
	},
	...
]

Definitions

Term	Definition
`categoryId`	Index of the case note category, e.g. 5 = Social History, to which this sentence is used. This property is zero-indexed.
`sourceId`	Index of the source sentence in the transcript. This property is zero-indexed.
`formalText`	Modified version of the sentence as it is used in the case note.

NOTE: If a sentence is used in multiple casenote categories, a record will appear for each use. "sourceId":"n" refers to the sentence whose index is n in the whole transcript whereas multiple sentences can belong to the same dialogue turn. In the following transcript, "sourceId":"3" refers to sentence_d:

[
	{
		"speaker"  : 1,
		"dialogue" : ["sentence_a", "sentence_b"]
	},
	{
		"speaker"  : 2,
		"dialogue" : ["sentence_c", "sentence_d",  "sentence_e"]
	}
]

dataset.pickle

This is a pickle file (protocol version 4) containing all the transcribed transcripts and the casenotes for easy and quick access to the data using python.

Alternative repository for audio recordings

The audio recordings are also available in Oxiago Int. website.

Funding

This project is funded by CATalyst Gap fund, Fall 2019.

Transcripts adapted from Alexander Street

D0420-S2-T01 D0420-S3-T02 D0420-S3-T03 D0420-S4-T01 D0420-S4-T02 D0421-S1-T01 D0421-S1-T02 D0421-S1-T03 D0421-S1-T04 D0421-S1-T05 D0421-S2-T01 D0421-S2-T02 D0421-S3-T01 D0421-S3-T02 D0421-S3-T03 D0421-S3-T04 D0421-S3-T05 D0422-S1-T01 D0422-S1-T02 D0422-S1-T03 D0422-S1-T04 D0422-S2-T01 D0422-S2-T02 D0422-S3-T01 D0422-S3-T02 D0422-S3-T03 D0422-S3-T04 D0422-S3-T05 D0422-S3-T06 D0422-S4-T01 D0422-S4-T02 D0422-S4-T03 D0422-S4-T04 D0422-S4-T05 D0423-S1-T01 D0423-S1-T02 D0423-S1-T03 D0423-S2-T01 D0423-S2-T02 D0423-S2-T03 D0424-S1-T01 D0424-S1-T02 D0424-S1-T03 D0424-S2-T01 D0424-S2-T02 D0424-S2-T03 D0424-S2-T04 D0424-S2-T05 D0424-S2-T06 D0424-S3-T01 D0424-S3-T02 D0424-S3-T03 D0424-S3-T04 D0425-S1-T01 D0425-S1-T02 D0425-S1-T03 D0425-S2-T01 D0425-S2-T02 D0425-S2-T03 D0425-S2-T04 D0425-S3-T01 D0425-S3-T02 D0425-S3-T03 D0425-S3-T04 D0425-S3-T05

Disclaimer

This dataset is provided "As Is" without warranty of any kind. In no event shall the authors or copyright holders be liable for any claim, damages, or other liability. The source transcripts may not be enacted word-to-word as they appear in the transcript. Similarly, we used automatic speech recognition to transcribe the recordings and the transcribed transcripts may not match exactly as the words appear in the audio recordings.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
casenotes		casenotes
recordings		recordings
transcripts		transcripts
.zenodo.json		.zenodo.json
LICENSE.md		LICENSE.md
README.md		README.md
dataset.pickle		dataset.pickle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

casenotes

casenotes

recordings

recordings

transcripts

transcripts

.zenodo.json

.zenodo.json

LICENSE.md

LICENSE.md

README.md

README.md

dataset.pickle

dataset.pickle

Repository files navigation

Motivation

Authors

Contributors

Description of the dataset

Case note categories:

Directories:

Transcript file structure:

Definitions

Case note file structure:

Definitions

dataset.pickle

Alternative repository for audio recordings

Funding

Transcripts adapted from Alexander Street

Disclaimer

About

Releases 1

License

nazmulkazi/dataset_automated_medical_transcription

Folders and files

Latest commit

History

Repository files navigation

Motivation

Authors

Contributors

Description of the dataset

Case note categories:

Directories:

Transcript file structure:

Definitions

Case note file structure:

Definitions

dataset.pickle

Alternative repository for audio recordings

Funding

Transcripts adapted from Alexander Street

Disclaimer

About

Resources

License

Stars

Watchers

Forks