Fisher Parse Trees and Disfluency Labels

This repo contains the code for annotating English Fisher Speech Part I and II Transcripts. Since Fisher Corpus is not open-source, we cannot release the annotated transcripts. We instead provide the recipe by which you can pre-process, annotate and obtain the fluent version of Fisher transcripts. The annotations include silver constituency parse trees and silver disfluency labels which are allocated using a state-of-the-art joint parsing and disfluency detection model (with parsing accuracy of 93.9% and disfluency detection f-score of 92.4% on Switchboard dev set), as described in Improving Disfluency Detection by Self-Training a Self-Attentive Model from ACL 2020. Since a disfluency tag is allocated to each word, you can easily have the fluent English Fisher transcripts by removing the words tagged as disfluent.

$ git clone https://github.com/pariajm/english-fisher-annotations
$ cd english-fisher-annotations
$ mkdir model && cd model
$ wget https://github.com/pariajm/joint-disfluency-detector-and-parser/releases/download/naacl2019/swbd_fisher_bert_Edev.0.9078.pt
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
$ tar -xf bert-base-uncased.tar.gz && cd ..

To parse and disfluency tag the transcripts:

$ python main.py --input-path /path/to/extracted/LDC2004T19/and/LDC2005T19 --output-path /path/to/outputs/for/saving/results --model ./model/swbd_fisher_bert_Edev.0.9078.pt

Running the above command, you will end up with two types of output:

fe\_**_****\_parse.txt which includes Fisher constituency parse trees
fe\_**_****\_dys.txt which contains Fisher disfluency labelled transcripts (where "_" and "E" indicate that the previous word is fluent or disfluent, respectively). Remove the words tagged as "E" to obtain the fluent version of transcripts.

The available arguments are:

Argument	Description	Default	Required
`--input-path`	path to extracted LDC2004T19 & LDC2005T19	N/A	Yes
`--output-path`	output path to use for saving results	N/A	Yes
`--model-path`	path to saved .pt model	./model/swbd_fisher_bert_Edev.0.9078.pt	No
`--disfluency`	whether disfluency tag the files	True	No

Using the model to annotate your own dataset

You can use the repo to find silver parse trees as well as disfluency labels of your own sentences, but you probably need to modify the pre-processing part a bit!

To know more about the model

If you want to know more about the model, read our paper cited as below and check this repo.

Citation

If you use this code, please cite the following paper:

@inproceedings{jamshid-lou-2020-improving,
    title = "Improving Disfluency Detection by Self-Training a Self-Attentive Model",
    author = "Jamshid Lou, Paria and Johnson, Mark",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = "jul",
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.346",
    pages = "3754--3763"
}

Contact

Paria Jamshid Lou paria.jamshid-lou@hdr.mq.edu.au

Credits

The code for self-attentive parser is based on https://github.com/nikitakit/self-attentive-parser and the code for pre-processing Fisher is based on https://github.com/mozilla/DeepSpeech/blob/master/bin/import_fisher.py.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
img		img
README.md		README.md
chart_helper.pyx		chart_helper.pyx
fisher_annotator.py		fisher_annotator.py
main.py		main.py
nkutil.py		nkutil.py
parse_nk.py		parse_nk.py
trees.py		trees.py
vocabulary.py		vocabulary.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

img

img

README.md

README.md

chart_helper.pyx

chart_helper.pyx

fisher_annotator.py

fisher_annotator.py

main.py

main.py

nkutil.py

nkutil.py

parse_nk.py

parse_nk.py

trees.py

trees.py

vocabulary.py

vocabulary.py

Repository files navigation

Fisher Parse Trees and Disfluency Labels

Contents

Basic Overview

Software Requirements

Using the model to annotate Fisher

Using the model to annotate your own dataset

To know more about the model

Citation

Contact

Credits

About

Releases

Packages

Languages

pariajm/english-fisher-annotations

Folders and files

Latest commit

History

Repository files navigation

Fisher Parse Trees and Disfluency Labels

Contents

Basic Overview

Software Requirements

Using the model to annotate Fisher

Using the model to annotate your own dataset

To know more about the model

Citation

Contact

Credits

About

Topics

Resources

Stars

Watchers

Forks

Languages