# SemCHILDES construction

As an initial effort to construct SemCHILDES, I use an automatic word sense disambiguation to annotate the CHILDES corpus with word senses. As future work, I intend to manually annotate part of the corpus and then evaluate the performance of different algorithms, approaches, and combination of them in the annotated data. The best approach will be used to annotate the entire corpus.

Used tool: PySupWSDPocket - https://github.com/rodriguesfas/PySupWSDPocket




## PySupWSDPocket

PySupWSDPocket is a python lib for the [SupWSD Pocket](https://supwsd.net/supwsd/pocket.jsp). SupWSD is a supervised model for Word Sense Disambiguation.

We install it from github to get the latest version.

https://drive.google.com/file/d/1hEMlbToLL4xN7HJhPtebMbKYeethWmha/view?usp=sharing

In [7]:
!pip install git+https://github.com/rodriguesfas/PySupWSDPocket.git

Collecting git+https://github.com/rodriguesfas/PySupWSDPocket.git
  Cloning https://github.com/rodriguesfas/PySupWSDPocket.git to /tmp/pip-req-build-yi9rlw9t
  Running command git clone -q https://github.com/rodriguesfas/PySupWSDPocket.git /tmp/pip-req-build-yi9rlw9t
Building wheels for collected packages: pysupwsdpocket
  Building wheel for pysupwsdpocket (setup.py) ... [?25ldone
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-q1ltryik/wheels/60/71/8d/80f8c9ddf9fd2b65d10328afb6d580cfd83e4fbbc690cfb4dc
Successfully built pysupwsdpocket
Installing collected packages: pysupwsdpocket
Successfully installed pysupwsdpocket-0.0.9


PySupWSDPocket requires downloading its ~2GB model available on https://supwsd.net/supwsd/downloads.jsp#supwsd_pocket.

In [3]:
!pip install gdown

Collecting gdown
  Downloading https://files.pythonhosted.org/packages/5b/82/4d682a893626cd3436444130970443be0b101c39a92bce783dc920a767c8/gdown-4.4.0.tar.gz
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Collecting beautifulsoup4 (from gdown)
[?25l  Downloading https://files.pythonhosted.org/packages/69/bf/f0f194d3379d3f3347478bd267f754fc68c11cbf2fe302a6ab69447b1417/beautifulsoup4-4.10.0-py3-none-any.whl (97kB)
[K     |████████████████████████████████| 102kB 29.2MB/s ta 0:00:01
[?25hCollecting filelock (from gdown)
  Downloading https://files.pythonhosted.org/packages/cd/f1/ba7dee3de0e9d3b8634d6fbaa5d0d407a7da64620305d147298b683e5c36/filelock-3.6.0-py3-none-any.whl
Collecting soupsieve>1.2 (from beautifulsoup4->gdown)
  Downloading https://files.pythonhosted.org/packages/72/a6/fd01694427f1c3fcadfdc5f1de901b813b9ac756f0806ef470cfed1de281/soupsieve-2.3.1-py3-none-any.whl
Buildi

In [4]:
!mkdir pysupwsdpocket_models
!gdown  https://drive.google.com/uc?id=1hEMlbToLL4xN7HJhPtebMbKYeethWmha  -O="pysupwsdpocket_models/en.zip"

mkdir: cannot create directory ‘pysupwsdpocket_models’: File exists
Downloading...
From: https://drive.google.com/uc?id=1hEMlbToLL4xN7HJhPtebMbKYeethWmha
To: /root/capsule/pysupwsdpocket_models/en.zip
100%|███████████████████████████████████████| 1.80G/1.80G [00:13<00:00, 131MB/s]


In [9]:
from pysupwsdpocket import PySupWSDPocket
nlp = PySupWSDPocket(lang='en', model='semcor_omsti', model_path="./pysupwsdpocket_models/")    

## CHILDES

The Child Language Data Exchange System (CHILDES) is a corpus established in 1984 by Brian MacWhinney and Catherine Snow to serve as a central repository for data of first language acquisition[¹](https://en.wikipedia.org/wiki/CHILDES). It counts with a list of different corpora from many languages that can be downloaded in XML or CHA format.

In this notebook we download only one corpus, but SemCHILDES is composed by the entire American English CHILDES.

In [10]:
!mkdir corpora
corpora_files = ["Bates.zip", "Bernstein.zip", "Bliss.zip", "Bloom.zip", "Bohannon.zip", "Braunwald.zip", "Brent.zip", "Brown.zip", "Clark.zip", "Demetras1.zip", "Demetras2.zip", "Evans.zip", "Feldman.zip", "Garvey.zip", "Gathercole.zip", "Gelman.zip", "Gleason.zip", "Gopnik.zip", "HSLLD.zip", "Haggerty.zip", "Hall.zip", "Hicks.zip", "Higginson.zip", "Kuczaj.zip", "MacWhinney.zip", "McCune.zip", "McMillan.zip", "Morisset.zip", "Nelson.zip", "NewEngland.zip", "NewmanRatner.zip", "Peters.zip", "PetersonMcCabe.zip", "Post.zip", "Rollins.zip", "Sachs.zip", "Sawyer.zip", "Snow.zip", "Soderstrom.zip", "Sprott.zip", "Suppes.zip", "Tardif.zip", "Valian.zip", "VanHouten.zip", "VanKleeck.zip", "Warren.zip", "Weist.zip"]
for corpus_file in corpora_files:
    !wget https://childes.talkbank.org/data-xml/Eng-NA/$corpus_file -O corpora/$corpus_file

In [8]:
for corpus_file in corpora_files:
    !unzip corpora/$corpus_file -d corpora

Archive:  corpora/Bates.zip
  inflating: corpora/Bates/Free28/mandy28.xml  
  inflating: corpora/Bates/Free28/doug28.xml  
  inflating: corpora/Bates/Free28/frank28.xml  
  inflating: corpora/Bates/Free28/ivy28.xml  
  inflating: corpora/Bates/Free28/paula28.xml  
  inflating: corpora/Bates/Free28/ed28.xml  
  inflating: corpora/Bates/Free28/hank28.xml  
  inflating: corpora/Bates/Free28/pete28.xml  
  inflating: corpora/Bates/Free28/will28.xml  
  inflating: corpora/Bates/Free28/sue28.xml  
  inflating: corpora/Bates/Free28/rick28.xml  
  inflating: corpora/Bates/Free28/amy28.xml  
  inflating: corpora/Bates/Free28/betty28.xml  
  inflating: corpora/Bates/Free28/chuck28.xml  
  inflating: corpora/Bates/Free28/steve28.xml  
  inflating: corpora/Bates/Free28/olivia28.xml  
  inflating: corpora/Bates/Free28/keith28.xml  
  inflating: corpora/Bates/Free28/ruth28.xml  
  inflating: corpora/Bates/Free28/wanda28.xml  
  inflating: corpora/Bates/Free28/george28.xml  
  inflating: corpora/Bate

  inflating: corpora/Bloom/Peter/020415.xml  
  inflating: corpora/Bloom/Peter/020303.xml  
  inflating: corpora/Bloom/Peter/020915.xml  
  inflating: corpora/Bloom/Peter/010908.xml  
  inflating: corpora/Bloom/Peter/011011.xml  
  inflating: corpora/Bloom/Peter/020213.xml  
  inflating: corpora/Bloom/Peter/020713.xml  
  inflating: corpora/Bloom/Peter/020503.xml  
  inflating: corpora/Bloom/Peter/011117.xml  
  inflating: corpora/Bloom/Peter/011105.xml  
  inflating: corpora/Bloom/Peter/020118.xml  
  inflating: corpora/Bloom/Peter/021019.xml  
  inflating: corpora/Bloom/Peter/020324.xml  
  inflating: corpora/Bloom/Peter/020616.xml  
  inflating: corpora/Bloom/Peter/030120.xml  
  inflating: corpora/Bloom/Peter/020812.xml  
  inflating: corpora/Bloom/Peter/020522.xml  
Archive:  corpora/Bohannon.zip
  inflating: corpora/Bohannon/Nat/angela.xml  
  inflating: corpora/Bohannon/Nat/doug.xml  
  inflating: corpora/Bohannon/Nat/tom.xml  
  inflating: corpora/Bohannon/Nat/jim.xml  
  infla

  inflating: corpora/Braunwald/020616.xml  
  inflating: corpora/Braunwald/030316a.xml  
  inflating: corpora/Braunwald/020603.xml  
  inflating: corpora/Braunwald/010614b.xml  
  inflating: corpora/Braunwald/010720.xml  
  inflating: corpora/Braunwald/010600b.xml  
  inflating: corpora/Braunwald/010607.xml  
  inflating: corpora/Braunwald/010902.xml  
  inflating: corpora/Braunwald/010614a.xml  
  inflating: corpora/Braunwald/030616.xml  
  inflating: corpora/Braunwald/030000.xml  
  inflating: corpora/Braunwald/010604b.xml  
  inflating: corpora/Braunwald/020424.xml  
  inflating: corpora/Braunwald/030009.xml  
  inflating: corpora/Braunwald/030627a.xml  
  inflating: corpora/Braunwald/020300b.xml  
  inflating: corpora/Braunwald/010511.xml  
  inflating: corpora/Braunwald/041000.xml  
  inflating: corpora/Braunwald/040214.xml  
  inflating: corpora/Braunwald/020710.xml  
  inflating: corpora/Braunwald/030622.xml  
  inflating: corpora/Braunwald/030618b.xml  
  inflating: corpora/Bra

  inflating: corpora/Braunwald/0diary/010707.xml  
  inflating: corpora/Braunwald/0diary/020818.xml  
  inflating: corpora/Braunwald/0diary/010324.xml  
  inflating: corpora/Braunwald/0diary/020830.xml  
  inflating: corpora/Braunwald/0diary/020911.xml  
  inflating: corpora/Braunwald/0diary/010529.xml  
  inflating: corpora/Braunwald/0diary/020115.xml  
  inflating: corpora/Braunwald/0diary/010717.xml  
  inflating: corpora/Braunwald/0diary/020117.xml  
  inflating: corpora/Braunwald/0diary/010406.xml  
  inflating: corpora/Braunwald/0diary/010416.xml  
  inflating: corpora/Braunwald/0diary/030122.xml  
  inflating: corpora/Braunwald/0diary/020209.xml  
  inflating: corpora/Braunwald/0diary/020023.xml  
  inflating: corpora/Braunwald/0diary/020722.xml  
  inflating: corpora/Braunwald/0diary/020914.xml  
  inflating: corpora/Braunwald/0diary/020124.xml  
  inflating: corpora/Braunwald/0diary/010916.xml  
  inflating: corpora/Braunwald/0diary/010425.xml  
  inflating:

  inflating: corpora/Braunwald/0diary/020021.xml  
  inflating: corpora/Braunwald/0diary/010805.xml  
  inflating: corpora/Braunwald/0diary/011126.xml  
  inflating: corpora/Braunwald/0diary/010725.xml  
  inflating: corpora/Braunwald/0diary/021126.xml  
  inflating: corpora/Braunwald/0diary/011006.xml  
  inflating: corpora/Braunwald/0diary/010304.xml  
  inflating: corpora/Braunwald/0diary/011016.xml  
  inflating: corpora/Braunwald/0diary/020522.xml  
  inflating: corpora/Braunwald/0diary/030429.xml  
  inflating: corpora/Braunwald/0diary/020716.xml  
  inflating: corpora/Braunwald/0diary/020728.xml  
  inflating: corpora/Braunwald/0diary/020110.xml  
Archive:  corpora/Brent.zip
  inflating: corpora/Brent/q1/010217b.xml  
  inflating: corpora/Brent/q1/010205.xml  
  inflating: corpora/Brent/q1/010120.xml  
  inflating: corpora/Brent/q1/000928.xml  
  inflating: corpora/Brent/q1/010009.xml  
  inflating: corpora/Brent/q1/010113.xml  
  inflating: corpora/Brent/q1/001128.xml  
  infla

  inflating: corpora/Brent/w1/001025.xml  
  inflating: corpora/Brent/w1/001123.xml  
  inflating: corpora/Brent/w1/001011.xml  
  inflating: corpora/Brent/t1/001108.xml  
  inflating: corpora/Brent/t1/010110.xml  
  inflating: corpora/Brent/t1/010204.xml  
  inflating: corpora/Brent/t1/010119.xml  
  inflating: corpora/Brent/t1/000927.xml  
  inflating: corpora/Brent/t1/010216.xml  
  inflating: corpora/Brent/t1/010005.xml  
  inflating: corpora/Brent/t1/000920.xml  
  inflating: corpora/Brent/t1/000830.xml  
  inflating: corpora/Brent/t1/001025.xml  
  inflating: corpora/Brent/t1/010225.xml  
  inflating: corpora/Brent/t1/001016.xml  
  inflating: corpora/Brent/t1/010019.xml  
  inflating: corpora/Brent/t1/001126.xml  
  inflating: corpora/Brent/s3/010009.xml  
  inflating: corpora/Brent/s3/001128.xml  
  inflating: corpora/Brent/s3/010109.xml  
  inflating: corpora/Brent/s3/001112.xml  
  inflating: corpora/Brent/s3/001028.xml  
  inflating: corpora/Brent/s3/000913.xml  
  inflating

  inflating: corpora/Brown/Eve/011000b.xml  
  inflating: corpora/Brown/Eve/020100b.xml  
  inflating: corpora/Brown/Eve/010900b.xml  
  inflating: corpora/Brown/Eve/010600a.xml  
  inflating: corpora/Brown/Eve/011000a.xml  
  inflating: corpora/Brown/Eve/020300a.xml  
  inflating: corpora/Brown/Eve/010900c.xml  
  inflating: corpora/Brown/Eve/020000b.xml  
  inflating: corpora/Brown/Eve/020100a.xml  
  inflating: corpora/Brown/Eve/010900a.xml  
  inflating: corpora/Brown/Eve/011100b.xml  
  inflating: corpora/Brown/Eve/010600b.xml  
  inflating: corpora/Brown/Eve/011100a.xml  
  inflating: corpora/Brown/Eve/020000a.xml  
  inflating: corpora/Brown/Eve/020300b.xml  
  inflating: corpora/Brown/Eve/010700b.xml  
  inflating: corpora/Brown/Adam/040624.xml  
  inflating: corpora/Brown/Adam/021113.xml  
  inflating: corpora/Brown/Adam/021002.xml  
  inflating: corpora/Brown/Adam/040729.xml  
  inflating: corpora/Brown/Adam/020512.xml  
  inflating: corpora/Brown/Adam/020918.xml  
  inflatin

Archive:  corpora/Evans.zip
  inflating: corpora/Evans/dyad11.xml  
  inflating: corpora/Evans/dyad07.xml  
  inflating: corpora/Evans/dyad06.xml  
  inflating: corpora/Evans/dyad05.xml  
  inflating: corpora/Evans/dyad03.xml  
  inflating: corpora/Evans/dyad12.xml  
  inflating: corpora/Evans/dyad08.xml  
  inflating: corpora/Evans/dyad19.xml  
  inflating: corpora/Evans/dyad04.xml  
  inflating: corpora/Evans/dyad10.xml  
  inflating: corpora/Evans/dyad02.xml  
  inflating: corpora/Evans/dyad22.xml  
  inflating: corpora/Evans/dyad16.xml  
  inflating: corpora/Evans/dyad15.xml  
  inflating: corpora/Evans/dyad13.xml  
  inflating: corpora/Evans/dyad18.xml  
  inflating: corpora/Evans/dyad21.xml  
  inflating: corpora/Evans/dyad20.xml  
  inflating: corpora/Evans/dyad01.xml  
  inflating: corpora/Evans/dyad09.xml  
  inflating: corpora/Evans/dyad17.xml  
  inflating: corpora/Evans/dyad14.xml  
Archive:  corpora/Feldman.zip
  inflating: corpora/Feldman/020916.xml  
  inflating: corpora

  inflating: corpora/Gelman/2014-IndDiff/71P-R2.xml  
  inflating: corpora/Gelman/2014-IndDiff/42C-P1.xml  
  inflating: corpora/Gelman/2014-IndDiff/62C-R1.xml  
  inflating: corpora/Gelman/2014-IndDiff/17C-R2.xml  
  inflating: corpora/Gelman/2014-IndDiff/03C-R1.xml  
  inflating: corpora/Gelman/2014-IndDiff/46C-P1.xml  
  inflating: corpora/Gelman/2014-IndDiff/01P-R2.xml  
  inflating: corpora/Gelman/2014-IndDiff/25C-R1.xml  
  inflating: corpora/Gelman/2014-IndDiff/01P-R1.xml  
  inflating: corpora/Gelman/2014-IndDiff/39C-R2.xml  
  inflating: corpora/Gelman/2014-IndDiff/29C-R1.xml  
  inflating: corpora/Gelman/2014-IndDiff/25C-R2.xml  
  inflating: corpora/Gelman/2014-IndDiff/49C-R2.xml  
  inflating: corpora/Gelman/2014-IndDiff/55C-R2.xml  
  inflating: corpora/Gelman/2014-IndDiff/56P-R1.xml  
  inflating: corpora/Gelman/2014-IndDiff/25P-R2.xml  
  inflating: corpora/Gelman/2014-IndDiff/62C-P1.xml  
  inflating: corpora/Gelman/2014-IndDiff/61C-R1.xml  
  inflatin

  inflating: corpora/Gelman/2014-IndDiff/18C-P2.xml  
  inflating: corpora/Gelman/2014-IndDiff/08C-R1.xml  
  inflating: corpora/Gelman/2014-IndDiff/04P-R2.xml  
  inflating: corpora/Gelman/2014-IndDiff/62P-R2.xml  
  inflating: corpora/Gelman/2014-IndDiff/72C-P2.xml  
  inflating: corpora/Gelman/2014-IndDiff/06C-P1.xml  
  inflating: corpora/Gelman/2014-IndDiff/26C-P1.xml  
  inflating: corpora/Gelman/2014-IndDiff/20P-R1.xml  
  inflating: corpora/Gelman/2014-IndDiff/42C-P2.xml  
  inflating: corpora/Gelman/2014-IndDiff/65C-R1.xml  
  inflating: corpora/Gelman/2014-IndDiff/54C-R2.xml  
  inflating: corpora/Gelman/2014-IndDiff/45C-P1.xml  
  inflating: corpora/Gelman/2014-IndDiff/31C-R2.xml  
  inflating: corpora/Gelman/2014-IndDiff/61P-R1.xml  
  inflating: corpora/Gelman/2014-IndDiff/65C-P2.xml  
  inflating: corpora/Gelman/2014-IndDiff/29P-R1.xml  
  inflating: corpora/Gelman/2014-IndDiff/15C-R1.xml  
  inflating: corpora/Gelman/2014-IndDiff/38P-R2.xml  
  inflating: corpora/Gelman/

  inflating: corpora/Gelman/1998-Books/picturebook35/11.xml  
  inflating: corpora/Gelman/1998-Books/picturebook35/18.xml  
  inflating: corpora/Gelman/1998-Books/picturebook35/36.xml  
  inflating: corpora/Gelman/1998-Books/picturebook35/48.xml  
  inflating: corpora/Gelman/1998-Books/picturebook20/41.xml  
  inflating: corpora/Gelman/1998-Books/picturebook20/45.xml  
  inflating: corpora/Gelman/1998-Books/picturebook20/46.xml  
  inflating: corpora/Gelman/1998-Books/picturebook20/47.xml  
  inflating: corpora/Gelman/1998-Books/picturebook20/43.xml  
  inflating: corpora/Gelman/1998-Books/picturebook20/37.xml  
  inflating: corpora/Gelman/1998-Books/picturebook20/42.xml  
  inflating: corpora/Gelman/1998-Books/picturebook20/44.xml  
  inflating: corpora/Gelman/1998-Books/picturebook20/50.xml  
  inflating: corpora/Gelman/1998-Books/picturebook20/32.xml  
  inflating: corpora/Gelman/1998-Books/picturebook20/26.xml  
  inflating: corpora/Gelman/1998-Books/picturebook20/49.xml  
  inflat

Archive:  corpora/Gopnik.zip
  inflating: corpora/Gopnik/prompted/p01522.xml  
  inflating: corpora/Gopnik/prompted/p01831.xml  
  inflating: corpora/Gopnik/prompted/p04822.xml  
  inflating: corpora/Gopnik/prompted/p04622.xml  
  inflating: corpora/Gopnik/prompted/p05122.xml  
  inflating: corpora/Gopnik/prompted/p04621.xml  
  inflating: corpora/Gopnik/prompted/p05422.xml  
  inflating: corpora/Gopnik/prompted/p01311.xml  
  inflating: corpora/Gopnik/prompted/p01542.xml  
  inflating: corpora/Gopnik/prompted/p00911.xml  
  inflating: corpora/Gopnik/prompted/p01011.xml  
  inflating: corpora/Gopnik/prompted/p05911.xml  
  inflating: corpora/Gopnik/prompted/p06022.xml  
  inflating: corpora/Gopnik/prompted/p01432.xml  
  inflating: corpora/Gopnik/prompted/p01111.xml  
  inflating: corpora/Gopnik/prompted/p05032.xml  
  inflating: corpora/Gopnik/prompted/p00211.xml  
  inflating: corpora/Gopnik/prompted/p05143.xml  
  inflating: corpora/Gopnik/prompted/p04652.xml  
  inflating: corpora/

  inflating: corpora/HSLLD/HV1/TP/pautp1.xml  
  inflating: corpora/HSLLD/HV1/TP/justtp1.xml  
  inflating: corpora/HSLLD/HV1/TP/megtp1.xml  
  inflating: corpora/HSLLD/HV1/TP/anatp1.xml  
  inflating: corpora/HSLLD/HV1/TP/cantp1.xml  
  inflating: corpora/HSLLD/HV1/TP/brntp1.xml  
  inflating: corpora/HSLLD/HV1/TP/asttp1.xml  
  inflating: corpora/HSLLD/HV1/TP/geotp1.xml  
  inflating: corpora/HSLLD/HV1/TP/mortp1.xml  
  inflating: corpora/HSLLD/HV1/TP/gretp1.xml  
  inflating: corpora/HSLLD/HV1/TP/aimtp1.xml  
  inflating: corpora/HSLLD/HV1/TP/jeatp1.xml  
  inflating: corpora/HSLLD/HV1/TP/stntp1.xml  
  inflating: corpora/HSLLD/HV1/TP/kurtp1.xml  
  inflating: corpora/HSLLD/HV1/TP/bobtp1.xml  
  inflating: corpora/HSLLD/HV1/TP/timtp1.xml  
  inflating: corpora/HSLLD/HV1/TP/rastp1.xml  
  inflating: corpora/HSLLD/HV1/TP/chatp1.xml  
  inflating: corpora/HSLLD/HV1/TP/castp1.xml  
  inflating: corpora/HSLLD/HV1/TP/diatp1.xml  
  inflating: corpora/HSLLD/HV1/TP/maytp1.xml  
  inflating:

  inflating: corpora/HSLLD/HV1/MT/astmt1.xml  
  inflating: corpora/HSLLD/HV1/MT/jenmt1.xml  
  inflating: corpora/HSLLD/HV1/MT/alimt1.xml  
  inflating: corpora/HSLLD/HV1/MT/chamt1.xml  
  inflating: corpora/HSLLD/HV1/MT/aprmt1.xml  
  inflating: corpora/HSLLD/HV1/MT/jacmt1.xml  
  inflating: corpora/HSLLD/HV1/MT/sarmt1.xml  
  inflating: corpora/HSLLD/HV1/MT/tamtp1.xml  
  inflating: corpora/HSLLD/HV1/MT/jusmt1.xml  
  inflating: corpora/HSLLD/HV1/MT/seamt1.xml  
  inflating: corpora/HSLLD/HV1/MT/catmt1.xml  
  inflating: corpora/HSLLD/HV1/MT/stnmt1.xml  
  inflating: corpora/HSLLD/HV1/MT/aimmt1.xml  
  inflating: corpora/HSLLD/HV1/MT/vicmt1.xml  
  inflating: corpora/HSLLD/HV1/MT/todmt1.xml  
  inflating: corpora/HSLLD/HV1/MT/mormt1.xml  
  inflating: corpora/HSLLD/HV1/MT/joemt1.xml  
  inflating: corpora/HSLLD/HV1/MT/jesmt1.xml  
  inflating: corpora/HSLLD/HV1/MT/trumt1.xml  
  inflating: corpora/HSLLD/HV1/MT/brimt1.xml  
  inflating: corpora/HSLLD/HV1/MT/raumt1.xml  
  inflating: 

  inflating: corpora/HSLLD/HV5/MT/davmt5.xml  
  inflating: corpora/HSLLD/HV5/MT/annmt5.xml  
  inflating: corpora/HSLLD/HV5/MT/bramt5.xml  
  inflating: corpora/HSLLD/HV5/MT/carmt5.xml  
  inflating: corpora/HSLLD/HV5/MT/jermt5.xml  
  inflating: corpora/HSLLD/HV5/MT/gilmt5.xml  
  inflating: corpora/HSLLD/HV5/MT/mrkmt5.xml  
  inflating: corpora/HSLLD/HV5/MT/guymt5.xml  
  inflating: corpora/HSLLD/HV5/MT/raumt5.xml  
  inflating: corpora/HSLLD/HV5/MT/kurmt5.xml  
  inflating: corpora/HSLLD/HV5/MT/sarmt5.xml  
  inflating: corpora/HSLLD/HV5/MT/megmt5.xml  
  inflating: corpora/HSLLD/HV5/MT/trimt5.xml  
  inflating: corpora/HSLLD/HV5/MT/anamt5.xml  
  inflating: corpora/HSLLD/HV5/MT/tommt5.xml  
  inflating: corpora/HSLLD/HV5/MT/aprmt5.xml  
  inflating: corpora/HSLLD/HV5/MT/joymt5.xml  
  inflating: corpora/HSLLD/HV5/MT/vicmt5.xml  
  inflating: corpora/HSLLD/HV5/MT/rilmt5.xml  
  inflating: corpora/HSLLD/HV5/MT/geomt5.xml  
  inflating: corpora/HSLLD/HV5/MT/ctrmt5.xml  
  inflating: 

  inflating: corpora/HSLLD/HV7/MT/melmt7.xml  
  inflating: corpora/HSLLD/HV7/MT/raumt7.xml  
  inflating: corpora/HSLLD/HV7/MT/admmt7.xml  
  inflating: corpora/HSLLD/HV7/MT/diamt7.xml  
  inflating: corpora/HSLLD/HV7/MT/stnmt7.xml  
  inflating: corpora/HSLLD/HV7/MT/trimt7.xml  
  inflating: corpora/HSLLD/HV7/MT/maymt7.xml  
  inflating: corpora/HSLLD/HV7/MT/kurmt7.xml  
  inflating: corpora/HSLLD/HV7/MT/kevmt7.xml  
  inflating: corpora/HSLLD/HV7/MT/petmt7.xml  
  inflating: corpora/HSLLD/HV7/MT/ethmt7.xml  
  inflating: corpora/HSLLD/HV7/MT/rosmt7.xml  
  inflating: corpora/HSLLD/HV7/MT/sarmt7.xml  
  inflating: corpora/HSLLD/HV7/MT/aprmt7.xml  
  inflating: corpora/HSLLD/HV7/MT/conmt7.xml  
  inflating: corpora/HSLLD/HV7/MT/casmt7.xml  
  inflating: corpora/HSLLD/HV7/MT/brimt7.xml  
  inflating: corpora/HSLLD/HV7/MT/zanmt7.xml  
  inflating: corpora/HSLLD/HV7/MT/rasmt7.xml  
  inflating: corpora/HSLLD/HV7/MT/todmt7.xml  
  inflating: corpora/HSLLD/HV7/MT/rilmt7.xml  
  inflating: 

  inflating: corpora/HSLLD/HV3/TP/zantp3.xml  
  inflating: corpora/HSLLD/HV3/TP/tamtp3.xml  
  inflating: corpora/HSLLD/HV3/TP/rautp3.xml  
  inflating: corpora/HSLLD/HV3/TP/megtp3.xml  
  inflating: corpora/HSLLD/HV3/TP/geotp3.xml  
  inflating: corpora/HSLLD/HV3/TP/cattp3.xml  
  inflating: corpora/HSLLD/HV3/TP/dontp3.xml  
  inflating: corpora/HSLLD/HV3/TP/jentp3.xml  
  inflating: corpora/HSLLD/HV3/TP/brntp3.xml  
  inflating: corpora/HSLLD/HV3/TP/asttp3.xml  
  inflating: corpora/HSLLD/HV3/TP/admtp3.xml  
  inflating: corpora/HSLLD/HV3/TP/kevtp3.xml  
  inflating: corpora/HSLLD/HV3/TP/shotp3.xml  
  inflating: corpora/HSLLD/HV3/BR/morbr3.xml  
  inflating: corpora/HSLLD/HV3/BR/joybr3.xml  
  inflating: corpora/HSLLD/HV3/BR/devbr3.xml  
  inflating: corpora/HSLLD/HV3/BR/terbr3.xml  
  inflating: corpora/HSLLD/HV3/BR/geobr3.xml  
  inflating: corpora/HSLLD/HV3/BR/admbr3.xml  
  inflating: corpora/HSLLD/HV3/BR/trebr3.xml  
  inflating: corpora/HSLLD/HV3/BR/karbr2.xml  
  inflating: 

  inflating: corpora/HSLLD/HV2/TP/clatp2.xml  
  inflating: corpora/HSLLD/HV2/TP/martp2.xml  
  inflating: corpora/HSLLD/HV2/TP/mattp2.xml  
  inflating: corpora/HSLLD/HV2/TP/geotp2.xml  
  inflating: corpora/HSLLD/HV2/TP/joetp2.xml  
  inflating: corpora/HSLLD/HV2/TP/fratp2.xml  
  inflating: corpora/HSLLD/HV2/TP/kartp2.xml  
  inflating: corpora/HSLLD/HV2/TP/anntp2.xml  
  inflating: corpora/HSLLD/HV2/TP/tomtp2.xml  
  inflating: corpora/HSLLD/HV2/TP/jamtp2.xml  
  inflating: corpora/HSLLD/HV2/TP/aimtp2.xml  
  inflating: corpora/HSLLD/HV2/TP/meltp2.xml  
  inflating: corpora/HSLLD/HV2/TP/acetp2.xml  
  inflating: corpora/HSLLD/HV2/TP/remtp2.xml  
  inflating: corpora/HSLLD/HV2/TP/bratp2.xml  
  inflating: corpora/HSLLD/HV2/TP/nictp2.xml  
  inflating: corpora/HSLLD/HV2/TP/jestp2.xml  
  inflating: corpora/HSLLD/HV2/TP/brimt2.xml  
  inflating: corpora/HSLLD/HV2/TP/asttp2.xml  
  inflating: corpora/HSLLD/HV2/TP/anatp2.xml  
  inflating: corpora/HSLLD/HV2/TP/giltp2.xml  
  inflating: 

  inflating: corpora/HSLLD/HV2/ER/trier2.xml  
  inflating: corpora/HSLLD/HV2/ER/jeaer2.xml  
  inflating: corpora/HSLLD/HV2/ER/jamer2.xml  
  inflating: corpora/HSLLD/HV2/ER/emier2.xml  
  inflating: corpora/HSLLD/HV2/ER/meler2.xml  
  inflating: corpora/HSLLD/HV2/ER/giler2.xml  
  inflating: corpora/HSLLD/HV2/ER/aceer2.xml  
  inflating: corpora/HSLLD/HV2/ER/anner2.xml  
  inflating: corpora/HSLLD/HV2/ER/diaer2.xml  
  inflating: corpora/HSLLD/HV2/ER/daver2.xml  
  inflating: corpora/HSLLD/HV2/ER/brner2.xml  
  inflating: corpora/HSLLD/HV2/ER/suser2.xml  
  inflating: corpora/HSLLD/HV2/ER/kever2.xml  
  inflating: corpora/HSLLD/HV2/ER/raser2.xml  
  inflating: corpora/HSLLD/HV2/ER/geoer2.xml  
  inflating: corpora/HSLLD/HV2/ER/peter2.xml  
  inflating: corpora/HSLLD/HV2/ER/kurer2.xml  
  inflating: corpora/HSLLD/HV2/ER/toder2.xml  
  inflating: corpora/HSLLD/HV2/ER/tamer2.xml  
  inflating: corpora/HSLLD/HV2/ER/alier2.xml  
  inflating: corpora/HSLLD/HV2/ER/remer2.xml  
  inflating: 

  inflating: corpora/Hicks/1st/event/evt004.xml  
  inflating: corpora/Hicks/Kinder/report/rep046.xml  
  inflating: corpora/Hicks/Kinder/report/rep032.xml  
  inflating: corpora/Hicks/Kinder/report/rep051.xml  
  inflating: corpora/Hicks/Kinder/report/rep045.xml  
  inflating: corpora/Hicks/Kinder/report/rep056.xml  
  inflating: corpora/Hicks/Kinder/report/rep044.xml  
  inflating: corpora/Hicks/Kinder/report/rep052.xml  
  inflating: corpora/Hicks/Kinder/report/rep015.xml  
  inflating: corpora/Hicks/Kinder/report/rep059.xml  
  inflating: corpora/Hicks/Kinder/report/rep033.xml  
  inflating: corpora/Hicks/Kinder/report/rep053.xml  
  inflating: corpora/Hicks/Kinder/report/rep034.xml  
  inflating: corpora/Hicks/Kinder/report/rep058.xml  
  inflating: corpora/Hicks/Kinder/report/rep054.xml  
  inflating: corpora/Hicks/Kinder/report/rep031.xml  
  inflating: corpora/Hicks/Kinder/report/rep057.xml  
  inflating: corpora/Hicks/Kinder/report/rep047.xml  
  inflating: corpora/Hicks/Kinde

  inflating: corpora/Kuczaj/031100.xml  
  inflating: corpora/Kuczaj/030016.xml  
  inflating: corpora/Kuczaj/020801.xml  
  inflating: corpora/Kuczaj/030524.xml  
  inflating: corpora/Kuczaj/030430.xml  
  inflating: corpora/Kuczaj/040814.xml  
  inflating: corpora/Kuczaj/031102.xml  
  inflating: corpora/Kuczaj/030029.xml  
  inflating: corpora/Kuczaj/030007.xml  
  inflating: corpora/Kuczaj/030914.xml  
  inflating: corpora/Kuczaj/030821.xml  
  inflating: corpora/Kuczaj/030722.xml  
  inflating: corpora/Kuczaj/030101.xml  
  inflating: corpora/Kuczaj/040528.xml  
  inflating: corpora/Kuczaj/041127.xml  
  inflating: corpora/Kuczaj/040124.xml  
  inflating: corpora/Kuczaj/041022.xml  
  inflating: corpora/Kuczaj/040321.xml  
  inflating: corpora/Kuczaj/040612.xml  
  inflating: corpora/Kuczaj/040627.xml  
  inflating: corpora/Kuczaj/030221.xml  
  inflating: corpora/Kuczaj/040605.xml  
  inflating: corpora/Kuczaj/040802.xml  
  inflating: corpora/Kuczaj/021022.xml  
  inflating: cor

  inflating: corpora/MacWhinney/021022a.xml  
  inflating: corpora/MacWhinney/010405c.xml  
  inflating: corpora/MacWhinney/040316c.xml  
  inflating: corpora/MacWhinney/030001a.xml  
  inflating: corpora/MacWhinney/060922c.xml  
  inflating: corpora/MacWhinney/060906c.xml  
  inflating: corpora/MacWhinney/061017c.xml  
  inflating: corpora/MacWhinney/000917a.xml  
  inflating: corpora/MacWhinney/040404d.xml  
  inflating: corpora/MacWhinney/070018c.xml  
  inflating: corpora/MacWhinney/070318c.xml  
  inflating: corpora/MacWhinney/040316b.xml  
  inflating: corpora/MacWhinney/010425a.xml  
  inflating: corpora/MacWhinney/070318.xml  
  inflating: corpora/MacWhinney/010405a.xml  
  inflating: corpora/MacWhinney/000623d.xml  
  inflating: corpora/MacWhinney/020718c.xml  
  inflating: corpora/MacWhinney/030017.xml  
  inflating: corpora/MacWhinney/021001d.xml  
  inflating: corpora/MacWhinney/021017b.xml  
  inflating: corpora/MacWhinney/030805.xml  
  inflating: corpora/MacWhinney/02110

  inflating: corpora/MacWhinney/060002c.xml  
  inflating: corpora/MacWhinney/070309b.xml  
  inflating: corpora/MacWhinney/020718b.xml  
  inflating: corpora/MacWhinney/040601d.xml  
  inflating: corpora/MacWhinney/060302a.xml  
  inflating: corpora/MacWhinney/010009a.xml  
  inflating: corpora/MacWhinney/020617c.xml  
  inflating: corpora/MacWhinney/051001d.xml  
  inflating: corpora/MacWhinney/031109.xml  
  inflating: corpora/MacWhinney/040601c.xml  
  inflating: corpora/MacWhinney/030401.xml  
  inflating: corpora/MacWhinney/020817b.xml  
  inflating: corpora/MacWhinney/070503c.xml  
  inflating: corpora/MacWhinney/021001b.xml  
  inflating: corpora/MacWhinney/010306a.xml  
  inflating: corpora/MacWhinney/000710c.xml  
  inflating: corpora/MacWhinney/070309c.xml  
  inflating: corpora/MacWhinney/060922a.xml  
  inflating: corpora/MacWhinney/030616.xml  
  inflating: corpora/MacWhinney/060406a2.xml  
  inflating: corpora/MacWhinney/050309c.xml  
  inflating: corpora/MacWhinney/0602

  inflating: corpora/McCune/Rick/010900.xml  
  inflating: corpora/McCune/Rick/001100.xml  
  inflating: corpora/McCune/Rick/000800.xml  
  inflating: corpora/McCune/Rick/010200.xml  
  inflating: corpora/McCune/Rick/010400.xml  
  inflating: corpora/McCune/Rick/010000.xml  
  inflating: corpora/McCune/Rick/000900.xml  
  inflating: corpora/McCune/Rick/020000.xml  
  inflating: corpora/McCune/Rick/010700.xml  
  inflating: corpora/McCune/Rala/010300.xml  
  inflating: corpora/McCune/Rala/010600.xml  
  inflating: corpora/McCune/Rala/010800.xml  
  inflating: corpora/McCune/Rala/001000.xml  
  inflating: corpora/McCune/Rala/020700.xml  
  inflating: corpora/McCune/Rala/010900.xml  
  inflating: corpora/McCune/Rala/000700.xml  
  inflating: corpora/McCune/Rala/010400.xml  
  inflating: corpora/McCune/Rala/010000.xml  
  inflating: corpora/McCune/Rala/030000.xml  
  inflating: corpora/McCune/Rala/000900.xml  
  inflating: corpora/McCune/Rala/020000.xml  
  inflating: corpora/McCune/Vito/0

  inflating: corpora/Morisset/Topeka/223m30t.xml  
  inflating: corpora/Morisset/Topeka/177m30t.xml  
  inflating: corpora/Morisset/Topeka/140m30t.xml  
  inflating: corpora/Morisset/Topeka/127m30t.xml  
  inflating: corpora/Morisset/Topeka/173m30t.xml  
  inflating: corpora/Morisset/Topeka/242m30t.xml  
  inflating: corpora/Morisset/Topeka/119m30t.xml  
  inflating: corpora/Morisset/Topeka/187m30t.xml  
  inflating: corpora/Morisset/Topeka/323m30t.xml  
  inflating: corpora/Morisset/Topeka/210m30t.xml  
  inflating: corpora/Morisset/Topeka/108m30t.xml  
  inflating: corpora/Morisset/Topeka/181m30t.xml  
  inflating: corpora/Morisset/Topeka/158m30t.xml  
  inflating: corpora/Morisset/Topeka/252m30t.xml  
  inflating: corpora/Morisset/Topeka/244m30t.xml  
  inflating: corpora/Morisset/Topeka/219m30t.xml  
  inflating: corpora/Morisset/Topeka/138m30t.xml  
  inflating: corpora/Morisset/Topeka/107m30t.xml  
  inflating: corpora/Morisset/Topeka/157m30t.xml  
  inflating: corpora/Morisset/T

Archive:  corpora/NewEngland.zip
  inflating: corpora/NewEngland/60/38.xml  
  inflating: corpora/NewEngland/60/25.xml  
  inflating: corpora/NewEngland/60/97.xml  
  inflating: corpora/NewEngland/60/55.xml  
  inflating: corpora/NewEngland/60/01.xml  
  inflating: corpora/NewEngland/60/47.xml  
  inflating: corpora/NewEngland/60/43.xml  
  inflating: corpora/NewEngland/60/99.xml  
  inflating: corpora/NewEngland/60/13.xml  
  inflating: corpora/NewEngland/60/92.xml  
  inflating: corpora/NewEngland/60/90.xml  
  inflating: corpora/NewEngland/60/89.xml  
  inflating: corpora/NewEngland/60/20.xml  
  inflating: corpora/NewEngland/60/32.xml  
  inflating: corpora/NewEngland/60/56.xml  
  inflating: corpora/NewEngland/60/60.xml  
  inflating: corpora/NewEngland/60/65.xml  
  inflating: corpora/NewEngland/60/26.xml  
  inflating: corpora/NewEngland/60/06.xml  
  inflating: corpora/NewEngland/60/10.xml  
  inflating: corpora/NewEngland/60/98.xml  
  inflating: corpora/NewEngland/60/75.xml  

  inflating: corpora/NewmanRatner/Interviews/24/4619WZ.xml  
  inflating: corpora/NewmanRatner/Interviews/24/6825MT.xml  
  inflating: corpora/NewmanRatner/Interviews/24/5733LBE.xml  
  inflating: corpora/NewmanRatner/Interviews/24/5936SR.xml  
  inflating: corpora/NewmanRatner/Interviews/24/4743NA.xml  
  inflating: corpora/NewmanRatner/Interviews/24/6630TM.xml  
  inflating: corpora/NewmanRatner/Interviews/24/5859ME.xml  
  inflating: corpora/NewmanRatner/Interviews/24/4697JK.xml  
  inflating: corpora/NewmanRatner/Interviews/24/5073AC.xml  
  inflating: corpora/NewmanRatner/Interviews/24/5903AE.xml  
  inflating: corpora/NewmanRatner/Interviews/24/5540LD.xml  
  inflating: corpora/NewmanRatner/Interviews/24/7660HK.xml  
  inflating: corpora/NewmanRatner/Interviews/24/5585ME.xml  
  inflating: corpora/NewmanRatner/Interviews/24/7444IJ.xml  
  inflating: corpora/NewmanRatner/Interviews/24/6206MP.xml  
  inflating: corpora/NewmanRatner/Interviews/24/4650KS.xml  
  inflating: corpora/Ne

  inflating: corpora/NewmanRatner/Interviews/10/6314AK.xml  
  inflating: corpora/NewmanRatner/Interviews/10/5878SC.xml  
  inflating: corpora/NewmanRatner/Interviews/10/7061AS.xml  
  inflating: corpora/NewmanRatner/Interviews/10/5346GG.xml  
  inflating: corpora/NewmanRatner/Interviews/10/7534EM.xml  
  inflating: corpora/NewmanRatner/Interviews/10/7222MD.xml  
  inflating: corpora/NewmanRatner/Interviews/10/7658LT.xml  
  inflating: corpora/NewmanRatner/Interviews/10/4767JC.xml  
  inflating: corpora/NewmanRatner/Interviews/10/4802JP.xml  
  inflating: corpora/NewmanRatner/Interviews/10/6047JC.xml  
  inflating: corpora/NewmanRatner/Interviews/10/5039MB.xml  
  inflating: corpora/NewmanRatner/Interviews/10/7120CB.xml  
  inflating: corpora/NewmanRatner/Interviews/10/5923MW.xml  
  inflating: corpora/NewmanRatner/Interviews/10/4801RB.xml  
  inflating: corpora/NewmanRatner/Interviews/10/5694MC.xml  
  inflating: corpora/NewmanRatner/Interviews/10/7099EH.xml  
  inflating: corpora/New

  inflating: corpora/NewmanRatner/Interviews/11/5794ES.xml  
  inflating: corpora/NewmanRatner/Interviews/11/6691MW.xml  
  inflating: corpora/NewmanRatner/Interviews/11/5954ML.xml  
  inflating: corpora/NewmanRatner/Interviews/11/6493TM.xml  
  inflating: corpora/NewmanRatner/Interviews/11/7183TB.xml  
  inflating: corpora/NewmanRatner/Interviews/11/5196AVI.xml  
  inflating: corpora/NewmanRatner/Interviews/11/6785KS.xml  
  inflating: corpora/NewmanRatner/Interviews/11/5837AK.xml  
  inflating: corpora/NewmanRatner/Interviews/11/5244SE.xml  
  inflating: corpora/NewmanRatner/Interviews/11/5609DW.xml  
  inflating: corpora/NewmanRatner/Interviews/11/4929MM.xml  
  inflating: corpora/NewmanRatner/Interviews/11/5440JJ.xml  
  inflating: corpora/NewmanRatner/Interviews/11/5928RL.xml  
  inflating: corpora/NewmanRatner/Interviews/11/5563DB.xml  
  inflating: corpora/NewmanRatner/Interviews/11/5118PM.xml  
  inflating: corpora/NewmanRatner/Interviews/11/5977QJ.xml  
  inflating: corpora/Ne

  inflating: corpora/NewmanRatner/18/5013LA.xml  
  inflating: corpora/NewmanRatner/18/6825MT.xml  
  inflating: corpora/NewmanRatner/18/5936SR.xml  
  inflating: corpora/NewmanRatner/18/6630TM.xml  
  inflating: corpora/NewmanRatner/18/5859ME.xml  
  inflating: corpora/NewmanRatner/18/5244RE.xml  
  inflating: corpora/NewmanRatner/18/6206MP.xml  
  inflating: corpora/NewmanRatner/18/5630WS.xml  
  inflating: corpora/NewmanRatner/18/6314AK.xml  
  inflating: corpora/NewmanRatner/18/5878SC.xml  
  inflating: corpora/NewmanRatner/18/7061AS.xml  
  inflating: corpora/NewmanRatner/18/7814NB.xml  
  inflating: corpora/NewmanRatner/18/7534EM.xml  
  inflating: corpora/NewmanRatner/18/6047JC.xml  
  inflating: corpora/NewmanRatner/18/7120CB.xml  
  inflating: corpora/NewmanRatner/18/5923MW.xml  
  inflating: corpora/NewmanRatner/18/5694MC.xml  
  inflating: corpora/NewmanRatner/18/7099EH.xml  
  inflating: corpora/NewmanRatner/18/6757JC.xml  
  inflating: corpora/NewmanRatner/18/5057MS.xml  


  inflating: corpora/NewmanRatner/10/4767JC.xml  
  inflating: corpora/NewmanRatner/10/4802JP.xml  
  inflating: corpora/NewmanRatner/10/6047JC.xml  
  inflating: corpora/NewmanRatner/10/5039MB.xml  
  inflating: corpora/NewmanRatner/10/7120CB.xml  
  inflating: corpora/NewmanRatner/10/5923MW.xml  
  inflating: corpora/NewmanRatner/10/4801RB.xml  
  inflating: corpora/NewmanRatner/10/5694MC.xml  
  inflating: corpora/NewmanRatner/10/7099EH.xml  
  inflating: corpora/NewmanRatner/10/4903LS.xml  
  inflating: corpora/NewmanRatner/10/6757JC.xml  
  inflating: corpora/NewmanRatner/10/4452CM.xml  
  inflating: corpora/NewmanRatner/10/5571FW.xml  
  inflating: corpora/NewmanRatner/10/7419EB.xml  
  inflating: corpora/NewmanRatner/10/4629AB.xml  
  inflating: corpora/NewmanRatner/10/5057MS.xml  
  inflating: corpora/NewmanRatner/10/4664AM.xml  
  inflating: corpora/NewmanRatner/10/5224EZS.xml  
  inflating: corpora/NewmanRatner/10/4708IB.xml  
  inflating: corpora/NewmanRatner/10/4731SA.xml  

  inflating: corpora/NewmanRatner/07/5543EF.xml  
  inflating: corpora/NewmanRatner/07/5837JK.xml  
  inflating: corpora/NewmanRatner/07/4946RC.xml  
  inflating: corpora/NewmanRatner/07/4724LM.xml  
  inflating: corpora/NewmanRatner/07/5623AT.xml  
  inflating: corpora/NewmanRatner/07/7252PD.xml  
  inflating: corpora/NewmanRatner/07/4592HVG.xml  
  inflating: corpora/NewmanRatner/07/5482DF.xml  
  inflating: corpora/NewmanRatner/07/5733LE.xml  
  inflating: corpora/NewmanRatner/07/5013LA.xml  
  inflating: corpora/NewmanRatner/07/4310AM.xml  
  inflating: corpora/NewmanRatner/07/4619WZ.xml  
  inflating: corpora/NewmanRatner/07/6825MT.xml  
  inflating: corpora/NewmanRatner/07/5936SR.xml  
  inflating: corpora/NewmanRatner/07/4743NA.xml  
  inflating: corpora/NewmanRatner/07/6630TM.xml  
  inflating: corpora/NewmanRatner/07/5859ME.xml  
  inflating: corpora/NewmanRatner/07/4697JK.xml  
  inflating: corpora/NewmanRatner/07/5244RE.xml  
  inflating: corpora/NewmanRatner/07/5073AC.xml  

  inflating: corpora/Peters/020000a.xml  
  inflating: corpora/Peters/010614a.xml  
  inflating: corpora/Peters/030207b.xml  
  inflating: corpora/Peters/011114b.xml  
  inflating: corpora/Peters/020107b.xml  
  inflating: corpora/Peters/011120.xml  
  inflating: corpora/Peters/010327a.xml  
  inflating: corpora/Peters/010412.xml  
  inflating: corpora/Peters/010817.xml  
  inflating: corpora/Peters/011125a.xml  
  inflating: corpora/Peters/010402a.xml  
  inflating: corpora/Peters/010410a.xml  
  inflating: corpora/Peters/011011c.xml  
  inflating: corpora/Peters/010901b.xml  
  inflating: corpora/Peters/011006.xml  
  inflating: corpora/Peters/011016.xml  
Archive:  corpora/PetersonMcCabe.zip
  inflating: corpora/PetersonMcCabe/38.xml  
  inflating: corpora/PetersonMcCabe/41.xml  
  inflating: corpora/PetersonMcCabe/25.xml  
  inflating: corpora/PetersonMcCabe/24.xml  
  inflating: corpora/PetersonMcCabe/45.xml  
  inflating: corpora/PetersonMcCabe/59.xml  
  inflating: corpora/Peter

  inflating: corpora/Rollins/ds09.xml  
  inflating: corpora/Rollins/nb12.xml  
  inflating: corpora/Rollins/nb09.xml  
  inflating: corpora/Rollins/cy12.xml  
  inflating: corpora/Rollins/zx09.xml  
  inflating: corpora/Rollins/ds06.xml  
  inflating: corpora/Rollins/ax06.xml  
  inflating: corpora/Rollins/st12.xml  
  inflating: corpora/Rollins/nj09.xml  
  inflating: corpora/Rollins/mm09.xml  
  inflating: corpora/Rollins/te12.xml  
  inflating: corpora/Rollins/ps09.xml  
  inflating: corpora/Rollins/gb06.xml  
  inflating: corpora/Rollins/et06.xml  
  inflating: corpora/Rollins/ma12.xml  
  inflating: corpora/Rollins/di06.xml  
  inflating: corpora/Rollins/sa06.xml  
  inflating: corpora/Rollins/gp06.xml  
  inflating: corpora/Rollins/sb06.xml  
  inflating: corpora/Rollins/pa06.xml  
  inflating: corpora/Rollins/tx09.xml  
  inflating: corpora/Rollins/cb06.xml  
  inflating: corpora/Rollins/tl06.xml  
  inflating: corpora/Rollins/jp18.xml  
  inflating: corpora/Rollins/cy30.xml  


  inflating: corpora/Sawyer/2-28-92.xml  
Archive:  corpora/Snow.zip
  inflating: corpora/Snow/020804b.xml  
  inflating: corpora/Snow/030410a.xml  
  inflating: corpora/Snow/030106a.xml  
  inflating: corpora/Snow/030106b.xml  
  inflating: corpora/Snow/030904.xml  
  inflating: corpora/Snow/020622a.xml  
  inflating: corpora/Snow/030418c.xml  
  inflating: corpora/Snow/020804a.xml  
  inflating: corpora/Snow/020600c.xml  
  inflating: corpora/Snow/020622b.xml  
  inflating: corpora/Snow/030408a.xml  
  inflating: corpora/Snow/030409b.xml  
  inflating: corpora/Snow/030019a.xml  
  inflating: corpora/Snow/030408c.xml  
  inflating: corpora/Snow/030019b.xml  
  inflating: corpora/Snow/020518b.xml  
  inflating: corpora/Snow/030408d.xml  
  inflating: corpora/Snow/030408b.xml  
  inflating: corpora/Snow/021100c.xml  
  inflating: corpora/Snow/020819b.xml  
  inflating: corpora/Snow/020600a.xml  
  inflating: corpora/Snow/030021b.xml  
  inflating: corpora/Snow/030409d.xml  
  inflating:

  inflating: corpora/Suppes/020913.xml  
Archive:  corpora/Tardif.zip
  inflating: corpora/Tardif/e21.xml  
  inflating: corpora/Tardif/e08.xml  
  inflating: corpora/Tardif/e19.xml  
  inflating: corpora/Tardif/e16.xml  
  inflating: corpora/Tardif/e11.xml  
  inflating: corpora/Tardif/e06.xml  
  inflating: corpora/Tardif/e10.xml  
  inflating: corpora/Tardif/e26book.xml  
  inflating: corpora/Tardif/e03.xml  
  inflating: corpora/Tardif/e12.xml  
  inflating: corpora/Tardif/e24.xml  
  inflating: corpora/Tardif/e25.xml  
  inflating: corpora/Tardif/e13.xml  
  inflating: corpora/Tardif/e17.xml  
  inflating: corpora/Tardif/e04.xml  
  inflating: corpora/Tardif/e15.xml  
  inflating: corpora/Tardif/e18.xml  
  inflating: corpora/Tardif/e20.xml  
  inflating: corpora/Tardif/e23.xml  
  inflating: corpora/Tardif/e09.xml  
  inflating: corpora/Tardif/e02.xml  
  inflating: corpora/Tardif/e22.xml  
  inflating: corpora/Tardif/e01.xml  
  inflating: corpora/Tardif/e14.xml  
  inflating: c

Archive:  corpora/VanKleeck.zip
  inflating: corpora/VanKleeck/megan1.xml  
  inflating: corpora/VanKleeck/rachel2.xml  
  inflating: corpora/VanKleeck/lara2.xml  
  inflating: corpora/VanKleeck/walter2a.xml  
  inflating: corpora/VanKleeck/matjoy2.xml  
  inflating: corpora/VanKleeck/nikki1b.xml  
  inflating: corpora/VanKleeck/graham2.xml  
  inflating: corpora/VanKleeck/ben2.xml  
  inflating: corpora/VanKleeck/nikki2b.xml  
  inflating: corpora/VanKleeck/jessica1.xml  
  inflating: corpora/VanKleeck/justin1a.xml  
  inflating: corpora/VanKleeck/justin2.xml  
  inflating: corpora/VanKleeck/lara1.xml  
  inflating: corpora/VanKleeck/graham1.xml  
  inflating: corpora/VanKleeck/megan2.xml  
  inflating: corpora/VanKleeck/bree2.xml  
  inflating: corpora/VanKleeck/susan1.xml  
  inflating: corpora/VanKleeck/nikki1a.xml  
  inflating: corpora/VanKleeck/ben1.xml  
  inflating: corpora/VanKleeck/mattm2.xml  
  inflating: corpora/VanKleeck/matjoy1.xml  
  inflating: corpora/VanKleeck/jessi

  inflating: corpora/Weist/Matt/030002.xml  
  inflating: corpora/Weist/Matt/020407.xml  
  inflating: corpora/Weist/Matt/030226.xml  
  inflating: corpora/Weist/Matt/040809.xml  
  inflating: corpora/Weist/Matt/020317.xml  
  inflating: corpora/Weist/Matt/031106.xml  
  inflating: corpora/Weist/Matt/040610.xml  
  inflating: corpora/Weist/Matt/030514.xml  
  inflating: corpora/Weist/Matt/040704.xml  
  inflating: corpora/Weist/Matt/040405.xml  
  inflating: corpora/Weist/Matt/030704.xml  
  inflating: corpora/Weist/Jillian/020916.xml  
  inflating: corpora/Weist/Jillian/020122.xml  
  inflating: corpora/Weist/Jillian/020421.xml  
  inflating: corpora/Weist/Jillian/020507.xml  
  inflating: corpora/Weist/Jillian/020707.xml  
  inflating: corpora/Weist/Jillian/020325.xml  
  inflating: corpora/Weist/Jillian/020129.xml  
  inflating: corpora/Weist/Jillian/020101.xml  
  inflating: corpora/Weist/Jillian/020521.xml  
  inflating: corpora/Weist/Jillian/020526.xml  
  inflating: corpora/Weis

### Extract data from CHILDES

The data extraction is made by parsing the CHILDES' XML files.

In [1]:
import os
from glob import glob
PATH = "./corpora"
all_files = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.xml'))]
print(len(all_files))
print(all_files[0])

7719
./corpora/Suppes/030107.xml


#### Find participants

This information is important for making queries in the future. For example, get sentences by children age.

In [2]:
def find_participants(root):
    participants = []
    for participant in root.find(ns+"Participants"):
      participants.append(participant.attrib)
    return participants

#### Parse utterances

In [3]:
def parse_utterance(u):
    wsd_doc = []
    if 'text' in u: # some utterances in CHILDES have just researchers comments or actions like (he screamed)
        doc = nlp.wsd(u['text'])
        for token in doc.tokens():
          wsd_doc.append(token.__dict__)
    return wsd_doc

#### Process utterances

In [4]:
from tqdm import tqdm
def process_utterances(root, process_faster=False):
    utterances = []
    for u in root.findall(ns+'u'):
      utterance_dict = u.attrib
      utterance_dict['original_tokens'] = []
      tokens = []
      for token in u.getchildren():
        if token.tag == ns+"w":
          tags = [a.tag for a in token.getchildren()]
          if ns+"shortening" in tags:
            try:
                tokens.append(token.find(ns+'mor').find(ns+"mw").find(ns+"stem").text)
            except:
                pass
          elif token.text is not None:
            tokens.append(token.text)
        elif token.tag == ns+"g": # group of words
          token = token.find(ns+'w')
          if token is not None:
              tags = [a.tag for a in token.getchildren()]
              if ns+"shortening" in tags:
                try:
                    tokens.append(token.find(ns+'mor').find(ns+"mw").find(ns+"stem").text)
                except:
                    pass
              elif token.text is not None:
                tokens.append(token.text)

        elif token.tag == ns+"t": # punctuation
          if token.attrib['type'] == 'p':
            tokens.append(".")
          elif token.attrib['type'] == 'q':
            tokens.append("?")
        elif token.tag == ns+"tagMarker": #comma
          tokens.append(',')
      if len(tokens) > 1:
        utterance_dict['text'] = " ".join(tokens)
      if not process_faster:
          utterance_dict['wsd_doc'] = parse_utterance(utterance_dict)
      utterances.append(utterance_dict)

    return utterances

In [22]:
!mkdir dicts

In [5]:
import xml.etree.ElementTree as ET
from tqdm import tqdm
import warnings
import json
from os.path import exists

warnings.filterwarnings('ignore')
all_dicts = []

process_n_files = len(all_files) # change to len(all_files) to use all

faster_processing = True # To process faster, you can use pysupwsd process_corpus method. However, by using this method we cannot use the sentences metadata (e.g., children age).
if faster_processing:
    !mkdir only_texts

for xml_file in tqdm(all_files):
    if faster_processing:
        txt_file = "only_texts/{0}.txt".format("_".join(xml_file.split('/')[-2:]))
        if exists(txt_file):
            continue
    tree = ET.parse(xml_file)
    root = tree.getroot()
    ns = "{http://www.talkbank.org/ns/talkbank}"
    sem_dict = root.attrib
    sem_dict['file'] = "/content/corpora/MacWhinney/030018a.xml"
    sem_dict['participants'] = find_participants(root)
    sem_dict['utterances'] = process_utterances(root,faster_processing)
    
    if faster_processing:
        ft = open("only_texts/{0}.txt".format("_".join(xml_file.split('/')[-2:])),'w')
        ft.writelines([l['text']+"\n" for l in sem_dict['utterances'] if 'text' in l])
        ft.close() 
    
    all_dicts.append(sem_dict)
    json.dump(sem_dict, open("dicts/{0}.json".format("_".join(xml_file.split('/')[-2:])),'w'))

mkdir: cannot create directory ‘only_texts’: File exists


100%|██████████| 7719/7719 [00:49<00:00, 154.78it/s]


#### Create corpus for BERT input

In [11]:
!pip install nltk

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/43/0b/8298798bc5a9a007b7cae3f846a3d9a325953e0f9c238affa478b4d59324/nltk-3.7-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 28.1MB/s eta 0:00:01
Collecting click (from nltk)
[?25l  Downloading https://files.pythonhosted.org/packages/4a/a8/0b2ced25639fb20cc1c9784de90a8c25f9504a7f18cd8b5397bd61696d7d/click-8.0.4-py3-none-any.whl (97kB)
[K     |████████████████████████████████| 102kB 48.0MB/s ta 0:00:01
[?25hCollecting regex>=2021.8.3 (from nltk)
[?25l  Downloading https://files.pythonhosted.org/packages/59/ec/091ea11974453cff690837ae97a8fa5e433e9e47ed596ee9cf4c889a9079/regex-2022.3.15-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (670kB)
[K     |████████████████████████████████| 675kB 60.3MB/s eta 0:00:01
[?25hCollecting joblib (from nltk)
[?25l  Downloading https://files.pythonhosted.org/packages/3e/d5/0163eb0cfa0b673aa4fe1cd3ea9

In [12]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


In [23]:
import os
from glob import glob
f = open("data/semCHILDES.txt",'w')
if faster_processing:
    PATH = "only_texts/"
    only_texts_files = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]
    
    for text_file in tqdm(only_texts_files):
        corpus = nlp.parse_corpus(text_file)
        for doc in corpus:
            new_sentence = []
            for t in doc.tokens():
              token = t.__dict__
              if token['word'] in ['me','and','or',',']:
                  new_sentence.append(token['word'])
              elif token['lemma'] in ["can","a","to","how","what",'this',"that"]:
                  new_sentence.append(token['lemma'])
              elif token['senses'][0]['id'] != 'U':
                  new_sentence.append(token['senses'][0]['id'])
              elif token['pos'] in ['IN','PRP','.','WRB','CC',"PRP$","DT"]:
                  new_sentence.append(token['lemma'])
              elif token['pos'] in ['NNP']:
                  new_sentence.append('proper_noun')
              elif token['pos'] in ['NN',"NNS"]:
                  n_token = None
                  synsets = wn.synsets(token['lemma'],'n')
                  if len(synsets) > 0:
                      synset = synsets[0]
                      for l in synset.lemmas():
                          if l.name() == token['lemma']:
                              n_token = l.key()
                  if n_token is not None:
                    new_sentence.append(n_token)
                  else:
                    new_sentence.append(token['lemma']) # it may be words that are common on children vocabulary.
            if len(new_sentence) > 1:
                f.write(" ".join(new_sentence)+"\n")
f.close()




  0%|          | 0/7142 [00:00<?, ?it/s][A[A

  0%|          | 1/7142 [00:12<24:09:30, 12.18s/it][A[A

  0%|          | 2/7142 [00:14<18:29:38,  9.32s/it][A[A

  0%|          | 3/7142 [00:25<19:29:45,  9.83s/it][A[A

  0%|          | 4/7142 [00:46<26:00:23, 13.12s/it][A[A

  0%|          | 5/7142 [01:16<35:55:39, 18.12s/it][A[A

  0%|          | 6/7142 [01:18<26:15:51, 13.25s/it][A[A

  0%|          | 7/7142 [01:23<21:23:37, 10.79s/it][A[A

  0%|          | 8/7142 [01:29<18:24:53,  9.29s/it][A[A

  0%|          | 9/7142 [01:32<14:36:02,  7.37s/it][A[A

  0%|          | 10/7142 [01:34<11:39:06,  5.88s/it][A[A

  0%|          | 11/7142 [01:40<11:40:37,  5.90s/it][A[A

  0%|          | 12/7142 [01:46<11:41:00,  5.90s/it][A[A

  0%|          | 13/7142 [01:57<15:04:35,  7.61s/it][A[A

  0%|          | 14/7142 [01:58<11:09:49,  5.64s/it][A[A

  0%|          | 15/7142 [02:01<9:03:01,  4.57s/it] [A[A

  0%|          | 16/7142 [02:09<11:04:28,  5.59s/it][A[A


  2%|▏         | 136/7142 [22:34<26:49:28, 13.78s/it][A[A

  2%|▏         | 137/7142 [22:35<19:22:25,  9.96s/it][A[A

  2%|▏         | 138/7142 [22:39<16:03:00,  8.25s/it][A[A

  2%|▏         | 139/7142 [23:08<28:23:27, 14.59s/it][A[A

  2%|▏         | 140/7142 [23:17<24:51:13, 12.78s/it][A[A

  2%|▏         | 141/7142 [23:31<25:36:58, 13.17s/it][A[A

  2%|▏         | 142/7142 [23:33<19:02:46,  9.80s/it][A[A

  2%|▏         | 143/7142 [23:37<15:38:11,  8.04s/it][A[A

  2%|▏         | 144/7142 [23:39<12:11:56,  6.28s/it][A[A

  2%|▏         | 145/7142 [23:40<9:08:48,  4.71s/it] [A[A

  2%|▏         | 146/7142 [23:47<10:30:34,  5.41s/it][A[A

  2%|▏         | 147/7142 [23:48<7:57:19,  4.09s/it] [A[A

  2%|▏         | 148/7142 [24:00<12:36:18,  6.49s/it][A[A

  2%|▏         | 149/7142 [24:05<11:30:40,  5.93s/it][A[A

  2%|▏         | 150/7142 [24:12<12:12:55,  6.29s/it][A[A

  2%|▏         | 151/7142 [24:16<10:47:06,  5.55s/it][A[A

  2%|▏         | 152/714

  4%|▍         | 270/7142 [41:39<14:28:28,  7.58s/it][A[A

  4%|▍         | 271/7142 [42:15<30:45:23, 16.11s/it][A[A

  4%|▍         | 272/7142 [42:28<29:12:39, 15.31s/it][A[A

  4%|▍         | 273/7142 [42:30<21:22:14, 11.20s/it][A[A

  4%|▍         | 274/7142 [42:31<15:32:49,  8.15s/it][A[A

  4%|▍         | 275/7142 [42:32<11:31:05,  6.04s/it][A[A

  4%|▍         | 276/7142 [42:43<14:33:23,  7.63s/it][A[A

  4%|▍         | 277/7142 [42:51<14:29:00,  7.60s/it][A[A

  4%|▍         | 278/7142 [42:57<13:21:32,  7.01s/it][A[A

  4%|▍         | 279/7142 [43:05<14:13:49,  7.46s/it][A[A

  4%|▍         | 280/7142 [43:06<10:32:37,  5.53s/it][A[A

  4%|▍         | 281/7142 [43:09<8:57:51,  4.70s/it] [A[A

  4%|▍         | 282/7142 [43:11<7:15:08,  3.81s/it][A[A

  4%|▍         | 283/7142 [43:18<9:11:29,  4.82s/it][A[A

  4%|▍         | 284/7142 [43:19<7:05:37,  3.72s/it][A[A

  4%|▍         | 285/7142 [43:23<7:12:33,  3.79s/it][A[A

  4%|▍         | 286/7142 [4

  6%|▌         | 404/7142 [1:00:47<13:52:40,  7.41s/it][A[A

  6%|▌         | 405/7142 [1:00:51<12:25:20,  6.64s/it][A[A

  6%|▌         | 406/7142 [1:01:09<18:45:00, 10.02s/it][A[A

  6%|▌         | 407/7142 [1:01:33<26:30:10, 14.17s/it][A[A

  6%|▌         | 408/7142 [1:01:38<21:13:31, 11.35s/it][A[A

  6%|▌         | 409/7142 [1:01:48<20:21:55, 10.89s/it][A[A

  6%|▌         | 410/7142 [1:02:03<22:37:03, 12.09s/it][A[A

  6%|▌         | 411/7142 [1:02:22<26:37:15, 14.24s/it][A[A

  6%|▌         | 412/7142 [1:02:28<21:47:56, 11.66s/it][A[A

  6%|▌         | 413/7142 [1:02:29<15:51:08,  8.48s/it][A[A

  6%|▌         | 414/7142 [1:02:33<13:43:39,  7.35s/it][A[A

  6%|▌         | 415/7142 [1:02:45<16:00:23,  8.57s/it][A[A

  6%|▌         | 416/7142 [1:02:47<12:18:32,  6.59s/it][A[A

  6%|▌         | 417/7142 [1:02:48<9:22:35,  5.02s/it] [A[A

  6%|▌         | 418/7142 [1:03:11<19:34:20, 10.48s/it][A[A

  6%|▌         | 419/7142 [1:03:14<14:56:25,  8.00s/it]

KeyboardInterrupt: 

In [None]:
f = open("data/semCHILDES.txt",'w')
if not faster_processing:
    for sem_dict in all_dicts:
        for u in tqdm(sem_dict['utterances']):
          if 'wsd_doc' in u:
            new_sentence = []
            for token in u['wsd_doc']:
              if token['word'] in ['me','and','or',',']:
                  new_sentence.append(token['word'])
              elif token['lemma'] in ["can","a","to","how","what",'this',"that"]:
                  new_sentence.append(token['lemma'])
              elif token['senses'][0]['id'] != 'U':
                  new_sentence.append(token['senses'][0]['id'])
              elif token['pos'] in ['IN','PRP','.','WRB','CC',"PRP$","DT"]:
                  new_sentence.append(token['lemma'])
              elif token['pos'] in ['NNP']:
                  new_sentence.append('proper_noun')
              elif token['pos'] in ['NN',"NNS"]:
                  n_token = None
                  synsets = wn.synsets(token['lemma'],'n')
                  if len(synsets) > 0:
                      synset = synsets[0]
                      for l in synset.lemmas():
                          if l.name() == token['lemma']:
                              n_token = l.key()
                  if n_token is not None:
                    new_sentence.append(n_token)
                  else:
                    new_sentence.append(token['lemma']) # it may be words that are common on children vocabulary.
            if len(new_sentence) > 1:
                f.write(" ".join(new_sentence)+"\n")
    f.close()
