# SemCHILDES construction

As an initial effort to construct SemCHILDES, I use an automatic word sense disambiguation to annotate the CHILDES corpus with word senses. As future work, I intend to manually annotate part of the corpus and then evaluate the performance of different algorithms, approaches, and combination of them in the annotated data. The best approach will be used to annotate the entire corpus.

Used tool: PySupWSDPocket - https://github.com/rodriguesfas/PySupWSDPocket




## PySupWSDPocket

PySupWSDPocket is a python lib for the [SupWSD Pocket](https://supwsd.net/supwsd/pocket.jsp). SupWSD is a supervised model for Word Sense Disambiguation.

We install it from github to get the latest version.

https://drive.google.com/file/d/1hEMlbToLL4xN7HJhPtebMbKYeethWmha/view?usp=sharing

In [1]:
!pip install git+https://github.com/rodriguesfas/PySupWSDPocket.git

Collecting git+https://github.com/rodriguesfas/PySupWSDPocket.git
  Cloning https://github.com/rodriguesfas/PySupWSDPocket.git to /tmp/pip-req-build-su9dhykg
  Running command git clone -q https://github.com/rodriguesfas/PySupWSDPocket.git /tmp/pip-req-build-su9dhykg
Building wheels for collected packages: pysupwsdpocket
  Building wheel for pysupwsdpocket (setup.py) ... [?25l[?25hdone
  Created wheel for pysupwsdpocket: filename=pysupwsdpocket-0.0.9-py3-none-any.whl size=1443874 sha256=9eee8a32a44115e880f3d092615036066c101b618fb6ce27c15553f95bc60b92
  Stored in directory: /tmp/pip-ephem-wheel-cache-2levx73k/wheels/0d/ae/0a/10bc8b4cbcf4aeef156cf95eb8d9e02b33921220e8a8010e08
Successfully built pysupwsdpocket
Installing collected packages: pysupwsdpocket
Successfully installed pysupwsdpocket-0.0.9


PySupWSDPocket requires downloading its ~2GB model available on https://supwsd.net/supwsd/downloads.jsp#supwsd_pocket.

In [2]:
!mkdir pysupwsdpocket_models
!wget -P pysupwsdpocket_models http://jayr.clubedosgeeks.com.br/pictobert/en.zip

--2022-03-25 12:34:34--  http://jayr.clubedosgeeks.com.br/pictobert/en.zip
Resolving jayr.clubedosgeeks.com.br (jayr.clubedosgeeks.com.br)... 192.185.214.132
Connecting to jayr.clubedosgeeks.com.br (jayr.clubedosgeeks.com.br)|192.185.214.132|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1797035229 (1.7G) [application/zip]
Saving to: ‘pysupwsdpocket_models/en.zip’


2022-03-25 12:35:08 (49.8 MB/s) - ‘pysupwsdpocket_models/en.zip’ saved [1797035229/1797035229]



In [3]:
from pysupwsdpocket import PySupWSDPocket
nlp = PySupWSDPocket(lang='en', model='semcor_omsti', model_path="./pysupwsdpocket_models/")    

## CHILDES

The Child Language Data Exchange System (CHILDES) is a corpus established in 1984 by Brian MacWhinney and Catherine Snow to serve as a central repository for data of first language acquisition[¹](https://en.wikipedia.org/wiki/CHILDES). It counts with a list of different corpora from many languages that can be downloaded in XML or CHA format.

In this notebook we download only one corpus, but SemCHILDES is composed by the entire American English CHILDES.

In [4]:
!mkdir corpora
corpora_files = ["Bates.zip", "Bernstein.zip", "Bliss.zip", "Bloom.zip", "Bohannon.zip", "Braunwald.zip", "Brent.zip", "Brown.zip", "Clark.zip", "Demetras1.zip", "Demetras2.zip", "Evans.zip", "Feldman.zip", "Garvey.zip", "Gathercole.zip", "Gelman.zip", "Gleason.zip", "Gopnik.zip", "HSLLD.zip", "Haggerty.zip", "Hall.zip", "Hicks.zip", "Higginson.zip", "Kuczaj.zip", "MacWhinney.zip", "McCune.zip", "McMillan.zip", "Morisset.zip", "Nelson.zip", "NewEngland.zip", "NewmanRatner.zip", "Peters.zip", "PetersonMcCabe.zip", "Post.zip", "Rollins.zip", "Sachs.zip", "Sawyer.zip", "Snow.zip", "Soderstrom.zip", "Sprott.zip", "Suppes.zip", "Tardif.zip", "Valian.zip", "VanHouten.zip", "VanKleeck.zip", "Warren.zip", "Weist.zip"]
for corpus_file in corpora_files:
    !wget https://childes.talkbank.org/data-xml/Eng-NA/$corpus_file -O corpora/$corpus_file

--2022-03-25 12:35:08--  https://childes.talkbank.org/data-xml/Eng-NA/Bates.zip
Resolving childes.talkbank.org (childes.talkbank.org)... 128.2.24.68
Connecting to childes.talkbank.org (childes.talkbank.org)|128.2.24.68|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 720606 (704K) [application/zip]
Saving to: ‘corpora/Bates.zip’


2022-03-25 12:35:09 (3.94 MB/s) - ‘corpora/Bates.zip’ saved [720606/720606]

--2022-03-25 12:35:09--  https://childes.talkbank.org/data-xml/Eng-NA/Bernstein.zip
Resolving childes.talkbank.org (childes.talkbank.org)... 128.2.24.68
Connecting to childes.talkbank.org (childes.talkbank.org)|128.2.24.68|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 939284 (917K) [application/zip]
Saving to: ‘corpora/Bernstein.zip’


2022-03-25 12:35:09 (4.29 MB/s) - ‘corpora/Bernstein.zip’ saved [939284/939284]

--2022-03-25 12:35:09--  https://childes.talkbank.org/data-xml/Eng-NA/Bliss.zip
Resolving childes.talkbank.org (childe

In [5]:
for corpus_file in corpora_files:
    !unzip corpora/$corpus_file -d corpora

[1;30;43mA saída de streaming foi truncada nas últimas 5000 linhas.[0m
  inflating: corpora/HSLLD/HV1/BR/seaact1.xml  
  inflating: corpora/HSLLD/HV1/BR/nicact1.xml  
  inflating: corpora/HSLLD/HV1/BR/brabr1.xml  
  inflating: corpora/HSLLD/HV1/BR/alibr1.xml  
  inflating: corpora/HSLLD/HV1/BR/davbr1.xml  
  inflating: corpora/HSLLD/HV1/BR/diabr1.xml  
  inflating: corpora/HSLLD/HV1/BR/sopbr1.xml  
  inflating: corpora/HSLLD/HV1/BR/zenact1.xml  
  inflating: corpora/HSLLD/HV1/BR/jasbr1.xml  
  inflating: corpora/HSLLD/HV1/BR/gilbr1a.xml  
  inflating: corpora/HSLLD/HV1/BR/rasbr1.xml  
  inflating: corpora/HSLLD/HV1/BR/jacbr1.xml  
  inflating: corpora/HSLLD/HV1/BR/allbr1.xml  
  inflating: corpora/HSLLD/HV1/BR/shabr1.xml  
  inflating: corpora/HSLLD/HV1/BR/vicbr1.xml  
  inflating: corpora/HSLLD/HV1/BR/cltbr1.xml  
  inflating: corpora/HSLLD/HV1/BR/emibr1.xml  
  inflating: corpora/HSLLD/HV1/BR/mhwbr1.xml  
  inflating: corpora/HSLLD/HV1/BR/jusbr1.xml  
  inflating: corpora/HSLLD/HV1

### Extract data from CHILDES

The data extraction is made by parsing the CHILDES' XML files.

In [None]:
import os
from glob import glob
PATH = "./corpora"
all_files = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.xml'))]
print(len(all_files))
print(all_files[0])

7719
./corpora/Suppes/030107.xml


#### Find participants

This information is important for making queries in the future. For example, get sentences by children age.

In [None]:
def find_participants(root):
    participants = []
    for participant in root.find(ns+"Participants"):
      participants.append(participant.attrib)
    return participants

#### Parse utterances

In [None]:
def parse_utterance(u):
    wsd_doc = []
    if 'text' in u: # some utterances in CHILDES have just researchers comments or actions like (he screamed)
        doc = nlp.wsd(u['text'])
        for token in doc.tokens():
          wsd_doc.append(token.__dict__)
    return wsd_doc

#### Process utterances

In [None]:
from tqdm import tqdm
def process_utterances(root, process_faster=False):
    utterances = []
    for u in root.findall(ns+'u'):
      utterance_dict = u.attrib
      utterance_dict['original_tokens'] = []
      tokens = []
      for token in u.getchildren():
        if token.tag == ns+"w":
          tags = [a.tag for a in token.getchildren()]
          if ns+"shortening" in tags:
            try:
                tokens.append(token.find(ns+'mor').find(ns+"mw").find(ns+"stem").text)
            except:
                pass
          elif token.text is not None:
            tokens.append(token.text)
        elif token.tag == ns+"g": # group of words
          token = token.find(ns+'w')
          if token is not None:
              tags = [a.tag for a in token.getchildren()]
              if ns+"shortening" in tags:
                try:
                    tokens.append(token.find(ns+'mor').find(ns+"mw").find(ns+"stem").text)
                except:
                    pass
              elif token.text is not None:
                tokens.append(token.text)

        elif token.tag == ns+"t": # punctuation
          if token.attrib['type'] == 'p':
            tokens.append(".")
          elif token.attrib['type'] == 'q':
            tokens.append("?")
        elif token.tag == ns+"tagMarker": #comma
          tokens.append(',')
      if len(tokens) > 1:
        utterance_dict['text'] = " ".join(tokens)
      if not process_faster:
          utterance_dict['wsd_doc'] = parse_utterance(utterance_dict)
      utterances.append(utterance_dict)

    return utterances

In [None]:
!mkdir dicts

In [None]:
import xml.etree.ElementTree as ET
from tqdm import tqdm
import warnings
import json
from os.path import exists

warnings.filterwarnings('ignore')
all_dicts = []

process_n_files = len(all_files) # change to len(all_files) to use all

faster_processing = True # To process faster, you can use pysupwsd process_corpus method. However, by using this method we cannot use the sentences metadata (e.g., children age).
if faster_processing:
    !mkdir only_texts

for xml_file in tqdm(all_files):
    if faster_processing:
        txt_file = "only_texts/{0}.txt".format("_".join(xml_file.split('/')[-2:]))
        if exists(txt_file):
            continue
    tree = ET.parse(xml_file)
    root = tree.getroot()
    ns = "{http://www.talkbank.org/ns/talkbank}"
    sem_dict = root.attrib
    sem_dict['file'] = "/content/corpora/MacWhinney/030018a.xml"
    sem_dict['participants'] = find_participants(root)
    sem_dict['utterances'] = process_utterances(root,faster_processing)
    
    if faster_processing:
        ft = open("only_texts/{0}.txt".format("_".join(xml_file.split('/')[-2:])),'w')
        ft.writelines([l['text']+"\n" for l in sem_dict['utterances'] if 'text' in l])
        ft.close() 
    
    all_dicts.append(sem_dict)
    json.dump(sem_dict, open("dicts/{0}.json".format("_".join(xml_file.split('/')[-2:])),'w'))

mkdir: cannot create directory ‘only_texts’: File exists


100%|██████████| 7719/7719 [00:49<00:00, 154.78it/s]


#### Create corpus for BERT input

In [None]:
!pip install nltk

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/43/0b/8298798bc5a9a007b7cae3f846a3d9a325953e0f9c238affa478b4d59324/nltk-3.7-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 28.1MB/s eta 0:00:01
Collecting click (from nltk)
[?25l  Downloading https://files.pythonhosted.org/packages/4a/a8/0b2ced25639fb20cc1c9784de90a8c25f9504a7f18cd8b5397bd61696d7d/click-8.0.4-py3-none-any.whl (97kB)
[K     |████████████████████████████████| 102kB 48.0MB/s ta 0:00:01
[?25hCollecting regex>=2021.8.3 (from nltk)
[?25l  Downloading https://files.pythonhosted.org/packages/59/ec/091ea11974453cff690837ae97a8fa5e433e9e47ed596ee9cf4c889a9079/regex-2022.3.15-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (670kB)
[K     |████████████████████████████████| 675kB 60.3MB/s eta 0:00:01
[?25hCollecting joblib (from nltk)
[?25l  Downloading https://files.pythonhosted.org/packages/3e/d5/0163eb0cfa0b673aa4fe1cd3ea9

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


In [None]:
import os
from glob import glob
f = open("data/semCHILDES.txt",'w')
if faster_processing:
    PATH = "only_texts/"
    only_texts_files = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]
    
    for text_file in tqdm(only_texts_files):
        corpus = nlp.parse_corpus(text_file)
        for doc in corpus:
            new_sentence = []
            for t in doc.tokens():
              token = t.__dict__
              if token['word'] in ['me','and','or',',']:
                  new_sentence.append(token['word'])
              elif token['lemma'] in ["can","a","to","how","what",'this',"that"]:
                  new_sentence.append(token['lemma'])
              elif token['senses'][0]['id'] != 'U':
                  new_sentence.append(token['senses'][0]['id'])
              elif token['pos'] in ['IN','PRP','.','WRB','CC',"PRP$","DT"]:
                  new_sentence.append(token['lemma'])
              elif token['pos'] in ['NNP']:
                  new_sentence.append('proper_noun')
              elif token['pos'] in ['NN',"NNS"]:
                  n_token = None
                  synsets = wn.synsets(token['lemma'],'n')
                  if len(synsets) > 0:
                      synset = synsets[0]
                      for l in synset.lemmas():
                          if l.name() == token['lemma']:
                              n_token = l.key()
                  if n_token is not None:
                    new_sentence.append(n_token)
                  else:
                    new_sentence.append(token['lemma']) # it may be words that are common on children vocabulary.
            if len(new_sentence) > 1:
                f.write(" ".join(new_sentence)+"\n")
f.close()




  0%|          | 0/7142 [00:00<?, ?it/s][A[A

  0%|          | 1/7142 [00:12<24:09:30, 12.18s/it][A[A

  0%|          | 2/7142 [00:14<18:29:38,  9.32s/it][A[A

  0%|          | 3/7142 [00:25<19:29:45,  9.83s/it][A[A

  0%|          | 4/7142 [00:46<26:00:23, 13.12s/it][A[A

  0%|          | 5/7142 [01:16<35:55:39, 18.12s/it][A[A

  0%|          | 6/7142 [01:18<26:15:51, 13.25s/it][A[A

  0%|          | 7/7142 [01:23<21:23:37, 10.79s/it][A[A

  0%|          | 8/7142 [01:29<18:24:53,  9.29s/it][A[A

  0%|          | 9/7142 [01:32<14:36:02,  7.37s/it][A[A

  0%|          | 10/7142 [01:34<11:39:06,  5.88s/it][A[A

  0%|          | 11/7142 [01:40<11:40:37,  5.90s/it][A[A

  0%|          | 12/7142 [01:46<11:41:00,  5.90s/it][A[A

  0%|          | 13/7142 [01:57<15:04:35,  7.61s/it][A[A

  0%|          | 14/7142 [01:58<11:09:49,  5.64s/it][A[A

  0%|          | 15/7142 [02:01<9:03:01,  4.57s/it] [A[A

  0%|          | 16/7142 [02:09<11:04:28,  5.59s/it][A[A


  2%|▏         | 136/7142 [22:34<26:49:28, 13.78s/it][A[A

  2%|▏         | 137/7142 [22:35<19:22:25,  9.96s/it][A[A

  2%|▏         | 138/7142 [22:39<16:03:00,  8.25s/it][A[A

  2%|▏         | 139/7142 [23:08<28:23:27, 14.59s/it][A[A

  2%|▏         | 140/7142 [23:17<24:51:13, 12.78s/it][A[A

  2%|▏         | 141/7142 [23:31<25:36:58, 13.17s/it][A[A

  2%|▏         | 142/7142 [23:33<19:02:46,  9.80s/it][A[A

  2%|▏         | 143/7142 [23:37<15:38:11,  8.04s/it][A[A

  2%|▏         | 144/7142 [23:39<12:11:56,  6.28s/it][A[A

  2%|▏         | 145/7142 [23:40<9:08:48,  4.71s/it] [A[A

  2%|▏         | 146/7142 [23:47<10:30:34,  5.41s/it][A[A

  2%|▏         | 147/7142 [23:48<7:57:19,  4.09s/it] [A[A

  2%|▏         | 148/7142 [24:00<12:36:18,  6.49s/it][A[A

  2%|▏         | 149/7142 [24:05<11:30:40,  5.93s/it][A[A

  2%|▏         | 150/7142 [24:12<12:12:55,  6.29s/it][A[A

  2%|▏         | 151/7142 [24:16<10:47:06,  5.55s/it][A[A

  2%|▏         | 152/714

  4%|▍         | 270/7142 [41:39<14:28:28,  7.58s/it][A[A

  4%|▍         | 271/7142 [42:15<30:45:23, 16.11s/it][A[A

  4%|▍         | 272/7142 [42:28<29:12:39, 15.31s/it][A[A

  4%|▍         | 273/7142 [42:30<21:22:14, 11.20s/it][A[A

  4%|▍         | 274/7142 [42:31<15:32:49,  8.15s/it][A[A

  4%|▍         | 275/7142 [42:32<11:31:05,  6.04s/it][A[A

  4%|▍         | 276/7142 [42:43<14:33:23,  7.63s/it][A[A

  4%|▍         | 277/7142 [42:51<14:29:00,  7.60s/it][A[A

  4%|▍         | 278/7142 [42:57<13:21:32,  7.01s/it][A[A

  4%|▍         | 279/7142 [43:05<14:13:49,  7.46s/it][A[A

  4%|▍         | 280/7142 [43:06<10:32:37,  5.53s/it][A[A

  4%|▍         | 281/7142 [43:09<8:57:51,  4.70s/it] [A[A

  4%|▍         | 282/7142 [43:11<7:15:08,  3.81s/it][A[A

  4%|▍         | 283/7142 [43:18<9:11:29,  4.82s/it][A[A

  4%|▍         | 284/7142 [43:19<7:05:37,  3.72s/it][A[A

  4%|▍         | 285/7142 [43:23<7:12:33,  3.79s/it][A[A

  4%|▍         | 286/7142 [4

  6%|▌         | 404/7142 [1:00:47<13:52:40,  7.41s/it][A[A

  6%|▌         | 405/7142 [1:00:51<12:25:20,  6.64s/it][A[A

  6%|▌         | 406/7142 [1:01:09<18:45:00, 10.02s/it][A[A

  6%|▌         | 407/7142 [1:01:33<26:30:10, 14.17s/it][A[A

  6%|▌         | 408/7142 [1:01:38<21:13:31, 11.35s/it][A[A

  6%|▌         | 409/7142 [1:01:48<20:21:55, 10.89s/it][A[A

  6%|▌         | 410/7142 [1:02:03<22:37:03, 12.09s/it][A[A

  6%|▌         | 411/7142 [1:02:22<26:37:15, 14.24s/it][A[A

  6%|▌         | 412/7142 [1:02:28<21:47:56, 11.66s/it][A[A

  6%|▌         | 413/7142 [1:02:29<15:51:08,  8.48s/it][A[A

  6%|▌         | 414/7142 [1:02:33<13:43:39,  7.35s/it][A[A

  6%|▌         | 415/7142 [1:02:45<16:00:23,  8.57s/it][A[A

  6%|▌         | 416/7142 [1:02:47<12:18:32,  6.59s/it][A[A

  6%|▌         | 417/7142 [1:02:48<9:22:35,  5.02s/it] [A[A

  6%|▌         | 418/7142 [1:03:11<19:34:20, 10.48s/it][A[A

  6%|▌         | 419/7142 [1:03:14<14:56:25,  8.00s/it]

KeyboardInterrupt: 

In [None]:
f = open("data/semCHILDES.txt",'w')
if not faster_processing:
    for sem_dict in all_dicts:
        for u in tqdm(sem_dict['utterances']):
          if 'wsd_doc' in u:
            new_sentence = []
            for token in u['wsd_doc']:
              if token['word'] in ['me','and','or',',']:
                  new_sentence.append(token['word'])
              elif token['lemma'] in ["can","a","to","how","what",'this',"that"]:
                  new_sentence.append(token['lemma'])
              elif token['senses'][0]['id'] != 'U':
                  new_sentence.append(token['senses'][0]['id'])
              elif token['pos'] in ['IN','PRP','.','WRB','CC',"PRP$","DT"]:
                  new_sentence.append(token['lemma'])
              elif token['pos'] in ['NNP']:
                  new_sentence.append('proper_noun')
              elif token['pos'] in ['NN',"NNS"]:
                  n_token = None
                  synsets = wn.synsets(token['lemma'],'n')
                  if len(synsets) > 0:
                      synset = synsets[0]
                      for l in synset.lemmas():
                          if l.name() == token['lemma']:
                              n_token = l.key()
                  if n_token is not None:
                    new_sentence.append(n_token)
                  else:
                    new_sentence.append(token['lemma']) # it may be words that are common on children vocabulary.
            if len(new_sentence) > 1:
                f.write(" ".join(new_sentence)+"\n")
    f.close()
