# SemCHILDES construction

As an initial effort to construct SemCHILDES, I use an automatic word sense disambiguation to annotate the CHILDES corpus with word senses. As future work, I intend to manually annotate part of the corpus and then evaluate the performance of different algorithms, approaches, and combination of them in the annotated data. The best approach will be used to annotate the entire corpus.

Used tool: PySupWSDPocket - https://github.com/rodriguesfas/PySupWSDPocket




## PySupWSDPocket

PySupWSDPocket is a python lib for the [SupWSD Pocket](https://supwsd.net/supwsd/pocket.jsp). SupWSD is a supervised model for Word Sense Disambiguation.

We install it from github to get the latest version.

https://drive.google.com/file/d/1hEMlbToLL4xN7HJhPtebMbKYeethWmha/view?usp=sharing

In [None]:
!pip install git+https://github.com/rodriguesfas/PySupWSDPocket.git

Collecting git+https://github.com/rodriguesfas/PySupWSDPocket.git
  Cloning https://github.com/rodriguesfas/PySupWSDPocket.git to /tmp/pip-req-build-wt33a1rh
  Running command git clone -q https://github.com/rodriguesfas/PySupWSDPocket.git /tmp/pip-req-build-wt33a1rh
Building wheels for collected packages: pysupwsdpocket
  Building wheel for pysupwsdpocket (setup.py) ... [?25l[?25hdone
  Created wheel for pysupwsdpocket: filename=pysupwsdpocket-0.0.9-cp37-none-any.whl size=1443874 sha256=5de6034d64343a6dd7638f5df8a2bcd77cd9c7eec29e1325666e183e632ef102
  Stored in directory: /tmp/pip-ephem-wheel-cache-of40il45/wheels/60/71/8d/80f8c9ddf9fd2b65d10328afb6d580cfd83e4fbbc690cfb4dc
Successfully built pysupwsdpocket
Installing collected packages: pysupwsdpocket
Successfully installed pysupwsdpocket-0.0.9


PySupWSDPocket requires downloading its ~2GB model available on https://supwsd.net/supwsd/downloads.jsp#supwsd_pocket.

In [None]:
!mkdir pysupwsdpocket_models
!gdown  https://drive.google.com/uc?id=1hEMlbToLL4xN7HJhPtebMbKYeethWmha  -O="/content/pysupwsdpocket_models/en.zip"

mkdir: cannot create directory ‘pysupwsdpocket_models’: File exists
Downloading...
From: https://drive.google.com/uc?id=1hEMlbToLL4xN7HJhPtebMbKYeethWmha
To: /content/pysupwsdpocket_models/en.zip
1.80GB [00:20, 89.2MB/s]


In [None]:
from pysupwsdpocket import PySupWSDPocket
nlp = PySupWSDPocket(lang='en', model='semcor_omsti', model_path="./pysupwsdpocket_models/")    

## CHILDES

The Child Language Data Exchange System (CHILDES) is a corpus established in 1984 by Brian MacWhinney and Catherine Snow to serve as a central repository for data of first language acquisition[¹](https://en.wikipedia.org/wiki/CHILDES). It counts with a list of different corpora from many languages that can be downloaded in XML or CHA format.

In this notebook we download only one corpus, but SemCHILDES is composed by the entire American English CHILDES.

In [None]:
!mkdir corpora
!wget https://childes.talkbank.org/data-xml/Eng-NA/MacWhinney.zip -O /content/corpora/MacWhinney.zip

--2021-04-09 16:28:46--  https://childes.talkbank.org/data-xml/Eng-NA/MacWhinney.zip
Resolving childes.talkbank.org (childes.talkbank.org)... 128.2.24.68
Connecting to childes.talkbank.org (childes.talkbank.org)|128.2.24.68|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8844400 (8.4M) [application/zip]
Saving to: ‘/content/corpora/MacWhinney.zip’


2021-04-09 16:28:47 (18.6 MB/s) - ‘/content/corpora/MacWhinney.zip’ saved [8844400/8844400]



In [None]:
!unzip /content/corpora/MacWhinney.zip -d corpora

Archive:  /content/corpora/MacWhinney.zip
  inflating: corpora/MacWhinney/010123d.xml  
  inflating: corpora/MacWhinney/050820a.xml  
  inflating: corpora/MacWhinney/010204b.xml  
  inflating: corpora/MacWhinney/000818a.xml  
  inflating: corpora/MacWhinney/021109.xml  
  inflating: corpora/MacWhinney/030001b.xml  
  inflating: corpora/MacWhinney/041125b.xml  
  inflating: corpora/MacWhinney/030105c.xml  
  inflating: corpora/MacWhinney/040315.xml  
  inflating: corpora/MacWhinney/060027c.xml  
  inflating: corpora/MacWhinney/051016d.xml  
  inflating: corpora/MacWhinney/020627.xml  
  inflating: corpora/MacWhinney/010114d.xml  
  inflating: corpora/MacWhinney/040920c.xml  
  inflating: corpora/MacWhinney/030818.xml  
  inflating: corpora/MacWhinney/040920b.xml  
  inflating: corpora/MacWhinney/050420a.xml  
  inflating: corpora/MacWhinney/060530c.xml  
  inflating: corpora/MacWhinney/030105d.xml  
  inflating: corpora/MacWhinney/040508d.xml  
  inflating: corpora/MacWhinney/060530b.xm

### Extract data from CHILDES

The data extraction is made by parsing the CHILDES' XML files.

In [None]:
import xml.etree.ElementTree as ET

tree = ET.parse("/content/corpora/MacWhinney/030018a.xml")
root = tree.getroot()
ns = "{http://www.talkbank.org/ns/talkbank}"
sem_dict = root.attrib
sem_dict['file'] = "/content/corpora/MacWhinney/030018a.xml"
sem_dict['participants'] = []
sem_dict['utterances'] = []

root.tag

'{http://www.talkbank.org/ns/talkbank}CHAT'

#### Find participants

This information is important for making restrictions in the future. For example, get sentences by children age.

In [None]:
for participant in root.find(ns+"Participants"):
  sem_dict['participants'].append(participant.attrib)
sem_dict

{'ActivityType': 'toyplay',
 'Corpus': 'MacWhinney',
 'Date': '1981-01-12',
 'DesignType': 'long',
 'GroupType': 'TD',
 'Lang': 'eng',
 'Media': '030018a',
 'Mediatypes': 'audio',
 'PID': '11312/c-00016502-1',
 'Version': '2.16.0',
 'file': '/content/corpora/MacWhinney/030018a.xml',
 'participants': [{'age': 'P3Y00M18D',
   'group': 'TD',
   'id': 'CHI',
   'language': 'eng',
   'name': 'Ross',
   'role': 'Target_Child',
   'sex': 'male'},
  {'age': 'P1Y01M23D',
   'id': 'MAR',
   'language': 'eng',
   'name': 'Mark',
   'role': 'Target_Child'},
  {'id': 'MOT',
   'language': 'eng',
   'name': 'Mary',
   'role': 'Mother',
   'sex': 'female'},
  {'id': 'FAT',
   'language': 'eng',
   'name': 'Brian',
   'role': 'Father',
   'sex': 'male'}],
 'utterances': [],
 '{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://www.talkbank.org/ns/talkbank https://talkbank.org/software/talkbank.xsd'}

#### Process utterances

In [None]:
for u in root.findall(ns+'u'):
  utterance_dict = u.attrib
  utterance_dict['original_tokens'] = []
  tokens = []
  for token in u.getchildren():
    if token.tag == ns+"w":
      tags = [a.tag for a in token.getchildren()]
      if ns+"shortening" in tags:
        tokens.append(token.find(ns+'mor').find(ns+"mw").find(ns+"stem").text)
      elif token.text is not None:
        tokens.append(token.text)
      else:
        print(token)
    elif token.tag == ns+"g": # group of words
      token = token.find(ns+'w')
      tags = [a.tag for a in token.getchildren()]
      if ns+"shortening" in tags:
        tokens.append(token.find(ns+'mor').find(ns+"mw").find(ns+"stem").text)
      elif token.text is not None:
        tokens.append(token.text)
      else:
        print(token)
    elif token.tag == ns+"t": # punctuation
      if token.attrib['type'] == 'p':
        tokens.append(".")
      elif token.attrib['type'] == 'q':
        tokens.append("?")
    elif token.tag == ns+"tagMarker": #comma
      tokens.append(',')
  if len(tokens) > 1:
    utterance_dict['text'] = " ".join(tokens)
  sem_dict['utterances'].append(utterance_dict)



<Element '{http://www.talkbank.org/ns/talkbank}w' at 0x7f48684a6dd0>


  """
  import sys
  app.launch_new_instance()


#### Parse utterances

In [None]:
for u in sem_dict['utterances']:
  if 'text' in u: # some utterances in CHILDES have just researchers comments or actions like (he screamed)
    doc = nlp.wsd(u['text'])
    u['wsd_doc'] = []
    for token in doc.tokens():
      u['wsd_doc'].append(token.__dict__)

I got an owie from Titus like Marky got an owie from Titus .
why does that cat do that ?
because she because her do .
because her do ?
why does she do it Ross ?
because her do .
is she mean ?
yes .
is she mean ?
I don't like that cat .
why does she do it ?
cause she do .
let's give her a spanking .
okay .
no no no .
no let's not .
let's try to talk to her okay .
why ?
I'll try to talk to her .
owies hurt .
I don't like owies .
he must have learned that at preschool because we don't use owies we don't talk about owies .
this is excellent Ross .
what else do you wanna eat ?
I wanna eat some more
he wants some coq_au_vin juice .
no .
can I have some chicken off of that ?
uh hm .
I'll serve you .
can I have some more chicken ?
more chicken ?
yeah .
Ross gets a nice piece here mom .
uhuh .
he does .
and momma gets that nice piece .
and this big piece back here .
thank_you .
very good chicken .
mhm .
thank_you .
you're welcome Mother .
Mary said it was .
no the bird says
coo coo .
what is th

KeyboardInterrupt: ignored

In [None]:
import json
f = open("sem_childes.json",'w')
f.write(json.dumps(sem_dict))

198064

#### Create corpus for BERT input

In [None]:
import nltk
nltk.download('wordnet')

from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [None]:
f = open("for_training.txt",'w')
for u in sem_dict['utterances']:
  if 'wsd_doc' in u:
    print(u['text'])
    new_sentence = []
    for token in u['wsd_doc']:
      if token['word'] in ['me','and','or',',']:
          new_sentence.append(token['word'])
      elif token['lemma'] in ["can","a","to","how","what",'this',"that"]:
          new_sentence.append(token['lemma'])
      elif token['senses'][0]['id'] != 'U':
          new_sentence.append(token['senses'][0]['id'])
      elif token['pos'] in ['IN','PRP','.','WRB','CC',"PRP$","DT"]:
          new_sentence.append(token['lemma'])
      elif token['pos'] in ['NNP']:
          new_sentence.append('proper_noun')
      elif token['pos'] in ['NN',"NNS"]:
          n_token = None
          synsets = wn.synsets(token['lemma'],'n')
          if len(synsets) > 0:
              synset = synsets[0]
              for l in synset.lemmas():
                  if l.name() == token['lemma']:
                      n_token = l.key()
          if n_token is not None:
            new_sentence.append(n_token)
          else:
            new_sentence.append(token['lemma']) # it may be words that are common on children vocabulary.
    print(new_sentence)
    f.write(" ".join(new_sentence)+"\n")
    print()
f.close()

I got an owie from Titus like Marky got an owie from Titus .
['i', 'get%2:40:00::', 'a', '[NOSENSE]', 'from', 'proper_noun', 'like', 'proper_noun', 'get%2:40:00::', 'a', '[NOSENSE]', 'from', 'proper_noun', '.']

why does that cat do that ?
['why', 'do%2:41:01::', 'that', 'cat%1:05:00::', 'do%2:41:01::', 'that', '?']

because she because her do .
['because', 'she', 'because', 'she', 'do%2:41:01::', '.']

because her do ?
['because', 'she', 'do%2:41:01::', '?']

why does she do it Ross ?
['why', 'do%2:41:01::', 'she', 'do%2:41:01::', 'it', 'proper_noun', '?']

because her do .
['because', 'she', 'do%2:41:01::', '.']

is she mean ?
['be%2:42:03::', 'she', 'mean%2:32:01::', '?']

yes .
['yes%1:10:00::', '.']

is she mean ?
['be%2:42:03::', 'she', 'mean%2:32:01::', '?']

I don't like that cat .
['i', 'do%2:41:01::', '[NOSENSE]', '[NOSENSE]', 'like', 'that', 'cat%1:05:00::', '.']

why does she do it ?
['why', 'do%2:41:01::', 'she', 'do%2:41:01::', 'it', '?']

cause she do .
['cause%1:10:00::