<a href="https://colab.research.google.com/github/gksthdals/NLTK/blob/main/11.%20Managing_Linguistic_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
""" Main Topic

1. How do we design a new language resource and ensure that its coverage, balance, and documentation support a wide range of uses?
2. When existing data is in the wrong format for some analysis tool, how can we convert it to a suitable format?
3. What is a good way to document the existence of a resource we have created so that others can easily find it?

"""

## 1. Corpus Structure: a Case Study

### The Structure of TIMIT

In [1]:
import nltk

In [2]:
help(nltk.corpus.timit)

Help on LazyCorpusLoader in module nltk.corpus.util object:

timit = class LazyCorpusLoader(builtins.object)
 |  timit(name, reader_cls, *args, **kwargs)
 |  
 |  To see the API documentation for this lazily loaded corpus, first
 |  run corpus.ensure_loaded(), and then run help(this_corpus).
 |  
 |  LazyCorpusLoader is a proxy object which is used to stand in for a
 |  corpus object before the corpus is loaded.  This allows NLTK to
 |  create an object for each corpus, but defer the costs associated
 |  with loading those corpora until the first time that they're
 |  actually accessed.
 |  
 |  The first time this object is accessed in any way, it will load
 |  the corresponding corpus, and transform itself into that corpus
 |  (by modifying its own ``__class__`` and ``__dict__`` attributes).
 |  
 |  If the corpus can not be found, then accessing this object will
 |  raise an exception, displaying installation instructions for the
 |  NLTK data package.  Once they've properly install

In [4]:
nltk.download('timit')

[nltk_data] Downloading package timit to /root/nltk_data...
[nltk_data]   Unzipping corpora/timit.zip.


True

In [5]:
nltk.corpus.timit.fileids()

['dr1-fvmh0/sa1.phn',
 'dr1-fvmh0/sa1.txt',
 'dr1-fvmh0/sa1.wav',
 'dr1-fvmh0/sa1.wrd',
 'dr1-fvmh0/sa2.phn',
 'dr1-fvmh0/sa2.txt',
 'dr1-fvmh0/sa2.wav',
 'dr1-fvmh0/sa2.wrd',
 'dr1-fvmh0/si1466.phn',
 'dr1-fvmh0/si1466.txt',
 'dr1-fvmh0/si1466.wav',
 'dr1-fvmh0/si1466.wrd',
 'dr1-fvmh0/si2096.phn',
 'dr1-fvmh0/si2096.txt',
 'dr1-fvmh0/si2096.wav',
 'dr1-fvmh0/si2096.wrd',
 'dr1-fvmh0/si836.phn',
 'dr1-fvmh0/si836.txt',
 'dr1-fvmh0/si836.wav',
 'dr1-fvmh0/si836.wrd',
 'dr1-fvmh0/sx116.phn',
 'dr1-fvmh0/sx116.txt',
 'dr1-fvmh0/sx116.wav',
 'dr1-fvmh0/sx116.wrd',
 'dr1-fvmh0/sx206.phn',
 'dr1-fvmh0/sx206.txt',
 'dr1-fvmh0/sx206.wav',
 'dr1-fvmh0/sx206.wrd',
 'dr1-fvmh0/sx26.phn',
 'dr1-fvmh0/sx26.txt',
 'dr1-fvmh0/sx26.wav',
 'dr1-fvmh0/sx26.wrd',
 'dr1-fvmh0/sx296.phn',
 'dr1-fvmh0/sx296.txt',
 'dr1-fvmh0/sx296.wav',
 'dr1-fvmh0/sx296.wrd',
 'dr1-fvmh0/sx386.phn',
 'dr1-fvmh0/sx386.txt',
 'dr1-fvmh0/sx386.wav',
 'dr1-fvmh0/sx386.wrd',
 'dr1-mcpm0/sa1.phn',
 'dr1-mcpm0/sa1.txt',
 'dr1-mc

In [7]:
phonetic = nltk.corpus.timit.phones('dr1-fvmh0/sa1')
phonetic

['h#',
 'sh',
 'iy',
 'hv',
 'ae',
 'dcl',
 'y',
 'ix',
 'dcl',
 'd',
 'aa',
 'kcl',
 's',
 'ux',
 'tcl',
 'en',
 'gcl',
 'g',
 'r',
 'iy',
 's',
 'iy',
 'w',
 'aa',
 'sh',
 'epi',
 'w',
 'aa',
 'dx',
 'ax',
 'q',
 'ao',
 'l',
 'y',
 'ih',
 'ax',
 'h#']

In [8]:
nltk.corpus.timit.word_times('dr1-fvmh0/sa1')

[('she', 7812, 10610),
 ('had', 10610, 14496),
 ('your', 14496, 15791),
 ('dark', 15791, 20720),
 ('suit', 20720, 25647),
 ('in', 25647, 26906),
 ('greasy', 26906, 32668),
 ('wash', 32668, 37890),
 ('water', 38531, 42417),
 ('all', 43091, 46052),
 ('year', 46052, 50522)]

In [9]:
timitdict = nltk.corpus.timit.transcription_dict()
timitdict['greasy'] + timitdict['wash'] + timitdict['water']

['g', 'r', 'iy1', 's', 'iy', 'w', 'ao1', 'sh', 'w', 'ao1', 't', 'axr']

In [10]:
phonetic[17:30]

['g', 'r', 'iy', 's', 'iy', 'w', 'aa', 'sh', 'epi', 'w', 'aa', 'dx', 'ax']

In [11]:
nltk.corpus.timit.spkrinfo('dr1-fvmh0')

SpeakerInfo(id='VMH0', sex='F', dr='1', use='TRN', recdate='03/11/86', birthdate='01/08/60', ht='5\'05"', race='WHT', edu='BS', comments='BEST NEW ENGLAND ACCENT SO FAR')

### Notable Design Features

### Fundamental Data Types

## 2. The Life-Cycle of a Corpus

### Three Corpus Creation Scenarios

### Quality Control

In [12]:
s1 = "00000010000000001000000"
s2 = "00000001000000010000000"
s3 = "00010000000000000001000"

In [13]:
nltk.windowdiff(s1, s1, 3)

0.0

In [14]:
nltk.windowdiff(s1, s2, 3)

0.19047619047619047

In [15]:
nltk.windowdiff(s2, s3, 3)

0.5714285714285714

### Curation vs Evolution

## 3. Acquiring Data

### Obtaining Data from the Web

In [None]:
"""
Web Crawing Tools

1. GNU : http://www.gnu.org/software/wget/
2. Heritrix : http://crawler.archive.org/

"""

### Obtaining Data from Word Processor Files

In [None]:
import re

legal_pos = set(['n', 'v.t.', 'v.i.', 'adj', 'det'])
pattern = re.compile(r"'font-size:11.0pt'>([a-z.]+)<")
document = open("/content/dict.htm", encoding='windows-1252').read()
used_pos = set(re.findall(pattern, document))
illegal_pos = used_pos.difference(legal_pos)
print(list(illegal_pos))

# ['v.i.', 'intrans']

In [18]:
from bs4 import BeautifulSoup

def lexical_data(html_file, encoding='utf-8'):
  SEP = '_ENTRY'
  html = open(html_file, encoding=encoding).read()
  html = re.sub(r'<p', SEP + '<p', html)
  text = BeautifulSoup(html, 'html.parser').get_text()
  text = ' '.join(text.split())
  for entry in text.split(SEP):
    if entry.count(' ') > 2:
      yield entry.split(' ', 3)

In [20]:
import csv

writer = csv.writer(open('dict1.csv', 'w', encoding='utf-8'))
writer.writerows(lexical_data('dict.htm', encoding='windows-1252'))

### Obtaining Data from Spreadsheets and Databases

In [None]:
import csv

lexicon = csv.reader(open('dict.csv'))
pairs = [(lexeme, defn) for (lexeme, _, _, defn) in lexicon]
lexemes, defns = zip(*pairs)
defn_words = set(w for defn in defns for w in defn.split())
sorted(defn_words.difference(lexemes))

### Converting Data Formats

In [None]:
idx = nltk.Index((defn_word, lexeme)
                 for (lexeme, defn) in pairs
                 for defn_word in nltk.word_tokenize(defn)
                 if len(defn_word) > 3)

with open('dict.idx', 'w') as idx_file:
  for word in sorted(idx):
    idx_words = ', '.join(idx[word])
    idx_line = "{}: {}".format(word, idx_words)
    print(idx_line, file=idx_file)

### Deciding Which Layers of Annotation to Include

### Standards and Tools

### Special Considerations when Working with Endangered Languages

In [23]:
mappings = [('ph', 'f'), ('ght', 't'), ('^kn', 'n'), ('qu', 'kw'), ('[aeiou]+', 'a'), (r'(.)\1', r'\1')]

In [30]:
def signature(word):
  for patt, repl in mappings:
    word = re.sub(patt, repl, word)
  pieces = re.findall('[^aeiou]+', word)
  return ''.join(char for piece in pieces for char in sorted(piece))[:8]

In [31]:
signature('illefent')

'lfnt'

In [32]:
signature('ebsekwieous')

'bskws'

In [33]:
signature('nuculerr')

'nclr'

In [35]:
nltk.download('words')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [36]:
signatures = nltk.Index((signature(w), w) for w in nltk.corpus.words.words())

In [38]:
signatures[signature('nuculerr')]

['anicular',
 'inocular',
 'nucellar',
 'nuclear',
 'unicolor',
 'uniocular',
 'unocular']

In [39]:
def rank(word, wordlist):
  ranked = sorted((nltk.edit_distance(word, w), w) for w in wordlist)
  return [word for (_, word) in ranked]

In [40]:
def fuzzy_spell(word):
  sig = signature(word)
  if sig in signatures:
    return rank(word, signatures[sig])
  else:
    return []

In [41]:
fuzzy_spell('illefent')

['olefiant', 'elephant', 'oliphant', 'elephanta']

In [42]:
fuzzy_spell('ebsekwieous')

['obsequious']

In [43]:
fuzzy_spell('nucular')

['anicular',
 'inocular',
 'nucellar',
 'nuclear',
 'unocular',
 'uniocular',
 'unicolor']

## 4. Working with XML

### Using XML for Linguistic Structures

### The Role of XML

### The ElementTree Interface

In [45]:
nltk.download('shakespeare')

[nltk_data] Downloading package shakespeare to /root/nltk_data...
[nltk_data]   Unzipping corpora/shakespeare.zip.


True

In [46]:
merchant_file = nltk.data.find('corpora/shakespeare/merchant.xml')
raw = open(merchant_file).read()
print(raw[:163])

<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="shakes.css"?>
<!-- <!DOCTYPE PLAY SYSTEM "play.dtd"> -->

<PLAY>
<TITLE>The Merchant of Venice</TITLE>


In [47]:
print(raw[1789:2006])

<TITLE>ACT I</TITLE>

<SCENE><TITLE>SCENE I.  Venice. A street.</TITLE>
<STAGEDIR>Enter ANTONIO, SALARINO, and SALANIO</STAGEDIR>

<SPEECH>
<SPEAKER>ANTONIO</SPEAKER>
<LINE>In sooth, I know not why I am so sad:</LINE>


In [48]:
from xml.etree.ElementTree import ElementTree

merchant = ElementTree().parse(merchant_file)
merchant

<Element 'PLAY' at 0x7f23df8682f0>

In [49]:
merchant[0]

<Element 'TITLE' at 0x7f23df868890>

In [50]:
merchant[0].text

'The Merchant of Venice'

In [51]:
merchant.getchildren()

  """Entry point for launching an IPython kernel.


[<Element 'TITLE' at 0x7f23df868890>,
 <Element 'PERSONAE' at 0x7f23df868c50>,
 <Element 'SCNDESCR' at 0x7f23df87a770>,
 <Element 'PLAYSUBT' at 0x7f23df87ab30>,
 <Element 'ACT' at 0x7f23df87add0>,
 <Element 'ACT' at 0x7f23df8fe650>,
 <Element 'ACT' at 0x7f23df53ea10>,
 <Element 'ACT' at 0x7f23dfb70a10>,
 <Element 'ACT' at 0x7f23dfa8d710>]

In [52]:
merchant[-2][0].text

'ACT IV'

In [53]:
merchant[-2][1]

<Element 'SCENE' at 0x7f23dfb70ad0>

In [54]:
merchant[-2][1][0].text

'SCENE I.  Venice. A court of justice.'

In [55]:
merchant[-2][1][54]

<Element 'SPEECH' at 0x7f23dfb40590>

In [56]:
merchant[-2][1][54][0]

<Element 'SPEAKER' at 0x7f23dfb405f0>

In [57]:
merchant[-2][1][54][0].text

'PORTIA'

In [58]:
merchant[-2][1][54][1]

<Element 'LINE' at 0x7f23dfb40650>

In [59]:
merchant[-2][1][54][1].text

"The quality of mercy is not strain'd,"

In [60]:
for i, act in enumerate(merchant.findall('ACT')):
  for j, scene in enumerate(act.findall('SCENE')):
    for k, speech in enumerate(scene.findall('SPEECH')):
      for line in speech.findall('LINE'):
        if 'music' in str(line.text):
          print("Act %d Scene %d Speech %d: %s" % (i+1, j+1, k+1, line.text))

Act 3 Scene 2 Speech 9: Let music sound while he doth make his choice;
Act 3 Scene 2 Speech 9: Fading in music: that the comparison
Act 3 Scene 2 Speech 9: And what is music then? Then music is
Act 5 Scene 1 Speech 23: And bring your music forth into the air.
Act 5 Scene 1 Speech 23: Here will we sit and let the sounds of music
Act 5 Scene 1 Speech 23: And draw her home with music.
Act 5 Scene 1 Speech 24: I am never merry when I hear sweet music.
Act 5 Scene 1 Speech 25: Or any air of music touch their ears,
Act 5 Scene 1 Speech 25: By the sweet power of music: therefore the poet
Act 5 Scene 1 Speech 25: But music for the time doth change his nature.
Act 5 Scene 1 Speech 25: The man that hath no music in himself,
Act 5 Scene 1 Speech 25: Let no such man be trusted. Mark the music.
Act 5 Scene 1 Speech 29: It is your music, madam, of the house.
Act 5 Scene 1 Speech 32: No better a musician than the wren.


In [61]:
from collections import Counter

speaker_seq = [s.text for s in merchant.findall('ACT/SCENE/SPEECH/SPEAKER')]
speaker_freq = Counter(speaker_seq)

top5 = speaker_freq.most_common(5)
top5

[('PORTIA', 117),
 ('SHYLOCK', 79),
 ('BASSANIO', 73),
 ('GRATIANO', 48),
 ('ANTONIO', 47)]

In [62]:
from collections import defaultdict

abbreviate = defaultdict(lambda: 'OTH')
for speaker, _ in top5:
  abbreviate[speaker] = speaker[:4]

speaker_seq2 = [abbreviate[speaker] for speaker in speaker_seq]
cfd = nltk.ConditionalFreqDist(nltk.bigrams(speaker_seq2))
cfd.tabulate()

     ANTO BASS GRAT  OTH PORT SHYL 
ANTO    0   11    4   11    9   12 
BASS   10    0   11   10   26   16 
GRAT    6    8    0   19    9    5 
 OTH    8   16   18  153   52   25 
PORT    7   23   13   53    0   21 
SHYL   15   15    2   26   21    0 


### Using ElementTree for Accessing Toolbox Data

In [64]:
nltk.download('toolbox')

[nltk_data] Downloading package toolbox to /root/nltk_data...
[nltk_data]   Unzipping corpora/toolbox.zip.


True

In [65]:
from nltk.corpus import toolbox

lexicon = toolbox.xml('rotokas.dic')

In [66]:
lexicon[3][0]

<Element 'lx' at 0x7f23e0231dd0>

In [67]:
lexicon[3][0].tag

'lx'

In [68]:
lexicon[3][0].text

'kaa'

In [69]:
[lexeme.text.lower() for lexeme in lexicon.findall('record/lx')]

['kaa',
 'kaa',
 'kaa',
 'kaakaaro',
 'kaakaaviko',
 'kaakaavo',
 'kaakaoko',
 'kaakasi',
 'kaakau',
 'kaakauko',
 'kaakito',
 'kaakuupato',
 'kaaova',
 'kaapa',
 'kaapea',
 'kaapie',
 'kaapie',
 'kaapiepato',
 'kaapisi',
 'kaapisivira',
 'kaapo',
 'kaapopato',
 'kaara',
 'kaare',
 'kaareko',
 'kaarekopie',
 'kaareto',
 'kaareva',
 'kaava',
 'kaavaaua',
 'kaaveaka',
 'kaaveakapie',
 'kaaveakapievira',
 'kaaveakavira',
 'kae',
 'kae',
 'kaekae',
 'kaekae',
 'kaekaearo',
 'kaekaeo',
 'kaekaesoto',
 'kaekaevira',
 'kaekeru',
 'kaepaa',
 'kaepie',
 'kaepie',
 'kaepievira',
 'kaereasi',
 'kaereasivira',
 'kaetu',
 'kaetupie',
 'kaetuvira',
 'kaeviro',
 'kagave',
 'kaie',
 'kaiea',
 'kaikaio',
 'kaio',
 'kaipori',
 'kaiporipie',
 'kaiporivira',
 'kairi',
 'kairiro',
 'kairo',
 'kaita',
 'kaitutu',
 'kaitutupie',
 'kaitutuvira',
 'kakae',
 'kakae',
 'kakae',
 'kakaevira',
 'kakapikoa',
 'kakapikoto',
 'kakapu',
 'kakapua',
 'kakara',
 'kakarapaia',
 'kakarau',
 'kakarera',
 'kakata',
 'kakate

In [70]:
import sys
from nltk.util import elementtree_indent
from xml.etree.ElementTree import ElementTree

elementtree_indent(lexicon)
tree = ElementTree(lexicon[3])
tree.write(sys.stdout, encoding='unicode')

<record>
    <lx>kaa</lx>
    <ps>N</ps>
    <pt>MASC</pt>
    <cl>isi</cl>
    <ge>cooking banana</ge>
    <tkp>banana bilong kukim</tkp>
    <pt>itoo</pt>
    <sf>FLORA</sf>
    <dt>12/Aug/2005</dt>
    <ex>Taeavi iria kaa isi kovopaueva kaparapasia.</ex>
    <xp>Taeavi i bin planim gaden banana bilong kukim tasol long paia.</xp>
    <xe>Taeavi planted banana in order to cook it.</xe>
  </record>

### Formatting Entries

In [71]:
html = "<table>\n"
for entry in lexicon[70:80]:
  lx = entry.findtext('lx')
  ps = entry.findtext('ps')
  ge = entry.findtext('ge')
  html += "  <tr><td>%s</td><td>%s</td><td>%s</td></tr>\n" % (lx, ps, ge)

html += "</table>"
print(html)

<table>
  <tr><td>kakae</td><td>???</td><td>small</td></tr>
  <tr><td>kakae</td><td>CLASS</td><td>child</td></tr>
  <tr><td>kakaevira</td><td>ADV</td><td>small-like</td></tr>
  <tr><td>kakapikoa</td><td>???</td><td>small</td></tr>
  <tr><td>kakapikoto</td><td>N</td><td>newborn baby</td></tr>
  <tr><td>kakapu</td><td>V</td><td>place in sling for purpose of carrying</td></tr>
  <tr><td>kakapua</td><td>N</td><td>sling for lifting</td></tr>
  <tr><td>kakara</td><td>N</td><td>arm band</td></tr>
  <tr><td>Kakarapaia</td><td>N</td><td>village name</td></tr>
  <tr><td>kakarau</td><td>N</td><td>frog</td></tr>
</table>


## 5. Working with Toolbox Data

In [72]:
from nltk.corpus import toolbox
lexicon = toolbox.xml('rotokas.dic')
sum(len(entry) for entry in lexicon) / len(lexicon)

13.635955056179775

### Adding a Field to Each Entry

In [73]:
from xml.etree.ElementTree import SubElement

def cv(s):
  s = s.lower()
  s = re.sub(r'[^a-z]',     r'_', s)
  s = re.sub(r'[aeiou]',    r'V', s)
  s = re.sub(r'[^V_]',      r'C', s)
  return (s)

def add_cv_field(entry):
  for field in entry:
    if field.tag == 'lx':
      cv_field = SubElement(entry, 'cv')
      cv_field.text = cv(field.text)

In [76]:
lexicon = toolbox.xml('rotokas.dic')

add_cv_field(lexicon[53])
print(nltk.toolbox.to_sfm_string(lexicon[53]))

\lx kaeviro
\ps V
\pt A
\ge lift off
\ge take off
\tkp go antap
\sc MOTION
\vx 1
\nt used to describe action of plane
\dt 03/Jun/2005
\ex Pita kaeviroroe kepa kekesia oa vuripierevo kiuvu.
\xp Pita i go antap na lukim haus win i bagarapim.
\xe Peter went to look at the house that the wind destroyed.
\cv CVVCVCV



### Validating a Toolbox Lexicon

In [77]:
from collections import Counter

field_sequences = Counter(':'.join(field.tag for field in entry) for entry in lexicon)
field_sequences.most_common()

[('lx:ps:pt:ge:tkp:dt:ex:xp:xe', 41),
 ('lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe', 37),
 ('lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe', 27),
 ('lx:ps:pt:ge:tkp:nt:dt:ex:xp:xe', 20),
 ('lx:ps:pt:ge:tkp:nt:dt:ex:xp:xe:ex:xp:xe', 17),
 ('lx:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe', 16),
 ('lx:rt:ps:pt:ge:ge:tkp:dt:ex:xp:xe:ex:xp:xe', 12),
 ('lx:ps:pt:ge:tkp:nt:sf:dt:ex:xp:xe', 9),
 ('lx:ps:pt:ge:ge:tkp:dt:ex:xp:xe', 9),
 ('lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe:ex:xp:xe', 9),
 ('lx:ps:ge:tkp:dt:ex:xp:xe', 8),
 ('lx:ps:pt:ge:ge:tkp:dt:ex:xp:xe:ex:xp:xe', 8),
 ('lx:rt:ps:pt:ge:ge:tkp:dt:ex:xp:xe', 8),
 ('lx:alt:rt:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe', 7),
 ('lx:alt:rt:ps:pt:ge:tkp:dt:ex:xp:xe', 7),
 ('lx:ps:pt:ge:ge:tkp:arg:vx:dt:ex:xp:xe:ex:xp:xe', 6),
 ('lx:ps:pt:ge:tkp:cmt:dt:ex:xp:xe', 5),
 ('lx:ps:pt:ge:tkp:nt:sf:dt:ex:xp:xe:ex:xp:xe', 5),
 ('lx:rt:ps:pt:ge:tkp:cmt:dt:ex:xp:xe', 5),
 ('lx:rt:ps:pt:ge:tkp:nt:dt:ex:xp:xe:ex:xp:xe', 4),
 ('lx:ps:pt:ge:tkp:dt:cmt:ex:xp:xe:ex:xp:xe', 4),
 ('lx:rt:ps:pt:ge:tkp

In [78]:
grammar = nltk.CFG.fromstring('''
S -> Head PS Glosses Comment Date Sem_Field Examples
Head -> Lexeme Root
Lexeme -> "lx"
Root -> "rt" |
PS -> "ps"
Glosses -> Gloss Glosses |
Gloss -> "ge" | "tkp" | "eng"
Date -> "dt"
Sem_Field -> "sf"
Examples -> Example Ex_Pidgin Ex_English Examples |
Example -> "ex"
Ex_Pidgin -> "xp"
Ex_English -> "xe"
Comment -> "cmt" | "nt" |
''')

In [81]:
def validate_lexicon(grammar, lexicon, ignored_tags):
  rd_parser = nltk.RecursiveDescentParser(grammar)
  for entry in lexicon:
    marker_list = [field.tag for field in entry if field.tag not in ignored_tags]
    if list(rd_parser.parse(marker_list)):
      print("+", ':'.join(marker_list))
    else:
      print("-", ':'.join(marker_list))

In [82]:
lexicon = toolbox.xml('rotokas.dic')[10:20]
ignored_tags = ['arg', 'dcsv', 'pt', 'vx']
validate_lexicon(grammar, lexicon, ignored_tags)

- lx:ps:ge:tkp:sf:nt:dt:ex:xp:xe:ex:xp:xe:ex:xp:xe
- lx:rt:ps:ge:tkp:nt:dt:ex:xp:xe:ex:xp:xe
- lx:ps:ge:tkp:nt:dt:ex:xp:xe:ex:xp:xe
- lx:ps:ge:tkp:nt:sf:dt
- lx:ps:ge:tkp:dt:cmt:ex:xp:xe:ex:xp:xe
- lx:ps:ge:ge:ge:tkp:cmt:dt:ex:xp:xe
- lx:rt:ps:ge:ge:tkp:dt
- lx:rt:ps:ge:eng:eng:eng:ge:tkp:tkp:dt:cmt:ex:xp:xe:ex:xp:xe:ex:xp:xe:ex:xp:xe:ex:xp:xe
- lx:rt:ps:ge:tkp:dt:ex:xp:xe
- lx:ps:ge:ge:tkp:dt:ex:xp:xe:ex:xp:xe


In [83]:
grammar = r"""
lexfunc: {<lf>(<lv><ln|le>*)*}
example: {<rf|xv><xn|xe>*}
sense:   {<sn><ps><pn|gv|dv|gn|gp|dn|rn|ge|de|re>*<example>*<lexfunc>*}
record:   {<lx><hm><sense>+<dt>}
"""

In [85]:
from xml.etree.ElementTree import ElementTree
from nltk.toolbox import ToolboxData

db = ToolboxData()
db.open(nltk.data.find('corpora/toolbox/iu_mien_samp.db'))
lexicon = db.parse(grammar, encoding='utf8')
tree = ElementTree(lexicon)
with open('iu_mien_samp.xml', 'wb') as output:
  tree.write(output)

TypeError: ignored

## 6. Describing Language Resources using OLAC Metadata

### What is Metadata?

### OLAC : Open Language Archives Community

### Disseminating Language Resources