# COLX 521 Lecture 3: Lexicons

* Word "Lists" (Sets)
* Simple dictionary lexicons
* Complex lexicons
* Lexicons in NLTK
* the CMU pronounciation dictionary 
* File IO for lexicons

## Word "Lists" (Sets)

The simpliest lexicon is just a list of words that have something in common. For example:

* Pronouns ("he","she", "I", ...)
* Negative words ("terrible","jerk","foolishly",...)
* A list of all family relations ("father", "sister",...)
* The vocabulary of the Brown corpus

Typically, one builds a lexicon so that one can identify instances of this word class in some corpus

Rule #1: Don't use lists for lexicons. Use [sets](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset)!

In [4]:
some_words = ["the","quick","brown", "fox","jumped", "over","the","lazy","dog"]

#my code here
word_types = set(some_words)
print(word_types)
#my code here

{'fox', 'brown', 'jumped', 'the', 'quick', 'dog', 'over', 'lazy'}


Two big reasons to use sets for lexicons:

* Elements are unique
* Checking for membership (*in*) is much faster, especially for large lexicons

In [5]:
#provided code
list_of_nums = list(range(1000000))
set_of_nums = set(list_of_nums)
test = -1

%timeit test in list_of_nums
%timeit test in set_of_nums


10.6 ms ± 281 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
55.2 ns ± 5.49 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


There are a few ways to create sets. If you are starting from scratch, you have to use the *set()* function, but if you are starting with fixed set of existing items you can declare using curly brackets instead of converting a list.

In [6]:
#provided code
family = {"mother","father","brother"}
months = set()
seasons = {}

In [7]:
family.add("sister")
family

{'brother', 'father', 'mother', 'sister'}

In [8]:
months.add("january")
months

{'january'}

In [9]:
seasons.add("spring")

AttributeError: 'dict' object has no attribute 'add'

It is worth memorizing the basic set methods for adding and removing items, including *add*, *update*, and *discard*

In [None]:
#provided code
planets = {"Mars","Venus", "Mercury"}
more_planets = ["Jupiter","Neptune,","Uranus", "Pluto"]

In [None]:
planets.add("Earth")
planets

In [None]:
planets.update(more_planets)
planets

In [None]:
planets.discard("Pluto")
planets

Sets are great for quickly finding intersections using `&` and differences using `-`. Let's find words only in both the Brown and Treebank, and words that appear in only one or the other

In [None]:
#provided code
from nltk.corpus import brown, treebank

brown_vocab = set(brown.words())
print('brown vocab: ', len(brown_vocab))
treebank_vocab = set(treebank.words())
print('treebank vocab: ', len(treebank_vocab))

In [None]:
brown_only = brown_vocab - treebank_vocab
print('brown only: ', len(brown_only))
treebank_only = treebank_vocab - brown_vocab
print('treebank only: ', len(treebank_only))

In [None]:
both_vocab = brown_vocab & treebank_vocab
print('in both: ', len(both_vocab))

You might sometimes want to intersect a set with something that isn't a set (like a list, or even a string). If so, and it doesn't make sense to convert the other to a set, you can use set methods such as `intersection` instead of operators (which only work both elements are sets).

In [None]:
vowels = {'a','e','i','o','u','y'}
word = "rstln"

if len(vowels.intersection(word)) == 0:
    print("no vowels")



In [None]:
vowels & word

Some set drawbacks:

* No order, no guarantee that order will be perserved when the set changes
* Can't add mutuable objects like lists and dicts (use tuples!)

In [None]:
months = ["Jan", "Feb", "Mar"]

#my code here
list(set(months))
#my code here

In [None]:
date = ["Jan", 3]
#my code here
dates = set()
#dates.add(date)
dates.add(tuple(date))


## Simple dictionary lexicons

Most common lexical need in computational linguistics: word counts

Easy way to build a dictionary of counts when you've got a list of words: [Counters](https://docs.python.org/3/library/collections.html#collections.Counter)

In [10]:
from collections import Counter

#my code here
counts = Counter(brown.words())
#my code here


In [11]:
counts["niki"]

0

In [None]:
counts.update(treebank.words())
counts["the"]

But sometimes not practical because what you are counting isn't an iterable or you want to do some operation before counting. A normal Python dict is fine, but each value need to be initialized to zero. One solution: the *get* method.

In [11]:
counts = {}
for word in brown.words():
    word = word.lower()
    # my code here
    #if word not in counts:
    #    counts[word] = 0
    counts[word] = counts.get(word, 0) + 1

print(counts["the"])
    # my code here

69971


Exercise: Count the first words of sentences in the Brown corpus

In [38]:
first_word_count = {}
for sent in brown.sents():
    # your code here
    first_word_count[sent[0]] = first_word_count.get(sent[0], 0) + 1
    # your code here
    
print(first_word_count["And"])

789


Another very common need: assign a index (an integer) to each word, for looking up in data structures like matrices

In [13]:
index_dict = {}
# my code here
for word in counts:
    index_dict[word] = len(index_dict)
# my code here
    
index_dict["the"]

30

In these cases, you will often need a reversed index as well. The `items` method for dictionaries is useful if you are iterating over an existing dictionary and want to access both key and value.

In [14]:
rev_index_dict = {}

#my code here
for word, index in index_dict.items():
    rev_index_dict[index] = word
    
rev_index_dict[9573]
#my code here

'V-1'

Alternatively, you can do this with the `keys()` and `values()` methods, which provide iterators over the keys and values of the dictionary.

In [15]:
rev_index_dict = dict((index_dict.values(), index_dict.keys()))
rev_index_dict[9573]

ValueError: dictionary update sequence element #0 has length 56057; 2 is required

## Complex lexicons


Sometimes lexicons are more complex and might be represented as multiple recursive Python datatypes. For example, instead of a single counts, you might have a list of word senses, which are actually dictionaries of properties (including part-of-speech and a counts of that word sense in a corpus)

In [17]:
# provided code
mini_sense_lexicon = {"bear":[{"POS":"noun","animate":True,"count":634,"gloss":"A big furry animal"},
                              {"POS":"verb","transitive":True,"count":294, "past tense":"bore", "past participle":"borne", "gloss":"to endure"}],
                      "slug":[{"POS":"noun","animate":True,"count":34, "gloss":"A slimy animal"},
                              {"POS":"verb","transitive":True,"count":3, "gloss": "to hit"}],
                      "back":[{"POS":"noun","animate":False,"count":12,"gloss":"a body part"},
                              {"POS":"noun","animate":False,"count":43, "gloss":"the rear of a place"},
                              {"POS":"verb","transitive":True,"count":5, "gloss":"to support"},
                              {"POS":"adverb","count":47,"gloss":"in a returning fashion"}],
                      "good":[{"POS":"noun","animate":False,"count":19,"gloss":"a thing of value"},
                              {"POS":"adjective", "count":1293,"gloss":"positive"}]}

These can be tricky to navigate. Let's answer the following questions by accessing the information in data structure:

How many senses does "back" have in this lexicon?

In [24]:
len(mini_sense_lexicon["back"])

4

What is the most common sense of back?

In [26]:
highest_count = 0
gloss = ""

for feature_dict in mini_sense_lexicon["back"]:
    if feature_dict["count"] > highest_count:
        highest_count = feature_dict["count"]
        gloss = feature_dict["gloss"]
gloss

'in a returning fashion'

Does "slug" have an adjectival sense?

In [28]:
has_adjective = False
for feature_dict in mini_sense_lexicon["slug"]:
    if feature_dict["POS"] == "adjective":
        has_adjective = True
has_adjective

False

Which words have a verb sense with an irregular past tense?

In [30]:
for word, pos_list in mini_sense_lexicon.items():
    for feature_dict in pos_list:
        if "past tense" in feature_dict:
            print(word)

bear


Exercise: Which words mention the word animal in one of their glosses?

In [60]:
for word, pos_list in mini_sense_lexicon.items():
    for feature_dict in pos_list:
        if "animal" in feature_dict["gloss"]:
            print(word)

bear
slug


## NLTK lexicons


NLTK has a lot of useful lexicons. Most are word lists. Note that they are listed under corpora and have the same interface; note that they need to be converted to sets if you want to use them for look up

#### Stopwords

Lists of closed-class/function words in various languages



In [32]:
#provided code 
from nltk.corpus import stopwords

In [33]:
stopwords.words("english")

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [34]:
stopwords.words("french")

['au',
 'aux',
 'avec',
 'ce',
 'ces',
 'dans',
 'de',
 'des',
 'du',
 'elle',
 'en',
 'et',
 'eux',
 'il',
 'je',
 'la',
 'le',
 'leur',
 'lui',
 'ma',
 'mais',
 'me',
 'même',
 'mes',
 'moi',
 'mon',
 'ne',
 'nos',
 'notre',
 'nous',
 'on',
 'ou',
 'par',
 'pas',
 'pour',
 'qu',
 'que',
 'qui',
 'sa',
 'se',
 'ses',
 'son',
 'sur',
 'ta',
 'te',
 'tes',
 'toi',
 'ton',
 'tu',
 'un',
 'une',
 'vos',
 'votre',
 'vous',
 'c',
 'd',
 'j',
 'l',
 'à',
 'm',
 'n',
 's',
 't',
 'y',
 'été',
 'étée',
 'étées',
 'étés',
 'étant',
 'étante',
 'étants',
 'étantes',
 'suis',
 'es',
 'est',
 'sommes',
 'êtes',
 'sont',
 'serai',
 'seras',
 'sera',
 'serons',
 'serez',
 'seront',
 'serais',
 'serait',
 'serions',
 'seriez',
 'seraient',
 'étais',
 'était',
 'étions',
 'étiez',
 'étaient',
 'fus',
 'fut',
 'fûmes',
 'fûtes',
 'furent',
 'sois',
 'soit',
 'soyons',
 'soyez',
 'soient',
 'fusse',
 'fusses',
 'fût',
 'fussions',
 'fussiez',
 'fussent',
 'ayant',
 'ayante',
 'ayantes',
 'ayants',
 'eu'

#### Swadesh

words for 200 common concepts from large list of languages, used for historical linguistics

In [38]:
#provided code
from nltk.corpus import swadesh

In [39]:
swadesh.fileids()

['be',
 'bg',
 'bs',
 'ca',
 'cs',
 'cu',
 'de',
 'en',
 'es',
 'fr',
 'hr',
 'it',
 'la',
 'mk',
 'nl',
 'pl',
 'pt',
 'ro',
 'ru',
 'sk',
 'sl',
 'sr',
 'sw',
 'uk']

In [40]:
for i in range(len(swadesh.words("en"))):
    print(swadesh.words("en")[i], swadesh.words("de")[i]) 

I ich
you (singular), thou du, Sie
he er
we wir
you (plural) ihr, Sie
they sie
this dieses
that jenes
here hier
there dort
who wer
what was
where wo
when wann
how wie
not nicht
all alle
many viele
some einige
few wenige
other andere
one eins
two zwei
three drei
four vier
five fünf
big groß
long lang
wide breit, weit
thick dick
heavy schwer
small klein
short kurz
narrow eng
thin dünn
woman Frau
man (adult male) Mann
man (human being) Mensch
child Kind
wife Frau, Ehefrau
husband Mann, Ehemann
mother Mutter
father Vater
animal Tier
fish Fisch
bird Vogel
dog Hund
louse Laus
snake Schlange
worm Wurm
tree Baum
forest Wald
stick Stock
fruit Frucht
seed Samen
leaf Blatt
root Wurzel
bark (from tree) Rinde
flower Blume
grass Gras
rope Seil
skin Haut
meat Fleisch
blood Blut
bone Knochen
fat (noun) Fett
egg Ei
horn Horn
tail Schwanz
feather Feder
hair Haar
head Kopf, Haupt
ear Ohr
eye Auge
nose Nase
mouth Mund
tooth Zahn
tongue Zunge
fingernail Fingernagel
foot Fuß
leg Bein
knee Knie
hand Hand
win

#### Names

Lists of mostly English names, divided by gender

In [37]:
#provided code
from nltk.corpus import names

In [46]:
names.fileids()

['female.txt', 'male.txt']

In [45]:
names.words("female.txt")[-10:]

['Zonnya',
 'Zora',
 'Zorah',
 'Zorana',
 'Zorina',
 'Zorine',
 'Zsa Zsa',
 'Zsazsa',
 'Zulema',
 'Zuzana']

#### Opinion Lexicon

Positive and negative word lists

In [41]:
#provided code
from nltk.corpus import opinion_lexicon

In [42]:
opinion_lexicon.fileids()

['negative-words.txt', 'positive-words.txt']

In [44]:
opinion_lexicon.words('negative-words.txt')[:10]

['2-faced',
 '2-faces',
 'abnormal',
 'abolish',
 'abominable',
 'abominably',
 'abominate',
 'abomination',
 'abort',
 'aborted']

## CMU Pronouncing Dictionary

* A list of pronounciations for each English word string
* Pronounciations are a list of ARPAbet phones
* Araphet phones are strings which are alphabetic except for numbers at the end of the vowels
* The numbers indicate stress

Let's look at a few entries

In [19]:
#provided code
from nltk.corpus import cmudict
p_dict = cmudict.dict()

In [20]:
print(p_dict["read"])

[['R', 'EH1', 'D'], ['R', 'IY1', 'D']]


In [22]:
print(p_dict["index"])

[['IH1', 'N', 'D', 'EH0', 'K', 'S']]


Let's get some basic stats for this lexicon: total entries, average number of pronounciations per word, average number of phones per pronounciation

In [23]:
phones = 0
pronounciation_count = 0
for pronounciations in p_dict.values():
    # my code here
    pronounciation_count += len(pronounciations)
    for pronounciation in pronounciations:
        phones += len(pronounciation)
    # my code here

In [24]:
len(p_dict)

123455

In [25]:
pronounciation_count/len(p_dict)

1.0832854076384109

In [26]:
phones/pronounciation_count

6.3850542482633825

Exercise: Count how often each English phone appears in this lexicon. We need to make sure to strip off the stress markers.

In [27]:
phone_counts = {}
for pronounciations in p_dict.values():
    for pronounciation in pronounciations:
        # your code here
        for phone in pronounciation:
            if phone[-1].isdigit():
                phone = phone[:-1]
            phone_counts[phone] = phone_counts.get(phone, 0) + 1
        # your code here
            
print(phone_counts)

{'AH': 71410, 'EY': 13521, 'F': 13748, 'AO': 11290, 'R': 46046, 'T': 48549, 'UW': 9736, 'W': 8864, 'N': 60564, 'IH': 50093, 'P': 19715, 'L': 49479, 'AA': 24546, 'B': 21057, 'ER': 29027, 'G': 13553, 'K': 42502, 'S': 50427, 'EH': 27398, 'TH': 2902, 'M': 29347, 'D': 32389, 'V': 10742, 'Z': 27842, 'IY': 34504, 'AE': 21804, 'OW': 19047, 'NG': 9865, 'SH': 8700, 'HH': 9319, 'AW': 3408, 'AY': 11313, 'JH': 6404, 'Y': 5171, 'CH': 4960, 'ZH': 560, 'UH': 2273, 'DH': 576, 'OY': 1267}


## File IO for lexicons

Lists of words are typically stored as lines of a file. Let's write out the list of English possesive pronouns, and read them back in by iterating one line at a time

In [31]:
pps = ["his", "her", "my", "your", "their", "our", "its"]

#my code here
fout = open("possessive_pronouns.txt","w")
for pp in pps:
    fout.write(pp + "\n")
fout.close()

f = open("possessive_pronouns.txt")
for line in f:
    print(line.strip())
f.close()

#my code here

his
her
my
your
their
our
its


For lexicons which are simple dictionaries, the most common format is a tab delimited file. Let's create one for the small count dict below, and again read it back in.

In [32]:
counts = {"the":62713, "quick":66, "fox":9}

# my code here
fout = open("counts.txt","w")
for word, count in counts.items():
    fout.write(word + "\t" + str(count) + "\n")
fout.close()

f = open("counts.txt")

new_counts = {}

for line in f:
    word, count = line.strip().split("\t")
    new_counts[word] = str(count)
    
print(new_counts)
# my code here

{'the': '62713', 'quick': '66', 'fox': '9'}


For complex lexicons, we should use a Python package which allows us to save the entire data structure in one go. There are two good choices for this: the [json](https://docs.python.org/3/library/json.html) package saves a Python data structure in a human-readable form, whereas the [pickle](https://docs.python.org/3/library/pickle.html) package saves it in a compact binary form. The two packages have essentially the same interface, so we'll just look at json (where we can inspect the file).

In [34]:
import json

#my code here
fout = open("mini_sense_lexicon.txt", "w")
json.dump(mini_sense_lexicon, fout)
fout.close()

f = open("mini_sense_lexicon.txt")
new_mini_sense_lexicon = json.load(f)
f.close()

print(new_mini_sense_lexicon)

#my code here

{'bear': [{'POS': 'noun', 'animate': True, 'count': 634, 'gloss': 'A big furry animal'}, {'POS': 'verb', 'transitive': True, 'count': 294, 'past tense': 'bore', 'past participle': 'borne', 'gloss': 'to endure'}], 'slug': [{'POS': 'noun', 'animate': True, 'count': 34, 'gloss': 'A slimy animal'}, {'POS': 'verb', 'transitive': True, 'count': 3, 'gloss': 'to hit'}], 'back': [{'POS': 'noun', 'animate': False, 'count': 12, 'gloss': 'a body part'}, {'POS': 'noun', 'animate': False, 'count': 43, 'gloss': 'the rear of a place'}, {'POS': 'verb', 'transitive': True, 'count': 5, 'gloss': 'to support'}, {'POS': 'adverb', 'count': 47, 'gloss': 'in a returning fashion'}], 'good': [{'POS': 'noun', 'animate': False, 'count': 19, 'gloss': 'a thing of value'}, {'POS': 'adjective', 'count': 1293, 'gloss': 'positive'}]}


Exercise: pick some words with form a natural class. Create a text file with those words, read them in, and then count how often words of that type appear an NLTK corpus of your choice

In [35]:
f = open("possessive_pronouns.txt")
pps = set()
for line in f:
    pps.add(line.strip())
    
pps_count = 0

for word in brown.words():
    if word.lower() in pps:
        pps_count += 1
print(pps_count)

18052
