# LING 242 Python Lecture 3: Lexicons

* Word "Lists" (Sets)
* Simple dictionary lexicons
* Complex lexicons
* Lexicons in NLTK
* Exercises

## Word "Lists" (Sets)

The simpliest lexicon is just a list of words that have something in common. For example:

* Pronouns ("he","she", "I", ...)
* Negative words ("terrible","jerk","foolishly",...)
* A list of all family relations ("father", "sister",...)
* The vocabulary of the Brown corpus

Typically, one builds a lexicon so that one can identify instances of this word class in some corpus

Rule #1: Don't use lists for lexicons. Use [sets](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset)!

In [1]:
some_words = ["the","quick","brown", "fox","jumped", "over","the","lazy","dog"]

word_types = set(some_words)
word_types

{'brown', 'dog', 'fox', 'jumped', 'lazy', 'over', 'quick', 'the'}

Two reasons to use sets for lexicons:

* Elements are unique, which is exactly what you need for lexicons
* Checking for membership (*in*) is much faster, $O(1)$ vs $O(n)$. This especially matters when the lexicon is big

There are a few ways to create sets. 
- If you are starting totally from scratch, you have to use the build-in *set()* function
    - curly brackets alone refer to dicts, not sets!
- If you are starting with a (small) fixed set of existing items you can declare using curly brackets instead of converting a list. 
- There are also set comprehensions, also using curly brackets!


In [2]:
seasons = {"spring","summer","fall"}
months = {}
days = {n for n in range(31)}

In [3]:
days

{0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30}

In [4]:
seasons.add("winter")
seasons

{'fall', 'spring', 'summer', 'winter'}

In [5]:
days.add(31)
print(days)

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}


In [6]:
months.add("January")

AttributeError: 'dict' object has no attribute 'add'

In [7]:
months = set()
months.add("January")
months

{'January'}

It is worth memorizing the basic set methods for adding and removing items, including `add`, `update`, and `discard`

In [1]:
planets = {"Mars","Venus", "Mercury"}
more_planets = ["Jupiter","Neptune,","Uranus", "Pluto"]

In [3]:
more_planets_set = set(more_planets)
more_planets_set

{'Jupiter', 'Neptune,', 'Pluto', 'Uranus'}

In [6]:
more_planets_list = list(more_planets_set)
more_planets_list

['Pluto', 'Neptune,', 'Jupiter', 'Uranus']

In [7]:
planets.add("Earth")
planets

{'Earth', 'Mars', 'Mercury', 'Venus'}

In [11]:
planets.add(more_planets_set)

TypeError: unhashable type: 'set'

In [10]:
planets.update(more_planets)
planets

{'Earth', 'Jupiter', 'Mars', 'Mercury', 'Neptune,', 'Pluto', 'Uranus', 'Venus'}

In [11]:
planets.discard("Pluto")
planets

{'Earth', 'Jupiter', 'Mars', 'Mercury', 'Neptune,', 'Uranus', 'Venus'}

Sets are great for quickly finding intersections using `&` and differences using `-`. These are WAY faster than alternatives involving looping. Let's find words only in both the Brown and Treebank, and words that appear in only one or the other

In [12]:
from nltk.corpus import brown, treebank

brown_vocab = set(brown.words())
print('brown vocab: ', len(brown_vocab))
treebank_vocab = set(treebank.words())
print('treebank vocab: ', len(treebank_vocab))

brown vocab:  56057
treebank vocab:  12408


In [14]:
print(len(brown.words()))
print(len(treebank.words()))

1161192
100676


In [13]:
len(brown_vocab - treebank_vocab)

47510

In [14]:
len(treebank_vocab - brown_vocab)

3861

In [15]:
len(brown_vocab & treebank_vocab)

8547

In [16]:
len(set(treebank.words())) == len(treebank_vocab - brown_vocab) + len(brown_vocab & treebank_vocab)

True

You might sometimes want to intersect a set with something that isn't a set (like a list, or even a string). If so, and it doesn't make sense to convert the other to a set, you can use set methods such as `intersection` instead of operators (which only work when both elements are sets).

In [17]:
vowels = {'a','e','i','o','u','y'}
word1 = "rstln"
word2 = "rstlne"

In [18]:
vowels.intersection(word1)

set()

In [19]:
vowels.intersection(word2)

{'e'}

In [20]:
vowels & word1

TypeError: unsupported operand type(s) for &: 'set' and 'str'

In [22]:
word1_set = set(word1)
vowels & word1_set

set()

Some set drawbacks:

* No order, no guarantee that order will be perserved when the set changes
* Can't add mutuable objects like lists and dicts (use tuples!)

In [23]:
months = ["Feb", "Jan", "Mar"]
list(set(months))

['Jan', 'Feb', 'Mar']

In [15]:
months = ["Feb", "Jan", "Mar"]
list(set(months))

['Jan', 'Feb', 'Mar']

## Simple dictionary lexicons

Most common lexical need in computational linguistics: word counts

Easy way to build a dictionary of counts when you've got a list (or other iterable) of words: [Counters](https://docs.python.org/3/library/collections.html#collections.Counter)

In [18]:
from collections import Counter

counts_brown = Counter(brown.words())

In [19]:
counts_brown["the"]

62713

In [20]:
counts_brown_lower = Counter(word.lower() for word in brown.words())
counts_brown_lower["the"]

69971

You can `update` counters to add more items

In [27]:
counts_treebank =  Counter(treebank.words())
counts_treebank["the"]

4045

In [28]:
counts = Counter(brown.words())
counts.update(treebank.words())
counts["the"]

66758

In [29]:
counts_brown["the"] + counts_treebank["the"] 

66758

But sometimes not practical because what you are counting isn't an iterable or you want to do some operation before counting. A normal Python dict is fine, but each value needs to be initialized to zero. One solution: the [get](https://docs.python.org/3/library/stdtypes.html#dict.get) method.

In [30]:
counts = {}
for word in brown.words():
    word = word.lower()
    #if word not in counts:
    #    counts[word] = 0
    counts[word] = counts.get(word, 0) + 1
    
counts["the"]

69971

**A quick exercise**: let's count the first words of sentences in the Brown corpus using `get`

In [31]:
first_word_count = {}
for sent in brown.sents():
    # counts[word] = counts.get(word, 0) + 1
    first_word_count[sent[0]] = first_word_count.get(sent[0], 0) + 1
    
print(first_word_count["And"])

789


Yet another solution to the initialization problem for counting (and other situations) is the [defaultdict](https://docs.python.org/3/library/collections.html#collections.defaultdict). When you initialize a defaultdict for integers, the default value is zero, no need to do anything but add! Defaultdicts also work for lists, sets, and dicts (creating an empty instance as soon as you try to access it).

In [22]:
from collections import defaultdict
from nltk.corpus import brown

counts = defaultdict(int)

for word in brown.words():
    word = word.lower()    
    # counts[word] = counts.get(word, 0) + 1
    counts[word] += 1
    
print(counts["the"])

69971


In [23]:
counts

defaultdict(int,
            {'the': 69971,
             'fulton': 17,
             'county': 155,
             'grand': 48,
             'jury': 67,
             'said': 1961,
             'friday': 60,
             'an': 3740,
             'investigation': 51,
             'of': 36412,
             "atlanta's": 4,
             'recent': 179,
             'primary': 96,
             'election': 77,
             'produced': 90,
             '``': 8837,
             'no': 2139,
             'evidence': 204,
             "''": 8789,
             'that': 10594,
             'any': 1344,
             'irregularities': 8,
             'took': 426,
             'place': 570,
             '.': 49346,
             'further': 218,
             'in': 21337,
             'term-end': 1,
             'presentments': 1,
             'city': 393,
             'executive': 55,
             'committee': 168,
             ',': 58334,
             'which': 3561,
             'had': 5133,
             'ov

Another very common need: assign a index (an integer) to each word, for looking up in data structures like matrices.

In [33]:
index_dict = {}
for word in counts:
    index_dict[word] = len(index_dict) # the current len(index_dict) **can** be an index
    
index_dict["index"]

9573

In these cases, you will often need a reversed index as well. The `items` method for dictionaries is useful if you are iterating over an existing dictionary and want to access both key and value at the same time. Let's build a reverse index dict in one line using a dictionary comprehension!

In [35]:
rev_index_dict = {value:key for key, value in index_dict.items()}

rev_index_dict[9573]

'index'

## Complex lexicons


Sometimes lexicons are more complex and might be represented as multiple recursive Python datatypes. For example, instead of a single counts, you might have a list of word senses, which are actually dictionaries of properties (including part-of-speech and counts of that word sense in a corpus). 

In [24]:
mini_sense_lexicon = {"bear":[{"POS":"noun","animate":True,"count":634,"gloss":"A big furry animal"},
                              {"POS":"verb","transitive":True,"count":294, "past tense":"bore", "past participle":"borne", "gloss":"to endure"}],
                      "slug":[{"POS":"noun","animate":True,"count":34, "gloss":"A slimy animal"},
                              {"POS":"verb","transitive":True,"count":3, "gloss": "to hit"}],
                      "back":[{"POS":"noun","animate":False,"count":12,"gloss":"a body part"},
                              {"POS":"noun","animate":False,"count":43, "gloss":"the rear of a place"},
                              {"POS":"verb","transitive":True,"count":5, "gloss":"to support"},
                              {"POS":"adverb","count":47,"gloss":"in a returning fashion"}],
                      "good":[{"POS":"noun","animate":False,"count":19,"gloss":"a thing of value"},
                              {"POS":"adjective", "count":1293,"gloss":"positive"}]}

These can be tricky to navigate. Let's answer the following questions by accessing the information in data structure:

How many senses does "back" have in this lexicon?

In [25]:
len(mini_sense_lexicon["back"])

4

Does "slug" have an adjectival sense?

In [26]:
flag = False
for feature in mini_sense_lexicon["slug"]:
    if feature["POS"] == "adjective":
        flag = True

# if flag:
#     print(f)
# else: 
#     print("False")
print(flag)

False


In [27]:
any(feature["POS"] == "adjective" for feature in mini_sense_lexicon["slug"])

False

Which words have a verb sense with a *past tense*?

In [40]:
for word, pos_list in mini_sense_lexicon.items():
    for feature_dict in pos_list:
        if "past tense" in feature_dict:
            print(word, feature_dict["past tense"])

bear bore


What is the gloss of the most common sense of "back"?

In [41]:
mini_sense_lexicon["back"]

[{'POS': 'noun', 'animate': False, 'count': 12, 'gloss': 'a body part'},
 {'POS': 'noun',
  'animate': False,
  'count': 43,
  'gloss': 'the rear of a place'},
 {'POS': 'verb', 'transitive': True, 'count': 5, 'gloss': 'to support'},
 {'POS': 'adverb', 'count': 47, 'gloss': 'in a returning fashion'}]

In [42]:
highest_count = 0
gloss = ""

for feature_dict in mini_sense_lexicon["back"]:
    if feature_dict["count"] > highest_count:
        highest_count = feature_dict["count"]
        gloss = feature_dict["gloss"]
gloss

'in a returning fashion'

## NLTK lexicons


NLTK has a lot of useful lexicons. Most are word lists. Note that they are listed under corpora and have the same interface; note that they need to be converted to sets if you want to use them for look up

#### Stopwords

Lists of closed-class/function words in various languages



In [43]:
from nltk.corpus import stopwords

In [44]:
stopwords.words("english")

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [45]:
stopwords.words("french")

['au',
 'aux',
 'avec',
 'ce',
 'ces',
 'dans',
 'de',
 'des',
 'du',
 'elle',
 'en',
 'et',
 'eux',
 'il',
 'ils',
 'je',
 'la',
 'le',
 'les',
 'leur',
 'lui',
 'ma',
 'mais',
 'me',
 'même',
 'mes',
 'moi',
 'mon',
 'ne',
 'nos',
 'notre',
 'nous',
 'on',
 'ou',
 'par',
 'pas',
 'pour',
 'qu',
 'que',
 'qui',
 'sa',
 'se',
 'ses',
 'son',
 'sur',
 'ta',
 'te',
 'tes',
 'toi',
 'ton',
 'tu',
 'un',
 'une',
 'vos',
 'votre',
 'vous',
 'c',
 'd',
 'j',
 'l',
 'à',
 'm',
 'n',
 's',
 't',
 'y',
 'été',
 'étée',
 'étées',
 'étés',
 'étant',
 'étante',
 'étants',
 'étantes',
 'suis',
 'es',
 'est',
 'sommes',
 'êtes',
 'sont',
 'serai',
 'seras',
 'sera',
 'serons',
 'serez',
 'seront',
 'serais',
 'serait',
 'serions',
 'seriez',
 'seraient',
 'étais',
 'était',
 'étions',
 'étiez',
 'étaient',
 'fus',
 'fut',
 'fûmes',
 'fûtes',
 'furent',
 'sois',
 'soit',
 'soyons',
 'soyez',
 'soient',
 'fusse',
 'fusses',
 'fût',
 'fussions',
 'fussiez',
 'fussent',
 'ayant',
 'ayante',
 'ayantes',


#### Swadesh

words for 200 common concepts from large list of languages, used for historical linguistics

In [46]:
import nltk
nltk.download('swadesh')

from nltk.corpus import swadesh

[nltk_data] Downloading package swadesh to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package swadesh is already up-to-date!


In [47]:
swadesh.fileids()

['be',
 'bg',
 'bs',
 'ca',
 'cs',
 'cu',
 'de',
 'en',
 'es',
 'fr',
 'hr',
 'it',
 'la',
 'mk',
 'nl',
 'pl',
 'pt',
 'ro',
 'ru',
 'sk',
 'sl',
 'sr',
 'sw',
 'uk']

In [48]:
for i in range(len(swadesh.words("en")[:10])):
    print(swadesh.words("en")[i], "|", swadesh.words("fr")[i], "|", swadesh.words("de")[i]) 

I | je | ich
you (singular), thou | tu, vous | du, Sie
he | il | er
we | nous | wir
you (plural) | vous | ihr, Sie
they | ils, elles | sie
this | ceci | dieses
that | cela | jenes
here | ici | hier
there | là | dort


#### Names

Lists of mostly English names, divided by gender

In [49]:
nltk.download('names')
from nltk.corpus import names

[nltk_data] Downloading package names to /Users/jungyeul/nltk_data...
[nltk_data]   Package names is already up-to-date!


In [50]:
names.fileids()

['female.txt', 'male.txt']

In [51]:
names.words("female.txt")[-10:]

['Zonnya',
 'Zora',
 'Zorah',
 'Zorana',
 'Zorina',
 'Zorine',
 'Zsa Zsa',
 'Zsazsa',
 'Zulema',
 'Zuzana']

#### Opinion Lexicon

Positive and negative word lists

In [52]:
nltk.download('opinion_lexicon')
from nltk.corpus import opinion_lexicon

[nltk_data] Downloading package opinion_lexicon to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!


In [53]:
opinion_lexicon.fileids()

['negative-words.txt', 'positive-words.txt']

In [54]:
opinion_lexicon.words('negative-words.txt')[:10]

['2-faced',
 '2-faces',
 'abnormal',
 'abolish',
 'abominable',
 'abominably',
 'abominate',
 'abomination',
 'abort',
 'aborted']

#### CMU Pronouncing Dictionary

* A list of pronunciations for each English word string
* Pronunciations are a list of ARPAbet phones
* ARPAbet phones are strings which are alphabetic except for numbers at the end of the vowels
* The numbers indicate stress

Let's look at a few entries:

In [55]:
nltk.download("cmudict")
from nltk.corpus import cmudict
p_dict = cmudict.dict()

[nltk_data] Downloading package cmudict to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package cmudict is already up-to-date!


In [56]:
p_dict["read"]

[['R', 'EH1', 'D'], ['R', 'IY1', 'D']]

In [57]:
p_dict["index"]

[['IH1', 'N', 'D', 'EH0', 'K', 'S']]

Let's get some basic stats for this lexicon: total entries, average number of pronunciations per word, average number of phones per pronunciation

In [58]:
pro_count = 0
pho_count = 0
for pronuns in p_dict.values():
    pro_count += len(pronuns)
    for pronun in pronuns:
        pho_count += len(pronun)

In [59]:
len(p_dict)

123455

In [60]:
pro_count/len(p_dict)

1.0832854076384109

In [61]:
pho_count/pro_count

6.3850542482633825

## Exercises

1. Create a set of female names that appear in the Brown corpus.

In [62]:
fem_names_in_brown = set(names.words("female.txt")).intersection(brown.words())
fem_names_in_brown

{'Abbe',
 'Abbey',
 'Abigail',
 'Abra',
 'Ada',
 'Adele',
 'Adrian',
 'Adrien',
 'Agatha',
 'Aggie',
 'Agnes',
 'Agnese',
 'Aida',
 'Ailey',
 'Ainsley',
 'Alex',
 'Alexis',
 'Alice',
 'Alicia',
 'Alison',
 'Alix',
 'Alla',
 'Allison',
 'Alma',
 'Althea',
 'Amy',
 'Ana',
 'Anabel',
 'Andrea',
 'Andrei',
 'Andromache',
 'Andy',
 'Angel',
 'Angelina',
 'Angie',
 'Anita',
 'Ann',
 'Anna',
 'Anne',
 'Annie',
 'Ansley',
 'Anthea',
 'Antoinette',
 'Aphrodite',
 'April',
 'Arden',
 'Ariadne',
 'Arlen',
 'Arlene',
 'Ashley',
 'Asia',
 'Astra',
 'Athena',
 'Atlanta',
 'Audrey',
 'Augusta',
 'Augustine',
 'Aurora',
 'Austin',
 'Avis',
 'Bambi',
 'Barbara',
 'Bari',
 'Barry',
 'Bea',
 'Beatrice',
 'Beau',
 'Bee',
 'Bel',
 'Bell',
 'Bella',
 'Belle',
 'Benita',
 'Benny',
 'Bernardine',
 'Bernie',
 'Berry',
 'Bert',
 'Bertha',
 'Beryl',
 'Bess',
 'Bessie',
 'Beth',
 'Betsey',
 'Betsy',
 'Betty',
 'Beverly',
 'Bill',
 'Billie',
 'Billy',
 'Bird',
 'Birdie',
 'Birgit',
 'Birgitta',
 'Blair',
 'Blake',

2. (Programmatically) create a set of the words from `mini_sense_lexicon` which have an animate noun sense.

In [63]:
animates = set()
for word, pos_list in mini_sense_lexicon.items():
    for feature_dict in pos_list:
        if "animate" in feature_dict and feature_dict["animate"]:
            animates.add(word)
animates

{'bear', 'slug'}

3. Count how often each English phone appears in CMU lexicon, striping off the stress markers at the end of vowels.

In [64]:
phone_counts = defaultdict(int)
for pronunciations in p_dict.values():
    for pronunciation in pronunciations:
        for phone in pronunciation:
            if phone[-1].isdigit():
                phone = phone[:-1]
            phone_counts[phone] += 1
            
print(phone_counts)

defaultdict(<class 'int'>, {'AH': 71410, 'EY': 13521, 'F': 13748, 'AO': 11290, 'R': 46046, 'T': 48549, 'UW': 9736, 'W': 8864, 'N': 60564, 'IH': 50093, 'P': 19715, 'L': 49479, 'AA': 24546, 'B': 21057, 'ER': 29027, 'G': 13553, 'K': 42502, 'S': 50427, 'EH': 27398, 'TH': 2902, 'M': 29347, 'D': 32389, 'V': 10742, 'Z': 27842, 'IY': 34504, 'AE': 21804, 'OW': 19047, 'NG': 9865, 'SH': 8700, 'HH': 9319, 'AW': 3408, 'AY': 11313, 'JH': 6404, 'Y': 5171, 'CH': 4960, 'ZH': 560, 'UH': 2273, 'DH': 576, 'OY': 1267})
