# LING 242 Python Lecture 4: Corpus Statistics

* Advanced word counting
* Advanced sorting
* Simple statistics
* Comparing corpora
* T/F questions

## Advanced counting

Let's get counts of words from the Brown again, using a Counter:

In [9]:
from collections import Counter
from nltk.corpus import brown

counts_wo_lower = Counter(brown.words())
counts = Counter(word.lower() for word in brown.words())

In [10]:
print(len(counts_wo_lower))
print(len(counts))

56057
49815


Often you need to normalize the counts in a dictionary, to create a word probability distribution. You can keep a running total or, more conveniently, just [sum](https://docs.python.org/3/library/functions.html#sum) the values of your count dict (you can also use `np.sum`). If you've done this right, your new values should sum to 1 (or close enough)

In [11]:
len(brown.words())

1161192

In [12]:
total_tokens = sum(counts.values())
total_tokens

1161192

In [13]:
total_tokens = sum(counts.values())

probs = {}

for word in counts:
    probs[word] = counts[word]/total_tokens
    
print(counts["the"]) # 62713 by counts_wo_lower
print(probs["the"])

69971
0.06025790739171472


In [14]:
sum(probs.values())

0.9999999999991869

Question: Why would word probability be more useful than raw counts when we are trying to use statistics to characterize corpora?

Answer: <span style="background-color:black;">Corpora can be of vastly different sizes, and those sizes will have a direct effect on word count. For example, one corpus might have 1 mention of the word _dog_ and another might have 100. This might indicate the second corpus talks a lot more about dogs, or it might just indicate that the second one is 100X as large</span>

For easy interpretability and to avoid low numbers, one often multiples these normalized word probabilities by some large number like 1000, at which point the resulting number can be understood as X occurrences per 1000 tokens

In [15]:
import math

log_probs = {}

for word in probs:
    log_probs[word] = math.log(probs[word])


print(probs["university"])
print(log_probs["university"])

0.00018429338128405983
-8.598981606662296


In [16]:
probs_1k = {}

for word in probs:
    probs_1k[word] = probs[word] * 1000
    
probs_1k["university"]

0.18429338128405984

Another common use case involving counts is removing words with high or low counts, which are often uninteresting or statistically unreliable. This is tricky, because in Python you can't delete from something you're iterating over! Unless you're very worried about lack of memory, usually easier to just create a new dictionary.

In [17]:
new_counts = {word:count for word,count in counts.items() if 5 < count}
print(len(counts))
print(len(new_counts))

49815
12416


In [18]:
new_counts = {word:count for word,count in counts.items() if 5 < count < 10000}
print(len(new_counts))

12406


If you're just interested in the highest (or lowest) count item, it is easy enough just to iterate over the dictionary once and remember the top scoring item.

In [9]:
highest_count_word = ""# None
highest_count = 0

for word, count in counts.items():
    if count > highest_count:
        highest_count_word = word
        highest_count = count

print(highest_count_word)
print(highest_count)

the
69971


In [20]:
counts['the']

69971

In [10]:
highest_count_word = None
print(highest_count_word)

x = None

if x:  # if x is True:
    print("True?")
elif x is False:
    print ("False?")
else:
    print("None is just None")

None
None is just None


But if you're using a Counter object, the [most_common](https://docs.python.org/3/library/collections.html#collections.Counter.most_common) method is often handy. Counters have a few other neat options, for instance they can be added and subtracted (though, unlike `update`, this creates a new Counter).

In [22]:
counts.most_common(10)

[('the', 69971),
 (',', 58334),
 ('.', 49346),
 ('of', 36412),
 ('and', 28853),
 ('to', 26158),
 ('a', 23195),
 ('in', 21337),
 ('that', 10594),
 ('is', 10109)]

In [12]:
from nltk.corpus import treebank
treebank_counts = Counter(word.lower() for word in treebank.words()) # Counter(treebank.words())
both_counts = counts + treebank_counts
print(both_counts.most_common(10))

[('the', 74735), (',', 63219), ('.', 53174), ('of', 38737), ('and', 30409), ('to', 28340), ('a', 25183), ('in', 23106), ('that', 11442), ('is', 10781)]


Beyond that, you'll want to do some sorting.

## Advanced sorting

As you've already seen, simple sorting of a list of objects in Python is fairly straightward. Use the [sort](https://docs.python.org/3/library/stdtypes.html#list.sort) method to sort in place, or [sorted](https://docs.python.org/3/library/functions.html#sorted) to create a new sorted list. Result is order from smallest to largest, use reverse keyword to reverse the order.

In [13]:
nums = [3, 6, -4, 23, 0.5, 202, -24592, 3482]

In [14]:
print(sorted(nums))
print(sorted(nums,reverse=True))

[-24592, -4, 0.5, 3, 6, 23, 202, 3482]
[3482, 202, 23, 6, 3, 0.5, -4, -24592]


In [15]:
nums

[3, 6, -4, 23, 0.5, 202, -24592, 3482]

In [16]:
nums.sort(reverse=True)

In [17]:
nums

[3482, 202, 23, 6, 3, 0.5, -4, -24592]

In [18]:
nums.sort()
nums

[-24592, -4, 0.5, 3, 6, 23, 202, 3482]

In [19]:
nums2 = sorted(nums,reverse=True)
nums2

[3482, 202, 23, 6, 3, 0.5, -4, -24592]

Note that strings are generally sorted alphabetically, but once you get outside of a-z things can get unpredictable.

In [20]:
strings = ["aardvark", "Aardvark", "Zebra", "zebra", "12", "110", "2"]

sorted(strings)

['110', '12', '2', 'Aardvark', 'Zebra', 'aardvark', 'zebra']

Often, though, you have a statistic associated with a group of objects (words, documents, corpora, etc.), and want to sort the objects based on the statistic. One strategy is to create a list of tuples where the statistic is the first element of the tuple, since sort operates on the first element of each tuple first.

In [25]:
counts = {"the":87925, "quick":327, "brown":539, "fox":69}
sorted((count,word) for word, count in counts.items())

[(69, 'fox'), (327, 'quick'), (539, 'brown'), (87925, 'the')]

In [26]:
sorted((word,count) for word, count in counts.items())

[('brown', 539), ('fox', 69), ('quick', 327), ('the', 87925)]

In [27]:
counts.keys()

dict_keys(['the', 'quick', 'brown', 'fox'])

An similarly compact, elegant way is to use the `key` keyword for *sort/sorting* function which allows you to specify a function which will define the value to sort a given iterable. The typical way to specify the function for this is to use a [lambda expression](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions). One clear advantage of the `key` approach is that you just get the sorted list without having to deal with extracting what you need from tuples:

In [24]:
print(sorted(counts.keys()))
print(sorted(counts.values()))
print(sorted(counts))

['brown', 'fox', 'quick', 'the']
[69, 327, 539, 87925]
['brown', 'fox', 'quick', 'the']


In [26]:
print(counts)

sorted_words = sorted(counts.keys(), key=lambda x: x)
print(sorted_words)
sorted_words = sorted(counts.keys(), key=lambda x: len(x))
print(sorted_words)
sorted_words = sorted(counts.keys(), key=lambda x: counts[x])
print(sorted_words)

{'the': 87925, 'quick': 327, 'brown': 539, 'fox': 69}
['brown', 'fox', 'quick', 'the']
['the', 'fox', 'quick', 'brown']
['fox', 'quick', 'brown', 'the']


Once you have a sorted list, you can use slicing to get what you want.

In [27]:
sorted_words[-2:]

['brown', 'the']

Let's use sorting with lambdas to get the 50 longest and shortest word types in the Penn Treebank corpus

In [29]:
from nltk.corpus import treebank
sorted_by_length = sorted(set(treebank.words()), key=lambda x: len(x))

print(sorted_by_length[:5])
print(sorted_by_length[-5:])

['b', '9', "'", '@', '*']
['collective-bargaining', 'Bridgestone\\/Firestone', 'computer-system-design', 'Macmillan\\/McGraw-Hill', 'marketing-communications']


Remember that are two other python built-in functions, [min](https://docs.python.org/3/library/functions.html#min) and [max](https://docs.python.org/3/library/functions.html#min) which get the minimum and maximum values. Like sort/sorted, they have a *key* keyword argument

In [32]:
max(set(treebank.words()), key=lambda x: len(x))

'marketing-communications'

As you've seen elsewhere, there are more ways to sort when you are using numpy arrays or pandas dataframes, but that is beyond our scope here! When you're dealing with relatively simple situations (like word counts) and there is little benefit to be gained from vectorization, you probably don't want the overhead of converting to these formats: pure Python is easier.

## Simple statistics

The easiest sorts of corpus statistics to calculate are averages: e.g. average word length, average sentence length, average words per text.

In [30]:
def get_simple_stats(corpus):
    num_chars = sum([len(word) for word in corpus.words()])
    num_words = len(corpus.words())
    num_sents = len(corpus.sents())
    num_texts = len(corpus.fileids())
    print("average word length\t =", num_chars/num_words)
    print("average sentence length\t =", num_words/num_sents)
    print("average text length\t =", num_words/num_texts) 

In [31]:
get_simple_stats(brown)

average word length	 = 4.276538246904905
average sentence length	 = 20.250994070456922
average text length	 = 2322.384


One popular statistic for individual texts that reflects how varied the vocabulary used in the text is the type-token ratio (TTR), i.e.

\begin{equation*}
\frac {\text{No. of word types}}{\text{No. of word token}}
\end{equation*}

Note that when you are using it for comparison, you generally need to fix the number of tokens for the texts you're comparing, since TTR can be quite different for different numbers of tokens. As we've seen already seen, sets are an easy way to get the number of types, though normally you'll want to lower case first.

Let's generalize this into a function for use later

In [32]:
def type_token_ratio(words, *argv):
    '''
    calculate type-token ratio from the corpus of word tokens (list of strings) using the first
    num_words tokens
    '''
    
    num_words = argv[0] if argv else len(words) 
    types = set([word.lower() for word in words[:num_words]])
    
    return len(types)/num_words

In [33]:
type_token_ratio(brown.words())

0.04289988219002542

In [34]:
type_token_ratio(brown.words(), 1000)

0.417

In [35]:
type_token_ratio(brown.words(), 100000)

0.13082

The relative quantity of the main _open-class_ part-of-speech can reflect the nature of a particular corpus. For example, narrative texts (like stories) have more verbs, and informational texts (like technical reports) tend to have more nouns. If we have POS tags, this is easy to calculate:

In [36]:
noun_count = 0
verb_count = 0
adj_count = 0

for word, pos in brown.tagged_words():
    if pos[0] == "N":
        noun_count += 1
    if pos[0] == "V":
        verb_count += 1
    if pos[0] == "J":
        adj_count += 1

print(noun_count/len(brown.words()))
print(verb_count/len(brown.words()))
print(adj_count/len(brown.words()))

0.23521002555994186
0.09976127978835542
0.06200008267366637


One popular POS summary statistic is lexical density, which can also be calculated using a POS-tagged corpus. It is the ratio of open-class words (nouns, verbs, adjectives, adverbs) to all words.

In [37]:
open_class_prefix = {"N", "V", "J", "R"}
open_class_total = 0
for word, pos in brown.tagged_words():
    if pos[0] in open_class_prefix:
        open_class_total += 1
print(open_class_total/len(brown.words()))

0.43480664696277616


Of course, any word or POS sequence or otherwise easy identified linguistic property may be considered a potential statistic. For example, the code below counts English split infinitives (i.e. TO + RB + V) appear per 1000 words in the Brown (and prints them out). These particular construction was considered ungrammatical English by for hundreds of years, but it is actually quite common (e.g. in the opening to the _Star Trek_ TV show: _to boldly go where no man/one has gone before_)

<img src="https://i.imgur.com/P8t4dDL.png" style="width:300px;height:200px;">
(From OneirosTheWriter at sufficientvelocity.com) 


**TBBT S10E24: _The Long Distance Dissonance_**

- Sheldon: I'm sorry you had to go through that.
- Amy: In fact, that's when I started to really miss you.
- Sheldon: You know you just split an infinitive.
- Amy: Did I? Are you gonna teach me a lesson?
- Sheldon: I am. It is naughty to put an adverb between the word "to" and the verb stem.
- Amy: What are you gonna do about it?
- Sheldon: I'm going to admonish you.
- Amy: Vigorously?
- Sheldon: That's the only kind of admonishing I do.



In [38]:
split_infinitives = 0
split_infinitives_list = []
for sent in brown.tagged_sents():
    for i in range(len(sent) - 2):
        if sent[i][1] == "TO" and sent[i+1][1][:2] == "RB" and sent[i+2][1][0] == "V":
            split_infinitives_list.append(sent[i][0] + " " + sent[i+1][0] + " " + sent[i+2][0])
            split_infinitives += 1
            
# print("Frequency per 100000 words")     
# print(100000*split_infinitives/len(brown.words())) # 2.3251968666680445
split_infinitives_list[:5]

['to formally request',
 'to completely bypass',
 'to merely go',
 'to properly express',
 'to properly display']

Corpus linguists use these methods to investigate how lingustic patterns of interest are being used!

## Comparing corpora

We count and sort words and derive corpus statistics primarily so we can identify differences between texts or corpora. Let's calculate and compare some basic stats for some corpora in NLTK

In [39]:
import numpy as np
import matplotlib.pyplot as plt

def ordered_bar_from_dict(py_dict, title):
    '''create a bar chart from values in py_dict, ordered from smallest to largest and labeled with keys'''
    labels = sorted(py_dict.keys(),key=lambda x: py_dict[x])
    y_pos = np.arange(len(labels))
    values = sorted(py_dict.values())

    plt.bar(y_pos, values, align='center', alpha=0.5,color=list('rgbkym'))
    plt.xticks(y_pos, labels,rotation=45)
    plt.title(title)
    plt.show()
    

In [40]:
def average_word_length(words):
    '''calculate the average length of the provided words'''
    total_words = 0
    total_chars = 0
    for word in words:
        total_words += 1
        total_chars += len(word)
    return total_chars/total_words

In [41]:
import sys
corpora = ["treebank", "gutenberg", "reuters", "switchboard","webtext", "movie_reviews"]

exec("from nltk.corpus import " + ", ".join(corpora))


In [55]:
corpus = "treebank"
print(len(treebank.words()))
print(len(corpus.words()))

100676


AttributeError: 'str' object has no attribute 'words'

In [43]:
avg_word_lengths = {}
for corpus in corpora:
    exec("words = " +corpus + ".words()")
    avg_word_lengths[corpus] = average_word_length(words)

avg_word_lengths
# ordered_bar_from_dict(avg_word_lengths,"Average word lengths")

{'treebank': 4.406154396281139,
 'gutenberg': 3.618868231123358,
 'reuters': 4.0016898124877605,
 'switchboard': 3.240325152188617,
 'webtext': 3.552701691061747,
 'movie_reviews': 3.9314442297735854}

In [44]:
ttrs = {}
for corpus in corpora:
    exec("words = " +corpus + ".words()")
    ttrs[corpus] = type_token_ratio(words)

ttrs
# ordered_bar_from_dict(ttrs,"Type token ratios")

{'treebank': 0.11310540744566729,
 'gutenberg': 0.016149980946844555,
 'reuters': 0.01805914459925353,
 'switchboard': 0.052855348342835055,
 'webtext': 0.043893500162577856,
 'movie_reviews': 0.02510891389173012}

Let's talk a little bit about what we see here, and how it reflects the differences between corpora...

What if we want to identify individual words (or other linguistic features) that are particularly important in one corpus relative to another.  We are going to do this for just two very different corpora, gutenberg and webtext. Let's go back to our word (unigram) probabilities. First, we build them for each corpus:

In [45]:
def get_unigram_probs(words):
    '''get unigram probabilities for the words in a corpus'''
    counts = Counter(word.lower() for word in words)
    total = sum(counts.values())
    return {word:count/total for word,count in counts.items()}
    

In [46]:
web_probs = get_unigram_probs(webtext.words())
guten_probs = get_unigram_probs(gutenberg.words())

There are two natural, relatively direct ways to use these probabilities to identify words that are characteristic of the two corpora. One is to rank the words by the _difference_ between probabilities, and the other is to rank the words by the _ratio_ of the probabilities. Both will work, but they lead to different results.  

In [47]:
def subtract_probs(prob1, prob2):
    '''given two probability dictionaries, create a dictionary has the difference of probabilities (prob1 - prob2)
    for words that appear in either dictionary'''
    all_words = set(prob1.keys())
    all_words.update(prob2.keys())
    return {word:prob1.get(word,0) - prob2.get(word,0) for word in all_words}


In [48]:
sub_dict = subtract_probs(guten_probs,web_probs)
sub_sorted_words = sorted(sub_dict.keys(),key=lambda x: sub_dict[x])

In [49]:
print(sub_sorted_words[:50])

["'", ':', '.', '#', 'i', 'you', '?', 't', '!', 'girl', 'guy', 'on', '-', '1', 's', '...', '2', ']', '[', 'like', 'don', 'm', 'a', 'yeah', 'page', 'firefox', 'when', 'can', 'woman', 'just', 're', 'get', 'chick', 'does', 'new', 'no', '(', ')', 'window', 'bookmarks', 'open', 'doesn', 'teen', 'firebird', 'cell', 'know', 'is', 'menu', 'tab', 'bar']


In [50]:
print(sub_sorted_words[-50:])

['men', 'from', 'house', 'hath', 'israel', 'before', 'king', 'will', 'at', '--', 'came', 'she', 'god', 'upon', 'but', ',"', 'ye', '."', 'with', 'thee', 'were', 'in', 'thy', 'by', 'to', 'thou', 'their', 'for', 'all', 'they', 'be', 'her', 'which', 'had', 'said', 'them', 'lord', 'unto', 'as', 'him', 'that', 'was', 'shall', 'he', 'his', ';', 'of', 'and', 'the', ',']


When we subtract probabilities, we tend to get a lot of very common words (since these words have high probabilities to begin with).

Next, the ratio method. One problem with the ratio is the potential for divide by zero errors, hence we will only look at the shared vocabulary:  

In [51]:
def divide_probs(prob1, prob2):
    '''given two probability dictionaries, create a dictionary has the ratios of probabilities (prob1/prob2)
    for each word included in both'''
    all_words = set(prob1.keys()).intersection(prob2.keys())
    return {word:prob1[word]/prob2[word] for word in all_words}

In [52]:
div_dict = divide_probs(guten_probs,web_probs)
div_sorted_words = sorted(div_dict.keys(),key=lambda x: div_dict[x])

In [53]:
print(div_sorted_words[:50])

['guy', '0', 'clicking', 'tourist', 'folder', '+', 'download', 'password', 'dad', 'option', 'turner', 'install', 'phoenix', 'default', 'bitch', 'user', 'html', '?...', 'location', 'os', 'status', 'auto', 'context', 'bug', 'focus', 'settings', 'click', 'anymore', 'extension', 'disable', 'installed', 'pussy', 'delete', 'site', '***', 'jewish', 'cashier', 'cop', 'cute', 'loading', 'cancel', 'program', '<', '>', 'fails', 'font', 'yo', 'data', 'blocking', 'switch']


In [54]:
print(div_sorted_words[-50:])

['21', 'wrath', 'sight', 'ark', '19', 'offering', 'sin', 'abraham', ',"', 'offerings', 'cried', 'pharaoh', 'grace', 'among', '22', 'princes', 'elliot', 'receive', 'stood', 'angel', 'wilderness', '27', 'solomon', 'spirits', 'shalt', 'mercy', 'aaron', 'shall', 'whom', 'israel', '14', 'praise', 'wherefore', '29', 'spoken', 'replied', 'commanded', 'behold', 'thou', 'sons', 'spake', 'thus', 'therefore', 'mrs', 'thine', 'thy', 'thee', 'hast', 'whale', 'ye']


The words found using ratios are more distinctive, but there is also noise, because it includes some very low probability types.

Question: What's a possible quick solution to this problem of noise? (Something we've already talked about in this lecture)

Answer: <span style="background-color:black;"> We could remove words with low counts in one or both corpora </span>

As it happens, there are more sophisticated statistical methods that find a better balance between these two extremes (being biased neither towards extremely common nor extremely rare words), but we'll stop here for now... 