# (3E-2) Word Counts

In this notebook, we'll learn:

* How to loop over lists
* How to count words (list of words -> word counts as dictionary)

We'll apply what we learned in the previous two notebooks about lists and dictionaries. Specifically, we're going to learn how to transform lists of words into dictionaries of word counts.

## Looping over lists

How do we count the words in a list? We need to be able to **"loop"** over each one individually. To do that, we use a for loop. Here's the syntax:

```python
for thing in list_of_things:
    print(thing)   # or do something else with thing
```

You can read this in human terms as: **For each** thing in this list of things, print the thing.

Let's try it out.

In [1]:
# First, here's a list of words to get us started
marx = ['all', 'that', 'is', 'solid', 'melts', 'into', 'air', ',', 
        'all', 'that', 'is', 'holy', 'is', 'profaned', ',', 
        'and', 'man', 'is', 'at', 'last', 'compelled', 'to', 'face', 'with', 'sober', 'senses',
        'his', 'real', 'conditions', 'of', 'life', ',', 
        'and', 'his', 'relations', 'with', 'his', 'kind', '.']

print(marx)

['all', 'that', 'is', 'solid', 'melts', 'into', 'air', ',', 'all', 'that', 'is', 'holy', 'is', 'profaned', ',', 'and', 'man', 'is', 'at', 'last', 'compelled', 'to', 'face', 'with', 'sober', 'senses', 'his', 'real', 'conditions', 'of', 'life', ',', 'and', 'his', 'relations', 'with', 'his', 'kind', '.']


In [2]:
# Ok, now let's loop!
for word in marx:
    print(word)

all
that
is
solid
melts
into
air
,
all
that
is
holy
is
profaned
,
and
man
is
at
last
compelled
to
face
with
sober
senses
his
real
conditions
of
life
,
and
his
relations
with
his
kind
.


In [7]:
# @TODO: Loop over each word in the Marx passage, but print the word only if it is longer than two letters
#
for word in marx:
    if(len(word)>=2):
        print(word)


all
that
is
solid
melts
into
air
all
that
is
holy
is
profaned
and
man
is
at
last
compelled
to
face
with
sober
senses
his
real
conditions
of
life
and
his
relations
with
his
kind


In [8]:
# @TODO: Loop over each word in the Marx passage, but print the word only if the first letter is alphabetic
# hint: use the .isalpha() method of strings, which returns True if the string is not alphabetic (otherwise False)
#

for word in marx:
    if(word[0].isalpha()):
        print(word)


all
that
is
solid
melts
into
air
all
that
is
holy
is
profaned
and
man
is
at
last
compelled
to
face
with
sober
senses
his
real
conditions
of
life
and
his
relations
with
his
kind


### Advanced loops

#### How do we remember how far into the list we've looped?

Use the `enumerate()` wrapper around any list, and then iterate like this:

In [9]:
for index,word in enumerate(marx):   # 'index' is the index in the list marx at which 'word' is located
    print(index, word, marx[index])    

0 all all
1 that that
2 is is
3 solid solid
4 melts melts
5 into into
6 air air
7 , ,
8 all all
9 that that
10 is is
11 holy holy
12 is is
13 profaned profaned
14 , ,
15 and and
16 man man
17 is is
18 at at
19 last last
20 compelled compelled
21 to to
22 face face
23 with with
24 sober sober
25 senses senses
26 his his
27 real real
28 conditions conditions
29 of of
30 life life
31 , ,
32 and and
33 his his
34 relations relations
35 with with
36 his his
37 kind kind
38 . .


#### Going backwards

Just wrap the list in `reversed()`.

In [None]:
for word in reversed(marx):
    print(word,end=' ')     # print without adding a newline afterward; instead add a ' '

#### How to stop a loop short

Use `break`.

In [None]:
for index,word in enumerate(marx):
    if index>=10:
        break      # stop the loop!
    
    print(index,word)

#### How to skip to the next iteration of the loop

Use `continue`.

In [None]:
for word in marx:
    if len(word)<3:
        continue          # skip right to the next iteration in the loop, don't even keep reading below
    
    print(word,end=' ')   # this won't run if we've already hit continue

## Loop-counting words in list

How do we count the words in a list? Let's:

1. Create an empty dictionary of word counts
2. Loop over each word in the text
3. For each word, add 1 to its entry in the dictionary of word counts


In [None]:
# 1. Create an empty dictionary of word counts
wordcounts={}

# 2. Loop over each word in the text
for word in marx:
    
    # 3. For each word, add 1 to its entry in the dictionary of word counts
    wordcounts[word]+=1


In [None]:
wordcounts

In [40]:
# @TODO: Write a function to produce a dictionary of counts from any list
def count(tokens):
    dict={}
    for word in tokens:
        if(word not in dict):
            dict[word]=1
        else:
            dict[word]+=1
    return dict

In [41]:
# @TODO: Write a function to produce a dictionary of relative counts (*term frequencies*)
# hint: divide by number of words

def tf(tokens):
    wordcounts = {}
    num_words = len(token)
    for word in tokens:
        if word in wordcounts:
            wordcounts[word]+=1
        else:
            wordcounts[word] = 1
    for word in wordcounts.keys():
        wordcounts[word] = wordcounts[word] / num_words
    

## Text to Count Pipeline

First, let's recapitulate some steps we already know:

### 1. Open a text

In [42]:
# @TODO: Write this function
#

def file2string(filename):
    """
    This function takes a filename,
    opens the file,
    and returns a string corresponding to the file's contents.
    """
    with open(filename) as file:
        return file.read()

In [23]:
# @TODO: Use your function on either one of your texts or one of Yamashita's
#

food = file2string('../corpora/gertrude/food.txt')

### 2. Tokenize the text

In [43]:
# @TODO: Write this function
#

def tokenize(string):
    """
    This function takes in a string,
    lower-cases it,
    and returns a list of words using NLTK's tokenizer.
    """
    import nltk
    s = string.lower()
    return nltk.word_tokenize(s)

In [46]:
# @TODO: Use your function to tokenize the text you opened above
#
food_tokens = tokenize(food)
print(food_tokens)

['food', 'roastbeef', ';', 'mutton', ';', 'breakfast', ';', 'sugar', ';', 'cranberries', ';', 'milk', ';', 'eggs', ';', 'apple', ';', 'tails', ';', 'lunch', ';', 'cups', ';', 'rhubarb', ';', 'single', ';', 'fish', ';', 'cake', ';', 'custard', ';', 'potatoes', ';', 'asparagus', ';', 'butter', ';', 'end', 'of', 'summer', ';', 'sausages', ';', 'celery', ';', 'veal', ';', 'vegetable', ';', 'cooking', ';', 'chicken', ';', 'pastry', ';', 'cream', ';', 'cucumber', ';', 'dinner', ';', 'dining', ';', 'eating', ';', 'salad', ';', 'sauce', ';', 'salmon', ';', 'orange', ';', 'cocoa', ';', 'and', 'clear', 'soup', 'and', 'oranges', 'and', 'oat-meal', ';', 'salad', 'dressing', 'and', 'an', 'artichoke', ';', 'a', 'centre', 'in', 'a', 'table', '.', 'roastbeef', '.', 'in', 'the', 'inside', 'there', 'is', 'sleeping', ',', 'in', 'the', 'outside', 'there', 'is', 'reddening', ',', 'in', 'the', 'morning', 'there', 'is', 'meaning', ',', 'in', 'the', 'evening', 'there', 'is', 'feeling', '.', 'in', 'the', 'even

### 3. [New] Count the tokens!

In [50]:
# @TODO: Use the function developed above to count the tokens in your text
#
count(food_tokens)

{'food': 1,
 'roastbeef': 2,
 ';': 39,
 'mutton': 5,
 'breakfast': 4,
 'sugar': 5,
 'cranberries': 2,
 'milk': 3,
 'eggs': 3,
 'apple': 4,
 'tails': 2,
 'lunch': 4,
 'cups': 3,
 'rhubarb': 4,
 'single': 15,
 'fish': 7,
 'cake': 11,
 'custard': 3,
 'potatoes': 8,
 'asparagus': 3,
 'butter': 10,
 'end': 3,
 'of': 63,
 'summer': 3,
 'sausages': 3,
 'celery': 3,
 'veal': 2,
 'vegetable': 4,
 'cooking': 4,
 'chicken': 10,
 'pastry': 2,
 'cream': 5,
 'cucumber': 2,
 'dinner': 4,
 'dining': 3,
 'eating': 10,
 'salad': 6,
 'sauce': 4,
 'salmon': 7,
 'orange': 6,
 'cocoa': 3,
 'and': 321,
 'clear': 6,
 'soup': 10,
 'oranges': 4,
 'oat-meal': 3,
 'dressing': 3,
 'an': 41,
 'artichoke': 3,
 'a': 472,
 'centre': 6,
 'in': 169,
 'table': 2,
 '.': 438,
 'the': 190,
 'inside': 4,
 'there': 138,
 'is': 408,
 'sleeping': 1,
 ',': 482,
 'outside': 2,
 'reddening': 1,
 'morning': 3,
 'meaning': 4,
 'evening': 4,
 'feeling': 7,
 'anything': 8,
 'resting': 2,
 'mounting': 3,
 'resignation': 2,
 'recognitio

### 4. Repeat steps 1-3 for another text

In [52]:
# @TODO: Repeat steps 1-3 for another text
#

rafaela_str = file2string('../corpora/tropic_of_orange/texts/ch01.txt')
rafaela_tokens = tokenize(rafaela_str)
rafaela_counts = count(rafaela_tokens)

rafaela_counts['he']

41

### 5. Compare word counts

In [58]:
# @TODO: Loop over a list of words that are interesting to you,
# and print their relative counts in both your texts
#

interesting_words = ['orange','tree','sun']
for word in interesting_words:
    print(word+': '+str(rafaela_counts[word]))

orange: 18
tree: 19
sun: 4
