In this article, we will take a journey and learn how to count words frequency using dictionary, defaultdict, and Counter. At the end of this journey, I hope that you will come to appreciate the power of various kinds of dictionaries and their applications.

# The Problem

One of the problem we often deal with when processing text is to calculate word frequencies from a block of text. For example, given the following block:

> Watch your thoughts; they become words.  
> Watch your words; they become actions.  
> Watch your actions; they become habits.  
> Watch your habits; they become character.  
> Watch your character; it becomes your destiny.  

How can we come up with word frequencies?

# How to Split the Words

Before we can construct such a dictionary, we need a way to parse the block of text and gets a list of words. There are many ways to do this, but the easiest is to use regular expression to split the block of text at non-word boundaries:

In [1]:
import re

text = """Watch your thoughts; they become words
Watch your words; they become actions
Watch your actions; they become habits
Watch your habits; they become character
Watch your character; it becomes your destiny"""

words = re.split(r'\W+', text.lower())
print(words)

['watch', 'your', 'thoughts', 'they', 'become', 'words', 'watch', 'your', 'words', 'they', 'become', 'actions', 'watch', 'your', 'actions', 'they', 'become', 'habits', 'watch', 'your', 'habits', 'they', 'become', 'character', 'watch', 'your', 'character', 'it', 'becomes', 'your', 'destiny']


Notes:

* In the example above, we split the block of text using the regular expression `r'\W+'`. The `r''` notation tells the Python interpreter to treat this string as a *raw* string and do not interpret special characters such as the backslash. The expression `\W` means non-word and the plus sign means at least one (non word).
* In order to normalize the text, we turn the block into lower case with `text.lower()`
* The `re.split()` function then returns a list of words

Now that we have a list of words in hand, we can go ahead and count them using a couple different approaches, but they all employ some form of dictionaries.

# First Solution: Using Dictionary

A dictionary is a natural solution in this case. We can have a dictionary whose keys represent the words and whose values represent the count:

In [2]:
frequency = {}
for word in words:
    if word not in frequency:
        frequency[word] = 0
    frequency[word] += 1

frequency

{'actions': 2,
 'become': 4,
 'becomes': 1,
 'character': 2,
 'destiny': 1,
 'habits': 2,
 'it': 1,
 'they': 4,
 'thoughts': 1,
 'watch': 5,
 'words': 2,
 'your': 6}

Notes

* In the loop, if the word is not yet found in the `frequency` dictionary, then we initialize it with zero
* Next, we increase the count for that word

This solution is simple, easy for newbie to understand, but we can still improve it. Let's take a detour and learn a bout a dictionary feature which will make our lives simpler and more efficient.

# Detour: A Useful Dictionary Feature: setdefault

A Python dictionary object has many useful methods, one if which is `.setdefault()`. Let's see how it works:

In [3]:
d = {'a': 1, 'b': 2}
d.setdefault('c', 9)

9

In [4]:
d

{'a': 1, 'b': 2, 'c': 9}

The English translation for the above would be: if the key **c** is not yet in the dictionary, add it with value **9**. Note that `setdefault()` also returns this value, **9**. Let's take another example:

In [5]:
d = {'a': 1, 'b': 2}
d.setdefault('a', 100)

1

In [6]:
d

{'a': 1, 'b': 2}

In this case, the key **a** is already in the dictionary, `setdefault()` does not do anything. Instead, it returns the current value of `d[a]`.

# Second Solution: Using Dictionary and setdefault

With this feature, we can improve the solution above:

In [7]:
frequency = {}
for word in words:
    frequency.setdefault(word, 0)
    frequency[word] += 1

frequency

{'actions': 2,
 'become': 4,
 'becomes': 1,
 'character': 2,
 'destiny': 1,
 'habits': 2,
 'it': 1,
 'they': 4,
 'thoughts': 1,
 'watch': 5,
 'words': 2,
 'your': 6}

This time around, we call `setdefault()` to initialize the count of the *word* to 0. If the word is already in the dictionary, the call does not do anything, a no-op. Knowing that `setdefault()` also return the current value, we can shorten our code further:

In [8]:
frequency = {}
for word in words:
    frequency[word] = frequency.setdefault(word, 0) + 1

frequency

{'actions': 2,
 'become': 4,
 'becomes': 1,
 'character': 2,
 'destiny': 1,
 'habits': 2,
 'it': 1,
 'they': 4,
 'thoughts': 1,
 'watch': 5,
 'words': 2,
 'your': 6}

While this solution is even shorter than before, I still like the former one for being easier to understand.

# Detour: defaultdict

Let us take another detour and look at a specialized dictionary: `defaultdict`. This dictionary behaves almost identical to the normal dictionary except when dealing with key-not-found situation:

In [9]:
normal_dict = dict(a=1, b=2)

# Access the non-existing key
normal_dict['c'] 

KeyError: 'c'

In [10]:
from collections import defaultdict

int_dict = defaultdict(int, a=1, b=2)

# Access the non-existing key
int_dict['c']

0

In [11]:
int_dict

defaultdict(int, {'a': 1, 'b': 2, 'c': 0})

The first argument to `defaultdict` is called a factory, which is any callable (function, method, or class) which can be called without any parameter. In this case we passed `int` as a factory and `int() ==> 0`.

As you can see, the first time we access the key **c**, the normal dictionary will raise a `KeyError` exception whereas the `defaultdict` simply create a new key with value `int()`, or 0.

# Third Solution: Using defaultdict

Now that we learned about `defaultdict`, we can see it fits naturally to our solution. Instead of using a normal dictionary and call `setdefault()`, we can use `defaultdict` to simplify the code:

In [12]:
from collections import defaultdict

frequency = defaultdict(int)

for word in words:
    frequency[word] += 1

frequency

defaultdict(int,
            {'actions': 2,
             'become': 4,
             'becomes': 1,
             'character': 2,
             'destiny': 1,
             'habits': 2,
             'it': 1,
             'they': 4,
             'thoughts': 1,
             'watch': 5,
             'words': 2,
             'your': 6})

This solution so far is the shortest and easiest to understand (provide that we know how `defaultdict` works. 

* The first time we refer to `frequency[word]`, it will be initialized with value of `int()` or zero
* Then, the expression `frequency[word] += 1` increments it

This solution is as short as it gets, can we do better? It turns out that we can. Let's take another detour and learn about the `Counter` class.

# Detour: The Counter Class

When it comes to counting things, Python standard library `collections` has a wonderful gift for you: the `Counter` class. Let's take a look at some example:

In [13]:
from collections import Counter

counter = Counter()
counter.update(['a', 'b', 'c', 'a', 'a'])
counter

Counter({'a': 3, 'b': 1, 'c': 1})

In [14]:
counter['a']

3

In the example above, we create a new counter object, update it with a list of string objects and it will automatically count the frequency of those string objects.

Also note that the counter behaves just like a dictionary with `counter['a']` returns the frequency for `'a'`.

It turns out that we can create a new counter object and initialize it at the same time.

In [15]:
counter = Counter(['a', 'b', 'c', 'a', 'a', 'b'])
counter

Counter({'a': 3, 'b': 2, 'c': 1})

A `Counter` object has other methods that you might find useful. For example:

In [16]:
counter.most_common(2)

[('a', 3), ('b', 2)]

# Fourth Solution: Using Counter

Once we learned about the `Counter` class, the solution becomes a one-liner:

In [17]:
from collections import Counter

frequency = Counter(words)
frequency

Counter({'actions': 2,
         'become': 4,
         'becomes': 1,
         'character': 2,
         'destiny': 1,
         'habits': 2,
         'it': 1,
         'they': 4,
         'thoughts': 1,
         'watch': 5,
         'words': 2,
         'your': 6})

# Conclusion

Counting word frequency is a great way to learn about various kinds of dictionaries and their capabilities. After reading through this article, many newbies will undoubtedly ask:

> Why didn't you show me the Counter solution and be over with? Why take the long road with many detours and finally ended with Counter?

There are many reasons which I decided to take the long road:

* This is an opportunity to learn about different kinds dictionaries
* In order to learn how to run (Counter), we first need to learn how to walk (dict)
* By learning how to do it with a normal dictionary, we learn the mechanics of how things work. The knowlege you learn here will be with you for a long time
