# Counters and plotting

by Koenraad De Smedt at UiB

---
We have already seen that a frequency distribution (FreqDist) in NLTK is a kind of counter. We can also make our own counters of items in a sequence.

This notebook shows:
1.   How to make a counter with the help of `Counter` (a subclass of *dict*) from the `collections` module.
2.   How to plot the counts in a bar or line plot.

---



The genetic code in RNA is also a language. As an example, consider the sequence of nucleotides in the 5' region of [the mRNA in the BNT162b2 vaccine](https://berthub.eu/articles/posts/reverse-engineering-source-code-of-the-biontech-pfizer-vaccine/). We want to count how many times each nucleotide occur in the sequence.

In [None]:
from collections import Counter
fiveprimeutr = 'GAAΨAAACΨAGΨAΨΨCΨΨCΨGGΨCCCCACAGACΨCAGAGAGAACCCGCCACC'
cntr = Counter(fiveprimeutr)
cntr

Make a bar plot of the counter with the `pyplot` module. A counter is a kind of *dict*, so we can retrieve its keys and values.

In [None]:
import matplotlib.pyplot as plt
plt.bar(cntr.keys(), cntr.values(), align='center')
plt.title('Nucleotide frequencies')
plt.show()

---

Let's find the character distribution in a small text.

In [None]:
poem ='''On the Pulse of Morning
by Maya Angelou.

A Rock, A River, A Tree
Hosts to species long since departed,
Marked the mastodon,
The dinosaur, who left dried tokens
Of their sojourn here
On our planet floor,
Any broad alarm of their hastening doom
Is lost in the gloom of dust and ages.'''

charcount = Counter(poem.casefold())
charcount

A counter is unordered. To get the counted items by decreasing value, we  use the `.most_common()` method. This produces a list of pairs with keys and values.

In [None]:
charcount.most_common()

Let's get the 10 most common items in a specified format and print each character with its count. Of course, the space is invisible when printed.

In [None]:
for ch, val in charcount.most_common(10):
  print(ch, val)

Put only the frequencies of all items, in decreasing order, into a list.

In [None]:
count_values = [val for ch, val in charcount.most_common()]
print(count_values)

Make a line plot of all frequencies with the `pyplot` module.

In [None]:
plt.plot(count_values)

Make the plot a bit fancier. Uncomment the commented lines to specify the plot resolution and save the figure to file.

In [None]:
# plt.figure(dpi=120)
plt.plot(count_values)
plt.title('Character frequencies')
plt.ylabel('Frequency')
plt.xlabel('Index')
# plt.savefig('charfreq')
plt.show()

### Exercises

1.  If you look at the distribution of characters above, or in another text, what do you observe? Why would it not be a straight line?
2.  For which purposes can knowledge about character distributions be useful?
3.  Can you make a counter of a set? Does it make sense?
4.  Compute the relative frequencies for characters in the text, in particular, the proportion of the number of occurrences to the length of the text.