# Analysis of Sequences in a text file

We're gonna use two large texts to analyse the number of sequences of length n, what percentage of the total of possible sequences of length n it corresponds and entropy per character in the sequences.

The texts are: *Ulysses* by James Joyce and *The History of a Young Lady* by Samuel Richardson. Both are available at the Gutenberg Project.

To compute the entropy we are using the Python script available at [*clscripts*](https://github.com/leolca/clscripts).

In [4]:
wget -q https://www.gutenberg.org/files/4300/4300-0.txt -O /tmp/ulysses.txt
wget -q https://github.com/leolca/clscripts/raw/master/entropy.py -O entropy.py
chmod +x entropy.py
wget -q https://github.com/leolca/clscripts/raw/master/ngram -O ngram
chmod +x ngram

The folloing script does the hard word. It receives two parameters: the text file name and the maximum sequence length we are going to analyse. 

We are considering a case insensitive approach, therefore we are converting uppercase into lowercase. We are interested in analyse sequeces maide only by of characters 'a-z'. 

In order to create sequences of length n, we could simply use *fold*. In this case we are not taking overlapping sequeces. See the example:

In [31]:
echo "abcdef" | fold -w3

abc
def


If you wish to get overlapping sequences, we need to use the *ngram* function from *clscripts*. See the example:

In [32]:
echo "abcdef" | ./ngram -n 3

abc
bcd
cde
def


This function also avoids (default behaviour) extracting n-grams acros word boundaries. Observe in the example bellow that the sequence 'met' and 'eta' (on the edge of the two words) are not generated.

In [33]:
echo "time table" | ./ngram -n 3

tim
ime
tab
abl
ble


For short sequences (small n), that might be desirable. If we keep this aproach for large values of n, we will get fewer sequences, since there are not many long words in a language. See bellow a simple example when I consider only sequences of length n=6 and n=7.

In [40]:
echo "the train time table changed" | tee >(./ngram -n 6) >(./ngram -n 7) > /dev/null

changed
change
hanged


We might also disconsider word boundaries by removing them. Using the simple 'time table' example, we observe that thet sequences 'met' and 'eta' are now generated.

In [43]:
echo "time table" | tr -dc 'a-z' | ./ngram -n 3

tim
ime
met
eta
tab
abl
ble


We need not to list unique sequences and count the number of times each one appear. We're using GNU tools *sort* and *uniq*. The result will be used to 1) compute the entropy, 2) count the number of sequences observed, and 3) plot a graph using *gnuplot*. To do so we use *tee* to make pipes derivaions from the main stream. This creates the problem that each subshell might take a different amount of time to accomplish its job. To syncronize everything we used locks. The results are all saved in temporary files, which are gonna be read on the end to retrieve the computed values. Some bash have a bug on locks implementation, for that reason we also add a while loop to wait for non empty files. In the end we must assert that the temporary files were deleted, so we create a trap to be run regardless of the result of the script.

In [49]:
cat sequenceanalysis.sh

#!/bin/bash
FILENAME=$1
LIMIT=$2
mkdir -p imgs
printf "%-20s%-20s%-20s%-20s%-20s\n" "n" "entropy/char" "num_of_seq." "%_of_total" "typ_set_size"
for i in `seq $LIMIT`
do
  tmpa=$(mktemp) tmpb=$(mktemp)
  trap 'rm "$tmpa" "$tmpb"' EXIT
  imgfilename=$(echo "$FILENAME" | cut -f 1 -d '.')`printf %03d $i`.png
  cat $FILENAME |
  (
    flock 3
    flock 4

    #tr 'A-Z' 'a-z' | tr -dc 'a-z' | fold -w$i | sort |
    #tr 'A-Z\n' 'a-z ' | tr -dc 'a-z ' | ./ngram -n $i | sort |
    tr 'A-Z' 'a-z' | tr -dc 'a-z' | ./ngram -n $i | sort |
    uniq -c | sort -nr | awk '{print NR "\t" $0}' |
    tee >(
        awk '{print $2}' | ./entropy.py | { sleep 0.2; cat; } > "$tmpa"
        flock -u 3
        ) >(
        awk 'END{ print NR }' |  { sleep 0.3; cat; } > "$tmpb"
        flock -u 4
        ) >(
        gnuplot -e "set terminal png; set output '$imgfilename'; set xlabel 'rank'; set ylabel 'counts'; set title 'sequence length: $i'; set key left top; set logscale x 10; set logscale y 10; plot '-' us

As results, this script provides us with: 1) entropy per character; 2) number of sequences observed in the data for a ginve length n; 3) percentage of the total possible sequences of length n that were in fact observed in the data; 4) an approximation to the typical set size $2^{nH}$ . Observe that, as n gets large, the number of observed sequences approach the size of the typical set.

In [50]:
./sequenceanalysis.sh /tmp/ulysses.txt 12

n                   entropy/char        num_of_seq.         %_of_total          typ_set_size        
1                   4.2057              26                  1                   18.4519             
2                   3.94375             634                 0.93787             236.796             
3                   3.7291              8783                0.499716            2331.92             
4                   3.5092              66114               0.144677            16807.3             
5                   3.258               243472              0.0204919           80127               
6                   2.97502             493099              0.00159622          236277              
7                   2.69013             720894              8.97549e-05         466304              
8                   2.42649             892221              4.27254e-06         697542              
9                   2.19342             1006904             1.8545e-07          876127     

In [46]:
montage /tmp/ulysses*.png -tile 6x2 -geometry +0+0 /tmp/out.png 2>/dev/null
convert /tmp/out.png -resize 800x ulysses_results.png
rm /tmp/out.png

![ulysses](ulysses_results.png)

In [51]:
wget -nc -q https://www.gutenberg.org/ebooks/9296.txt.utf-8 -O /tmp/harlowe001.txt
wget -nc -q https://www.gutenberg.org/ebooks/9798.txt.utf-8 -O /tmp/harlowe002.txt
wget -nc -q https://www.gutenberg.org/ebooks/9881.txt.utf-8 -O /tmp/harlowe003.txt
wget -nc -q https://www.gutenberg.org/ebooks/10462.txt.utf-8 -O /tmp/harlowe004.txt
wget -nc -q https://www.gutenberg.org/ebooks/10799.txt.utf-8 -O /tmp/harlowe005.txt
wget -nc -q https://www.gutenberg.org/ebooks/11364.txt.utf-8 -O /tmp/harlowe006.txt
wget -nc -q https://www.gutenberg.org/ebooks/11889.txt.utf-8 -O /tmp/harlowe007.txt
wget -nc -q https://www.gutenberg.org/ebooks/12180.txt.utf-8 -O /tmp/harlowe008.txt
wget -nc -q https://www.gutenberg.org/ebooks/12398.txt.utf-8 -O /tmp/harlowe009.txt
cat /tmp/harlowe0*.txt > /tmp/harlowe_history_of_a_young_lady.txt
./sequenceanalysis.sh /tmp/harlowe_history_of_a_young_lady.txt 12

n                   entropy/char        num_of_seq.         %_of_total          typ_set_size        
1                   4.1784              26                  1                   18.1061             
2                   3.8923              617                 0.912722            220.495             
3                   3.62935             7900                0.449477            1895.09             
4                   3.36259             59541               0.130293            11193.4             
5                   3.10821             250821              0.0211104           47678.5             
6                   2.87263             641015              0.00207505          154343              
7                   2.6533              1180888             0.000147026         389996              
8                   2.44795             1770226             8.477e-06           785690              
9                   2.25703             2317249             4.26788e-07         1.30288e+06

In [54]:
montage /tmp/harlowe_*.png -tile 6x2 -geometry +0+0 /tmp/out.png 2>/dev/null
convert /tmp/out.png -resize 800x harlowe_results.png
rm /tmp/out.png

![harlowe](harlowe_results.png)

The type might be undestood as the empirical histogram. Strings that have the same empirical histogram are strings of the same type. For example: 'opts', 'post', 'pots', 'spot', 'stop' and 'tops' share the same type, they all have 1 o, 1 p, 1 s and 1 t. Words that are anagram have the type! Just for fun, we could list them. We're gonna use the awk script available at *clscripts* (this script was written by Arnold Robbins, based on the algorithm from Jon Bentley).

In [55]:
wget -q https://github.com/leolca/clscripts/raw/master/anagram.awk -O anagram.awk

Here is the list of words that are anagrams and start with q.

In [56]:
awk -f anagram.awk /usr/share/dict/words | grep '^q'

[01;31m[Kq[m[Kuads squad 
[01;31m[Kq[m[Kuakes squeak 
[01;31m[Kq[m[Kuartets squatter 
[01;31m[Kq[m[Kuids squid 
[01;31m[Kq[m[Kuieter requite 
[01;31m[Kq[m[Kuiet quite 
[01;31m[Kq[m[Kuires squire 
[01;31m[Kq[m[Kuotes toques 
[01;31m[Kq[m[Kuote toque 


And here is a list of the words that have more than 5 anagrams.

In [57]:
awk -f anagram.awk /usr/share/dict/words | awk 'NF > 5'

abets baste bates beast beats betas 
aster rates stare tares taser tears 
caret cater crate react recta trace 
carets caster caters crates reacts recast traces 
drapes padres parsed rasped spared spread 
lapse leaps pales peals pleas sepal 
least slate stale steal tales teals 
opts post pots spot stop tops 
palest pastel petals plates pleats staple 
pares parse pears rapes reaps spare spear 


To compute the type of a string we are gonna use the script *type.sh* from *clscripts*.

In [66]:
wget -q https://github.com/leolca/clscripts/raw/master/type -O type
chmod +x type
echo "test" | ./type

e1s1t2


Lets compute the type of the anagrams we listed above.

In [67]:
echo -e "word\ttype"
mkfifo /tmp/myfifo
awk -f anagram.awk /usr/share/dict/words | awk 'BEGIN{OFS="\n"} NF > 5 {$1=$1; print}' | 
   tee >(./type > /tmp/myfifo) | paste - /tmp/myfifo 
rm /tmp/myfifo

word	type
abets	a1b1e1s1t1
baste	a1b1e1s1t1
bates	a1b1e1s1t1
beast	a1b1e1s1t1
beats	a1b1e1s1t1
betas	a1b1e1s1t1
aster	a1e1r1s1t1
rates	a1e1r1s1t1
stare	a1e1r1s1t1
tares	a1e1r1s1t1
taser	a1e1r1s1t1
tears	a1e1r1s1t1
caret	a1c1e1r1t1
cater	a1c1e1r1t1
crate	a1c1e1r1t1
react	a1c1e1r1t1
recta	a1c1e1r1t1
trace	a1c1e1r1t1
carets	a1c1e1r1s1t1
caster	a1c1e1r1s1t1
caters	a1c1e1r1s1t1
crates	a1c1e1r1s1t1
reacts	a1c1e1r1s1t1
recast	a1c1e1r1s1t1
traces	a1c1e1r1s1t1
drapes	a1d1e1p1r1s1
padres	a1d1e1p1r1s1
parsed	a1d1e1p1r1s1
rasped	a1d1e1p1r1s1
spared	a1d1e1p1r1s1
spread	a1d1e1p1r1s1
lapse	a1e1l1p1s1
leaps	a1e1l1p1s1
pales	a1e1l1p1s1
peals	a1e1l1p1s1
pleas	a1e1l1p1s1
sepal	a1e1l1p1s1
least	a1e1l1s1t1
slate	a1e1l1s1t1
stale	a1e1l1s1t1
steal	a1e1l1s1t1
tales	a1e1l1s1t1
teals	a1e1l1s1t1
opts	o1p1s1t1
post	o1p1s1t1
pots	o1p1s1t1
spot	o1p1s1t1
stop	o1p1s1t1
tops	o1p1s1t1
palest	a1e1l1p1s1t1
pastel	a1e1l1p1s1t1
petals	a1e1l1p1s1t1
plates	a1e1l1p1s1t1
pleats	a1e1l1p1s1t1
staple	a1e1l1p1s1t1
pares	a1e1p1r1s1

Lets modify the previous script so that now we also compute the number of types observed in a text file. We just need to add a new `tee` and get one the stream to compute the types and the other stream to do the same computations as done previously.

In [68]:
cat typeanalysis.sh

#!/bin/bash
FILENAME=$1
LIMIT=$2
mkdir -p imgs
printf "%-20s%-20s%-20s%-20s%-20s%-20s\n" "n" "entropy/char" "num_of_seq." "%_of_total" "num_types" "typ_set_size"
for i in `seq $LIMIT`
do
  tmpa=$(mktemp) tmpb=$(mktemp) tmpc=$(mktemp)
  trap 'rm "$tmpa" "$tmpb" "$tmpc"' EXIT
  imgfilename=$(echo "$FILENAME" | cut -f 1 -d '.')`printf %03d $i`.png
  cat $FILENAME |
  (
    flock 3
    flock 4
    flock 5

    #tr 'A-Z' 'a-z' | tr -dc 'a-z' | fold -w$i | tee >(
    #tr 'A-Z\n' 'a-z ' | tr -dc 'a-z ' | ./ngram -n $i | tee >(
    tr 'A-Z' 'a-z' | tr -dc 'a-z' | ./ngram -n $i | tee >(
       ./type | sort | uniq | awk 'END{ print NR }' > "$tmpc"
       flock -u 5
       ) | sort | uniq -c | sort -nr | awk '{print NR "\t" $0}' |
    tee >(
        awk '{print $2}' | ./entropy.py > "$tmpa"
        flock -u 3
        ) >(
        awk 'END{ print NR }' > "$tmpb"
        flock -u 4
        ) >(
        gnuplot -e "set terminal png; set output '$imgfilename'; set xlabel 'rank'; set ylabel 'counts';

In [65]:
./typeanalysis.sh /tmp/ulysses.txt 12

n                   entropy/char        num_of_seq.         %_of_total          num_types           typ_set_size        
1                   4.2057              26                  1                   26                  18.4519             
2                   3.94375             634                 0.93787             344                 236.796             
3                   3.7291              8783                0.499716            2505                2331.92             
4                   3.5092              66114               0.144677            11821               16807.3             
5                   3.258               243472              0.0204919           41253               80127               
6                   2.97502             493099              0.00159622          108987              236277              
7                   2.69013             720894              8.97549e-05         227045              466304              
8                   2.42649     