# Letter Frequency Analysis - Determine Language of Document based on Signal Similarity

Letter frequency is language specific. See for Letter Frequencies this page in Wikipedia: https://en.wikipedia.org/wiki/Letter_frequency#Relative_frequencies_of_letters_in_other_languages.

The letter occurrence in a language can be regarded as a signal or data series that can easily be plotted - as histogram or barchart. It is like the DNA or fingerprint of the language.

And that signal can be compared to the signal extracted from any document or piece of text. The language DNA closest to the DNA taken from that document is likely to be the language that document is written in. That is the theory that I like to validate in this article. 

A similar investigation was done by Adrianus Kleemans in this blog article: https://www.kleemans.ch/letter-frequency. He used a slightly different approach than I will be using here - but I am certainly inspired by his article. I am gratefully using the data file composed by Adrianus with the letter frequencies per language.

In this notebook, I will investigate the incidence of letters in several "unknown" documents. I do know in fact what language these documents are written in - their name reveals the language. They were taken from the Gutenberg project (https://www.gutenberg.org/ - with over 50K free ebooks).

I will use various methods for comparing the letter frequency in the documents under scrutiny with the known relative letter frequency in 20+ languages:
* Compare Euclidean distance between letter frequencies for all letters in document under scrutiny and all reference frequencies for all languages
* Naive binning of letter frequencies, symbolic representation and comparing using Fuzzy Search with Levenshtein distance
* SAX (Symbolic Aggregate approXimation) and MINDIST distance determination between strings representing the relative occurrence for each letter

All three methods give us decent results. Whether any is an actual meaningful method for language detection on documents I cannot tell. Note: an even simpler approach to that particular challenge was discussed in my article: [Determine the Language of a Document from the Letter Frequency – using Levenshtein Distance between sequences](https://technology.amis.nl/2019/08/21/determine-the-language-of-a-document-from-the-letter-frequency-using-levenshtein-distance-between-sequences/)

I consider this notebook primarily an introduction to various methods of describing and comparing signals. More on that in other notebooks as well.

# Read Letter Frequency Data into Pandas Data Frame
Let's load the CSV data from file letter-frequencies.csv into a Pandas Data Frame and prepare it for visualization and further processing. This file contains the relative frequencies per letter in over a dozen languages (not currently for English). 

In [232]:
import pandas as pd
import json
import plotly.graph_objs as go
import plotly.express as px
from chart_studio.plotly import plot, iplot
from plotly.subplots import make_subplots

In [233]:
lf_df = pd.read_csv("letter-frequencies.csv", skiprows = 0, sep=';')
# show the content of the dataframe with our reference data
lf_df.head(50)

Unnamed: 0,Letter,French,German,Spanish,Portuguese,Esperanto,Italian,Turkish,Swedish,Polish,Dutch,Danish,Icelandic,Finnish,Czech
0,a,7.636%,6.516%,11.525%,14.634%,12.117%,11.745%,12.920%,9.383%,10.503%,7.486%,6.025%,10.110%,12.217%,8.421%
1,b,0.901%,1.886%,2.215%,1.043%,0.980%,0.927%,2.844%,1.535%,1.740%,1.584%,2.000%,1.043%,0.281%,0.822%
2,c,3.260%,2.732%,4.019%,3.882%,0.776%,4.501%,1.463%,1.486%,3.895%,1.242%,0.565%,0,0.281%,0.740%
3,d,3.669%,5.076%,5.010%,4.992%,3.044%,3.736%,5.206%,4.702%,3.725%,5.933%,5.858%,1.575%,1.043%,3.475%
4,e,14.715%,16.396%,12.181%,12.570%,8.995%,11.792%,9.912%,10.149%,7.352%,17.324%,15.453%,6.418%,7.968%,7.562%
5,f,1.066%,1.656%,0.692%,1.023%,1.037%,1.153%,0.461%,2.027%,0.143%,0.805%,2.406%,3.013%,0.194%,0.084%
6,g,0.866%,3.009%,1.768%,1.303%,1.171%,1.644%,1.253%,2.862%,1.731%,3.403%,4.077%,4.241%,0.392%,0.092%
7,h,0.737%,4.577%,0.703%,0.781%,0.384%,0.636%,1.212%,2.090%,1.015%,2.380%,1.621%,1.871%,1.851%,1.356%
8,i,7.529%,6.550%,6.247%,6.186%,10.012%,10.143%,9.600%*,5.817%,8.328%,6.499%,6.000%,7.578%,10.817%,6.073%
9,j,0.613%,0.268%,0.493%,0.397%,3.501%,0.011%,0.034%,0.614%,1.836%,1.461%,0.730%,1.144%,2.042%,1.433%


In [234]:
# create a list of all letters (to iterate through later on)
all_letters = lf_df['Letter'].to_list()

In [235]:
# I want this shape for the data frame: three columns - Letter,  Language,  Frequency
# using melt (see https://hackernoon.com/reshaping-data-in-python-fa27dda2ff77) to to an unpivot and create a long format

df = pd.melt(lf_df, id_vars='Letter', value_vars=['French','German','Spanish','Portuguese','Esperanto','Italian','Turkish','Swedish','Polish','Dutch','Danish','Icelandic','Finnish','Czech'])
df.rename(columns={"variable": "language","Letter": "letter"}, inplace=True)
df.head(10)

Unnamed: 0,letter,language,value
0,a,French,7.636%
1,b,French,0.901%
2,c,French,3.260%
3,d,French,3.669%
4,e,French,14.715%
5,f,French,1.066%
6,g,French,0.866%
7,h,French,0.737%
8,i,French,7.529%
9,j,French,0.613%


In [236]:
# the value column should contain numeric values, not strings only representing a percentage
df['frequency']= df['value'].apply(lambda percentage_string: float(percentage_string.replace('%','').replace('*','')))
df.head(5)

Unnamed: 0,letter,language,value,frequency
0,a,French,7.636%,7.636
1,b,French,0.901%,0.901
2,c,French,3.260%,3.26
3,d,French,3.669%,3.669
4,e,French,14.715%,14.715


Show a bar chart with the fingerprint for French

In [237]:
fig = px.bar(df[(df['language']=='French') & (df['frequency']>0.05)] , x="letter", y="frequency"
             , range_y=[0,15]
             , color="language"
             , barmode="group"
            #,facet_col="language"
            )

fig.update_layout(
    title=go.layout.Title(
        text="Bar Chart with Relative Frequencies per Letter in French"
    ))
fig.show()

In [238]:
# show a partial fingerprint for several languages - to see that immediately a meaningful difference emerges
fig = px.bar(df[(df['language'].isin(  ['French','German','Dutch','Czech','Italian'])) & (df['letter'].isin(['a','e','i','o','u','y','c','x']))] 
             , x="letter", y="frequency"
             , range_y=[0,18]
             , color="language"
             , barmode="group"
             , facet_col="language"
            )

fig.update_layout(
    title=go.layout.Title(
        text="Bar Chart with Relative Frequencies per Letter"
    ))
fig.show()

In [239]:
# Process Text File and Determine Letter Frequency
# from https://github.com/akleemans/letter-frequency/blob/master/language_identifier.py

def process_file(textfile):
    with open(textfile) as myfile:
        content = myfile.readlines()
    # initialize the dict with ordered entries for all letters, with each a value initialized to 0
    dic ={letter: 0 for letter in all_letters}
    total = 0
    for line in content:
        for letter in line:
            letter = letter.lower()
            if letter in all_letters:
                total += 1
                if letter in dic: dic[letter] += 1
                else: dic[letter] = 0

    # normalize
    for letter in dic:
        dic[letter] = dic[letter] / total
    return dic

textfile='text-file-italian.txt'

text_lf_dict = process_file(textfile)


In [240]:
# create a dataframe text_lf that contains all letters and for each letter the relative frequency in the processed document
text_lf = pd.DataFrame.from_dict(text_lf_dict, orient='index', columns=['frequency'])
text_lf['letter'] = text_lf.index
text_lf.head(10)       

Unnamed: 0,frequency,letter
a,0.103346,a
b,0.007594,b
c,0.043363,c
d,0.038081,d
e,0.127559,e
f,0.011886,f
g,0.01926,g
h,0.011116,h
i,0.107858,i
j,0.00011,j


In [241]:
fig = px.bar(text_lf, x="letter", y="frequency"
             , range_y=[0,0.14]
             , barmode="group"
            )

fig.update_layout(
    title=go.layout.Title(
        text="Bar Chart with Relative Frequencies for Letters for text in as to yet unknown language"
    ))
fig.show()

We can perhaps determine the language of the document by comparing this bar chart, this fingerprint, with the bar charts we have for all languages. The one that is most similar is probably for the language of our document.

Spoiler alert. It is Italian.

Is the next bar chart for Italian similar to the previous one?

In [247]:
fig = px.bar(df[(df['language'].isin(['Italian'])) ] , x="letter", y="frequency"
             , range_y=[0,15]
             , color="language"
             , barmode="group"
            #,facet_col="language"
            )

fig.update_layout(
    title=go.layout.Title(
        text="Bar Chart with reference values for Relative Frequencies per Letter in Italian"
    ))
fig.show()

## Euclidean Distance
Calculate the euclidean distance between the frequency value of all letters in the document with the reference frequency values for a specific language. 

Here we take the bar chart for the scanned document and compare it one by one to the bar chart for each language. We calculate the distance between the two charts using the Euclidean distance method - looking directly at the numerical differences in frequency value for each of the letters.  

In [48]:
# we want to know the distance between
# text_lf['frequency']
# and each of the language frequency sequences
# df[(df['language'] == 'French']]['frequency'] 
from scipy.spatial import distance
def compare_letter_frequencies_to_language(unknown_text_frequencies , language):
  # create a Series with all letter frequencies for the given language (sort of normalized between 0 and 1)
  letter_frequencies_in_language =df[df['language'] == language]['frequency'] *0.01
  dst = distance.euclidean(unknown_text_frequencies, letter_frequencies_in_language)
  return dst


In [54]:
df['language'].unique()

array(['French', 'German', 'Spanish', 'Portuguese', 'Esperanto',
       'Italian', 'Turkish', 'Swedish', 'Polish', 'Dutch', 'Danish',
       'Icelandic', 'Finnish', 'Czech'], dtype=object)

In [69]:
unknown_text_frequencies = text_lf['frequency']
language_scores = pd.DataFrame ({}, columns = ['Language','Distance'])
# loop over all unique language values and calculate the distance; add results to Data Frame language_scores
for language in df['language'].unique():
    dst = compare_letter_frequencies_to_language(unknown_text_frequencies, language)
    language_scores = language_scores.append({'Language' : language , 'Distance' : dst} , ignore_index=True)
    
# get top 5 of languages with shortest distance compare to letter frequencies in document
language_scores.sort_values('Distance').head(5)


Unnamed: 0,Language,Distance
5,Italian,0.025948
2,Spanish,0.06369
0,French,0.074196
4,Esperanto,0.08172
3,Portuguese,0.084151


In [73]:
# cerate a function to perform this analysis on any text file
def process_text_file(text_file):
    text_lf_dict = process_file(text_file)
    text_lf = pd.DataFrame.from_dict(text_lf_dict, orient='index', columns=['frequency'])
    text_lf['letter'] = text_lf.index
    unknown_text_frequencies = text_lf['frequency']
    language_scores = pd.DataFrame ({}, columns = ['Language','Distance'])
    for language in df['language'].unique():
        dst = compare_letter_frequencies_to_language(unknown_text_frequencies, language)
        language_scores = language_scores.append({'Language' : language , 'Distance' : dst} , ignore_index=True)
    
    # get top 5
    print("Top 5 scores based on Euclidean distance for ",text_file)
    print(language_scores.sort_values('Distance').head(5))
    
# invoke this function for a number of local text files
# the results are promising - although we keep struggling with Danish
process_text_file('text-file-italian.txt')    
process_text_file('text-file-danish.txt')    
process_text_file('text-file-german.txt')    
process_text_file('text-file-dutch.txt')    

Top 5 scores for  text-file-italian.txt
     Language  Distance
5     Italian  0.025948
2     Spanish  0.063690
0      French  0.074196
4   Esperanto  0.081720
3  Portuguese  0.084151
Top 5 scores for  text-file-danish.txt
   Language  Distance
9     Dutch  0.055544
10   Danish  0.062055
1    German  0.075647
7   Swedish  0.093439
0    French  0.107550
Top 5 scores for  text-file-german.txt
   Language  Distance
1    German  0.031016
9     Dutch  0.061782
10   Danish  0.078350
0    French  0.094755
7   Swedish  0.104399
Top 5 scores for  text-file-dutch.txt
   Language  Distance
9     Dutch  0.029877
1    German  0.070388
10   Danish  0.078216
0    French  0.087943
7   Swedish  0.099022


## Find distance from Symbolic Representations
Convert the letter frequency values into quantile based representations and compare the resulting letter sequences.

The frequency for a letter is a value between 0 and 20%. Letters with a frequency of 10% or higher are top dogs in letter land, letters with a frequency below 0.5 % are not. We will categorize our letters based on the frequency: very frequenty letters go in one category, less frequent in another and very unfrequent in yet another. Each category corresponds to a quantile - and we can pick the number of quantiles as we see fit (more quantiles give more differentiation but perhaps the data and its error may not warrant such a level of differentiation; you can always start with a small number of quantiles ). We can represent the quantile for each of our letters with a symbol that represents the category. To really confuse you, I have selected capital letters as symbols. Here, capital A represents a letter classified as very unfreqent. B is more frequent and the further we go in the alphabet, the more frequent the letters assigned to that category become.

The letter frequencies for a language can now be expressed as a string or a word that consists of symbols. Each symbol in the word tells us something about the element from the Series at that index. The first symbol tells us about the frequency of the letter 'a'. The second symbol about the letter 'b' and so on.

Here we assign the quantiles for all letters for each of the languages, based on the known frequency values.

In [248]:
# represent each languages's letter frequencies as a single string of characters
# each letter's quantile is represented by a string (a..z)
# 97 is the ASCII code for a / capital A = 65)
number_of_quantiles = 10
df['quantile'] = pd.qcut(df['frequency'], number_of_quantiles, labels=False, duplicates='drop')
df['quantile_label']=df['quantile'].apply(lambda q: chr(65+q))
lq = df.groupby(['language'])['quantile_label'].apply(lambda q: ''.join(q)).reset_index()
lq

Unnamed: 0,language,quantile_label
0,Czech,DBBCDAABDBCCCDDCACDDCDAABBAABAAAAAAABAAABAAAAA...
1,Danish,DCBDDCCBDBCDCDCBADDDCCAABAAAAAAABBAAAAAAAAAAAA...
2,Dutch,DBBDDBCCDBCCCDDBADDDCCCAABAAAAAAAAAAAAAAAAAAAA...
3,Esperanto,DBBCDBBADCCDCDDCADDDCCAAAAAAAAAAAAAAAAAAAAAAAA...
4,Finnish,DAABDAACDCCDCDDCACDDDCAABAAAAAACAAAAAAAAAAAAAA...
5,French,DBCCDBBBDBADCDDCBDDDDCAAAAAAAAAAAAAABAAAAAAAAA...
6,German,DCCDDBCCDABCCDCBADDDCBCAABAAAAABAAAAAAAAAAAAAA...
7,Icelandic,DBABDCCCDBCCCDCBADDCCCAABAAABAAAABAABAAABAACAA...
8,Italian,DBCCDBBBDAADCDDCBDCDCCAAABABAAAAAAAAAAAAAAAAAA...
9,Polish,DBCCDABBDCCCCDDCADDCCADACCAAAAAAAAAAAAAAAAAAAA...


The highest category is assigned in almost all cased to at least the fifth element in each word, representing the letter 'e'.

Let's create the symbolic representation for the letter frequencies for the processed text document, using the same number of quantiles for binning or classifying the values.

In [249]:
# represent the letter frequencies from this unknown text  (italian text) as a single string of characters
# each letter's quantile is represented by a string (a..z)
# 97 is the ASCII code for a / capital A = 65)

text_lf['quantile'] = pd.qcut(text_lf['frequency'], number_of_quantiles, labels=False, duplicates='drop')
text_lf['quantile_label']=text_lf['quantile'].apply(lambda q: chr(65+q))
text_lf
symbolic_representation = ''.join(text_lf['quantile_label'].astype(str))
print(symbolic_representation)

DBCCDCCBDAADCDDCBDDDCCAAABAABAAAAAAABAAABAAAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA


This word should look at least a little similar to the word for Italian.

Instead of mannually trying to assess the similarity of this word with all the language reference words, we will use [module fuzzy search](https://pypi.org/project/fuzzysearch) to find similarities for us. It searches for matches between strings, accepting slight differences, based on the Levenshtein distance calculation. 

This does *not* consider A and B more similar than A and X when comparing the strings. We know that strings AAA and BBB are far closer than AAA and XXX. This method therefore is quite crude. Later on we will be using the MINDIST method that is kinder to close symbols than to distant ones. 

Note: we could have assigned numbers to the categories instead of letters and looked at the delta between the numbers. Or we could look at ascii value for the letters and compare those. The MINDIST comparison discussed later on takes care of this.

In [75]:
# we are looking for one on one comparison & distance calculation
# https://pypi.org/project/fuzzysearch/ for Approximate sub-string searches
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install  fuzzysearch

Collecting fuzzysearch
[?25l  Downloading https://files.pythonhosted.org/packages/84/a4/ae12fef8f50332419291f40c0faeb2af1e24804faabcc6e386a9c854a4db/fuzzysearch-0.6.2.tar.gz (99kB)
[K    100% |████████████████████████████████| 102kB 2.9MB/s a 0:00:011
Building wheels for collected packages: fuzzysearch
  Building wheel for fuzzysearch (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/c8/88/03/be9b7fb7326e8d5880023ffd0aa719941515240a5096938a06
Successfully built fuzzysearch
Installing collected packages: fuzzysearch
Successfully installed fuzzysearch-0.6.2


In [250]:
language_sentence = lq[lq['language']=='Italian']['quantile_label'].squeeze() # squeeze to turn the single value from the Pandas series into a single string
print("Symbolic representation for Italian ",language_sentence)
print("Symbolic representation for the unknown document (it should be similar)",symbolic_representation)

Symbolic representation for Italian  DBCCDBBBDAADCDDCBDCDCCAAABABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Symbolic representation for the unknown document (it should be similar) DBCCDCCBDAADCDDCBDDDCCAAABAABAAAAAAABAAABAAAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA


Compare the words for Italian and the unknown document; see if there is a match and if the distance is not high.

In [251]:
from fuzzysearch import find_near_matches
# not a great match, even for the correct language
find_near_matches(symbolic_representation,language_sentence , max_l_dist=16)

[Match(start=0, end=81, dist=7)]

So we have a match. Italian is certainly a candidate as language for the document.

Let's create a function that allows us to compare symbolic representations for any document against any language: 

In [252]:
def inspect_against_language(symbolic_representation, language):
    distance = find_near_matches(symbolic_representation,lq[lq['language']==language]['quantile_label'].squeeze() , max_l_dist=25)
    print('fuzzy search in ',language,' - result:',distance)
    return distance

and then try the fuzzy search to match the document's symbolic representation against the representations of several languages. See which one scores best.

In [253]:
# fuzzy search against some language; lower distance means higher similarity
inspect_against_language(symbolic_representation, "French")
inspect_against_language(symbolic_representation, "German")
inspect_against_language(symbolic_representation, "Spanish")
inspect_against_language(symbolic_representation, "Dutch")
inspect_against_language(symbolic_representation, "Finnish")
inspect_against_language(symbolic_representation, "Danish")
inspect_against_language(symbolic_representation, "Italian")

fuzzy search in  French  - result: [Match(start=0, end=82, dist=8)]
fuzzy search in  German  - result: [Match(start=8, end=82, dist=15)]
fuzzy search in  Spanish  - result: [Match(start=0, end=82, dist=11)]
fuzzy search in  Dutch  - result: [Match(start=4, end=82, dist=13)]
fuzzy search in  Finnish  - result: [Match(start=4, end=82, dist=15)]
fuzzy search in  Danish  - result: [Match(start=0, end=80, dist=14)]
fuzzy search in  Italian  - result: [Match(start=0, end=81, dist=7)]


[Match(start=0, end=81, dist=7)]

Italian scored best, only just beating French.

Let's try the same thing again for a different text document. See if the correct language is suggested again:

In [227]:
# load german document and try again
text_lf_dict = process_file('text-file-german.txt')
text_lf = pd.DataFrame.from_dict(text_lf_dict, orient='index', columns=['frequency'])
text_lf['letter'] = text_lf.index
# represent the letter frequencies from thistext as a single string of characters
# each letter's quantile is represented by a string (a..z)
# 97 is the ASCII code for a / capital A = 65)
number_of_quantiles = 20
text_lf['quantile'] = pd.qcut(text_lf['frequency'], number_of_quantiles, labels=False, duplicates='drop')
text_lf['quantile_label']=text_lf['quantile'].apply(lambda q: chr(65+q))
symbolic_representation = ''.join(text_lf['quantile_label'].astype(str))
print(symbolic_representation)
# fuzzy search against some language; lower distance means higher similarity
inspect_against_language(symbolic_representation, "French")
inspect_against_language(symbolic_representation, "German")
inspect_against_language(symbolic_representation, "Spanish")
inspect_against_language(symbolic_representation, "Dutch")
inspect_against_language(symbolic_representation, "Finnish")
inspect_against_language(symbolic_representation, "Danish")
inspect_against_language(symbolic_representation, "Italian")
print("Do we have a winner? And is it German?")

GEFGHDFGHBDFEHEBBHGHFDDABECAAAACAAAAAAAAAAAAAAAACAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
fuzzy search in  French  - result: [Match(start=11, end=82, dist=24)]
fuzzy search in  German  - result: [Match(start=0, end=82, dist=13)]
fuzzy search in  Spanish  - result: [Match(start=0, end=81, dist=19)]
fuzzy search in  Dutch  - result: [Match(start=0, end=82, dist=21)]
fuzzy search in  Finnish  - result: [Match(start=8, end=82, dist=19)]
fuzzy search in  Danish  - result: [Match(start=0, end=79, dist=20)]
fuzzy search in  Italian  - result: [Match(start=11, end=82, dist=22)]
Do we have a winner? And is it German?


So this crude comparison renders pretty okay results. Nice to know. Nice to have such a straightforward, naive approach at our disposal.

# SAX: Symbolic Aggregate approXimation
A more refined, scalable and better result producing way to represent data series.



In [131]:
# Install a conda package in the current Jupyter kernel
# based on https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/
import sys
!conda install --yes --prefix {sys.prefix} saxpy

# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install saxpy

Collecting package metadata: done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - saxpy

Current channels:

  - https://conda.anaconda.org/conda-forge/linux-64
  - https://conda.anaconda.org/conda-forge/noarch
  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/free/linux-64
  - https://repo.anaconda.com/pkgs/free/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.




In [203]:
from saxpy.alphabet import cuts_for_asize
import numpy as np
from saxpy.znorm import znorm
from saxpy.sax import ts_to_string

alphabet_size = 5
word_size =  len(all_letters) # feel free to use a smaller number of letters; just the first few letters of the alphabet can be telling

In [204]:
sax_unknown = ts_to_string(znorm(text_lf[:word_size]['frequency']), cuts_for_asize(alphabet_size))
print('SAX symbolic representation for unknown document is ',sax_unknown)

#df[(df['language'].isin(  ['French','German','Dutch','Czech','Italian']))

SAX symbolic representation for unknown document is  eceeecdcebbedeedceeeecbbbcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb


In [205]:
len(sax_unknown)

82

In [206]:
italian_lf = list(df[df['language']=='Italian']['frequency'])
sax_italian =ts_to_string(znorm(  italian_lf[:word_size]), cuts_for_asize(alphabet_size))


In [207]:
german_lf = list(df[df['language']=='German']['frequency'])
sax_german = ts_to_string(znorm(  german_lf[:word_size]), cuts_for_asize(alphabet_size))

In [208]:
import saxpy_nphoff as sx
import importlib
#importlib.reload(my_module)
s = sx.SAX(alphabetSize = alphabet_size, wordSize = word_size)
s.compare_strings(sax_unknown, sax_italian)


0.0

In [209]:
s.compare_strings(sax_unknown, sax_german)

0.7733692520394123

In [210]:
def score_sax_string_against_language( sax_unknown, language):
    language_lf = list(df[df['language']== language]['frequency'])
    sax_language = ts_to_string(znorm(  language_lf[:word_size]), cuts_for_asize(alphabet_size))
    print(language,': ',sax_language)
    return s.compare_strings(sax_unknown, sax_language)

In [211]:
print ('Scoring Document with unknown language against some known languages')
print('Against French',score_sax_string_against_language( sax_unknown, 'French'))
print('Against German',score_sax_string_against_language( sax_unknown, 'German'))
print('Against Dutch',score_sax_string_against_language( sax_unknown, 'Dutch'))
print('Against Italian',score_sax_string_against_language( sax_unknown, 'Italian'))
print('Against Swedish',score_sax_string_against_language( sax_unknown, 'Swedish'))
print('Against Finnish',score_sax_string_against_language( sax_unknown, 'Finnish'))

Scoring Document with unknown language against some known languages
French :  ecdeecccecbedeedceeeecbbbbbbbbbbbbbbcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
Against French 0.0
German :  ecdeecdeebcddedcbeeeecdbbcbbbbbcbbbbbbbbbbbbbbbbbbbbcbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
Against German 0.7733692520394123
Dutch :  ecceecddecdedeecbeeedccbbcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
Against Dutch 0.7733692520394123
Italian :  eceeecccebbedeedbeeeddbbbcbcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
Against Italian 0.0
Swedish :  ecceedddecdeeeedbeeeddbbcbbbbbbccbbbbbbbbbbbbbbbcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
Against Swedish 0.7733692520394123
Finnish :  ebbcebbcedeedeecbdeeedbbcbbbbbbebbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
Against Finnish 2.1005713508471926


In [212]:
def process_text_file_sax(text_file):
    text_lf_dict = process_file(text_file)
    text_lf = pd.DataFrame.from_dict(text_lf_dict, orient='index', columns=['frequency'])
    text_lf['letter'] = text_lf.index
    sax_unknown_text_frequencies = ts_to_string(znorm(text_lf[:word_size]['frequency']), cuts_for_asize(alphabet_size))
    language_scores = pd.DataFrame ({}, columns = ['Language','Distance'])
    for language in df['language'].unique():
        dst = score_sax_string_against_language( sax_unknown_text_frequencies, language)
        language_scores = language_scores.append({'Language' : language , 'Distance' : dst} , ignore_index=True)
    
    # get top 5
    print("Top 5 scores based on SAX distance for ",text_file)
    print(language_scores.sort_values('Distance').head(5))
    
process_text_file('text-file-italian.txt')    
process_text_file('text-file-danish.txt')    
process_text_file('text-file-german.txt')    
process_text_file('text-file-dutch.txt')    

Top 5 scores for  text-file-italian.txt
     Language  Distance
5     Italian  0.025948
2     Spanish  0.063690
0      French  0.074196
4   Esperanto  0.081720
3  Portuguese  0.084151
Top 5 scores for  text-file-danish.txt
   Language  Distance
9     Dutch  0.055544
10   Danish  0.062055
1    German  0.075647
7   Swedish  0.093439
0    French  0.107550
Top 5 scores for  text-file-german.txt
   Language  Distance
1    German  0.031016
9     Dutch  0.061782
10   Danish  0.078350
0    French  0.094755
7   Swedish  0.104399
Top 5 scores for  text-file-dutch.txt
   Language  Distance
9     Dutch  0.029877
1    German  0.070388
10   Danish  0.078216
0    French  0.087943
7   Swedish  0.099022


In [213]:
# NOTE: THIS DOES NOT APPLY TO THE CASE OF LETTER frequencies or other discrete series

# Time series to SAX conversion with PAA aggregation (i.e., by "chunking")
# In order to reduce dimensionality further, the PAA (Piecewise Aggregate Approximation) is usually applied prior to SAX:

import numpy as np
from saxpy.znorm import znorm
from saxpy.paa import paa
from saxpy.sax import ts_to_string

#dat = text_lf['frequency']
#dat_znorm = znorm(dat)
#dat_paa_20 = paa(dat_znorm, 20)

# ts_to_string(dat_paa_20, cuts_for_asize(10))

# Resources
https://datascience.stackexchange.com/questions/15812/check-similarity-between-time-series
and:
http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/time_series_from_keogh_tutorial.pdf

Letter Frequencies: http://letterfrequency.org/ 

Symbolic Aggregate approXimation, HOT-SAX, and SAX-VSM implementation in Python: https://github.com/seninp/saxpy

Shape matching with time series data - by Devini Senaratna and Chris Potts https://roamanalytics.com/2016/11/28/shape-matching-with-time-series-data/
and: 

Experiencing SAX: a Novel Symbolic Representation of Time Series  - JESSICA LIN, EAMONN KEOGH, LI Wei, STEFANO LONARDI
https://cs.gmu.edu/~jessica/SAX_DAMI_preprint.pdf

SAX and Matrix Profile Techniques for Root Cause Analysis, Author: Supreet Oberoi Posted on October 29, 2018  https://www.datascience.com/blog/sax-and-matrix-profile-time-series


jMotif - Java Implementation of SAX, EMMA: https://github.com/jMotif/SAX 

https://github.com/nphoff/saxpy
s = SAX(12, 3)
(a_sax, a_indexes) = s.to_letter_rep(motif)
print "a_sax: %s" % a_sax
(sequence_strings, sequence_indexes) = s.sliding_window(sequence, len(sequence)/ len(motif)) x3x2ComparisonScores = s.batch_compare(sequence_strings,a_sax)_

SAX PY Fast implementation
https://github.com/zangsir/SPM/blob/master/ts_mining/saxpyFast.py

 Levenshtein Distance - to compare series and their difference - https://stackabuse.com/levenshtein-distance-and-text-similarity-in-python/ 

# Technical Environment
For this notebook, I made use of Jupyter Notebook 5.7 with the Jupyter Lab extension 1.0.4 installed (https://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html) in combination with ploty 4.1

conda install -c conda-forge jupyterlab

Installing plotly (4.1): 

conda install -c plotly plotly=4.1.0 

conda install -c plotly chart-studio=1.0.0

conda install jupyterlab=1.0 "ipywidgets>=7.5"

(see: https://plot.ly/python/getting-started/)

