# Letter Frequency Analysis - Determine Language of Document based on Frequency Sequence Similarity

Letter frequency is language specific. See for Letter Frequencies this page in Wikipedia: https://en.wikipedia.org/wiki/Letter_frequency#Relative_frequencies_of_letters_in_other_languages.

The letter occurrence in a language can be regarded as a signal or data series that can easily be plotted - as histogram or barchart. It is like the DNA or fingerprint of the language. This page gives us per language the sequence of letters in the order of highest frequency: http://letterfrequency.org/letter-frequency-by-language/ . 

For example for English: e t a o i n s r h l d c u m f p g w y b v k x j q z
And for French: e s a i t n r u l o d c m p é v q f b g h j à x è y ê z ç ô ù â û î œ w k ï ë ü æ ñ
While for Italian: e a i o n l r t s c d u p m v g h f b q z ò à ù ì é è ó y k w x j ô
Or Dutch: e n a t i r o d s l g h v k m u b p w j c z f x y (ë é ó) q


It would seem that we determine the letter frequency for any random document in an unknown language,by comparing the ordered letter sequence for this document with the known sequences for all languages and picking the best match, we can determine the language for the document under scrutiny.

Let's see if that really works.

# Read Letter Frequency Data into Pandas Data Frame
Let's load the CSV data from files into a Pandas Data Frame and prepare it for visualization and further processing.

In [62]:
import pandas as pd

In [63]:
lf_df = pd.read_csv("ordered-letter-sequences.csv", skiprows = 0, sep=',')
lf_df.head(10)

Unnamed: 0,language,ordered_letters
0,english,etaoinsrhldcumfpgwybvkxjqz
1,spanish,eaosrnidlctumpbgyívqóhfzjéáñxúüwk
2,german,enisratdhulcgmobwfkzvüpäßjöyqx
3,french,esaitnrulodcmpévqfbghjàxèyêzçôùâûî
4,italian,eaionlrtscdupmvghfbqzòàùì
5,dutch,enatirodslghvkmubpwjczfxy
6,turkish,aeinrlıdkmuytsboüşzgçhğvcöpfjwxq
7,polish,iaeoznscrwyłdkmtpujlgębąhżśóćńfźvqx
8,esperanto,aieonlsrtkjudmpvgfbcĝĉŭzŝhĵĥwyxq
9,swedish,eantrslidomgkvähfupåöbcjyxwzéq


In [64]:
# Process Text File and Determine Letter Frequency
# based on from https://github.com/akleemans/letter-frequency/blob/master/language_identifier.py
def process_file(textfile):
    with open(textfile) as myfile:
        content = myfile.readlines()
    
    all_letters ='esaitnrulodcmpévqfbghjàxèyêzçôùâûîøöœwkäßïëüæñ'
    # initialize the dict with ordered entries for all letters, with each a value initialized to 0
    dic ={letter: 0 for letter in all_letters}
    total = 0
    for line in content:
        for letter in line:
            letter = letter.lower()
            if letter in all_letters:
                total += 1
                if letter in dic: dic[letter] += 1
                else: dic[letter] = 0

    # normalize
    for letter in dic:
        dic[letter] = dic[letter] / total
    return dic

textfile='text-file-italian.txt'

text_lf_dict = process_file(textfile)


In [65]:
text_lf = pd.DataFrame.from_dict(text_lf_dict, orient='index', columns=['frequency'])
text_lf['letter'] = text_lf.index
text_lf.head(10)       

Unnamed: 0,frequency,letter
e,0.128478,e
s,0.054318,s
a,0.10409,a
i,0.108635,i
t,0.058863,t
n,0.069837,n
r,0.060525,r
u,0.036692,u
l,0.062521,l
o,0.092118,o


Show the letter sequence ordered by frequency. 

In [66]:
''.join(text_lf[text_lf['frequency']>0].sort_values(by=['frequency'], ascending=False)['letter'])

'eiaonlrtscdupmgvfhqbzéèjxâû'

In [67]:
# a function to calculate the Levenshtein distance matrix for two sequences
# taken from https://stackabuse.com/levenshtein-distance-and-text-similarity-in-python/ 
import numpy as np

def levenshtein(seq1, seq2):
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    # print (matrix)
    return (matrix[size_x - 1, size_y - 1])

In [80]:
print(levenshtein('bright','bright'))
print(levenshtein('bright','freight'))
print(levenshtein('bright','sleight'))
print(levenshtein('bright','bride'))
print(levenshtein('bright','plight'))
print(levenshtein('bright','pride'))
print(levenshtein('bright','donald duck'))

0.0
2.0
3.0
3.0
2.0
4.0
11.0


In [69]:
document_letter_sequence = ''.join(text_lf[text_lf['frequency']>0].sort_values(by=['frequency'], ascending=False)['letter'])
# loop over the letter sequences in lf_df - for each language, determine levenshtein distance with document_letter_sequence
best_score = 999
best_matching_language = None
for index, row in lf_df.iterrows():
    ld = levenshtein(document_letter_sequence,row['ordered_letters'])
    print(row['language'],': ',ld)
    if ld < best_score:
        best_score= ld
        best_matching_language = row['language']
print("We have a winner: ",best_matching_language)        

english :  19.0
spanish :  23.0
german :  26.0
french :  25.0
italian :  12.0
dutch :  22.0
turkish :  26.0
polish :  27.0
esperanto :  22.0
swedish :  22.0
portuguese :  30.0
norwegian :  25.0
icelandic :  31.0
hungarian :  31.0
slovak :  22.0
finnish :  23.0
danish :  33.0
czech :  32.0
hawaiian :  23.0
maori :  23.0
latin :  20.0
irish :  22.0
welsh :  27.0
gaelic :  21.0
We have a winner:  italian


In [81]:
def inspect_file(textfilename):
      text_lf_dict = process_file(textfilename)
      text_lf = pd.DataFrame.from_dict(text_lf_dict, orient='index', columns=['frequency'])
      text_lf['letter'] = text_lf.index
      document_letter_sequence = ''.join(text_lf[text_lf['frequency']>0].sort_values(by=['frequency'], ascending=False)['letter'])
      print(document_letter_sequence)
      # loop over the letter sequences in lf_df - for each language, determine levenshtein distance with document_letter_sequence
      best_score = 999
      best_matching_language = None
      for index, row in lf_df.iterrows():        
           ld = levenshtein(document_letter_sequence,row['ordered_letters'])
           print(row['language'],': ',ld)
           if ld == best_score:
               best_matching_language = best_matching_language + ', '+row['language']
           if ld < best_score:
               best_score= ld
               best_matching_language = row['language']
      print("We have a winner: ",best_matching_language) 
    
inspect_file('text-file-german.txt')    

enirtsadhlgucmbozwfvkäüßöjpqy
english :  21.0
spanish :  28.0
german :  19.0
french :  29.0
italian :  23.0
dutch :  22.0
turkish :  26.0
polish :  29.0
esperanto :  27.0
swedish :  25.0
portuguese :  30.0
norwegian :  26.0
icelandic :  30.0
hungarian :  31.0
slovak :  25.0
finnish :  22.0
danish :  33.0
czech :  33.0
hawaiian :  25.0
maori :  24.0
latin :  23.0
irish :  23.0
welsh :  26.0
gaelic :  24.0
We have a winner:  german


In [73]:
inspect_file('text-file-danish.txt') 

eandrstgloihkvmufpcøbjæxyzéä
english :  21.0
spanish :  26.0
german :  25.0
french :  27.0
italian :  22.0
dutch :  19.0
turkish :  27.0
polish :  30.0
esperanto :  27.0
swedish :  15.0
portuguese :  28.0
norwegian :  22.0
icelandic :  30.0
hungarian :  29.0
slovak :  22.0
finnish :  21.0
danish :  26.0
czech :  35.0
hawaiian :  24.0
maori :  23.0
latin :  23.0
irish :  24.0
welsh :  25.0
gaelic :  22.0
We have a winner:  swedish


In [72]:
inspect_file('text-file-dutch.txt') 

enatirodslguvwkmphbcjzfyëxïé
english :  23.0
spanish :  27.0
german :  23.0
french :  25.0
italian :  23.0
dutch :  11.0
turkish :  27.0
polish :  30.0
esperanto :  26.0
swedish :  21.0
portuguese :  30.0
norwegian :  23.0
icelandic :  30.0
hungarian :  29.0
slovak :  23.0
finnish :  18.0
danish :  31.0
czech :  33.0
hawaiian :  23.0
maori :  24.0
latin :  22.0
irish :  26.0
welsh :  28.0
gaelic :  25.0
We have a winner:  dutch


# Resources

Letter Frequencies: http://letterfrequency.org/ 

Levenshtein Distance - to compare series and their difference - https://stackabuse.com/levenshtein-distance-and-text-similarity-in-python/ 

xrange in Python 3.0 (replaced with range) - https://www.geeksforgeeks.org/range-vs-xrange-python/ 

# Technical Environment
For this notebook, I made use of Jupyter Notebook 5.7 with the Jupyter Lab extension 1.0.4 installed (https://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html) in combination with ploty 4.1

conda install -c conda-forge jupyterlab

Installing plotly (4.1): 

conda install -c plotly plotly=4.1.0 

conda install -c plotly chart-studio=1.0.0

conda install jupyterlab=1.0 "ipywidgets>=7.5"

(see: https://plot.ly/python/getting-started/)

