# Greek Unicode tricks

In this notebook, we explore how the unicodedata library works and how we can use it to convert Greek text to a number of formats. An important function is also to split punctuation from words even if it occurs in the middle of a word. 

(A lot of code in this notebook has originally been written by Dirk Roorda for which I am very thankful. Afterwards, I have done my own job with it and adjusted the original code here and there.)


In [2]:
import pickle
from unicodedata import category, normalize
from pprint import pprint

First we show the available categories of unicode characters available in the unicodedata library.

In [4]:
# Show dictionary with all unicode categories
open_dict = open("data/unicode_cat_dict.pickle","rb")
unicode_cat_dict = pickle.load(open_dict)
print('These are the available categories in the unicodedata library:')
pprint(unicode_cat_dict)

open_dict.close()

These are the available categories in the unicodedata library:
{'Cc': 'Other, Control',
 'Cf': 'Other, Format',
 'Cn': 'Other, Not Assigned (no characters in the file have this property)',
 'Co': 'Other, Private Use',
 'Cs': 'Other, Surrogate',
 'LC': 'Letter, Cased',
 'Ll': 'Letter, Lowercase',
 'Lm': 'Letter, Modifier',
 'Lo': 'Letter, Other',
 'Lt': 'Letter, Titlecase',
 'Lu': 'Letter, Uppercase',
 'Mc': 'Mark, Spacing Combining',
 'Me': 'Mark, Enclosing',
 'Mn': 'Mark, Nonspacing',
 'Nd': 'Number, Decimal Digit',
 'Nl': 'Number, Letter',
 'No': 'Number, Other',
 'Pc': 'Punctuation, Connector',
 'Pd': 'Punctuation, Dash',
 'Pe': 'Punctuation, Close',
 'Pf': 'Punctuation, Final quote (may behave like Ps or Pe depending on usage)',
 'Pi': 'Punctuation, Initial quote (may behave like Ps or Pe depending on '
       'usage)',
 'Po': 'Punctuation, Other',
 'Ps': 'Punctuation, Open',
 'Sc': 'Symbol, Currency',
 'Sk': 'Symbol, Modifier',
 'Sm': 'Symbol, Math',
 'So': 'Symbol, Other',
 'Zl':

We now process some (poluted) Greek words and categorize the various characters present in these words. Also the difference between combined characters with accents and uncombined characters with accents will be shown.


In [5]:
word1 = 'Αὐτούς·Αὐτούς'
word2 = 'Σιλωάμ?̔ὃ'
word3 = 'Διδάσκαλε?̓'
sen1 = 'ἄλογοι«, δι᾿ «ὧν« δῆλον ὅτι αἱ ἁμαρτίαι «μηνύονται« αἱ μὴ γεγονυῖαι κατὰ λόγον.'
sen2 = '*polla/ me ta\ parakalou=nta/ e)sti cumbouleu=sai u(mi=n'

for word in (word1, word2, word3):
    print([category(x) for x in word])
    print([x for x in normalize('NFD', word)])
    print([category(x) for x in normalize('NFD', word)])
    print('\n')
print([category(x) for x in sen2])
print([category(x) for x in 'cumbouleu=sai'])


['Lu', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Po', 'Lu', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll']
['Α', 'υ', '̓', 'τ', 'ο', 'υ', '́', 'ς', '·', 'Α', 'υ', '̓', 'τ', 'ο', 'υ', '́', 'ς']
['Lu', 'Ll', 'Mn', 'Ll', 'Ll', 'Ll', 'Mn', 'Ll', 'Po', 'Lu', 'Ll', 'Mn', 'Ll', 'Ll', 'Ll', 'Mn', 'Ll']


['Lu', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Po', 'Mn', 'Ll']
['Σ', 'ι', 'λ', 'ω', 'α', '́', 'μ', '?', '̔', 'ο', '̔', '̀']
['Lu', 'Ll', 'Ll', 'Ll', 'Ll', 'Mn', 'Ll', 'Po', 'Mn', 'Ll', 'Mn', 'Mn']


['Lu', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Po', 'Mn']
['Δ', 'ι', 'δ', 'α', '́', 'σ', 'κ', 'α', 'λ', 'ε', '?', '̓']
['Lu', 'Ll', 'Ll', 'Ll', 'Mn', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Po', 'Mn']


['Po', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Po', 'Zs', 'Ll', 'Ll', 'Zs', 'Ll', 'Ll', 'Po', 'Zs', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Sm', 'Ll', 'Ll', 'Ll', 'Po', 'Zs', 'Ll', 'Pe', 'Ll', 'Ll', 'Ll', 'Zs', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Sm', 'Ll', 'Ll', 'Ll', 'Zs', 'Ll', 'Ps', 'Ll', 'Ll', 'Sm', 'Ll']

We now define a number of functions to process Greek text.

In [6]:
letter = {'L'}
letter_space = {'L', 'Z'}
dia = {'M'}
punc = {'P'}
letter_dia = {'L', 'M'}

NFD = 'NFD'
NFC = 'NFC'


def splitPuncSimple(w):
    afterWord = len(w)
    for i in range(len(w) - 1, -1, -1):
        if category(w[i])[0] not in letter:
            afterWord = i
        else:
            break
    return (w[0:afterWord], w[afterWord:]+' ')

def splitPunc(w):
    pP = 0
    for i in range(len(w)):
        if category(w[i])[0] not in letter:
            pP += 1
        else:
            break
    preWord = w[0:pP] if pP else ''
    pW = pP
    for i in range(pP, len(w)):
        if category(w[i])[0] in letter:
            pW += 1
        else:
            break
    word = w[pP:pW]
    pA = pW
    for i in range(pW, len(w)):
        if category(w[i])[0] not in letter:
            pA += 1
        else:
            break
    afterWord = w[pW:pA]
    if pA == len(w):
        afterWord += ' '
    
    rest = splitPunc(w[pA:]) if pA < len(w) else ()
    return ((preWord, word, afterWord),) + rest

def Tokenizer(sentence, udnorm='NFD'): 
    sen = normalize(udnorm, sentence)
    words = sen.split(' ')
    tokens = []
    tokenized_sentence = []
    for w in words:
        pP = 0
        for i in range(len(w)):
            if category(w[i])[0] not in letter_dia:
                pP += 1
            else:
                break
        preWord = w[0:pP] if pP else ''
        pW = pP
        for i in range(pP, len(w)):
            if category(w[i])[0] in letter_dia:
                pW += 1
            else:
                break
        word = w[pP:pW]
        pA = pW
        for i in range(pW, len(w)):
            if category(w[i])[0] not in letter_dia:
                pA += 1
            else:
                break
        afterWord = w[pW:pA]
        tokens = [word] + ([w[pA:]] if pA < len(w) else [])
        tokenized_sentence += tokens        
    return tokenized_sentence  
    

def plainMajuscule(sentence):
    return ''.join(x.upper() for x in normalize('NFD', ' '.join(sentence)) if category(x)[0] not in dia)

def plainMinuscule(sentence):
    return ''.join(x.lower() for x in normalize('NFD', ' '.join(sentence)) if category(x)[0] not in dia)

def plainCaps(w):
    return ''.join(x.upper() for x in normalize(NFD, w) if category(x)[0] in letter)

def plainLow(w):
    return ''.join(x.lower() for x in normalize(NFD, w) if category(x)[0] in letter)

Finally, we do some texts to show how it works...

In [10]:
print(splitPunc('.'))

wordBare = Tokenizer(sen1, udnorm='NFD')

wordPlainMajuscule = plainMajuscule(wordBare)
print(wordPlainMajuscule)

wordPlainMinuscule = plainMinuscule(wordBare)
print(wordPlainMinuscule)

(('..', '', ' '),)
ΑΛΟΓΟΙ ΔΙ ΩΝ ΔΗΛΟΝ ΟΤΙ ΑΙ ΑΜΑΡΤΙΑΙ ΜΗΝΥΟΝΤΑΙ ΑΙ ΜΗ ΓΕΓΟΝΥΙΑΙ ΚΑΤΑ ΛΟΓΟΝ
αλογοι δι ων δηλον οτι αι αμαρτιαι μηνυονται αι μη γεγονυιαι κατα λογον


In [3]:
print(category('ʼ'))

Lm
