# NLTK with Non-Latin Scripts (Greek)

In this notebook, we will learn how to work with non-Latin scripts, taking Greek as an example.

You probably know that all characters, such as letters, are represented as numerical values by a computer. This is called **character encoding**. When computers were first created, there was no standard for the numerical values for different characters. To solve this problem, the American Standards Association created the **ASCII** encoding in the 1960s, which standardized the representation of English characters.

In the following years, as more countries grew and modernized, the need for a standard encoding beyond English characters became clear. In the 1990s, the International Organization for Standardization (ISO) developed the **Unicode** character encoding system to enable universal support for all languages. At the abstract level, Unicode assigns a number called a **code point** to every character in every writing system.

As we will be working in a completely non-Latin script, we will be working with Unicode.

## 1. Getting started

In [1]:
sentence = "ΑΥΤΟΣ είναι ο χορός της βροχής της φυλής· ό,τι περίεργο."
sentence = sentence.lower()
sentence

'αυτος είναι ο χορός της βροχής της φυλής· ό,τι περίεργο.'

A package called [`unidecode`](https://pypi.org/project/Unidecode) can be used to transliterate any Unicode string into the “closest possible representation” in ASCII text. This can sometimes be useful for certain applications when working with non-Latin scripts.

In [2]:
from unidecode import unidecode

sentence_latin = unidecode(sentence)
sentence_latin

'autos einai o khoros tes brokhes tes phules* o,ti periergo.'

## 2. Cleaning the text

We will start with removing the accents from the letters. First, we need to learn how putting accents on letters actually works in Unicode. Accents are **combining diacritical marks** that can be added after the base character. Unicode also contains precomposed versions of most letter/diacritic combinations in normal use. For example, `é` can be represented as `U+0065 e LATIN SMALL LETTER E` followed by `U+0301 ◌́ COMBINING ACUTE ACCENT`, or as the precomposed character `U+00E9 é LATIN SMALL LETTER E WITH ACUTE`. The mechanism of **canonical equivalence** within *The Unicode Standard* ensures that both representations are equivalent.

The standard also defines a normalization procedure, called **Unicode normalization**, where equivalent sequences of characters are replaced so that any two texts that are equivalent will be reduced to the same sequence of code points. There are four **normal forms** that we can choose from.

To remove the accents from our text, we can first convert it to **Normalization Form Canonical Decomposition (NFD)**, one of the four Unicode normal forms. In this form, all combining characters are separated from the base letters. Then, we can take advantage of the fact that each Unicode code point has a **General Category** property which identifies what kind of character it is. Accents have the `"Mn"` category, which stands for “Mark, nonspacing”.

In [3]:
import unicodedata


def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')


sentence_no_accents = strip_accents(sentence)
sentence_no_accents

'αυτος ειναι ο χορος της βροχης της φυλης· ο,τι περιεργο.'

In [4]:
from nltk.tokenize import WhitespaceTokenizer

tokens = WhitespaceTokenizer().tokenize(sentence_no_accents)
tokens

['αυτος',
 'ειναι',
 'ο',
 'χορος',
 'της',
 'βροχης',
 'της',
 'φυλης·',
 'ο,τι',
 'περιεργο.']

Now that we have tokenized the sentence, let’s remove the punctuation. Specifically, we need to remove `·`, which is the Greek equivalent of a semicolon.

We should not remove the `,` in the `ο,τι` token. It is actually part of the word. This is a separation mark called a “hypodiastole” that was used in Ancient Greek to distinguish words or phrases from others that are spelled similarly. In Modern Greek, this is represented by a comma and it only survived in this particular word.

In [5]:
from string import punctuation

punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Python’s `string.punctuation` only includes ASCII punctuation characters. [This Stack Overflow question](https://stackoverflow.com/questions/60983836/complete-set-of-punctuation-marks-for-python-not-just-ascii) provides suggestions for obtaining a complete set of Unicode punctuation marks. However, for the purposes of this notebook, we will just append the list of Greek punctuation marks that do not appear in `string.punctuation`.

In [6]:
punctuation_extended = punctuation + '–—…“”‘’«»·'
punctuation_extended

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~–—…“”‘’«»·'

In [7]:
clean_tokens = []

for token in tokens:
    if token == 'ο,τι':
        clean_tokens.append('ο,τι')
    else:
        clean_tokens.append(
            token.translate(str.maketrans(dict.fromkeys(punctuation_extended, None)))
        )

clean_tokens

['αυτος',
 'ειναι',
 'ο',
 'χορος',
 'της',
 'βροχης',
 'της',
 'φυλης',
 'ο,τι',
 'περιεργο']

## 3. Removing stopwords

Finally, let’s filter out stopwords. We will use a list of Greek stopwords adapted from [6/stopwords-json](https://github.com/6/stopwords-json), which contains stopword lists for 50 languages. However, better lists with more stopwords are available, like [Dr. Holger Bagola’s Greek stopword list](https://www.translatum.gr/forum/index.php?topic=3550.0?topic=3550.0).

In [8]:
greek_stopwords = [
    'αλλα', 'αν', 'αντι', 'απο', 'αυτα', 'αυτες', 'αυτη', 'αυτο', 'αυτοι', 'αυτος',
    'αυτους', 'αυτων', 'για', 'δε', 'δεν', 'εαν', 'ειμαι', 'ειμαστε', 'ειναι',
    'εισαι', 'ειστε', 'εκεινα', 'εκεινες', 'εκεινη', 'εκεινο', 'εκεινοι',
    'εκεινος', 'εκεινους', 'εκεινων', 'ενω', 'επι', 'η', 'θα', 'ισως', 'κ', 'και',
    'κατα', 'κι', 'μα', 'με', 'μετα', 'μη', 'μην', 'να', 'ο', 'οι', 'ομως', 'οπως',
    'οσο', 'οτι', 'ο,τι', 'παρα', 'ποια', 'ποιες', 'ποιο', 'ποιοι', 'ποιος',
    'ποιους', 'ποιων', 'που', 'προς', 'πως', 'σε', 'στη', 'στην', 'στο', 'στον',
    'στης', 'στου', 'στους', 'στις', 'στα', 'τα', 'την', 'της', 'το', 'τον',
    'τοτε', 'του', 'των', 'τις', 'τους', 'ως',
]
len(greek_stopwords)

83

In [9]:
clean_tokens_set = set(clean_tokens)
greek_stopwords_set = set(greek_stopwords)
intersection_set = clean_tokens_set.intersection(greek_stopwords_set)

for element in intersection_set:
    clean_tokens = list(filter((element).__ne__, clean_tokens))
clean_tokens

['χορος', 'βροχης', 'φυλης', 'περιεργο']

## 4. Other packages

There are more interesting packages like [`polyglot`](https://pypi.org/project/polyglot/) and [`greek-stemmer`](https://pypi.org/project/greek-stemmer/). These require [`PyICU`](https://pypi.org/project/PyICU/) to be installed first. `PyICU` is a Python extension wrapping the International Components for Unicode C++ libraries that implement much of the Unicode Standard.