# Writing systems (scripts)

by Koenraad De Smedt at UiB

---
> Each string in Python3 is an immutable sequence of Unicode code points.

What does that mean? Various writing systems (scripts) are in use for the world's languages. Many languages, such as English and Norwegian, have an *alphabet* with letters which have some phonemic value. Chinese has morpho-syllabic characters. The Hangul (Korean) script combines the features of an alphabetic and syllabic writing system. In addition, most writing systems use additional characters such as punctuation, spaces, quotation marks, numerals, etc. Whatever the system, all scripts have units that we call *characters*.

There is no one-to-one correspondence between a language and a script.
Turkmen has been written in variants of the Arabic, Cyrillic and Latin scripts. Serbo-Croatian medieval texts were written in five scripts: Latin, Glagolitic, Early Cyrillic, Bosnian Cyrillic and a variant of Arabic. Chinese has two main scripts, traditional and simplified.

Conversely, a script may be used to write several different languages, often with some variation.
Variants of the Latin script are used to write
English, German, French, Norwegian, Vietnamese and many other languages.
The Devanagari script is used for several languages of India and Nepal.

*Encodings* are digital representations of characters so that every character is represented by a number (often written as hexadecimal bytes). The default encoding for strings in Python3 is UTF-8, which can represent the whole [Unicode](https://home.unicode.org/) character set. Unicode has about 15000 code points and accommodates characters of almost all scripts in use today, as well as some historical ones. In addition, Unicode has characters for emojis, phonetic symbols, arrows, mathematical symbols, music notation, etc.

*Fonts* contain typographically defined shapes to make characters appear on the screen in a particular style. Whereas the Unicode inventory is very large, the set of defined shapes in most fonts is quite limited. There are specialized fonts for certain character sets. *Whether or not a particular character appears correctly on the screen is dependent on the font that is used!*

So, Python strings are sequences of Unicode code points for characters. Strings cannot be changed, but can be copied and transformed into new strings.
This notebook shows some examples of how strings in some scripts are handled in Python. Most string operations work fine for most writing systems, but there are some special cases to be aware of.

---

##Case methods and normalization

Some but not all scripts make a distinction between uppercase (capitals) and lowercase. The `.capitalize()` method returns a version with the first letter capitalized and the rest lowercase. The `.title()` method returns a version with the the initial letter of every word capitalized and the rest lowercase. This works for every script that has case distinction, such as Cyrillic in the example below.

In [None]:
print('a poem written by '.capitalize() + 'александр пушкин'.title())

Some writing systems use different glyphs for the same letter, depending on the position in the word or combining letters. The Greek final lowercase *sigma* is different from a non-final *sigma*. Lowercasing takes this into account.

In [None]:
sophos = 'ΣΟΦΌΣ'.lower() # Greek
sophos

As a result, the first letter is not the same as the last letter, even if they are both sigmas.

In [None]:
sophos[0] == sophos[-1]

In tasks where all sigmas need to be considered the same, the `.casefold()` method can be used to *normalize* the spelling.

In [None]:
print('ΣΟΦΌΣ'.casefold())

The German *Eszett* (not to be confused with *beta*) is equivalent to a double *s*. However, the two strings are not the same.

In [None]:
'Straße' == 'Strasse'

Especially in uppercase, Eszett is usually replaced by double S, and this is what Python does.

In [None]:
print('Straße'.upper()) # German

The `.casefold()` method can be used to normalize the spellings.

In [None]:
print('Straße'.casefold())

Using casefolding, the two spellings become equal. Casefolding and other normalizations are often performed as the first step in text processing, in order to get rid of accidental differences between words.

In [None]:
'Strasse'.casefold() == 'Straße'.casefold()

Some ligatures (one character representing several letters written together) can be converted to their composing characters using casefolding.

In [None]:
print('oﬃce has', len('oﬃce'), 'letters')
print('oﬃce'.casefold(), 'has', len('oﬃce'.casefold()), 'letters')

Normalization by means of `.casefold()` however has its limitations. The Dutch digraph *ĳ* (single character) is not converted to *ij* (two characters). Neither is *œ* decomposed.

In [None]:
print('Ĳstĳd'.casefold(), 'has', len('Ĳstĳd'.casefold()), 'letters')
print('cœur'.casefold(), 'has', len('cœur'.casefold()), 'letters')

A possible workaround to perform additional normalization is the use of `replace()` to replace each occurrence of a substring by another string.

In [None]:
'Ĳstĳd'.replace('ĳ', 'ij').replace('Ĳ','IJ')

## Case tests, string equality, inclusion and other tests

Case tests work for scripts that distinguish between uppercase and lowercase. Non-cased characters are ignored.

In [None]:
print('ΑΘΗΝΑΙ'.isupper())
print('ΑΘΗΝΑΙ. '.isupper())

Chinese, Ge'ez (Ethiopic), Hebrew, and many other scripts are single-case, which means they do not make a case distinction. Case tests in such scripts always produce `False`.

In [None]:
print('雨伞'.isupper()) # simplified Chinese
print('雨伞'.islower())

Case conversions produce the same string for scripts without case distinctions.

In [None]:
print('雨伞'.casefold())
print('雨伞'.lower() == '雨伞'.upper())
print('𐤀𐤍𐤊'.lower() == '𐤀𐤍𐤊'.upper()) # Phoenician anoki, 'nk, 1st person
print('ግዕዝ'.lower() == 'ግዕዝ'.upper()) # Ge'ez

The `in` operator works fine also for non-Latin scripts.

In [None]:
'伞' in '雨伞'

Equality is however rather strict. Traditional Chinese characters are different from their Simplified Chinese counterparts.

In [None]:
'傘' == '伞'

Check if all characters in a string are all alphabetic. Chinese characters are considered alphabetic (even if Chinese does not have an alphabet in the strict sense). Emojis, punctuation, spaces etc. are not alphabetic.

In [None]:
print('雨伞'.isalpha())
print('Γιάννης'.isalpha())
print('Ιωάννης ο Βαπτιστής'.isalpha())
print('LOL😀'.isalpha())

## Glyph variation and writing direction

Arabic and other languages using the Arabic script are written from right to left. This has consequences for indexing: the first character is the rightmost one. Also, the script is written in a cursive style: most of the glyphs have slightly different shapes according to whether they stand alone or are joined to a following or preceding letter.

In [None]:
marhaban = 'مرحباً' # thanks
print(len(marhaban))
print(marhaban[0])
print(marhaban[1])
print(marhaban[2])
print(marhaban[3])
print(marhaban[4])
print(marhaban[5])

Some scripts, like the Mongolian example below, are traditionally written vertically from the top down. However, the web browser and other applications will normally render the lines horizontally. Also Mongolian glyphs may be rendered differently depending on the combining letters.

In [None]:
Sorghaghtani_Beki = 'ᠰᠤᠷᠬᠠᠭᠲᠠᠨᠢ ᠪᠡᠬᠢ'
print(Sorghaghtani_Beki)
print(Sorghaghtani_Beki[1:3])
print(Sorghaghtani_Beki[2:5])

The following is a string with some standard Runic characters, see the [Runic Unicode block](https://en.wikipedia.org/wiki/Runic_(Unicode_block)).

In [None]:
futhark = 'ᚠᚢᚦᚨᚱᚲ'
futhark.isalpha()

 Note: Runic ligatures (bind-runes) are currently not standard Unicode, but [Menota](https://www.menota.org/HB4_ch18.xml#sec18.6) has defined them in the Unicode private use area. Those characters can be included in a Python string, but will not be displayed unless you have a special font for them.

## Type coercion

The standard positional decimal numeral system (based on the Hindu-Arabic numeral system) is independent of the script used to represent the digits.

*Strings of digits* in most (but not all) scripts can be coerced to ordinary numbers if the string can be interpreted in the standard positional decimal numeral system. However, simply typing in digits in other scripts does not automatically cause them to be recognized as numbers.

In [None]:
int('٠١٢٣٤٥٦٧٨٩') # Arabic digits

In [None]:
float('೩೧.೨೭') # Kannada digits

In [None]:
float('๓.๑๔') # Thai digits

### A note on other encodings

While the use of other encodings is discouraged, conversion from legacy encodings (such as Windows-1252, ISO-8859-1 or Mac OS Roman) to UTF-8 is possible. See the note in the notebook on *Reading and writing files* for handling files with text in other encodings.

## Exercises

In addition to testing the above examples, check if your font supports the International Phonetic Alphabet, such as `/ɪˈpɒnɪməs/`. If any characters are not properly rendered, consider choosing a different font in the Colab editor or wherever you are using this notebook. Do not however expect to find a font that renders all characters in all possible writing systems.