## Characters and Strings

### How Computers Store Characters as Numbers

Every character used by a computer corresponds to a unique number and vice versa. 

* Whitespaces
* Control Characters

The one named *ASCII* (short for American Standard Code for Information Interchange) is the most widely used, and you can assume that nearly all modern devices (like computers, printers, mobile phones, tablets, etc.) use that code.

The code provides space for 256 different characters, but we are interested only in the first 128

<div>
<img src="attachment:Screenshot_8.png" width="350" >
</div>

It was necessary to come up with something more flexible and capacious than ASCII, something able to make all the software in the world amenable to internationalization, because different languages use completely different alphabets, and sometimes these alphabets are not as simple as the Latin one.
<br><br>
The word internationalization is commonly shortened to **I18N**
<br><br>
**I at the front of the word, next there are 18 different letters, and an N at the end.**

<div>
<img src="attachment:Screenshot_10.png" width="350" >
</div>

### Code Point

It is a number which makes a character. 
<br><br>
For example:
<br>
32 is a code point which makes a space in ASCII encoding. 
<br><br>
We can say that standard ASCII code consists of 128 code points.

### Code Page

It is a standard for using the upper 128 code points to store specific national characters
<br><br>
For example:
<br>
There are different code pages for Western Europe and Eastern Europe, Cyrillic and Greek alphabets, Arabic and Hebrew languages, and so on.

<br>

In other words, the code points derived from the code page concept are ambiguous.



### Unicode

**Unicode assigns unique (unambiguous) characters (letters, hyphens, ideograms, etc.) to more than a million code points.**
<br>
* The first 128 Unicode code points are identical to ASCII, 
* The first 256 Unicode code points are identical to the ISO/IEC 8859-1 code page (a code page designed for western European languages).



<div>
<img src="attachment:Screenshot_11.png" width="150" >
</div>

### UCS-4 (Universal Character Set)

**The Unicode standard says nothing about how to code and store the characters in the memory and files. It only names all available characters and assigns them to planes (a group of characters of similar origin, application, or nature).**
<br>
* UCS-4 uses 32 bits (four bytes) to store each character, and the code is just the Unicode code points' unique number
* A file containing UCS-4 encoded text may start with a BOM (byte order mark), an unprintable combination of bits announcing the nature of the file's contents. Some utilities may require it.

<br>
As you can see, UCS-4 is a rather wasteful standard - it increases a text's size by four times compared to standard ASCII.


<div>
<img src="attachment:Screenshot_12.png" width="450" >
</div>

#### BOM (Byte Order Mark)

**It is a special combination of bits announcing encoding used by a file's content (eg. UCS-4 or UTF-B).**

### UTF-8 (Unicode Transformation Format)

**UTF-8 uses as many bits for each of the code points as it really needs to represent them.**
<br>
* All Latin characters (and all standard ASCII characters) occupy `8 bits`
* non-Latin characters occupy `16 bits`
* CJK (China-Japan-Korea) ideographs occupy `24 bits`

<br>
As you can see, UCS-4 is a rather wasteful standard - it increases a text's size by four times compared to standard ASCII.
