# Image Files

There are many ways to embed text in image files. Steganography is the practice of **hiding** messages in objects such as image files.  You can also embed text that is not hidden, but you have to know how to find it.

## Bytes

Every file is a sequence of **bytes**. A byte is a sequence of eight zeros and ones called **bits**.

A byte can be associated with a number, and that number can be written in baxe 10, out standard number system. People who work with bytes often express that number in other bases, namely base 2 or base 16.

### Base 2

**Examples**

>$13 = 8+4+1 = 2^3+2^2+2^0 = 1\cdot 2^3+1\cdot 2^2+0\cdot 2^1+1\cdot 2^0 = 1101\ base\ 2$

>$18=16+2=2^4+2^1=1\cdot 2^4+0\cdot 2^3+0\cdot 2^2+1\cdot 2^1+0\cdot 2^0=10010\ base\ 2$

The python function `bin()` takes a number as argument and returns a string representation of that number in base 2.

**Examples**

In [2]:
bin(18)

'0b10010'

The `0b` at the start of the string indicates binary or base 2.

Base 2 to Base 10

>$1011\ base\ 2=1\cdot 2^3+0\cdot 2^2+1\cdot 2^1+1\cdot 2^0=8+2+1=11$

If you put `0b1011` in a code cell and evaluate it, python shows you the base 10 form of the number:

In [4]:
0b1011

11

### Exercises

1. Find the base 2 representation of the following numbers and use bin() to check your answers:

* 9
* 17
* 15

2. Find the base 10 form of the following base 2 numbers and use python to check your answers:

* 0b10110
* 0b100011
* 0b1000

### Base 16

In base 16 you need single characters to represent the base 10 numbers 0,1,2,...,14,15.

The characters 0,...,9 represent the numbers 0,...,9, and a,b,c,d,e,f represent the numbers 10,11,12,13,14,15.

**Examples:**

>$27=16+11=1\cdot 16^1+11\cdot 16^0=1\cdot 16^1+b\cdot 16^0=1b\ base 16$

>$63=3\cdot 16^1+15\cdot 16^0=3\cdot 16^1+f\cdot 16^0 = 3f\ base\ 16$

The python function `hex()` takes a number argument and returns a string representation of that number in base 16.

In [13]:
hex(63)

'0x3f'

The `0x` at the beginning of the string indicate *hexadecimal* or base 16.

If you put `0x3f` in a code cell and evaluate it, python shows you the base 10 form of the number.

In [14]:
0x3f

63

### Exercises

1. Find the base 16 form of the following numbers and use hex() to check your answers:

* 21
* 30
* 283

2. Find the base 10 form of the following base 16 numbers and use python to check your answers:

* 0xff
* 0x3a7
* 0x11c

In [16]:
hex(283),hex(21),hex(30)

('0x11b', '0x15', '0x1e')

In [17]:
0xff,0x3a7,0x11c

(255, 935, 284)

It is cumbersome to use base 2 to represent numbers because they have so many digits. Programmers like to take one byte, an eight digit base 2 number and split it into two sets of four bits, then represent each four digit base 2 number as a base 16 number.

**Examples**

> $11111111= 1111\ \ 1111= 15\ \ 15=ff\  base\ 16= 15\cdot 16^1+15\cdot 16^0=255$

> $10011110= 1001\ \ 1110= 9\ \ 14=9e\ base\ 16=9\cdot 16^1+14\cdot 16^0=158$

In [18]:
hex(0b11111111)

'0xff'

In [20]:
hex(0b10011110)

'0x9e'

In [21]:
0xff

255

In [22]:
0x9e

158

## ASCII

Writing in English requires a small number of characters compared to other languages. For example, the western European languages use most of our letter characters, but they also need characters for the accent marks that are not used in English.

The ASCII character set consists of 128 different characters, some printable and some not. An example of a non-printable character is the bell character.

In decimal form (base 10) the characters are numbered 0 through 127

In binary form (base 2) the characters are numbered 0 through 1111111.

in hexadecimal form (base 16) the characters are numbered 0 through df.

### 0-31

These are the control characters, for example the **backspace character** is 8

|Decimal | Hexadecimal | Binary|
|---|---|---|
|8|0x08|0b00001000|

The **escape character** is 27:

|Decimal | Hexadecimal | Binary|
|---|---|---|
|27|0x1b|0b00011011|

### 32-127

These are the printable characters, for example A is 65 and a is 97.

The python function `chr()` shows the printable characters:

In [28]:
chr(65),chr(0x41),chr(0b01000001)

('A', 'A', 'A')

In [29]:
chr(97),chr(0x61),chr(0b01100001)

('a', 'a', 'a')

Here is a link to the ASCII character set: https://www.ascii-code.com/

Because there are only 128 ascii characters, and 128 = 0b1111111, the 128 ascii characters can each be represented a single byte that starts with 0.

## Latin-1

As the European countries started representing their languages in digital file, they started to use the netire byte, not just the bytes that start with 0. This gave them enough room to represent the ascii characters as well as their accents and other non-standard character.

That worked (sort of), but then different countries started using different codes for their accents, and it was a mess.  If you were given a text file you needed to know which numbers corresponded to which characters.

The problem is much more complicated when you look at Asian languages that use thousands of characters.

## UTF-8

UTF-8 is an encoding system that uses more than one byte for some characters. It gets a little complicated!

## strings command

The linux command `strings` searches for printable groups of characters in a binary file.

If picture.png is the name of an image file, then the command `strings picture.png` produces a lot of output, most of it junk.

The linux command `head` shows the first ten line of a file by default. `head -n 100` shows the first 100 lines.

The command `strings picture.png | head -n 100` pipes the output of strings, which is many lines to head, which with the flag `-n 100` shows just the first 100 lines.

The file `../data/image.png` is an image produced by webui.

In [1]:
!strings ../data/image.png | head

IHDR
tEXtparameters
((old dog)), mixed breed dachsund, and boykin spaniel, long hair, dark brown
Steps: 20, Sampler: DPM++ 2M Karras, CFG scale: 7, Seed: 716484137, Size: 512x512, Model hash: aadddd3d75, Model: deliberate_v3, Version: v1.6.0
IDATx
*+3c
F#h$h
O@$A@"!$
TJJe*33
9""2"
