# Writeup

## Problem (unicode1): Understanding Unicode (1 point)


### (a) 
#### Question
What Unicode character does chr(0) return?

Deliverable: A one-sentence response.


In [1]:
chr(0)

'\x00'

#### Answer
`chr(0)` returns the Unicode character with code point 0, which is the null character (`'\x00'`).

### (b) 
#### Question
How does this character’s string representation (__repr__()) differ from its printed representa-
tion?

Deliverable: A one-sentence response.


In [2]:
repr(chr(0))

"'\\x00'"

In [3]:
print(chr(0))

 


The string representation `repr(chr(0))` is `"'\\x00'"`, showing the escape sequence, while printing `chr(0)` outputs nothing visible.

### (c) 
#### Question
What happens when this character occurs in text? It may be helpful to play around with the
following in your Python interpreter and see if it matches your expectations:
```
>>> chr(0)
>>> print(chr(0))
>>> "this is a test" + chr(0) + "string"
>>> print("this is a test" + chr(0) + "string")
```

Deliverable: A one-sentence response.

In [5]:
"this is a test" + chr(0) + "string"

'this is a test\x00string'

In [6]:
print("this is a test" + chr(0) + "string")

this is a test string


#### Response

When the null character occurs in text, it acts as an invisible character and does not display when printed.

# Problem (unicode2): Unicode Encodings (3 points)


## (a) 
### Question
What are some reasons to prefer training our tokenizer on UTF-8 encoded bytes, rather than
UTF-16 or UTF-32? It may be helpful to compare the output of these encodings for various
input strings.

Deliverable: A one-to-two sentence response.



Training a tokenizer on UTF-8 encoded bytes is preferred because it's a variable-length encoding that efficiently represents common characters with fewer bytes, resulting in a more compact vocabulary and faster processing. Unlike the fixed-width or wider variable-width formats of UTF-32 and UTF-16, UTF-8's design avoids unnecessary padding for frequent characters and handles the full Unicode range without introducing null bytes that can complicate text processing.

## (b) 
### Question
Consider the following (incorrect) function, which is intended to decode a UTF-8 byte string into
a Unicode string. Why is this function incorrect? Provide an example of an input byte string
that yields incorrect results.
```
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])
>>> decode_utf8_bytes_to_str_wrong("hello".encode("utf-8"))
'hello'
```
Deliverable: An example input byte string for which decode_utf8_bytes_to_str_wrong pro-
duces incorrect output, with a one-sentence explanation of why the function is incorrect.



In [9]:
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])

decode_utf8_bytes_to_str_wrong("é".encode("utf-8"))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data

An example of an input byte string that fails is 'é'.encode('utf-8'), which evaluates to b'\xc3\xa9'.

This function is incorrect because it attempts to decode each byte in isolation, whereas many Unicode characters, such as 'é', are represented by multi-byte sequences in UTF-8.

## (c) 
### Question
Give a two byte sequence that does not decode to any Unicode character(s).

Deliverable: An example, with a one-sentence explanation.

The byte sequence b'\xc2\xc2' does not decode to any Unicode character.

This sequence is invalid because the first byte (\xc2) indicates the start of a two-byte character, but the second byte is another start byte rather than the required continuation byte.