# ISO9660 File System Reader Project

## Python `ctypes` Module

For data structures in this project we will use Python's `ctypes` module.
It is in the standart library so you don't need to install additional modules. 
You can also use the `struct` module.

In [8]:
from ctypes import c_int8, c_uint8, c_char, c_uint16, c_uint32, sizeof

All *multibyte* data in ISO9660 file system is written as both *little endian* (also called LSB) and *big endian* (also called MSB).
For this project we choose the `LittleEndianStructure`. We won't use the big endian fields. 

In [9]:
from ctypes import LittleEndianStructure

## Data Structures

### Date and Time Format in `DirectoryEntry`s
Each field in the `DirectoryEntryDateTime` structure is encoded as `c_uint8` except the time zone (`tz`) which is encoded as `c_int8` denoting the time difference from GMT in 15 minute intervals.

In [10]:
class DirectoryEntryDateTime(LittleEndianStructure):
    _pack_ = 1
    _fields_ = [
        ("year", c_uint8),
        ("month", c_uint8),
        ("day", c_uint8),
        ("hour", c_uint8),
        ("minute", c_uint8),
        ("second", c_uint8),
        ("tz", c_int8),
    ]

    def __repr__(self) -> str:
        return (
            f"{self.year+1900}-{self.month}-{self.day} "
            f"{self.hour}:{self.minute}:{self.second} "
            f"{self.tz//4:+02}:{self.tz*15%60:02}"
        )

### `DirectoryEntry`s
A directory is an array of `DirectoryEntry`s, each points to a file or subdirectory in the directory.

In [11]:
from pathlib import Path


def get_directory(directory_entry): ...
def read_file(directory_entry): ...


class DirectoryEntry(LittleEndianStructure):
    _pack_ = 1
    _fields_ = [
        ("length", c_uint8),  # length of this structure + the file name
        ("eattr_length", c_uint8),

        ("location_of_extent", c_uint32),  # block number of the actual file
        ("reserved", c_uint32),  # used for big endian
        ("size_of_extent", c_uint32),  # size in bytes of the actual file
        ("reserved", c_uint32),  # used for big endian

        ("datetime", DirectoryEntryDateTime),  # 7
        ("flags", c_uint8),
        ("unit_size", c_uint8),
        ("gap_size", c_uint8),

        ("volume_sequence_number", c_uint16),
        ("reserved", c_uint16),  # used for big endian
        ("name_length", c_uint8),
    ]

    _children: dict = ...
    parent = None
    path = Path("/")

    @property
    def is_directory(self):
        return bool(self.flags & 0x2)

    @property
    def is_hidden(self):
        return bool(self.flags & 0x1)

    @property
    def children(self):
        if self.is_directory:
            if self._children is ...:
                get_directory(self)
            return self._children

    def __iter__(self):
        return iter(self.children)

    def __getitem__(self, k):
        return self.children[k]

    def __repr__(self) -> str:
        if self.is_directory:
            return f"<Directory {repr(self.children)}>"
        return f"<File {self.path} at block {self.location_of_extent} ({self.size_of_extent} bytes)>"

### Date and Time Format of `VolumeDescriptor`s
Unlike `DirectoryEntryDateTime`, in `VolumeDescriptor`
date and time fields are stored as regular strings except the time zone (`tz`) which is stored as `c_int8` denoting the time difference from GMT in 15 minute intervals.

In [12]:
class VolumeDescriptorDateTime(LittleEndianStructure):
    _pack_ = 1
    _fields_ = [
        ("year", c_char * 4),
        ("month", c_char * 2),
        ("day", c_char * 2),
        ("hour", c_char * 2),
        ("minute", c_char * 2),
        ("second", c_char * 2),
        ("milis", c_char * 2),
        ("tz", c_int8),
    ]

    def __repr__(self) -> str:
        return (
            f"{self.year.decode()}-{self.month.decode()}-{self.day.decode()} "
            f"{self.hour.decode()}:{self.minute.decode()}:{self.second.decode()}:{self.milis.decode()} "
            f"{self.tz//4:+02}:{self.tz*15%60:02}"
        )

### `VolumeDescriptor`s

First a set of `VolumeDescriptor`s read from the ISO9660 image in order to locate the `root` directory. 
There are two types of `VolumeDescriptor`s we can use. 
Primary `VolumeDescriptor` and supplementary `VolumeDescriptor`. 

- Primary `VolumeDescriptor` is the standart `VolumeDescriptor` and have some file name limitations. 
Usually file names are encoded with ASCII, but to avoid encoding errors on some nonstandart images it is recomended to use ANSI (`iso-8859-1`) character set.

- Supplementary `VolumeDescriptor` is used for Joliet extension. 
File names are encoded in unicode (`utf-16_be`) character set.


Directory structure of both Primary and Joliet Volume Destriptors are very similar. 
Indeed additional fields used in Supplementary `VolumeDescriptor` for Joliet Extension are reserved fields in the primary `VolumeDescriptor`.
So we use the Supplementary `VolumeDescriptor` structure as `VolumeDescriptor`.


This is the structure of the Supplementary `VolumeDescriptor`.
Some fields are only for Joliet extension. 
But those fileds are already reveserved in Primary `VolumeDescriptor`.

All VolumeDescriptors must be `2048` bytes. 
This is why the `653` bytes reserved at the end of this structure.

In [13]:
class VolumeDescriptor(LittleEndianStructure):
    _pack_ = 1
    _fields_ = [
        ("type", c_uint8),  # 0x01 for Primary and 0x02 for Joliet Extension
        ("signature", c_char * 5),  # Must be CD001
        ("version", c_uint8),
        ("volume_flags", c_uint8),  # Used for Joliet Extension
        ("system", c_char * 32),
        ("volume", c_char * 32),
        ("reserved", c_char * 8),
        ("volume_space_size", c_uint32),
        ("reserved", c_uint32),  # used for big endian
        ("escape_sequences", c_char * 32),
        ("volume_set_size", c_uint16),
        ("reserved", c_uint16),  # used for big endian
        ("volume_sequence_number", c_uint16),
        ("reserved", c_uint16),  # used for big endian
        ("block_size", c_uint16),
        ("reserved", c_uint16),  # used for big endian
        ("path_table", c_char * 24),  # we dont use this
        ("root_directory", DirectoryEntry),
        ("reserved", c_uint8),
        ("volume_set", c_char * 128),
        ("publisher", c_char * 128),
        ("data_praparer", c_char * 128),
        ("application", c_char * 128),
        ("copyright", c_char * 37),
        ("abstract_file", c_char * 37),
        ("bibliographic_file", c_char * 37),
        ("creation_datetime", VolumeDescriptorDateTime),
        ("modification_datetime", VolumeDescriptorDateTime),
        ("expiration_datetime", VolumeDescriptorDateTime),
        ("effective_datetime", VolumeDescriptorDateTime),
        ("file_structure_version", c_uint8),
        ("reserved", c_char),
        ("application_used", c_char * 512),
        ("reserved", c_char * 653),
    ]

## Opening The File

Now we are ready to open the test file. We must open the file in *binary* mode. Since we will only read from the file, we are opening with mode `"rb"`.

In [14]:
fd = open(iso_image_path := "test/test2.iso", "rb")

## System Area

ISO9660 File System consists of blocks of `2048` bytes.
Unlike many other file systems, usefull data does not start at the very beginning of the image.
Instead first `15` blocks of the image is not used by the `ISO9660` file system.
This area is called **system area** and allows **hybrid** file systems.

In [15]:
fd.seek(16 * 2048, 0)

32768

## Reading the `VolumeDescriptor`

First we should read the primary or supplementary `VolumeDescriptor`. Supplementary `VolumeDescriptor` is optional and may not be found.


Usually the primary `VolumeDescriptor` is located at block `16`.
And Supplementary `VolumeDescriptor` used for Joliet Extension is found at block `17` if it exists.
But in general you should search the blocks between `16-22` 
untill finding the `0xff` as `type`.

In [16]:
def volume_descriptors(fd):
    fd.seek(16 * 2048, 0)
    while True:
        vd = VolumeDescriptor()
        fd.readinto(vd)
        assert vd.signature == b"CD001", (hex(vd.type), vd.signature)
        if vd.type == 0xff:
            break
        yield vd

In [17]:
print("Test File:", iso_image_path)
for _vd in volume_descriptors(fd):
    print()
    match _vd.type:
        case 0x00:
            print("Boot Record")
        case 0x01:
            print("Primary Volume Descriptor")
            print("  volume: \t", _vd.volume.decode("iso-8859-1"))
            print("  system: \t", _vd.system.decode("iso-8859-1"))
            print("  application:\t", _vd.application.decode("iso-8859-1"))
            print("  Creation:\t", _vd.creation_datetime)
            vd = _vd
        case 0x02:
            print("Joliet Extension")
            print("  volume: \t", _vd.volume.decode("utf-16_be"))
            print("  system: \t", _vd.system.decode("utf-16_be"))
            print("  application:\t", _vd.application.decode("utf-16_be"))
            print("  Creation:\t", _vd.creation_datetime)
            vd = _vd
        case _:
            print("Unknown Volume Decriptor", hex(_vd.type))

Test File: test/test2.iso

Primary Volume Descriptor
  volume: 	 tes2                            
  system: 	 Win32                           
  application:	 PowerISO                                                                                                                        
  Creation:	 2023-12-20 14:43:45:00 +3:00


## Listing The Directories

For listing the directories we have two choices. 
- Either reading the `Pathtable` to find all the directories in the image. And then reading that directories.
- Or directly reading the `root` directory of the image. And reading the remaining directories as needed. 

We choose the second because we don't want to implement additional `Pathtable` structure now.

Note that each `DirectoryEntry` must fit to one block. 
If there is not enought space for a new `DirectoryEntry` the remainig bits of the block filled with zero and the new `DirectoryEntry` is written to next block.

In [18]:
def get_directory(directory: DirectoryEntry):
    fd.seek(directory.location_of_extent * vd.block_size, 0)

    offset = 0
    directory._children = dict()

    while offset != directory.size_of_extent:
        if fd.peek(1)[0] == 0x00:
            next_block = -offset % vd.block_size
            fd.seek(next_block, 1)
            offset += next_block
            continue

        file = DirectoryEntry()
        fd.readinto(file)

        filename = fd.read(file.name_length)
        if filename not in (b"\x00", b"\x01"):
            file.path = directory.path / \
                filename.decode(("iso-8859-1", "utf-16_be")[vd.type - 1])
            fd.seek(file.length - file.name_length - sizeof(file), 1)
            directory._children[file.path.name] = file
            file.parent = directory
        offset += file.length

In [19]:
vd.root_directory

<Directory {'TEST1': <Directory {'TEST01.TXT': <File /TEST1/TEST01.TXT at block 27 (4 bytes)>, 'TEST02.TXT': <File /TEST1/TEST02.TXT at block 27 (4 bytes)>, 'TEST03.TXT': <File /TEST1/TEST03.TXT at block 27 (4 bytes)>, 'TEST04.TXT': <File /TEST1/TEST04.TXT at block 27 (4 bytes)>, 'TEST2': <Directory {'TEST05.TXT': <File /TEST1/TEST2/TEST05.TXT at block 27 (4 bytes)>, 'TEST06.TXT': <File /TEST1/TEST2/TEST06.TXT at block 27 (4 bytes)>, 'TEST07.TXT': <File /TEST1/TEST2/TEST07.TXT at block 27 (4 bytes)>, 'TEST08.TXT': <File /TEST1/TEST2/TEST08.TXT at block 27 (4 bytes)>}>, 'TEST3': <Directory {'TEST09.TXT': <File /TEST1/TEST3/TEST09.TXT at block 27 (4 bytes)>}>}>}>

In [20]:
vd.root_directory["TEST1"]['TEST01.TXT']

<File /TEST1/TEST01.TXT at block 27 (4 bytes)>

To get a file or directory by path we will define the following recursive function.

In [21]:
def get_file(path: Path):
    if path == path.parent:  # is root directory
        return vd.root_directory
    return get_file(path.parent)[path.name]

In [22]:
get_file(Path("/TEST1/TEST02.TXT"))

<File /TEST1/TEST02.TXT at block 27 (4 bytes)>

## Reading The Files

The following function `read_file` prints the contents of the **text** file in given path if it is a regular file.
If the file is a directory, then `read_file` prints the file names in that directory.

In [23]:
def read_file(file: DirectoryEntry | Path | str):
    if isinstance(file, DirectoryEntry):
        if file.is_directory:
            for subfile in file:
                print(subfile)
        else:
            fd.seek(file.location_of_extent * vd.block_size, 0)
            size = file.size_of_extent
            while size:
                data = fd.read(size)
                size -= len(data)
                print(data.decode(), end="")
            return
    if isinstance(file, Path):
        return read_file(get_file(file))
    if isinstance(file, str):
        return read_file(get_file(Path(file)))

In [24]:
read_file("/TEST1/TEST01.TXT")

test

In [25]:
read_file("/TEST1/TEST2")

TEST05.TXT
TEST06.TXT
TEST07.TXT
TEST08.TXT
