# Programming with Python

## Lecture 14: File I/O

### Khachatur Khechoyan

#### Yerevan State University
#### Portmind

# File

Fundamentally, a **file** comprises a continuous sequence of bytes utilized for data storage. The data within the file is structured according to a particular format and can range from basic text files to program executables.

Generally, a file consists of three essential parts:

1. **Header**: metadata about a file (file name, type, size, etc.).
2. **Data**: contents of a file.
3. **EOF**: special character that denots the end of a file.

# Character encodings

**Character encoding** is a system that assigns numeric values (code points) to characters in order to represent and store them in digital form. It provides a standardized way to map characters from different writing systems and languages to binary data that can be understood and processed by computers.

In a character encoding scheme, each character is assigned a unique code point, which is typically represented as a numeric value. The most commonly used character encoding schemes include:

1. **ASCII (American Standard Code for Information Interchange):** ASCII is a widely used encoding scheme that represents basic Latin characters (English letters, digits, punctuation marks, and control characters) using 7 bits, providing a total of 128 possible characters.

2. **Unicode**: Unicode is a universal character encoding standard that aims to cover all characters and scripts used in written languages worldwide. It assigns a unique code point to each character, allowing representation of a vast range of characters from different writing systems. Unicode can be implemented with different encoding schemes, such as UTF-8, UTF-16, and UTF-32, which vary in the number of bytes used to represent each code point.

    - UTF-8 (8-bit Unicode Transformation Format): UTF-8 is a variable-length encoding scheme that uses 8 bits (1 to 4 bytes) to represent code points. It is backward compatible with ASCII, meaning that ASCII characters are represented using a single byte in UTF-8. It is possible to encode up to 1,114,112 characters in UTF-8. 

    - UTF-16 (16-bit Unicode Transformation Format): UTF-16 is a variable-length encoding scheme that uses 16 bits (2 bytes) or more to represent code points. It can represent the entire Unicode character set.

    - UTF-32 (32-bit Unicode Transformation Format): UTF-32 is a fixed-length encoding scheme that uses 32 bits (4 bytes) to represent each code point. It provides a straightforward mapping between code points and their binary representation but consumes more space compared to UTF-8 and UTF-16.

Character encodings are crucial for correctly interpreting and displaying text in different languages and writing systems. When exchanging or storing text data, it's important to ensure that the encoding used for reading and writing the data matches to prevent data corruption or incorrect interpretation of characters. Many programming languages and text editors support specifying or detecting the character encoding of files to handle them correctly.

# File path

A **file path** is a string that represents the location of a file or directory within a file system. It is used to uniquely identify the location of a file or directory in a hierarchical structure.

A file path typically consists of a series of directory names separated by a delimiter, such as a forward slash (/) or a backslash (\), depending on the operating system. For example:

On Unix-like systems (e.g., Linux, macOS):
```
/home/user/Documents/file.txt
```

On Windows systems:
```
C:\Users\user\Documents\file.txt
```

# Open a file

Python's built-in `open()` function can be used to open a file, which returns a file object. If a problem occurs, an `OSError` is raised. 

The first two parameters are the most commonly used: the file path and the mode in which you want to open the file.

The available modes are:

- `'r'`: open for reading (default)
- `'w'`: open for writing, truncating the file first
- `'x'`: open for exclusive creation, failing if the file already exists
- `'a'`: open for writing, appending to the end of file if it exists
- `'b'`: binary mode
- `'t'`: text mode (default)
- `'+'`: open for updating (reading and writing)

# Close a file

An opened file should always be closed and it is your duty as a programmer to ensure the file is closed. Otherwise, unexpected behaviors may happen.

In Python, `.close()` method of the file object can be used to close a file.

In [None]:
fp = open("a.txt", "w")

print(fp.closed)

fp.close()

print(fp.closed)

In [None]:
fp = open("a.txt", "r")

print(fp.closed)

fp.close()

print(fp.closed)

In [None]:
fp = open("a.txt", "rt")

print(fp.closed)

fp.close()

print(fp.closed)

# `with` statement

The `with` statement in Python provides a convenient way to manage resources, such as files or network connections, that need to be cleaned up or released after use. It ensures that certain operations are performed both before and after the block of code within the `with` statement. The general syntax of a `with` statement is as follows:

```python
with expression [as target]:
    # code block
```

Here's how the `with` statement works:

1. The expression following the `with` keyword is typically a function or an object that represents the resource being managed. It must define two special methods: `__enter__()` and `__exit__()`.
2. The `__enter__()` method is called when the block of code within the `with` statement is entered. It sets up the resource and returns an object that will be assigned to the optional `target` variable.
3. The `target` is an optional variable that receives the result of the `__enter__()` method. It allows you to work with the resource within the block of code.
4. The indented code block following the `with` statement represents the actions to be performed using the resource.
5. After the block of code is executed or if an exception occurs, the `__exit__()` method of the resource object is called. It is responsible for cleaning up the resource or handling any exceptions that occurred within the `with` block.

Reference: [PEP 343 – The “with” Statement](https://peps.python.org/pep-0343/)

# Ensure a file is closed via `with` statement

It is a good practice to use `with` statement when working with files. It guarantees that the file is closed, even if an exception is raised during the operations.

In [None]:
file_path = "a.txt"
file_mode = "r"

# Open the file
with open(file_path, file_mode) as fp:
    # Perform operations on the file
    print(fp.closed)
    
print(fp.closed)

# Read a file

There are several methods you can use to read a file in Python. Here are some common approaches:

- `.read(size=-1)` method
- `.readline(size=-1)` method
- `.readlines(hint=-1)` method
- Iteration

# `.read(size=-1)` method

This method retrieves data from the file by specifying the number of bytes to be read (`size` parameter). If no argument is provided or if the argument is either `None` or `-1`, the entire file is read.

In [None]:
# Reading the entire file
with open("a.txt", "r") as fp:
    content = fp.read()
    print(content)

In [None]:
# Reading the first 10 bytes
with open("a.txt", "r") as fp:
    content = fp.read()
    print(content)
    content = fp.read()
    print(content)

# `.readline(size=-1)` method

This method reads a maximum of `size` characters from the current line. It reads until the end of the line and then wraps around to the beginning if necessary. If no argument is provided or if the argument is `None` or `-1`, it reads the entire line or the remaining part of the line.

In [None]:
# Reading the entire line
with open("a.txt", "r") as fp:
    first_line = fp.readline()
    print(first_line)
    first_line = fp.readline()
    print(first_line)

In [None]:
# Reading the first 10 bytes of a line
with open("a.txt", "r") as fp:
    first_line = fp.readline(10)
    print(first_line)
    first_line = fp.readline(20)
    print(first_line)

In [None]:
# Reading multiple lines
with open("a.txt", "r") as fp:
    print(fp.readline())
    print(fp.readline())
    print(fp.readline(5))
    print(fp.readline(12))
    print(fp.readline())

# `.readlines(hint=-1)` method

This method reads and retrieves a list of lines from the stream. You can provide a `hint` value to control the number of lines to be read. If the total size of all lines read so far exceeds the `hint` value (in bytes or characters), no more lines will be read.

In [None]:
with open("a.txt", "r") as fp:
    lines = fp.readlines()
    print(lines)

In [None]:
with open("a.txt", "r") as fp:
    lines = fp.readlines()
    print(lines)

# Iterate over a file via `.readline()` method

We can iterate over the lines of a file by using `.readline()` method and assuming that the end of file (EOF) character is the empty string, i.e. `""`.

In [None]:
with open("a.txt", "r") as fp:
    line = fp.readline()
    while line != "":
        print(line)
        line = fp.readline()

# Iterate over a file via `.readlines()` method

In [None]:
with open("a.txt", "r") as fp:
    lines = fp.readlines()

    for line in lines:
        print(line)

# Iterate over a file object

File object is an iterable and we can iterate over the file object itself.

In [None]:
with open("a.txt", "r") as fp:
    for line in fp:
        print(line)