### Accessing Files from Python Code

One common task for developers is processing data stored in files, which are typically kept on various storage devices such as hard drives, optical discs, network storage, or solid-state drives.

It's easy to envision a program that sorts 20 numbers, with the user entering these numbers directly from the keyboard. However, the task becomes significantly more complex when dealing with 20,000 numbers, as it's impractical for a user to enter such a large amount of data without errors.

Instead, it's much easier to imagine these numbers being stored in a disk file that the program reads. The program sorts the numbers and then saves the sorted sequence to a new file, rather than displaying it on the screen.

For instance, if we want to implement a simple database, the only way to retain information between program runs is to save it into a file (or multiple files for a more complex database).

In essence, any non-trivial programming problem involves using files, whether it’s processing images, multiplying matrices, or calculating wages and taxes—all these tasks involve reading and writing data stored in files.

#### The Concept of File Storage

You might wonder why we haven't addressed these issues earlier. The reason is straightforward—Python's method for accessing and processing files is built around a consistent set of objects, making this the ideal time to discuss it.

### File Names

Different operating systems handle files in different ways. For example, Windows uses a different naming convention compared to Unix/Linux systems.

Using the concept of a canonical file name (a name that uniquely defines the location of the file regardless of its position in the directory tree), we can see that these names appear differently in Windows and Unix/Linux systems.

#### The Concept of File Paths

Unix/Linux systems do not use disk drive letters (like C:). All directories branch out from a single root directory called `/`, while Windows systems recognize the root directory as `\`.

Additionally, file names in Unix/Linux systems are case-sensitive. Windows systems store the case of the letters in the file name but do not distinguish between them. 

This means that the strings "ThisIsTheNameOfTheFile" and "thisisthenameofthefile" refer to two different files in Unix/Linux systems but refer to the same file in Windows systems.

The most notable difference is the use of different separators for directory names: `\` in Windows and `/` in Unix/Linux. 

While this difference might not be significant for the average user, it is crucial when writing programs in Python. To understand why, consider the specific role of the `\` character in Python strings.

### File Names: Continued

Suppose you're interested in a particular file located in the directory `dir`, named `file`.

Suppose also that you want to assign a string containing the name of the file.

In Unix/Linux systems, it may look as follows:

```python
name = "/dir/file"
```

But if you try to code it for the Windows system:

```python
name = "\dir\file"
```

you'll get an unpleasant surprise: either Python will generate an error, or the execution of the program will behave strangely, as if the file name has been distorted in some way.

In fact, it's not strange at all, but quite obvious and natural. Python uses the backslash (`\`) as an escape character (like `\n`).

This means that Windows file names must be written as follows:

```python
name = "\\dir\\file"
```

Fortunately, there is also another solution. Python is smart enough to convert slashes into backslashes when required by the OS.

This means that the following assignments:

```python
name = "/dir/file"
name = "c:/dir/file"
```

will work with Windows, too.

Any program written in Python (and not only in Python, as this convention applies to virtually all programming languages) does not communicate with the files directly, but through abstract entities named differently in various languages or environments—the most common terms are handles or streams (we'll use them as synonyms here).

The programmer, having a set of functions/methods, can perform certain operations on the stream, which affect the real files using mechanisms contained in the operating system kernel.

This way, you can implement the process of accessing any file, even when the file name is unknown at the time of writing the program.

#### Accessing Files - A Tree Structure Concept

To connect (bind) the stream with the file, it's necessary to perform an explicit operation. The operation of connecting the stream with a file is called opening the file, while disconnecting this link is called closing the file.

Hence, the conclusion is that the very first operation performed on the stream is always open, and the last one is close. The program, in effect, is free to manipulate the stream between these two events and to handle the associated file.

This freedom is limited, of course, by the physical characteristics of the file and the way in which the file has been opened.

Opening the stream can fail for several reasons: the most common is the absence of a file with the specified name. 

It can also happen that the physical file exists, but the program is not allowed to open it. There's also the risk that the program has opened too many streams, and the specific operating system may not allow the simultaneous opening of more than a certain number of files (e.g., 200).

A well-written program should detect these failed openings and react accordingly.

### File Streams

Opening a stream is not only associated with the file but also involves declaring how the stream will be processed. This declaration is known as the open mode.

If the opening is successful, the program can only perform operations that are consistent with the declared open mode.

There are two basic operations performed on the stream:

- **Read from the stream:** Portions of data are retrieved from the file and placed in a memory area managed by the program (e.g., a variable).
- **Write to the stream:** Portions of data from the memory (e.g., a variable) are transferred to the file.

There are three basic modes used to open the stream:

- **Read mode:** A stream opened in this mode allows only read operations. Trying to write to the stream will cause an exception (UnsupportedOperation, which inherits from OSError and ValueError, from the `io` module).
- **Write mode:** A stream opened in this mode allows only write operations. Attempting to read from the stream will cause the aforementioned exception.
- **Update mode:** A stream opened in this mode allows both read and write operations.

Before we discuss how to manipulate the streams, some explanation is needed. The stream behaves almost like a tape recorder.

When you read something from a stream, a virtual head moves over the stream according to the number of bytes transferred from the stream.

When you write something to the stream, the same head moves along the stream, recording the data from the memory.

Whenever we talk about reading from and writing to the stream, try to imagine this analogy. Programming books refer to this mechanism as the current file position, and we will use this term as well.

#### The Read/Write Concept

It's now necessary to introduce the object responsible for representing streams in programs.

### File Handles

Python assumes that every file is managed through an object of an appropriate class.

Naturally, you might wonder what is meant by "appropriate."

Files can be processed in various ways—some methods depend on the file's contents, while others depend on the programmer's intentions.

In any case, different files may require different sets of operations and may behave differently.

An object of an appropriate class is created when you open the file and is destroyed when you close it.

Between these two events, you can use the object to perform specific operations on a particular stream. The operations you can perform depend on how you opened the file.

In general, the object comes from one of the classes shown here:

- IOBase
- RawIOBase
- BufferedIOBase
- TextIOBase

Note: You never use constructors to create these objects. The only way to obtain them is by invoking the `open()` function.

The function analyzes the arguments you've provided and automatically creates the required object.

If you want to get rid of the object, you call the `close()` method.

This call will sever the connection between the object and the file, effectively removing the object.

For our purposes, we'll focus only on streams represented by `BufferedIOBase` and `TextIOBase` objects. You'll understand why soon.

### Text vs. Binary Streams

Due to the type of the stream's contents, all streams are divided into text and binary streams.

**Text streams** are structured in lines; they contain typographical characters (letters, digits, punctuation, etc.) arranged in rows (lines), as seen when you view the file contents in an editor. These files are typically written to or read from character by character or line by line.

**Binary streams** do not contain text but rather a sequence of bytes of any value. This sequence can represent an executable program, an image, an audio or video clip, a database file, etc. Because these files do not contain lines, reading and writing relate to portions of data of any size, usually done byte by byte or block by block, with the block size ranging from one to an arbitrarily chosen value.

A subtle problem arises with text streams. In Unix/Linux systems, line endings are marked by a single character named LF (Line Feed, ASCII code 10), represented in Python as `\n`. In contrast, other operating systems, particularly those derived from the prehistoric CP/M system (including Windows), use a different convention: the end of a line is marked by a pair of characters, CR (Carriage Return, ASCII code 13) and LF (Line Feed, ASCII code 10), which can be encoded as `\r\n`.

#### Text vs. Binary Streams Concept

This difference can lead to various unpleasant consequences.

For example, if you create a program for processing a text file on Windows, you can recognize line endings by finding `\r\n` characters. However, the same program running in a Unix/Linux environment will be completely useless, and vice versa. A program written for Unix/Linux systems might not work correctly on Windows.

Such undesirable features that prevent or hinder the use of the program in different environments are called non-portability.

Conversely, a program that can execute in different environments is said to have portability. A program with this trait is called a portable program.

### Addressing Portability Issues

Given that portability issues were (and still are) quite significant, a decision was made to resolve them in a way that doesn't require the developer's attention.

#### Text vs. Binary Streams Concept

This solution was implemented at the level of classes responsible for reading and writing characters to and from the stream. Here’s how it works:

- When a stream is opened and it’s indicated that the data in the associated file will be processed as text (or if no such indication is given), the stream is switched to text mode.
- During the reading/writing of lines in a Unix environment, nothing special happens. However, in a Windows environment, newline character translation occurs: when reading a line, every pair of `\r\n` characters is replaced with a single `\n` character, and vice versa during writing—every `\n` character is replaced with a pair of `\r\n` characters.
- This mechanism is completely transparent to the program, allowing it to be written as if it were intended solely for Unix/Linux text files. The same source code will also work correctly in a Windows environment.
- When a stream is opened in binary mode, its contents are taken as-is, without any conversion—no bytes are added or omitted.

### Opening Streams

A stream is opened using a function that can be invoked as follows:

```python
stream = open(file, mode='r', encoding=None)
```

Let’s break this down:

- The function name `open` is self-explanatory; if successful, it returns a stream object; otherwise, an exception is raised (e.g., `FileNotFoundError` if the file doesn’t exist).
- The first parameter (`file`) specifies the name of the file to be associated with the stream.
- The second parameter (`mode`) specifies the open mode for the stream; it’s a string consisting of a sequence of characters, each with a special meaning (details will follow).
- The third parameter (`encoding`) specifies the encoding type (e.g., UTF-8 for text files).

The opening must be the first operation performed on the stream.

**Note:** The `mode` and `encoding` arguments are optional. If omitted, their default values are assumed. The default opening mode is reading in text mode, and the default encoding depends on the platform used.

Now, let’s present the most important and useful open modes. Ready?

### Opening Streams: Modes

#### `r` Open Mode: Read
- The stream will be opened in read mode.
- The file associated with the stream must exist and be readable; otherwise, the `open()` function raises an exception.

#### `w` Open Mode: Write
- The stream will be opened in write mode.
- The file associated with the stream does not need to exist. If it doesn't exist, it will be created. If it does exist, it will be truncated to zero length (erased). If creation is not possible (e.g., due to system permissions), the `open()` function raises an exception.

#### `a` Open Mode: Append
- The stream will be opened in append mode.
- The file associated with the stream does not need to exist. If it doesn't exist, it will be created. If it does exist, the virtual recording head will be set at the end of the file (the previous content of the file remains untouched).

#### `r+` Open Mode: Read and Update
- The stream will be opened in read and update mode.
- The file associated with the stream must exist and be writable; otherwise, the `open()` function raises an exception.
- Both read and write operations are allowed for the stream.

#### `w+` Open Mode: Write and Update
- The stream will be opened in write and update mode.
- The file associated with the stream does not need to exist. If it doesn't exist, it will be created. If it does exist, the previous content will be erased.
- Both read and write operations are allowed for the stream.

### Selecting Text and Binary Modes
- If there is a letter `b` at the end of the mode string, the stream is opened in binary mode.
- If the mode string ends with a letter `t`, the stream is opened in text mode.
- Text mode is the default behavior when no binary/text mode specifier is used.

The successful opening of the file sets the current file position (the virtual reading/writing head) before the first byte of the file if the mode is not `a`, and after the last byte of the file if the mode is set to `a`.

| Text Mode | Binary Mode | Description       |
|-----------|-------------|-------------------|
| rt        | rb          | read              |
| wt        | wb          | write             |
| at        | ab          | append            |
| r+t       | r+b         | read and update   |
| w+t       | w+b         | write and update  |

### Extra Mode: Exclusive Creation
You can also open a file for exclusive creation using the `x` open mode. If the file already exists, the `open()` function will raise an exception.

### Opening the Stream for the First Time

Imagine we want to develop a program that reads the content of a text file located at: `C:\Users\User\Desktop\file.txt`.

How do we open that file for reading? Here's the relevant snippet of code:

In [1]:
try:
    stream = open("C:\\Users\\User\\Desktop\\file.txt", "rt")
    # Processing goes here.
    stream.close()
except Exception as exc:
    print("Cannot open the file:", exc)

Cannot open the file: [Errno 2] No such file or directory: 'C:\\Users\\User\\Desktop\\file.txt'


#### What's Happening Here?

- We use a `try-except` block to handle runtime errors gracefully.
- We use the `open()` function to try to open the specified file (note the way the file path is specified).
- The open mode is set to read as text (`"rt"`). Since reading as text is the default mode, we could omit the `t`.
- If successful, `open()` returns an object, which we assign to the `stream` variable.
- If `open()` fails, we handle the exception by printing the full error information, which helps us understand what went wrong.

### Pre-Opened Streams

We mentioned earlier that any stream operation must be preceded by an `open()` function invocation. However, there are three well-defined exceptions to this rule.

When our program starts, three streams are already opened and don't require any extra preparation. You can use these streams explicitly by importing the `sys` module:

```python
import sys
```

These streams are named: `sys.stdin`, `sys.stdout`, and `sys.stderr`.

#### Let's Analyze Them:

- **sys.stdin**: Standard Input
  - Typically associated with the keyboard, pre-opened for reading, and considered the primary data source for running programs.
  - The well-known `input()` function reads data from `stdin` by default.

- **sys.stdout**: Standard Output
  - Typically associated with the screen, pre-opened for writing, and considered the primary target for outputting data by the running program.
  - The well-known `print()` function outputs data to the `stdout` stream.

- **sys.stderr**: Standard Error Output
  - Typically associated with the screen, pre-opened for writing, and considered the primary place where the running program should send error information.
  - We haven't yet presented a method to send data to this stream, but we will soon.
  - Separating `stdout` (useful program results) from `stderr` (error messages, useful but not results) allows redirecting these two types of information to different targets. A more extensive discussion of this is beyond the scope of this course, but the operating system handbook will provide more details.

### Closing Streams

The final operation performed on a stream (excluding `stdin`, `stdout`, and `stderr` streams, which do not require it) should be closing the stream. This action is carried out by invoking the `close()` method on the open stream object:

```python
stream.close()
```

- The function name is self-explanatory: `close()`.
- The function does not expect any arguments and returns nothing, but it can raise an `IOError` exception in case of an error.
- Most developers assume the `close()` function always succeeds and thus do not check if it has completed its task properly.

This assumption is only partly justified. If the stream was opened for writing and a series of write operations were performed, it is possible that the data sent to the stream has not yet been transferred to the physical device due to mechanisms like caching or buffering. Since closing the stream forces the buffers to flush, it may happen that the flush fails, causing the `close()` function to fail as well.

We have previously mentioned failures caused by functions operating with streams but have not discussed how to identify the cause of the failure.

### Diagnosing Stream Problems

The `IOError` object has a property named `errno` (short for error number), which can be accessed as follows:

In [4]:
try:
    stream = open("C:\\Users\\User\\Desktop\\file.txt", "rt")
    # The process will be done here
    stream.close()
except Exception as exc:
    print("Can not open file:", exc)

Can not open file: [Errno 2] No such file or directory: 'C:\\Users\\User\\Desktop\\file.txt'


The value of the `errno` attribute can be compared with predefined symbolic constants defined in the `errno` module.

#### Selected Constants for Detecting Stream Errors:

- **errno.EACCES → Permission denied**
  - Occurs when trying to open a file with read-only attributes for writing.

- **errno.EBADF → Bad file number**
  - Occurs when attempting to operate on an unopened stream.

- **errno.EEXIST → File exists**
  - Occurs when trying to rename a file to its existing name.

- **errno.EFBIG → File too large**
  - Occurs when attempting to create a file larger than the maximum size allowed by the operating system.

- **errno.EISDIR → Is a directory**
  - Occurs when attempting to treat a directory name as a regular file.

- **errno.EMFILE → Too many open files**
  - Occurs when trying to open more streams simultaneously than the operating system allows.

- **errno.ENOENT → No such file or directory**
  - Occurs when trying to access a non-existent file or directory.

- **errno.ENOSPC → No space left on device**
  - Occurs when there is no free space on the storage media.

The complete list of error codes is much longer and includes some error codes not related to stream processing.

### Diagnosing Stream Problems: Continued

If you are a very careful programmer, you might feel the need to use a sequence of statements similar to those presented in the editor.

Fortunately, there is a function that can significantly simplify error handling code.

The function is `strerror()`, which comes from the `os` module and expects just one argument—an error number.

Its role is simple: you provide an error number, and it returns a string describing the meaning of the error.

**Note:** If you pass a non-existent error code (a number not bound to any actual error), the function will raise a `ValueError` exception.

Now we can simplify our code as follows:

```python
from os import strerror

try:
    s = open("c:/users/user/Desktop/file.txt", "rt")
    # Actual processing goes here.
    s.close()
except Exception as exc:
    print("The file could not be opened:", strerror(exc.errno))
```

Alright. Now it's time to deal with text files and get familiar with some basic techniques you can use to process them.

### Summary

1. A file needs to be opened before it can be processed by a program, and it should be closed when processing is finished.

   Opening the file associates it with a stream, an abstract representation of the physical data stored on the media. The way in which the stream is processed is called the open mode. There are three open modes:
   
   - **Read mode** – only read operations are allowed.
   - **Write mode** – only write operations are allowed.
   - **Update mode** – both write and read operations are allowed.

2. Depending on the physical file content, different Python classes can be used to process files. Generally, `BufferedIOBase` can process any file, while `TextIOBase` is specialized for processing text files (i.e., files containing human-readable text divided into lines using newline markers). Thus, streams can be divided into binary and text streams.

3. The following `open()` function syntax is used to open a file:

   ```python
   open(file_name, mode=open_mode, encoding=text_encoding)
   ```

   This invocation creates a stream object and associates it with the file named `file_name`, using the specified `open_mode` and setting the specified `text_encoding`. If there is an error, it raises an exception.

4. Three predefined streams are already open when the program starts:

   - `sys.stdin` – standard input
   - `sys.stdout` – standard output
   - `sys.stderr` – standard error output

5. The `IOError` exception object, created when any file operation fails (including open operations), contains a property named `errno`, which holds the error code of the failed action. Use this value to diagnose the problem.