### Processing Text Files

In this lesson, we're going to create a simple text file with short, straightforward content.

We'll demonstrate some basic techniques for reading the file contents to process them.

The processing will be simple—you'll copy the file's contents to the console and count all the characters read by the program.

Remember, our definition of a text file is very strict. It's a plain text file, meaning it contains only text, without any formatting, different fonts, or other decorations.

Therefore, you should avoid creating the file using advanced text processors like MS Word or LibreOffice Writer. Instead, use basic tools your OS offers, such as Notepad, vim, or gedit.

If your text files contain some national characters not covered by the standard ASCII charset, you may need an additional step. Your `open()` function invocation may require an argument specifying the text encoding.

For example, if you're using a Unix/Linux OS configured to use UTF-8 as a system-wide setting, the `open()` function may look like this:

In [2]:
# Opening tzop.txt in read mode, returning it as a file object:
stream = open("tzop.txt", "rt", encoding="utf-8")

print(stream.read()) # printing the content of the file

1
2
3
4
5


Here, the `encoding` argument is set to a string representing the appropriate text encoding (UTF-8 in this case).

Consult your OS documentation to find an encoding name suitable for your environment.

**Note**

For the purposes of our experiments with file processing in this section, we'll use a pre-uploaded set of files (e.g., tzop.txt or text.txt) which you'll be able to work with. If you'd like to work with your own files locally on your machine, we strongly encourage you to do so, and to use IDLE (or any other IDE you prefer) for your tests.

### Reading a Text File's Contents

Reading a text file's contents can be done using several different methods, none inherently better or worse than the others. The choice of method depends on your preference and the specific situation.

Some methods might be more convenient in certain scenarios and less so in others. Be flexible and willing to adjust your approach as needed.

One of the most basic methods is using the `read()` function, which we demonstrated in the previous lesson.

When applied to a text file, the function can:

- Read a specified number of characters (including just one) from the file and return them as a string.
- Read all the file contents and return them as a string.
- If there is nothing more to read (the virtual reading head reaches the end of the file), the function returns an empty string.

We'll start with the simplest variant using a file named `text.txt` with the following contents:

```
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
```

Now look at the code below and let's analyze it:

In [3]:
from os import strerror

try:
    cnt = 0
    s = open('text.txt', "rt")
    ch = s.read(1)
    while ch != '':
        print(ch, end='')
        cnt += 1
        ch = s.read(1)
    s.close()
    print("\n\nCharacters in file:", cnt)
except IOError as e:
    print("I/O error occurred: ", strerror(e.errno))

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.

Characters in file: 131


The process is straightforward:

1. Use the `try-except` mechanism to open the file with the predetermined name (`text.txt` in this case).
2. Try to read the first character from the file (`ch = s.read(1)`).
3. If successful (indicated by a positive result of the `while` condition check), output the character (note the `end=''` argument—it prevents skipping to a new line after every character).
4. Update the counter (`cnt`).
5. Attempt to read the next character, and repeat the process.

### Reading a File in One Go

If you're absolutely sure that the file's length is manageable and you can read the whole file into memory at once, you can do so. The `read()` function, when called without any arguments or with an argument that evaluates to `None`, will read the entire file.

Remember, attempting to read a very large file, such as one that is terabytes in size, using this method may crash your OS. Computer memory has its limits.

Consider the code below. What do you think?

In [4]:
from os import strerror

try:
    cnt = 0
    s = open('text.txt', "rt")
    content = s.read()
    for ch in content:
        print(ch, end='')
        cnt += 1
    s.close()
    print("\n\nCharacters in file:", cnt)
except IOError as e:
    print("I/O error occurred: ", strerror(e.errno))

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.

Characters in file: 131


Let's break it down:

1. Open the file as previously.
2. Read its entire contents with a single invocation of the `read()` function.
3. Process the text by iterating through it with a regular `for` loop, updating the counter value at each iteration.

The result will be exactly the same as before.

### Processing Text Files: `readline()`

If you want to handle the file's contents as a set of lines rather than a collection of characters, the `readline()` method is perfect for the task.

This method reads a complete line of text from the file and returns it as a string if successful. If not, it returns an empty string.

Using this method opens up new possibilities—you can now easily count lines as well as characters.

Here's an example of how to use it:

In [5]:
from os import strerror

try:
    ccnt = lcnt = 0
    s = open('text.txt', 'rt')
    line = s.readline()
    while line != '':
        lcnt += 1
        for ch in line:
            print(ch, end='')
            ccnt += 1
        line = s.readline()
    s.close()
    print("\n\nCharacters in file:", ccnt)
    print("Lines in file:     ", lcnt)
except IOError as e:
    print("I/O error occurred:", strerror(e.errno))

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.

Characters in file: 131
Lines in file:      4


As you can see, the general concept is the same as in the previous examples:

1. Open the file.
2. Use `readline()` to read the file line by line.
3. Count characters and lines by iterating through each line and character.
4. Print the results after processing the file.

### Processing Text Files: `readlines()`

Another method that treats a text file as a set of lines, rather than characters, is `readlines()`.

The `readlines()` method, when called without arguments, reads the entire file and returns a list of strings, with each element representing a line in the file.

If you're unsure whether the file size is small enough and don't want to risk testing the OS limits, you can instruct the `readlines()` method to read no more than a specified number of bytes at once. The return value remains the same—a list of strings.

Experiment with the following example code to understand how the `readlines()` method works:

In [6]:
s = open("text.txt")
print(s.readlines(20))
print(s.readlines(20))
print(s.readlines(20))
print(s.readlines(20))
s.close()

['Beautiful is better than ugly.\n']
['Explicit is better than implicit.\n']
['Simple is better than complex.\n']
['Complex is better than complicated.']


The maximum accepted input buffer size is passed to the method as its argument.

You might expect that `readlines()` can process a file's contents more efficiently than `readline()`, as it may need to be invoked fewer times.

**Note:** When there is nothing left to read from the file, the method returns an empty list. Use this to detect the end of the file.

Increasing the buffer size might improve input performance to a certain extent, but there is no universal rule for the optimal buffer size—find the best values through experimentation.

Look at the code below. We've modified it to show you how to use `readlines()`:

In [7]:
from os import strerror

try:
    ccnt = lcnt = 0
    s = open('text.txt', 'rt')
    lines = s.readlines(20)
    while len(lines) != 0:
        for line in lines:
            lcnt += 1
            for ch in line:
                print(ch, end='')
                ccnt += 1
        lines = s.readlines(10)
    s.close()
    print("\n\nCharacters in file:", ccnt)
    print("Lines in file:     ", lcnt)
except IOError as e:
    print("I/O error occurred:", strerror(e.errno))

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.

Characters in file: 131
Lines in file:      4


We've chosen to use a 20-byte buffer, but this isn't a strict recommendation. We used this value to avoid the situation where the first `readlines()` invocation consumes the entire file. We want the method to demonstrate its capabilities by working harder.

There are two nested loops in the code: the outer loop uses the result of `readlines()` to iterate through it, while the inner loop prints the lines character by character.

### Processing Text Files: Iterable Objects

The last example we want to present showcases a very interesting feature of the object returned by the `open()` function in text mode.

It may surprise you—the object is an instance of an iterable class.

Strange? Not at all. Useful? Yes, absolutely.

The iteration protocol defined for the file object is straightforward—its `__next__` method returns the next line read from the file.

Moreover, you can expect that the object automatically invokes `close()` when any of the file reads reaches the end of the file.

Look at the code below to see how simple and clear it has become:

In [8]:
from os import strerror

try:
    ccnt = lcnt = 0
    for line in open('text.txt', 'rt'):
        lcnt += 1
        for ch in line:
            print(ch, end='')
            ccnt += 1
    print("\n\nCharacters in file:", ccnt)
    print("Lines in file:     ", lcnt)
except IOError as e:
    print("I/O error occurred:", strerror(e.errno))

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.

Characters in file: 131
Lines in file:      4


This example demonstrates how iterating over a file object simplifies the process of reading lines and counting characters.

### Writing to Text Files: `write()`

Writing to text files is straightforward, as there is primarily one method used for this task: `write()`. 

The `write()` method expects a single argument—a string that will be written to an open file. Remember, the open mode should match the intended operation—writing to a file opened in read mode will fail.

No newline character is automatically added to the `write()` method's argument, so you need to include it yourself if you want the file to contain multiple lines.

The following example demonstrates a simple code that creates a file named `newtext.txt` (note: the `w` mode ensures that the file is created from scratch, even if it already exists and contains data) and writes ten lines to it.

The string to be written consists of the word "line" followed by the line number. We've chosen to write the string's contents character by character (using the inner `for` loop), but you don't have to do it this way.

We wanted to show that `write()` can operate on single characters.

In [9]:
from os import strerror

try:
    fo = open('newtext.txt', 'wt') # A new file (newtext.txt) is created.
    for i in range(10):
        s = "line #" + str(i+1) + "\n"
        for ch in s:
            fo.write(ch)
    fo.close()
except IOError as e:
    print("I/O error occurred: ", strerror(e.errno))

The code creates a file filled with the following text:

```
line #1
line #2
line #3
line #4
line #5
line #6
line #7
line #8
line #9
line #10
```

Can you print the file's contents to the console?

We encourage you to test the behavior of the `write()` method locally on your machine.

### Dealing with Text Files: Continued

Look at the example below. We've modified the previous code to write whole lines to the text file.

In [10]:
from os import strerror

try:
    fo = open('newtext.txt', 'wt')
    for i in range(10):
        fo.write("line #" + str(i+1) + "\n")
    fo.close()
except IOError as e:
    print("I/O error occurred: ", strerror(e.errno))

The contents of the newly created file are the same.

**Note:** You can use the same method to write to the `stderr` stream, but you don't need to open it, as it's always open implicitly.

For example, if you want to send a message string to `stderr` to distinguish it from normal program output, it may look like this:

```python
import sys
sys.stderr.write("Error message\n")
```

This method ensures that error messages are sent to `stderr` and separated from the standard program output.

### What is a bytearray?

Before discussing binary files, we need to introduce you to one of Python's specialized classes used for storing amorphous data.

Amorphous data refers to data that lacks a specific shape or form; it is simply a series of bytes.

This doesn't mean these bytes lack meaning or can't represent useful objects, like bitmap graphics. The key point is that when dealing with this data, we either cannot or do not want to know its specific nature.

Amorphous data cannot be stored using the methods we've discussed previously; they are neither strings nor lists. Therefore, a special container is needed to handle such data.

Python provides more than one such container, and one of them is a specialized class named `bytearray`. As the name suggests, it's an array containing (amorphous) bytes.

To create such a container, for example, to read a bitmap image and process it, you need to explicitly create it using one of the available constructors.

Consider this example:


In [11]:
data = bytearray(10)


This invocation creates a `bytearray` object capable of storing ten bytes.

Note: This constructor initializes the entire array with zeros.

### Bytearrays in Python

Bytearrays resemble lists in many respects. For example, they are mutable, can be used with the `len()` function, and you can access any of their elements using conventional indexing.

However, there are important limitations:
- You cannot set any byte array elements with a value that is not an integer (doing so will cause a `TypeError` exception).
- You cannot assign a value outside the range of 0 to 255 inclusive (doing so will provoke a `ValueError` exception).

You can treat any byte array elements as integer values, as shown in the example below:

In [12]:
data = bytearray(10)

for i in range(len(data)):
    data[i] = 10 - i

for b in data:
    print(hex(b))

0xa
0x9
0x8
0x7
0x6
0x5
0x4
0x3
0x2
0x1


Note: We've used two methods to iterate over the byte arrays and employed the `hex()` function to print the elements as hexadecimal values.

Now, we'll show you how to write a byte array to a binary file. We want to create a one-to-one copy of the physical memory content, byte by byte, rather than saving a readable representation.

In [13]:
# Example code for writing a byte array to a binary file
from os import strerror

try:
    data = bytearray(10)
    for i in range(len(data)):
        data[i] = 10 - i

    with open('binaryfile.bin', 'wb') as bf:
        bf.write(data)
except IOError as e:
    print("I/O error occurred:", strerror(e.errno))

### Writing a Byte Array to a Binary File

Let's look at how to write a byte array to a binary file with the following code:

In [14]:
from os import strerror

try:
    # Initialize bytearray with values starting from 10
    data = bytearray(10)
    for i in range(len(data)):
        data[i] = 10 - i

    # Create the binary file and write the byte array to it
    with open('binaryfile.bin', 'wb') as bf:
        bf.write(data)
except IOError as e:
    print("I/O error occurred:", strerror(e.errno))

### Analysis:
- **Initialization:** We initialize a `bytearray` with subsequent values starting from 10. If you want the file's contents to be more readable, replace `10` with something like `ord('a')`. This will produce bytes with values corresponding to the alphabetical part of the ASCII code. However, the file remains binary since it's created with the `wb` flag.
- **File Creation:** We create the file using the `open()` function with the `wb` (write binary) mode.
- **Writing Data:** The `write()` method takes the `bytearray` and writes it to the file as a whole.
- **Closing the Stream:** The stream is closed in a routine way.
- **Return Value:** The `write()` method returns the number of successfully written bytes. If this differs from the length of the `bytearray`, it indicates a write error.

Run the code and analyze the contents of the newly created file. You'll use this file in the next step.

### Reading Bytes from a Stream

To read from a binary file, use the `readinto()` method. This method fills a pre-existing byte array with values from the binary file, rather than creating a new byte array.

**Note:**
- The method returns the number of successfully read bytes.
- It tries to fill the entire available space in its argument. If there is more data in the file than space in the argument, the read operation stops before the file's end. If the file has less data than the space, the result indicates that the byte array is only partially filled. The unused part of the array remains unchanged.

Here's the complete code:

In [15]:
from os import strerror

data = bytearray(10)

try:
    # Open the file in read binary mode
    bf = open('binaryfile.bin', 'rb')
    bf.readinto(data)
    bf.close()

    # Print the contents of the byte array
    for b in data:
        print(hex(b), end=' ')
except IOError as e:
    print("I/O error occurred:", strerror(e.errno))

0xa 0x9 0x8 0x7 0x6 0x5 0x4 0x3 0x2 0x1 

### Analysis:
- **Opening the File:** We open the file created in the previous step using `rb` mode.
- **Reading Data:** We read the file's contents into the `bytearray` named `data`, which is 10 bytes in size.
- **Printing Data:** Finally, we print the contents of the `bytearray`. Check if the contents match your expectations.

Run the code to verify if it works as intended.

### How to Read Bytes from a Stream

Another way to read the contents of a binary file is by using the `read()` method.

When invoked without arguments, it attempts to read the entire contents of the file into memory, storing them in a newly created object of the `bytes` class.

This class is similar to `bytearray`, with one significant difference: it is immutable.

Fortunately, you can easily create a `bytearray` from a `bytes` object by using its initial value directly, as shown below:

In [16]:
from os import strerror

try:
    bf = open('file.bin', 'rb')
    data = bytearray(bf.read())
    bf.close()

    for b in data:
        print(hex(b), end=' ')

except IOError as e:
    print("I/O error occurred:", strerror(e.errno))

I/O error occurred: No such file or directory


Be careful—don't use this method if you're not sure the file's contents will fit into the available memory.

### How to Read Bytes from a Stream: Continued

If the `read()` method is invoked with an argument, it specifies the maximum number of bytes to be read.

The method attempts to read the specified number of bytes from the file, and the length of the returned object indicates the number of bytes actually read.

You can use the method as shown here:

In [17]:
try:
    bf = open('file.bin', 'rb')
    data = bytearray(bf.read(5))
    bf.close()

    for b in data:
        print(hex(b), end=' ')

except IOError as e:
    print("I/O error occurred:", strerror(e.errno))

I/O error occurred: No such file or directory


**Note:** The first five bytes of the file have been read by the code; the next five are still waiting to be processed.

In [18]:
from os import strerror

data = bytearray(10)

for i in range(len(data)):
    data[i] = 10 + i

try:
    bf = open('file.bin', 'wb')
    bf.write(data)
    bf.close()
except IOError as e:
    print("I/O error occurred:", strerror(e.errno))

# Your code that reads bytes from the stream should go here.


### Copying Files - A Simple and Functional Tool

Now you'll combine all this new knowledge, add some new elements, and use it to write a real code that can actually copy a file's contents.

The goal isn't to replace commands like `copy` (MS Windows) or `cp` (Unix/Linux), but to see one way to create a working tool, even if it's not widely used.

Look at the code below and let's analyze it:

In [19]:
from os import strerror  # 1

srcname = input("Kaynak dosya adını girin: ")  # 3
try:  # 4
    src = open(srcname, 'rb')  # 5
except IOError as e:  # 6
    print("Kaynak dosya açılamıyor: ", strerror(e.errno))  # 7
    exit(e.errno)  # 8

dstname = input("Hedef dosya adını girin: ")  # 10
try:  # 11
    dst = open(dstname, 'wb')  # 12
except Exception as e:  # 13
    print("Hedef dosya oluşturulamıyor: ", strerror(e.errno))  # 14
    src.close()  # 15
    exit(e.errno)  # 16

buffer = bytearray(65536)  # 18
total = 0  # 19
try:  # 20
    readin = src.readinto(buffer)  # 21
    while readin > 0:  # 22
        written = dst.write(buffer[:readin])  # 23
        total += written  # 24
        readin = src.readinto(buffer)  # 25
except IOError as e:  # 26
    print("Hedef dosya oluşturulamıyor: ", strerror(e.errno))  # 27
    exit(e.errno)  # 28

print(total, 'bayt başarıyla yazıldı')  # 30
src.close()  # 31
dst.close()  # 32


Kaynak dosya adını girin:  text.txt
Hedef dosya adını girin:  aaa.txt


134 bayt başarıyla yazıldı


### Code Analysis:

- **Lines 3-8:** Ask the user for the name of the file to copy and try to open it for reading. If the open fails, terminate the program execution using `exit()`, passing the completion code to the OS. Any code other than 0 indicates an issue. Use the `errno` value to specify the nature of the problem.

- **Lines 10-16:** Repeat a similar process for the output file.

- **Line 18:** Allocate memory for transferring data from the source file to the target file, referred to as a buffer. Here, the buffer size is 64 kilobytes. A larger buffer typically means faster copying due to fewer I/O operations, but there's a limit where further increases yield no improvements.

- **Line 19:** Initialize a counter to track the number of bytes copied.

- **Line 21:** Fill the buffer for the first time.

- **Line 22:** As long as you read a non-zero number of bytes, continue the loop.

- **Line 23:** Write the buffer's contents to the output file. A slice is used to limit the number of bytes being written since `write()` prefers to write the entire buffer.

- **Line 24:** Update the counter with the number of bytes written.

- **Line 25:** Read the next chunk of the file.

- **Lines 30-32:** Perform final cleanup by closing both the source and destination files after the copying is done.

This code demonstrates a practical way to copy files using Python, combining knowledge of file handling, exception handling, and memory management.

### Summary

1. **Reading a File's Contents:**
   - `read(number)`: Reads the specified number of characters/bytes from the file and returns them as a string. It can read the entire file at once if no number is specified.
   - `readline()`: Reads a single line from the text file.
   - `readlines(number)`: Reads the specified number of lines from the text file. It can read all lines at once if no number is specified.
   - `readinto(bytearray)`: Reads bytes from the file and fills the `bytearray` with them.

2. **Writing New Content to a File:**
   - `write(string)`: Writes a string to a text file.
   - `write(bytearray)`: Writes all the bytes from the `bytearray` to a file.

3. **Using the `open()` Method:**
   - The `open()` method returns an iterable object that can be used to iterate through all the lines of a file inside a `for` loop. For example:


In [None]:
for line in open("file", "rt"):
    print(line, end='')

This code copies the file's contents to the console, line by line. Note: the stream closes itself automatically when it reaches the end of the file.