# Delo z datotekami

## Osnove branja in pisanja za tekstovne datoteke

Before we can go into how to work with files in Python, it’s important to understand what exactly a file is and how modern operating systems handle some of their aspects.

At its core, a file is a contiguous set of bytes used to store data. This data is organized in a specific format and can be anything as simple as a text file or as complicated as a program executable. In the end, these byte files are then translated into binary 1 and 0 for easier processing by the computer.

Files on most modern file systems are composed of three main parts:
- Header: metadata about the contents of the file (file name, size, type, and so on)
- Data: contents of the file as written by the creator or editor
- End of file (EOF): special character that indicates the end of the file

What this data represents depends on the format specification used, which is typically represented by an extension. For example, a file that has an extension of .gif most likely conforms to the Graphics Interchange Format specification. There are hundreds, if not thousands, of file extensions out there. For this tutorial, you’ll only deal with .txt or .csv file extensions.

#### Line Endings

One problem often encountered when working with file data is the representation of a new line or line ending. The line ending has its roots from back in the Morse Code era, when a specific pro-sign was used to communicate the end of a transmission or the end of a line.

Later, this was standardized for teleprinters by both the International Organization for Standardization (ISO) and the American Standards Association (ASA). ASA standard states that line endings should use the sequence of the Carriage Return (CR or \r) and the Line Feed (LF or \n) characters (CR+LF or \r\n). The ISO standard however allowed for either the CR+LF characters or just the LF character.

Windows uses the CR+LF characters to indicate a new line, while Unix and the newer Mac versions use just the LF character. This can cause some complications when you’re processing files on an operating system that is different than the file’s source. Here’s a quick example. Let’s say that we examine the file dog_breeds.txt that was created on a Windows system:

    Pug\r\n
    Jack Russell Terrier\r\n
    English Springer Spaniel\r\n
    German Shepherd\r\n
    Staffordshire Bull Terrier\r\n
    Cavalier King Charles Spaniel\r\n
    Golden Retriever\r\n
    West Highland White Terrier\r\n
    Boxer\r\n
    Border Terrier\r\n

This same output will be interpreted on a Unix device differently:

    Pug\r
    \n
    Jack Russell Terrier\r
    \n
    English Springer Spaniel\r
    \n
    German Shepherd\r
    \n
    Staffordshire Bull Terrier\r
    \n
    Cavalier King Charles Spaniel\r
    \n
    Golden Retriever\r
    \n
    West Highland White Terrier\r
    \n
    Boxer\r
    \n
    Border Terrier\r
    \n

This can make iterating over each line problematic, and you may need to account for situations like this.

#### Character Encodings

Another common problem that you may face is the encoding of the byte data. An encoding is a translation from byte data to human readable characters. This is typically done by assigning a numerical value to represent a character. The two most common encodings are the ASCII and UNICODE Formats. ASCII can only store 128 characters, while Unicode can contain up to 1,114,112 characters.

ASCII is actually a subset of Unicode (UTF-8), meaning that ASCII and Unicode share the same numerical to character values. It’s important to note that parsing a file with the incorrect character encoding can lead to failures or misrepresentation of the character. For example, if a file was created using the UTF-8 encoding, and you try to parse it using the ASCII encoding, if there is a character that is outside of those 128 values, then an error will be thrown.

### Opening and Closing a File in Python

When you want to work with a file, the first thing to do is to open it. This is done by invoking the open() built-in function. open() has a single required argument that is the path to the file. open() has a single return, the file object:

In [1]:
file = open('data/test.txt')

After you open a file, the next thing to learn is how to close it.

> Warning: You should always make sure that an open file is properly closed.

In [2]:
file.close()

It’s important to remember that it’s your responsibility to close the file. In most cases, upon termination of an application or script, a file will be closed eventually. However, there is no guarantee when exactly that will happen. This can lead to unwanted behavior including resource leaks. It’s also a best practice within Python (Pythonic) to make sure that your code behaves in a way that is well defined and reduces any unwanted behavior.

When you’re manipulating a file, there are two ways that you can use to ensure that a file is closed properly, even when encountering an error. The first way to close a file is to use the try-finally block:

In [3]:
reader = open('data/test.txt')
try:
    # Further file processing goes here
    pass
finally:
    reader.close()

The second way to close a file is to use the with statement:

In [4]:
with open('data/test.txt') as reader:
    # Further file processing goes herež
    pass

The with statement automatically takes care of closing the file once it leaves the with block, even in cases of error. I highly recommend that you use the with statement as much as possible, as it allows for cleaner code and makes handling any unexpected errors easier for you.

Most likely, you’ll also want to use the second positional argument, mode. This argument is a string that contains multiple characters to represent how you want to open the file. The default and most common is 'r', which represents opening the file in read-only mode as a text file:

In [5]:
with open('data/test.txt', 'r') as reader:
    # Further file processing goes here
    pass

Other options for modes are fully documented online, but the most commonly used ones are the following:

<table class="table table-hover">
<thead>
<tr>
<th>Character</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>'r'</code></td>
<td>Open for reading (default)</td>
</tr>
<tr>
<td><code>'w'</code></td>
<td>Open for writing, truncating (overwriting) the file first</td>
</tr>
<tr>
<td><code>'rb'</code> or <code>'wb'</code></td>
<td>Open in binary mode (read/write using byte data)</td>
</tr>
</tbody>
</table>

[open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)](https://docs.python.org/3/library/functions.html#open)

> mode is an optional string that specifies the mode in which the file is opened. It defaults to 'r' which means open for reading in text mode. Other common values are 'w' for writing (truncating the file if it already exists), 'x' for exclusive creation and 'a' for appending (which on some Unix systems, means that all writes append to the end of the file regardless of the current seek position). In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.) The available modes are:

<table class="docutils align-default" id="index-5">
<colgroup>
<col style="width: 13%">
<col style="width: 88%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Character</p></th>
<th class="head"><p>Meaning</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">'r'</span></code></p></td>
<td><p>open for reading (default)</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">'w'</span></code></p></td>
<td><p>open for writing, truncating the file first</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">'x'</span></code></p></td>
<td><p>open for exclusive creation, failing if the file already exists</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">'a'</span></code></p></td>
<td><p>open for writing, appending to the end of the file if it exists</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">'b'</span></code></p></td>
<td><p>binary mode</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">'t'</span></code></p></td>
<td><p>text mode (default)</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">'+'</span></code></p></td>
<td><p>open for updating (reading and writing)</p></td>
</tr>
</tbody>
</table>

Let’s go back and talk a little about file objects. A file object is:

> “an object exposing a file-oriented API (with methods such as read() or write()) to an underlying resource.” (Source)

There are three different categories of file objects:
- Text files
- Buffered binary files
- Raw binary files

Each of these file types are defined in the io module. Here’s a quick rundown of how everything lines up.


#### Text File Types

A text file is the most common file that you’ll encounter. Here are some examples of how these files are opened:

    open('abc.txt')

    open('abc.txt', 'r')

    open('abc.txt', 'w')

With these types of files, open() will return a TextIOWrapper file object:

In [6]:
with open('data/test.txt') as file:
    print(type(file))

<class '_io.TextIOWrapper'>


This is the default file object returned by open().

> Text I/O expects and produces str objects. This means that whenever the backing store is natively made of bytes (such as in the case of a file), encoding and decoding of data is made transparently as well as optional translation of platform-specific newline characters.

#### More about encodings

You need to read or write text data, possibly in different text encodings such as ASCII, UTF-8, or UTF-16.

By default, files are read/written using the system default text encoding, as can be found
in sys.getdefaultencoding(). On most machines, this is set to utf-8. If you know
that the text you are reading or writing is in a different encoding, supply the optional
encoding parameter to open(). For example:

In [7]:
import sys
sys.getdefaultencoding()

'utf-8'

Python understands several hundred possible text encodings. However, some of the
more common encodings are ascii, latin-1, utf-8, and utf-16. UTF-8 is usually a
safe bet if working with web applications. ascii corresponds to the 7-bit characters in
the range U+0000 to U+007F. latin-1 is a direct mapping of bytes 0-255 to Unicode
characters U+0000 to U+00FF. latin-1 encoding is notable in that it will never produce
a decoding error when reading text of a possibly unknown encoding. Reading a file as
latin-1 might not produce a completely correct text decoding, but it still might be
enough to extract useful data out of it. Also, if you later write the data back out, the
original input data will be preserved.

> encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

#### More open() function arguments

Another minor complication concerns the recognition of newlines, which are different
on Unix and Windows (i.e., \n versus \r\n). By default, Python operates in what’s known
as “universal newline” mode. In this mode, all common newline conventions are recognized,
and newline characters are converted to a single \n character while reading.
Similarly, the newline character \n is converted to the system default newline character on output. If you don’t want this translation, supply the newline='' argument to
open(), like this:

In [2]:
# Read with disabled newline translation
with open('data/test.txt', 'rt', newline='') as f:
    pass

> newline controls how universal newlines mode works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows:
- When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.
- When writing output to the stream, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '' or '\n', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.

A final issue concerns possible encoding errors in text files. When reading or writing a
text file, you might encounter an encoding or decoding error. For instance:

In [14]:
with open('data/ascii_read.txt', 'wt', encoding='utf-16') as f:
    print(f.write('Test sfsfr refef'))

16


In [15]:
with open('data/ascii_read.txt', 'rt', encoding='ascii') as f:
    print(f.read())

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

If you get this error, it usually means that you’re not reading the file in the correct
encoding. You should carefully read the specification of whatever it is that you’re reading
and check that you’re doing it right (e.g., reading data as UTF-8 instead of Latin-1 or
whatever it needs to be). If encoding errors are still a possibility, you can supply an
optional errors argument to open() to deal with the errors. Here are a few samples of
common error handling schemes:

In [16]:
# Replace bad chars with Unicode U+fffd replacement char
with open('data/ascii_read.txt', 'rt', encoding='ascii', errors='replace') as f:
    print(f.read())

��T e s t   s f s f r   r e f e f 


In [17]:
# Ignore bad chars entirely
with open('data/ascii_read.txt', 'rt', encoding='ascii', errors='ignore') as f:
    print(f.read())

T e s t   s f s f r   r e f e f 


> errors is an optional string that specifies how encoding and decoding errors are to be handled—this cannot be used in binary mode. A variety of standard error handlers are available (listed under Error Handlers), though any error handling name that has been registered with codecs.register_error() is also valid. The standard names include:
- 'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
- 'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
- 'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.

If you’re constantly fiddling with the encoding and errors arguments to open() and
doing lots of hacks, you’re probably making life more difficult than it needs to be. The
number one rule with text is that you simply need to make sure you’re always using the
proper text encoding. When in doubt, use the default setting (typically UTF-8).

**Buffering is an optional integer** used to set the buffering policy. Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:
- Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.
- “Interactive” text files (files for which isatty() returns True) use line buffering. Other text files use the policy described above for binary files.

There's typically two levels of buffering involved:
- Internal buffers
- Operating system buffers

The internal buffers are buffers created by the runtime/library/language that you're programming against and is meant to speed things up by avoiding system calls for every write. Instead, when you write to a file object, you write into its buffer, and whenever the buffer fills up, the data is written to the actual file using system calls.

However, due to the operating system buffers, this might not mean that the data is written to disk. It may just mean that the data is copied from the buffers maintained by your runtime into the buffers maintained by the operating system.

If you write something, and it ends up in the buffer (only), and the power is cut to your machine, that data is not on disk when the machine turns off.

So, in order to help with that you have the flush and fsync methods, on their respective objects.

The first, `flush`, will simply write out any data that lingers in a program buffer to the actual file. Typically this means that the data will be copied from the program buffer to the operating system buffer.

Specifically what this means is that if another process has that same file open for reading, it will be able to access the data you just flushed to the file. However, it does not necessarily mean it has been "permanently" stored on disk.

To do that, you need to call the `os.fsync` method which ensures all operating system buffers are synchronized with the storage devices they're for, in other words, that method will copy data from the operating system buffers to the disk.

Typically you don't need to bother with either method, but if you're in a scenario where paranoia about what actually ends up on disk is a good thing, you should make both calls as instructed.

- `os.fsync(fd)`: Force write of file with filedescriptor fd to disk. On Unix, this calls the native fsync() function; on Windows, the MS _commit() function. If you’re starting with a buffered Python file object f, first do f.flush(), and then do os.fsync(f.fileno()), to ensure that all internal buffers associated with f are written to disk.

### Reading and Writing Opened Files

Once you’ve opened up a file, you’ll want to read or write to the file. First off, let’s cover reading a file. There are multiple methods that can be called on a file object to help you out:

<div class="table-responsive">
<table class="table table-hover">
<thead>
<tr>
<th>Method</th>
<th>What It Does</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://docs.python.org/3.7/library/io.html#io.RawIOBase.read"><code>.read(size=-1)</code></a></td>
<td>This reads from the file based on the number of <code>size</code> bytes. If no argument is passed or <code>None</code> or <code>-1</code> is passed, then the entire file is read.</td>
</tr>
<tr>
<td><a href="https://docs.python.org/3.7/library/io.html#io.IOBase.readline"><code>.readline(size=-1)</code></a></td>
<td>This reads at most <code>size</code> number of characters from the line. This continues to the end of the line and then wraps back around. If no argument is passed or <code>None</code> or <code>-1</code> is passed, then the entire line (or rest of the line) is read.</td>
</tr>
<tr>
<td><a href="https://docs.python.org/3.7/library/io.html#io.IOBase.readlines"><code>.readlines()</code></a></td>
<td>This reads the remaining lines from the file object and returns them as a list.</td>
</tr>
</tbody>
</table>
</div>

Using the same dog_breeds.txt file you used above, let’s go through some examples of how to use these methods. Here’s an example of how to open and read the entire file using .read():

In [18]:
with open('data/test.txt', 'r') as reader:
    # Read & print the entire file
    print(reader.read())

03/22 08:51:06 INFO   :.....mailslot_create: creating mailslot for RSVP
03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp
03/22 08:51:06 INFO   :.....mailslot_create: creating mailslot for RSVP via UDP
03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp-udp
03/22 08:51:06 TRACE  :..entity_initialize: interface 127.0.0.1, entity for rsvp allocated and initialized
03/22 08:51:06 INFO   :......mailslot_create: creating socket for querying route
03/22 08:51:06 INFO   :.....mailbox_register: no mailbox necessary for forward
03/22 08:51:06 INFO   :......mailslot_create: creating mailslot for route engine - informational socket
03/22 08:51:06 TRACE  :......mailslot_create: ready to accept informational socket connection
03/22 08:51:11 INFO   :.....mailbox_register: mailbox allocated for route
03/22 08:51:11 INFO   :.....mailslot_create: creating socket for traffic control module
03/22 08:51:11 INFO   :....mailbox_register: no mailbox necessary for tra

Here’s an example of how to read 5 bytes of a line each time using the Python .readline() method:

In [19]:
with open('data/test.txt', 'r') as reader:
    # Read & print the first 5 characters of the line 5 times
    print(reader.readline(5))
    # Notice that line is greater than the 5 chars and continues
    # down the line, reading 5 chars each time until the end of the
    # line and then "wraps" around
    print(reader.readline(5))
    print(reader.readline(5))
    print(reader.readline(5))
    print(reader.readline(5))

03/22
 08:5
1:06 
INFO 
  :..


Here’s an example of how to read the entire file as a list using the Python .readlines() method:

In [20]:
with open('data/test.txt', 'r') as f:
    data = f.readlines()  # Returns a list object
    print(data)

['03/22 08:51:06 INFO   :.....mailslot_create: creating mailslot for RSVP\n', '03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp\n', '03/22 08:51:06 INFO   :.....mailslot_create: creating mailslot for RSVP via UDP\n', '03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp-udp\n', '03/22 08:51:06 TRACE  :..entity_initialize: interface 127.0.0.1, entity for rsvp allocated and initialized\n', '03/22 08:51:06 INFO   :......mailslot_create: creating socket for querying route\n', '03/22 08:51:06 INFO   :.....mailbox_register: no mailbox necessary for forward\n', '03/22 08:51:06 INFO   :......mailslot_create: creating mailslot for route engine - informational socket\n', '03/22 08:51:06 TRACE  :......mailslot_create: ready to accept informational socket connection\n', '03/22 08:51:11 INFO   :.....mailbox_register: mailbox allocated for route\n', '03/22 08:51:11 INFO   :.....mailslot_create: creating socket for traffic control module\n', '03/22 08:51:11 I

The above example can also be done by using list() to create a list out of the file object:

In [21]:
with open('data/test.txt', 'r') as f:
    data = list(f)
    print(data)

['03/22 08:51:06 INFO   :.....mailslot_create: creating mailslot for RSVP\n', '03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp\n', '03/22 08:51:06 INFO   :.....mailslot_create: creating mailslot for RSVP via UDP\n', '03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp-udp\n', '03/22 08:51:06 TRACE  :..entity_initialize: interface 127.0.0.1, entity for rsvp allocated and initialized\n', '03/22 08:51:06 INFO   :......mailslot_create: creating socket for querying route\n', '03/22 08:51:06 INFO   :.....mailbox_register: no mailbox necessary for forward\n', '03/22 08:51:06 INFO   :......mailslot_create: creating mailslot for route engine - informational socket\n', '03/22 08:51:06 TRACE  :......mailslot_create: ready to accept informational socket connection\n', '03/22 08:51:11 INFO   :.....mailbox_register: mailbox allocated for route\n', '03/22 08:51:11 INFO   :.....mailslot_create: creating socket for traffic control module\n', '03/22 08:51:11 I

#### Iterating Over Each Line in the File

> Note: Looping over lines in a text file preserves their own newline characters, which combined with the print() function’s default behavior will result in a redundant newline character. There are two newlines after each line of text. You want to strip one of the them, as shown earlier in this article, before printing the line:

    print(line.rstrip())

Alternatively, you can keep the newline in the content but suppress the one appended by print() automatically. You’d use the end keyword argument to do that:



    print(line, end='')

<hr>

A common thing to do while reading a file is to iterate over each line. Here’s an example of how to use the Python .readline() method to perform that iteration:

In [22]:
with open('data/test.txt', 'r') as reader:
    # Read and print the entire file line by line
    line = reader.readline()
    while line != '':  # The EOF char is an empty string
        print(line, end='')
        line = reader.readline()

03/22 08:51:06 INFO   :.....mailslot_create: creating mailslot for RSVP
03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp
03/22 08:51:06 INFO   :.....mailslot_create: creating mailslot for RSVP via UDP
03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp-udp
03/22 08:51:06 TRACE  :..entity_initialize: interface 127.0.0.1, entity for rsvp allocated and initialized
03/22 08:51:06 INFO   :......mailslot_create: creating socket for querying route
03/22 08:51:06 INFO   :.....mailbox_register: no mailbox necessary for forward
03/22 08:51:06 INFO   :......mailslot_create: creating mailslot for route engine - informational socket
03/22 08:51:06 TRACE  :......mailslot_create: ready to accept informational socket connection
03/22 08:51:11 INFO   :.....mailbox_register: mailbox allocated for route
03/22 08:51:11 INFO   :.....mailslot_create: creating socket for traffic control module
03/22 08:51:11 INFO   :....mailbox_register: no mailbox necessary for tra

Another way you could iterate over each line in the file is to use the Python .readlines() method of the file object. Remember, .readlines() returns a list where each element in the list represents a line in the file:

In [23]:
with open('data/test.txt', 'r') as reader:
    for line in reader.readlines():
        print(line, end='')

03/22 08:51:06 INFO   :.....mailslot_create: creating mailslot for RSVP
03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp
03/22 08:51:06 INFO   :.....mailslot_create: creating mailslot for RSVP via UDP
03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp-udp
03/22 08:51:06 TRACE  :..entity_initialize: interface 127.0.0.1, entity for rsvp allocated and initialized
03/22 08:51:06 INFO   :......mailslot_create: creating socket for querying route
03/22 08:51:06 INFO   :.....mailbox_register: no mailbox necessary for forward
03/22 08:51:06 INFO   :......mailslot_create: creating mailslot for route engine - informational socket
03/22 08:51:06 TRACE  :......mailslot_create: ready to accept informational socket connection
03/22 08:51:11 INFO   :.....mailbox_register: mailbox allocated for route
03/22 08:51:11 INFO   :.....mailslot_create: creating socket for traffic control module
03/22 08:51:11 INFO   :....mailbox_register: no mailbox necessary for tra

However, the above examples can be further simplified by iterating over the file object itself:

In [24]:
with open('data/test.txt', 'r') as reader:
    for line in reader:
        print(line, end='')

03/22 08:51:06 INFO   :.....mailslot_create: creating mailslot for RSVP
03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp
03/22 08:51:06 INFO   :.....mailslot_create: creating mailslot for RSVP via UDP
03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp-udp
03/22 08:51:06 TRACE  :..entity_initialize: interface 127.0.0.1, entity for rsvp allocated and initialized
03/22 08:51:06 INFO   :......mailslot_create: creating socket for querying route
03/22 08:51:06 INFO   :.....mailbox_register: no mailbox necessary for forward
03/22 08:51:06 INFO   :......mailslot_create: creating mailslot for route engine - informational socket
03/22 08:51:06 TRACE  :......mailslot_create: ready to accept informational socket connection
03/22 08:51:11 INFO   :.....mailbox_register: mailbox allocated for route
03/22 08:51:11 INFO   :.....mailslot_create: creating socket for traffic control module
03/22 08:51:11 INFO   :....mailbox_register: no mailbox necessary for tra

This final approach is more Pythonic and can be quicker and more memory efficient. Therefore, it is suggested you use this instead.

#### Writing lines

Now let’s dive into writing files. As with reading files, file objects have multiple methods that are useful for writing to a file:

<div class="table-responsive">
<table class="table table-hover">
<thead>
<tr>
<th>Method</th>
<th>What It Does</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>.write(string)</code></td>
<td>This writes the string to the file.</td>
</tr>
<tr>
<td><code>.writelines(seq)</code></td>
<td>This writes the sequence to the file. No line endings are appended to each sequence item. It’s up to you to add the appropriate line ending(s).</td>
</tr>
</tbody>
</table>
</div>

Here’s a quick example of using .write() and .writelines():

In [25]:
with open('data/test.txt', 'r') as f:
    # Note: readlines doesn't trim the line endings
    lines = f.readlines()

with open('data/test_reversed.txt', 'w') as f:
    # Alternatively you could use
    # f.writelines(reversed(lines))

    # Write the dog breeds to the file in reversed order
    for line in reversed(lines):
        f.write(line)

In [26]:
with open('data/test_reversed.txt', 'r') as f:
    # Note: readlines doesn't trim the line endings
    lines = f.readlines()

print(lines)

['03/22 08:51:11 INFO   :..mailbox_register: mailbox allocated for pipe03/22 08:51:11 INFO   :...mailslot_create: creating mailslot for (broken) pipe\n', '03/22 08:51:11 INFO   :..mailbox_register: mailbox allocated for dump\n', '03/22 08:51:11 INFO   :...mailslot_create: creating mailslot for dump\n', '03/22 08:51:11 INFO   :..mailbox_register: mailbox allocated for terminate\n', '03/22 08:51:11 INFO   :...mailslot_create: creating mailslot for terminate\n', '03/22 08:51:11 INFO   :...mailbox_register: mailbox allocated for rsvp-api\n', '03/22 08:51:11 INFO   :....mailslot_create: creating mailslot for RSVP client API\n', '03/22 08:51:11 INFO   :....mailbox_register: no mailbox necessary for traffic-control\n', '03/22 08:51:11 INFO   :.....mailslot_create: creating socket for traffic control module\n', '03/22 08:51:11 INFO   :.....mailbox_register: mailbox allocated for route\n', '03/22 08:51:06 TRACE  :......mailslot_create: ready to accept informational socket connection\n', '03/22 

#### Change file cursor position

In [27]:
f = open('data/test.txt', 'r')

In [28]:
f.read(8)

'03/22 08'

In [29]:
f.read(30)

':51:06 INFO   :.....mailslot_c'

In [30]:
f.read(150)

'reate: creating mailslot for RSVP\n03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp\n03/22 08:51:06 INFO   :.....mailslot_create:'

In [31]:
f.read()

' creating mailslot for RSVP via UDP\n03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp-udp\n03/22 08:51:06 TRACE  :..entity_initialize: interface 127.0.0.1, entity for rsvp allocated and initialized\n03/22 08:51:06 INFO   :......mailslot_create: creating socket for querying route\n03/22 08:51:06 INFO   :.....mailbox_register: no mailbox necessary for forward\n03/22 08:51:06 INFO   :......mailslot_create: creating mailslot for route engine - informational socket\n03/22 08:51:06 TRACE  :......mailslot_create: ready to accept informational socket connection\n03/22 08:51:11 INFO   :.....mailbox_register: mailbox allocated for route\n03/22 08:51:11 INFO   :.....mailslot_create: creating socket for traffic control module\n03/22 08:51:11 INFO   :....mailbox_register: no mailbox necessary for traffic-control\n03/22 08:51:11 INFO   :....mailslot_create: creating mailslot for RSVP client API\n03/22 08:51:11 INFO   :...mailbox_register: mailbox allocated for rsvp-api\n03/22

Once the end of file is reached, we get empty string on further reading.

We can change our current file cursor (position) using the seek() method. Similarly, the tell() method returns our current position (in number of bytes).

In [32]:
f.tell()    # get the current file position

1608

In [33]:
f.seek(0)   # bring file cursor to initial position

0

In [34]:
f.read(10)

'03/22 08:5'

In [35]:
f.tell() 

10

In [36]:
print(f.read())

1:06 INFO   :.....mailslot_create: creating mailslot for RSVP
03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp
03/22 08:51:06 INFO   :.....mailslot_create: creating mailslot for RSVP via UDP
03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp-udp
03/22 08:51:06 TRACE  :..entity_initialize: interface 127.0.0.1, entity for rsvp allocated and initialized
03/22 08:51:06 INFO   :......mailslot_create: creating socket for querying route
03/22 08:51:06 INFO   :.....mailbox_register: no mailbox necessary for forward
03/22 08:51:06 INFO   :......mailslot_create: creating mailslot for route engine - informational socket
03/22 08:51:06 TRACE  :......mailslot_create: ready to accept informational socket connection
03/22 08:51:11 INFO   :.....mailbox_register: mailbox allocated for route
03/22 08:51:11 INFO   :.....mailslot_create: creating socket for traffic control module
03/22 08:51:11 INFO   :....mailbox_register: no mailbox necessary for traffic-contr

## Pregled operaciji nad datotekami

### Pretvorba vsebine datoteke

Pretvorimo vsebino datoteke v velike črke.

In [37]:
def lower2upper(input_str: str) -> str:
    r_str = input_str.upper()
    return r_str

def converter(source_file: str, dest_file: str):
    with open(source_file, 'r') as reader:
        low_content = reader.read()

    up_content = lower2upper(low_content)

    with open(dest_file, 'w') as writer:
        writer.write(up_content)
        
if __name__ == "__main__":
    s_file = 'data/test.txt'
    d_file = 'data/test_upper.txt'
    converter(s_file, d_file)

###  Find the longest words

Write a python program to find the longest words.

In [39]:
def longest_word(filename):
    with open(filename, 'r') as infile:
        words = infile.read().split()
    max_len = len(max(words, key=len))
    return list(set([word for word in words if len(word) == max_len]))

print(longest_word('data/test.txt'))

[':.....mailbox_register:', ':......mailslot_create:']


### Writing to a File That Doesn’t Already Exist

You want to write data to a file, but only if it doesn’t already exist on the filesystem.

This problem is easily solved by using the little-known x mode to open() instead of the
usual w mode. For example:

In [40]:
with open('data/exists.txt', 'wt') as f:
    f.write('Hello\n')

In [41]:
with open('data/exists.txt', 'xt') as f:
    f.write('Hello\n')

FileExistsError: [Errno 17] File exists: 'data/exists.txt'

If the file is binary mode, use mode xb instead of xt.

This recipe illustrates an extremely elegant solution to a problem that sometimes arises
when writing files (i.e., accidentally overwriting an existing file). An alternative solution
is to first test for the file like this:

In [42]:
import os

if not os.path.exists('data/exists.txt'):
    with open('data/exists.txt', 'wt') as f:
        f.write('Hello\n')
else:
    print('File already exists!')

File already exists!


Clearly, using the x file mode is a lot more straightforward. It is important to note that
the x mode is a Python 3 specific extension to the open() function. In particular, no
such mode exists in earlier Python versions or the underlying C libraries used in Python’s
implementation.

### Appending to a File

Sometimes, you may want to append to a file or start writing at the end of an already populated file. This is easily done by using the 'a' character for the mode argument:

In [43]:
with open('data/exists.txt', 'a') as a_writer:
    a_writer.write('\nBeagle\n')

When you examine dog_breeds.txt again, you’ll see that the beginning of the file is unchanged and Beagle is now added to the end of the file:

In [44]:
with open('data/exists.txt', 'r') as reader:
    print(reader.read())

Hello

Beagle



### Working With Two Files at the Same Time

There are times when you may want to read a file and write to another file at the same time. If you use the example that was shown when you were learning how to write to a file, it can actually be combined into the following:

In [45]:
d_path = 'data/test.txt'
d_r_path = 'data/test_reversed.txt'

with open(d_path, 'r') as reader, open(d_r_path, 'w') as writer:
    data = reader.readlines()
    writer.writelines(reversed(data))

### Search for a string in text files

If your file is not too large, you can read it into a string, and just use that (easier and often faster than reading and checking line per line):

In [46]:
with open('data/test.txt') as f:
    if 'rsvp' in f.read():
        print("true")
        

true


There is no simple built-in string function that does what you're looking for, but you could use the more powerful regular expressions:

In [49]:
# želimo najdet vse ponovitve iskane besede
import re

with open('data/search_file.txt') as f:
    content = f.read()
    find_results_indexes = [m.start() for m in re.finditer('Python', content)]
    find_results = [content[index:index+20] for index in find_results_indexes]
    print(find_results)

['Python is an easy to', 'Python’s elegant syn', 'Python interpreter a', 'Python Web site, htt', 'Python modules, prog', 'Python interpreter i', 'Python is also suita', 'Python language and ', 'Python interpreter h']


### Counts words in a text file

In [50]:
from collections import Counter

with open('data/search_file.txt', 'r') as f:
    text = f.read()
    
# edit text
text_splited = text.split()
text_splited = [word.lower() for word in text_splited] # all lowercase
text_splited = [word.strip() for word in text_splited] # strip spaces
text_splited = [word.replace('.', '').replace('(', '').replace(')', '').replace(',', '') for word in text_splited if not word.startswith('http') ] # remove char . , ( ) an filter url

# boljše
text_splited = [''.join(c for c in word if c.isalnum()) for word in text_splited] 

wordcount = Counter(text_splited)
print(wordcount.most_common(10))

[('and', 11), ('the', 9), ('python', 8), ('to', 5), ('language', 4), ('for', 4), ('is', 3), ('an', 3), ('it', 3), ('in', 3)]


### Replace String in File

To replace a string in File using Python, follow these steps:
- Open input file in read mode and handle it in text mode.
- Open output file in write mode and handle it in text mode.
- For each line read from input file, replace the string and write to output file.
- Close both input and output files.

In [51]:
d_path = 'data/test.txt'
d_r_path = 'data/test_edited.txt'
with open(d_path, 'r') as reader, open(d_r_path, 'w') as writer:
    for line in reader:
        # logika za zamenjavo in pogoji
        writer.write(line.replace('INFO', 'ERROR-MESSAGE'))

### Counting lines in a file

In [52]:
def file_len(fname):
    with open(fname) as f:
        i = 0
        for i, l in enumerate(f):
            pass
    return i + 1

In [53]:
print(file_len('data/test.txt'))

20


### Keeping the Last N Items

You want to keep a limited history of the last few items seen during iteration or during
some other kind of processing.

Keeping a limited history is a perfect use for a collections.deque. For example, the
following code performs a simple text match on a sequence of lines and yields the
matching line along with the previous N lines of context when found:

In [55]:
from collections import deque

def search(lines, pattern, history=5):
    previous_lines = deque(maxlen=history)
    for line in lines:
        if pattern in line:
            yield line, previous_lines
        previous_lines.append(line)

In [56]:
# Example use on a file
if __name__ == '__main__':
    with open('data/weblog.csv', 'r') as f:
        for line, prevlines in search(f, '/bootstrap-3.3.7/js/bootstrap.min.js HTTP/1.1,304', 5):
            for pline in prevlines:
                print(pline, end='')
            print(line, end='')
            print('-'*20)

10.131.0.1,[29/Nov/2017:14:33:22,GET /css/normalize.css HTTP/1.1,304
10.131.2.1,[29/Nov/2017:14:33:22,GET /css/main.css HTTP/1.1,304
10.131.0.1,[29/Nov/2017:14:33:22,GET /css/style.css HTTP/1.1,304
10.128.2.1,[29/Nov/2017:14:33:23,GET /js/vendor/modernizr-2.8.3.min.js HTTP/1.1,304
10.131.2.1,[29/Nov/2017:14:33:23,GET /js/vendor/jquery-1.12.0.min.js HTTP/1.1,304
10.131.2.1,[29/Nov/2017:14:33:23,GET /bootstrap-3.3.7/js/bootstrap.min.js HTTP/1.1,304
--------------------
10.130.2.1,[30/Nov/2017:12:18:40,GET /css/font-awesome.min.css HTTP/1.1,304
10.128.2.1,[30/Nov/2017:12:18:40,GET /css/normalize.css HTTP/1.1,304
10.128.2.1,[30/Nov/2017:12:18:40,GET /js/vendor/modernizr-2.8.3.min.js HTTP/1.1,304
10.130.2.1,[30/Nov/2017:12:18:40,GET /js/vendor/jquery-1.12.0.min.js HTTP/1.1,304
10.129.2.1,[30/Nov/2017:12:18:40,GET /css/style.css HTTP/1.1,304
10.128.2.1,[30/Nov/2017:12:18:40,GET /bootstrap-3.3.7/js/bootstrap.min.js HTTP/1.1,304
--------------------


When writing code to search for items, it is common to use a generator function involving
yield, as shown in this recipe’s solution. This decouples the process of searching
from the code that uses the results.

Using deque(maxlen=N) creates a fixed-sized queue. When new items are added and
the queue is full, the oldest item is automatically removed. For example:

In [61]:
q = deque(maxlen=3)

In [62]:
q.append(1)

In [63]:
q.append(2)

In [64]:
q.append(3)

In [65]:
q

deque([1, 2, 3])

In [66]:
q.append(4)

In [67]:
q

deque([2, 3, 4])

In [68]:
q.append(5)

In [69]:
q

deque([3, 4, 5])

Although you could manually perform such operations on a list (e.g., appending, deleting,
etc.), the queue solution is far more elegant and runs a lot faster.

More generally, a deque can be used whenever you need a simple queue structure. If
you don’t give it a maximum size, you get an unbounded queue that lets you append
and pop items on either end. For example:

In [70]:
q = deque()
q.append(1)
q.append(2)
q.append(3)
q

deque([1, 2, 3])

In [71]:
q.appendleft(4)
q

deque([4, 1, 2, 3])

In [72]:
q.pop()
q

deque([4, 1, 2])

In [73]:
q.popleft()
q

deque([1, 2])

Adding or popping items from either end of a queue has O(1) complexity. This is unlike
a list where inserting or removing items from the front of the list is O(N).

### Skipping the First Part of a file

You want to iterate over items in an iterable, but the first few items aren’t of interest and
you just want to discard them.

The itertools module has a few functions that can be used to address this task. The
first is the itertools.dropwhile() function. To use it, you supply a function and an
iterable. The returned iterator discards the first items in the sequence as long as the
supplied function returns True. Afterward, the entirety of the sequence is produced.

To illustrate, suppose you are reading a file that starts with a series of comment lines.
For example:

In [57]:
with open('data/userdb.txt') as f:
    for line in f:
        print(line, end='')

##
# User Database
#
# Note that this file is consulted directly only when the system is running
# in single-user mode. At other times, this information is provided by
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
# POMEMBNO NOČEMO ZBRISATI
sys:x:3:3:sys:/dev:/usr/sbin/nologin


If you want to skip all of the initial comment lines, here’s one way to do it:

In [58]:
# lahko kot vaja da probajo sami -> spodaj daljši način
from itertools import dropwhile

with open('data/userdb.txt') as f:
    for line in dropwhile(lambda line: line.startswith('#'), f):
        print(line, end='')

nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
# POMEMBNO NOČEMO ZBRISATI
sys:x:3:3:sys:/dev:/usr/sbin/nologin


This example is based on skipping the first items according to a test function. If you
happen to know the exact number of items you want to skip, then you can use iter
tools.islice() instead. For example:

In [59]:
from itertools import islice

with open('data/userdb.txt') as f:
    for line in islice(f, 7, None):
        print(line, end='')

nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
# POMEMBNO NOČEMO ZBRISATI
sys:x:3:3:sys:/dev:/usr/sbin/nologin


In this example, the last None argument to islice() is required to indicate that you
want everything beyond the first three items as opposed to only the first three items
(e.g., a slice of [3:] as opposed to a slice of [:3]).

The dropwhile() and islice() functions are mainly convenience functions that you
can use to avoid writing rather messy code such as this:

In [60]:
with open('data/userdb.txt') as f:
    # Skip over initial comments
    while True:
        line = next(f, '')
        if not line.startswith('#'):
            break
    
    # Process remaining lines
    while line:
        # Replace with useful processing
        print(line, end='')
        line = next(f, None)

nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
# POMEMBNO NOČEMO ZBRISATI
sys:x:3:3:sys:/dev:/usr/sbin/nologin


Discarding the first part of an iterable is also slightly different than simply filtering all
of it. For example, the first part of this recipe might be rewritten as follows:

In [61]:
# v primeru da hočemo zbrisat vse ki se začnejo na # 
with open('data/userdb.txt') as f:
    lines = (line for line in f if not line.startswith('#'))
    for line in lines:
        print(line, end='')

nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin


This will obviously discard the comment lines at the start, but will also discard all such
lines throughout the entire file. On the other hand, the solution only discards items
until an item no longer satisfies the supplied test. After that, all subsequent items are
returned with no filtering.

Last, but not least, it should be emphasized that this recipe works with all iterables,
including those whose size can’t be determined in advance. This includes generators,
files, and similar kinds of objects.

### Reading Multiple Files

https://docs.python.org/3/library/fileinput.html

Python supports reading data from multiple input streams or from a list of files through the fileinput module. This module allows you to loop over the contents of one or more text files quickly and easily. Here’s the typical way fileinput is used:

    import fileinput

    for line in fileinput.input()
        process(line)

fileinput gets its input from command line arguments passed to sys.argv by default.

This iterates over the lines of all files listed in sys.argv[1:], defaulting to sys.stdin if the list is empty. If a filename is '-', it is also replaced by sys.stdin and the optional arguments mode and openhook are ignored. To specify an alternative list of filenames, pass it as the first argument to input(). A single file name is also allowed.

All files are opened in text mode by default, but you can override this by specifying the mode parameter in the call to input() or FileInput. If an I/O error occurs during opening or reading a file, OSError is raised.

In [63]:
import fileinput
#with fileinput.input(files=('data/multiple_files/file1.txt', 'data/multiple_files/file2.txt')) as f:
with fileinput.input(['data/test.txt', 'data/test_upper.txt']) as f:
    for line in f:
        if f.isfirstline():
            print()
        print(line, end='')


03/22 08:51:06 INFO   :.....mailslot_create: creating mailslot for RSVP
03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp
03/22 08:51:06 INFO   :.....mailslot_create: creating mailslot for RSVP via UDP
03/22 08:51:06 INFO   :....mailbox_register: mailbox allocated for rsvp-udp
03/22 08:51:06 TRACE  :..entity_initialize: interface 127.0.0.1, entity for rsvp allocated and initialized
03/22 08:51:06 INFO   :......mailslot_create: creating socket for querying route
03/22 08:51:06 INFO   :.....mailbox_register: no mailbox necessary for forward
03/22 08:51:06 INFO   :......mailslot_create: creating mailslot for route engine - informational socket
03/22 08:51:06 TRACE  :......mailslot_create: ready to accept informational socket connection
03/22 08:51:11 INFO   :.....mailbox_register: mailbox allocated for route
03/22 08:51:11 INFO   :.....mailslot_create: creating socket for traffic control module
03/22 08:51:11 INFO   :....mailbox_register: no mailbox necessary for tr

> `fileinput.input(files=None, inplace=False, backup='', *, mode='r', openhook=None)`: Create an instance of the FileInput class. The instance will be used as global state for the functions of this module, and is also returned to use during iteration. The parameters to this function will be passed along to the constructor of the FileInput class. The FileInput instance can be used as a context manager in the with statement. In this example, input is closed after the with statement is exited, even if an exception occurs:

Let’s use fileinput to build a crude version of the common UNIX utility cat. The cat utility reads files sequentially, writing them to standard output. When given more than one file in its command line arguments, cat will concatenate the text files and display the result in the terminal:

    cat exists.txt test_edited.txt test.txt

In [None]:
# cat.py
import fileinput

with fileinput.input() as files:
    for line in files:
        if fileinput.isfirstline():
            print(f'\n--- Reading {fileinput.filename()} ---')
        print(' -> ' + line, end='')
    print()

    python cat.py  data/test*

fileinput allows you to retrieve more information about each line such as whether or not it is the first line (.isfirstline()), the line number (.lineno()), and the filename (.filename()).

[fileinput — Iterate over lines from multiple input streams](https://docs.python.org/3/library/fileinput.html?highlight=fileinput#module-fileinput)

## Branje in pisanje binarnih datotek

### Bytes Objects

The bytes object is one of the core built-in types for manipulating binary data. A bytes object is an immutable sequence of single byte values. Each element in a bytes object is a small integer in the range 0 to 255.

Although binary sequences are really sequences of integers, their literal notation reflects
the fact that ASCII text is often embedded in them. Therefore, three different displays
are used, depending on each byte value:
- For bytes in the printable ASCII range—from space to ~—the ASCII character itself
is used.
- For bytes corresponding to tab, newline, carriage return, and \, the escape sequences
\t, \n, \r, and \\ are used.
- For every other byte value, a hexadecimal escape sequence is used (e.g., \x00 is the
null byte).

That is why in Example 4-2 you see b'caf\xc3\xa9': the first three bytes b'caf' are in
the printable ASCII range, the last two are not.

Both bytes and bytearray support every str method except those that do formatting
(format, format_map) and a few others that depend on Unicode data, including case
fold, isdecimal, isidentifier, isnumeric, isprintable, and encode. This means that
you can use familiar string methods like endswith, replace, strip, translate, upper,
and dozens of others with binary sequences—only using bytes and not str arguments.
In addition, the regular expression functions in the re module also work on binary
sequences, if the regex is compiled from a binary sequence instead of a str.

#### Defining a Literal bytes Object

A bytes literal is defined in the same way as a string literal with the addition of a 'b' prefix:

In [3]:
b = b'foo bar baz'

In [4]:
print(b)

b'foo bar baz'


In [5]:
type(b)

bytes

Only ASCII characters are allowed in a bytes literal. Any character value greater than 127 must be specified using an appropriate escape sequence:

In [6]:
b = b'foo\xddbar'

In [7]:
b

b'foo\xddbar'

In [8]:
b[3]

221

In [9]:
int(0xdd)

221

#### Defining a bytes Object With the Built-in bytes() Function

The bytes() function also creates a bytes object. What sort of bytes object gets returned depends on the argument(s) passed to the function. The possible forms are shown below.

`bytes(<s>, <encoding>) converts string <s> to a bytes object, using str.encode() according to the specified <encoding>:`

In [10]:
b = bytes('foo.bar', 'utf8')

In [11]:
b

b'foo.bar'

> Technical Note: In this form of the bytes() function, the <encoding> argument is required. “Encoding” refers to the manner in which characters are translated to integer values. A value of "utf8" indicates Unicode Transformation Format UTF-8, which is an encoding that can handle every possible Unicode character. UTF-8 can also be indicated by specifying "UTF8", "utf-8", or "UTF-8" for <encoding>.

Although a bytes object definition and representation is based on ASCII text, it actually behaves like an immutable sequence of small integers in the range 0 to 255, inclusive. That is why a single element from a bytes object is displayed as an integer:

In [12]:
b[3]

46

You can convert a bytes object into a list of integers with the built-in list() function:

In [13]:
list(b)

[102, 111, 111, 46, 98, 97, 114]

Hexadecimal numbers are often used to specify binary data because two hexadecimal digits correspond directly to a single byte. The bytes class supports two additional methods that facilitate conversion to and from a string of hexadecimal digits.

- `bytes.fromhex(<s>): Returns a bytes object constructed from a string of hexadecimal values.`

bytes.fromhex(s) returns the bytes object that results from converting each pair of hexadecimal digits in s to the corresponding byte value. The hexadecimal digit pairs in s may optionally be separated by whitespace, which is ignored:

In [14]:
b = bytes.fromhex(' aa 68 4682cc ')

In [15]:
b

b'\xaahF\x82\xcc'

In [16]:
list(b)

[170, 104, 70, 130, 204]

- `b.hex(): Returns a string of hexadecimal value from a bytes object.`

b.hex() returns the result of converting bytes object b into a string of hexadecimal digit pairs. That is, it does the reverse of .fromhex():



In [17]:
b = bytes.fromhex(' aa 68 4682cc ')

In [18]:
b

b'\xaahF\x82\xcc'

In [19]:
b.hex()

'aa684682cc'

In [20]:
type(b.hex())

str

The other ways of building bytes or bytearray instances are calling their constructors
with:
- A str and an encoding keyword argument.
- An iterable providing items with values from 0 to 255.
- A single integer, to create a binary sequence of that size initialized with null bytes.
(This signature will be deprecated in Python 3.5 and removed in Python 3.6. See
PEP 467 — Minor API improvements for binary sequences.)
- An object that implements the buffer protocol (e.g., bytes, bytearray, memory
view, array.array); this copies the bytes from the source object to the newly created
binary sequence.


Building a binary sequence from a buffer-like object is a low-level operation that may
involve type casting.

#### String encode()/decode()

In Python 3, there are two types that represent sequences of characters: bytes and str.
Instances of bytes contain raw 8-bit values. Instances of str contain Unicode
characters.

There are many ways to represent Unicode characters as binary data (raw 8-bit values).
The most common encoding is UTF-8. Importantly, str instances in Python 3 and
unicode instances in Python 2 do not have an associated binary encoding. To convert
Unicode characters to binary data, you must use the encode method. To convert binary
data to Unicode characters, you must use the decode method.

When you’re writing Python programs, it’s important to do encoding and decoding of
Unicode at the furthest boundary of your interfaces. The core of your program should use
Unicode character types (str in Python 3, unicode in Python 2) and should not assume
anything about character encodings. This approach allows you to be very accepting of
alternative text encodings (such as Latin-1, Shift JIS, and Big5) while being strict about
your output text encoding (ideally, UTF-8).

[codecs — Codec registry and base classes](https://docs.python.org/3/library/codecs.html)

The concept of “string” is simple enough: a string is a sequence of characters. The problem
lies in the definition of “character.”

In 2015, the best definition of “character” we have is a Unicode character. Accordingly,
the items you get out of a Python 3 str are Unicode characters, just like the items of a
unicode object in Python 2—and not the raw bytes you get from a Python 2 str.

The Unicode standard explicitly separates the identity of characters from specific byte
representations:
- The identity of a character—its code point—is a number from 0 to 1,114,111 (base
10), shown in the Unicode standard as 4 to 6 hexadecimal digits with a “U+” prefix.
For example, the code point for the letter A is U+0041, the Euro sign is U+20AC,
and the musical symbol G clef is assigned to code point U+1D11E. About 10% of
the valid code points have characters assigned to them in Unicode 6.3, the standard
used in Python 3.4.
- The actual bytes that represent a character depend on the encoding in use. An encoding
is an algorithm that converts code points to byte sequences and vice versa.
The code point for A (U+0041) is encoded as the single byte \x41 in the UTF-8
encoding, or as the bytes \x41\x00 in UTF-16LE encoding. As another example,
the Euro sign (U+20AC) becomes three bytes in UTF-8—\xe2\x82\xac—but in
UTF-16LE it is encoded as two bytes: \xac\x20.

Converting from code points to bytes is encoding; converting from bytes to code points
is decoding.

In [1]:
# Example: Encoding and decoding

In [21]:
s = 'café'

In [22]:
# The str 'café' has four Unicode characters.
len(s)

4

Python string encode() function is used to encode the string using the provided encoding. This function returns the bytes object. If we don’t provide encoding, “utf-8” encoding is used as default.

In [23]:
# Encode str to bytes using UTF-8 encoding.
b = s.encode('utf8')

In [24]:
b # bytes literals start with a b prefix.

b'caf\xc3\xa9'

In [25]:
#Isto
bytes(s, 'utf-8')

b'caf\xc3\xa9'

In [26]:
# bytes b has five bytes (the code point for “é” is encoded as two bytes in UTF-8).
len(b)

5

Python bytes decode() function is used to convert bytes to string object. Both these functions allow us to specify the error handling scheme to use for encoding/decoding errors. The default is ‘strict’ meaning that encoding errors raise a UnicodeEncodeError. Some other possible values are ‘ignore’, ‘replace’ and ‘xmlcharrefreplace’.

In [27]:
# Decode bytes to str using UTF-8 encoding.
b.decode('utf8')

'café'

> If you need a memory aid to help distinguish .decode()
from .encode(), convince yourself that byte sequences can be
cryptic machine core dumps while Unicode str objects are “human”
text. Therefore, it makes sense that we decode bytes to str
to get human-readable text, and we encode str to bytes for storage
or transmission.

The Python distribution bundles more than 100 codecs (encoder/decoder) for text to
byte conversion and vice versa. Each codec has a name, like 'utf_8', and often aliases,
such as 'utf8', 'utf-8', and 'U8', which you can use as the encoding argument in
functions like open(), str.encode(), bytes.decode(), and so on.

In [28]:
# The string “El Niño” encoded with three codecs producing very different
# byte sequences

In [29]:
for codec in ['latin_1', 'utf_8', 'utf_16']:
    print(codec, 'El Niño'.encode(codec), sep='\t')

latin_1	b'El Ni\xf1o'
utf_8	b'El Ni\xc3\xb1o'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


You’ll often need two helper functions to convert between these two cases and to ensure
that the type of input values matches your code’s expectations.

In Python 3, you’ll need one method that takes a str or bytes and always returns a
str.

In [30]:
def to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes):
        value = bytes_or_str.decode('utf-8')
    else:
        value = bytes_or_str
    return value # Instance of str

In [31]:
to_str(b'caf\xc3\xa9')

'café'

You’ll need another method that takes a str or bytes and always returns a bytes.

In [32]:
def to_bytes(bytes_or_str):
    if isinstance(bytes_or_str, str):
        value = bytes_or_str.encode('utf-8')
    else:
        value = bytes_or_str
    return value # Instance of bytes

In [33]:
to_bytes('danes je deževen dan')

b'danes je de\xc5\xbeeven dan'

- In Python 3, bytes contains sequences of 8-bit values, str contains sequences of
Unicode characters. bytes and str instances can’t be used together with operators
(like > or +).
- Use helper functions to ensure that the inputs you operate on are the type of
character sequence you expect (8-bit values, UTF-8 encoded characters, Unicode
characters, etc.).

#### Primer: base64

[Encoding and Decoding Base64 Strings in Python](https://stackabuse.com/encoding-and-decoding-base64-strings-in-python/)

Base64 encoding allows us to convert bytes containing binary or text data to ASCII characters. By encoding our data, we improve the chances of it being processed correctly by various systems.

**What is Base64 Encoding?**
Base64 encoding is a type of conversion of bytes into ASCII characters. In mathematics, the base of a number system refers to how many different characters represent numbers. The name of this encoding comes directly from the mathematical definition of bases - we have 64 characters that represent numbers.

The Base64 character set contains:
- 26 uppercase letters
- 26 lowercase letters
- 10 numbers
- + and / for new lines (some implementations may use different characters)

When the computer converts Base64 characters to binary, each Base64 character represents 6 bits of information.
Note: This is not an encryption algorithm, and should not be used for security purposes.

**Why use Base64 Encoding?**

In computers, all data of different types are transmitted as 1s and 0s. However, some communication channels and applications are not able to understand all the bits it receives. This is because the meaning of a sequence of 1s and 0s is dependent on the type of data it represents. For example, 10110001 must be processed differently if it represents a letter or an image.

To work around this limitation, you can encode your data to text, improving the chances of it being transmitted and processed correctly. Base64 is a popular method to get binary data into ASCII characters, which is widely understood by the majority of networks and applications.

A common real-world scenario where Base64 encoding is heavily used are in mail servers. They were originally built to handle text data, but we also expect them to send images and other media with a message. In those cases, your media data would be Base64 encoded when it is being sent. It will then be Base64 decoded when it is received so an application can use it. 

**Encoding Strings with Python**

Python 3 provides a base64 module that allows us to easily encode and decode information. We first convert the string into a bytes-like object. Once converted, we can use the base64 module to encode it.

In [34]:
import base64

message = "Python is fun"
message_bytes = message.encode('ascii')
print('message_bytes:', message_bytes)
base64_bytes = base64.b64encode(message_bytes)
print('base64_bytes:', base64_bytes)
base64_message = base64_bytes.decode('ascii')
print(base64_message)

message_bytes: b'Python is fun'
base64_bytes: b'UHl0aG9uIGlzIGZ1bg=='
UHl0aG9uIGlzIGZ1bg==


In the code above, we first imported the base64 module. The message variable stores our input string to be encoded. We convert that to a bytes-like object using the string's encode method and store it in message_bytes. We then Base64 encode message_bytes and store the result in base64_bytes using the base64.b64encode method. We finally get the string representation of the Base64 conversion by decoding the base64_bytes as ASCII.

**Decoding Strings with Python**

Decoding a Base64 string is essentially a reverse of the encoding process. We decode the Base64 string into bytes of unencoded data. We then convert the bytes-like object into a string.

In [4]:
import base64

base64_message = 'UHl0aG9uIGlzIGZ1bg=='
base64_bytes = base64_message.encode('ascii')
message_bytes = base64.b64decode(base64_bytes)
message = message_bytes.decode('ascii')

print(message)

Python is fun


Once again, we need the base64 module imported. We then encode our message into a bytes-like object with encode('ASCII'). We continue by calling the base64.b64decode method to decode the base64_bytes into our message_bytes variable. Finally, we decode message_bytes into a string object message, so it becomes human readable.



#### Primer: Hashing Strings

[Hashing Strings with Python](https://www.pythoncentral.io/hashing-strings-with-python/)

A hash function is a function that takes input of a variable length sequence of bytes and converts it to a fixed length sequence. It is a one way function. This means if f is the hashing function, calculating f(x) is pretty fast and simple, but trying to obtain x again will take years. The value returned by a hash function is often called a hash, message digest, hash value, or checksum. Most of the time a hash function will produce unique output for a given input

It is important to note the "b" preceding the string literal, this converts the string to bytes, because the hashing function only takes a sequence of bytes as a parameter.

In [35]:
message = 'Danes je lep dan.'
message_bytes = message.encode()

import hashlib
hash_object = hashlib.md5(message_bytes)

hash.digest(): Return the digest of the data passed to the update() method so far. This is a bytes object of size digest_size which may contain bytes in the whole range from 0 to 255.

In [36]:
hash_object.digest()

b'\x80\xf6\xda\xf3\x00|C\xdcH\xdeB]\x7fy\t\xa3'

In [37]:
hash_object.digest().hex()

'80f6daf3007c43dc48de425d7f7909a3'

hash.hexdigest(): Like digest() except the digest is returned as a string object of double length, containing only hexadecimal digits. This may be used to exchange the value safely in email or other non-binary environments.

In [38]:
hash_object.hexdigest()

'80f6daf3007c43dc48de425d7f7909a3'

In [39]:
type(hash_object.hexdigest())

str

In [40]:
# SHA512
import hashlib
hash_object = hashlib.sha512(b'Hello World')
hex_dig = hash_object.hexdigest()
print(hex_dig)

2c74fd17edafd80e8447b0d46741ee243b7eb74dd2149a0ab1b9246fb30382f27e853d8585719e0e67cbda0daa8f51671064615d645ae27acb15bfb1447f459b


### Binarne datoteke

You need to read or write binary data, such as that found in images, sound files, and so
on.

Use the open() function with mode rb or wb to read or write binary data. For example:

In [41]:
# Write binary data to a file
with open('data/somefile.bin', 'wb') as f:
    f.write(b'Hello World')

In [42]:
# Read the entire file as a single byte string
with open('data/somefile.bin', 'rb') as f:
    data = f.read()
    print(data)

b'Hello World'


Sometimes, you may need to work with files using byte strings. This is done by adding the 'b' character to the mode argument. All of the same methods for the file object apply. However, each of the methods expect and return a bytes object instead:

In [43]:
with open('data/example.txt', 'rb') as reader:
    line = reader.readline()
    print(line)
    print(type(line))

b'The problem with list comprehensions is that they may create a whole new list containing one \r\n'
<class 'bytes'>


When reading binary, it is important to stress that all data returned will be in the form
of byte strings, not text strings. Similarly, when writing, you must supply data in the
form of objects that expose data as bytes (e.g., byte strings, bytearray objects, etc.).

If you ever need to read or write text from a binary-mode file, make sure you remember
to decode or encode it. For example:

In [44]:
with open('data/somefile.bin', 'wb') as f:
    text = 'Hello World'
    f.write(text.encode('utf-8'))

In [45]:
with open('data/somefile.bin', 'rb') as f:
    data = f.read(16)
    text = data.decode('utf-8')
    print(text)

Hello World


### Buffered Binary File Types

A buffered binary file type is used for reading and writing binary files. Here are some examples of how these files are opened:

    open('abc.txt', 'rb')

    open('abc.txt', 'wb')

With these types of files, open() will return either a BufferedReader or BufferedWriter file object:

In [48]:
with open('data/example.txt', 'rb') as file:
    print(type(file))

<class '_io.BufferedReader'>


In [49]:
with open('data/example.txt', 'wb') as file:
    print(type(file))

<class '_io.BufferedWriter'>


> Binary I/O (also called buffered I/O) expects bytes-like objects and produces bytes objects. No encoding, decoding, or newline translation is performed. This category of streams can be used for all kinds of non-text data, and also when manual control over the handling of text data is desired.

### Raw File Types

Raw I/O (also called unbuffered I/O) is generally used as a low-level building-block for binary and text streams; it is rarely useful to directly manipulate a raw stream from user code. Nevertheless, you can create a raw stream by opening a file in binary mode with buffering disabled:

It is therefore not typically used.

Here’s an example of how these files are opened:

    open('abc.txt', 'rb', buffering=0)

With these types of files, open() will return a FileIO file object:

In [50]:
with open('data/example.txt', 'rb', buffering=0) as file:
    print(type(file))

<class '_io.FileIO'>


### Memory Mapping Binary Files

[mmap — Memory-mapped file support](https://docs.python.org/3.8/library/mmap.html) This module provides an interface to the operating system’s memory mapping functions. The mapped region behaves pretty much like a string object, but data is read directly from the fil

Memory-mapped file objects behave like both bytearray and like file objects. You can use mmap objects in most places where bytearray are expected; for example, you can use the re module to search through a memory-mapped file. You can also change a single byte by doing obj[index] = 97, or change a subsequence by assigning to a slice: obj[i1:i2] = b'...'. You can also read and write data starting at the current file position, and seek() through the file to different positions.

A memory-mapped file is created by the mmap constructor, which is different on Unix and on Windows. In either case you must provide a file descriptor for a file opened for update. If you wish to map an existing Python file object, use its fileno() method to obtain the correct value for the fileno parameter. Otherwise, you can open the file using the os.open() function, which returns a file descriptor directly (the file still needs to be closed when done).

In [51]:
import mmap

# write a simple example file
with open("data/mapping.txt", "wb") as f:
    f.write(b"Hello Python!\n")

In [52]:
with open("data/mapping.txt", "r+b") as f:
    #memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print(mm.readline())  # prints b"Hello Python!\n"
    # read content via slice notation
    print(mm[:5])  # prints b"Hello"
    # update content using slice notation;
    # note that new content must have same size
    mm[6:] = b" wXXld!\n"
    mm[6:8] = b"LL"
    # ... and read again using standard file methods
    mm.seek(0)
    print(mm.readline())  # prints b"Hello  world!\n"
    # close the map
    mm.close()

b'Hello Python!\n'
b'Hello'
b'Hello LLXXld!\n'


The code above opens a file, then memory maps it. It exercises the readline() method of the mapped file, demonstrating that it works just as with a standard file. It then reads and writes slices of the mapped file (an equally valid way to access the mapped file's content, which does not alter the file pointer). Finally the file pointer is repositioned at the start and the (updated) contents are read in. (

> Writing: To set up the memory mapped file to receive updates, start by opening it for appending with mode 'r+' (not 'w') before mapping it. Then use any of the API methods that change the data (write(), assignment to a slice, etc.).

Files can be so large that it is impractical to load all of their content into memory at once. The mmap.mmap() function creates a virtual file object. Not only can you perform all the regular file operations on a memory-mapped file, you can also treat it as a vast object (far larger than any real object could be) that you can address just like any other sequence.

This technique deals with files by mapping them into your process's address space. The mmap module allows you to treat files as similar to bytearray objects—you can index them, slice them, search them with regular expressions and the like. Many of these operations can make it much easier to handle the data in a file: without memory mapping, you have to read the file in chunks and process the chunks (assuming the files are too large to read into memory as a single chunk). This makes it very difficult to process strings that overlap the inter-chunk boundaries. Memory mapping allows you to pretend that all the data is in memory at the same time even when that is not actually the case. The necessary manipulations to allow this are performed automatically.


Memory-mapping a file uses the operating system virtual memory system to access the data on the filesystem directly, instead of using normal I/O functions. Memory-mapping typically improves I/O performance because it does not involve a separate system call for each access and it does not require copying data between buffers – the memory is accessed directly.

Memory-mapped files can be treated as mutable strings or file-like objects, depending on your need. A mapped file supports the expected file API methods, such as close(), flush(), read(), readline(), seek(), tell(), and write(). It also supports the string API, with features such as slicing and methods like find().

> Note There are differences in the arguments and behaviors for mmap() between Unix and Windows, which are not discussed below. For more details, refer to the standard library documentation.

#### The mmap Interface

<p> For calls to <b>mmap.mmap()</b> to be cross-platform compatible they should stick to the following signature: </p>


OBSERVE:<pre>mmap(<span class="darkred">fileno</span>, <span class="darkblue">length</span>, <span class="darkgreen">access=ACCESS_WRITE</span>, <span class="purple">offset=0</span>)</pre>

<p>The <span class="darkred">file number</span> is used simply because this mirrors the interface of the underlying C library (not always 
the best design decision, but fortunately the file number is easily obtained from an open file's fileno() method). Using a 
file number of -1 creates an anonymous share (one that cannot be accessed from the filestore). </p>

<p>The call above maps <span class="darkblue">length</span> bytes from the beginning of the file, and returns an mmap object that gives 
both file- and index-based access to that portion of the file's contents. If <span class="darkblue">length</span> exceeds the current 
length of the file, the file is extended to the new length before operations continue. If <span class="darkblue">length</span> is 
zero, the mmap object will map the current length of the file, which in turn sets the maximum valid index that can be used.</p>

<p>The optional <span class="darkgreen">access</span> argument can take one of three values, all defined in the mmap module: </p>

<table class="tbl"><tbody><tr><th>Access Value</th><th>Meaning</th></tr><tr><td>ACCESS_READ</td><td>Any attempt to assign to the memory map raises a TypeError exception.</td></tr><tr><td>ACCESS_WRITE</td><td>Assignments to the map affect both the map's content and the underlying file.</td></tr><tr><td>ACCESS_COPY</td><td>Assignments to the memory map change the map's contents but do not update the file on which the map was based 
(a copy-on-write mapping).</td></tr></tbody></table>

<p>The <span class="purple">offset</span> argument, when present, establishes an offset within the file for the starting position of the 
memory map. The offset must be a multiple of the constant <b>mmap.ALLOCATIONGRANULARITY</b> (which is typically the size of 
a virtual memory block, 4096 bytes on many systems). </p>

> Be careful with large files. Remember that if you memory map a file it gets mapped into your process's virtual address space. If you are using 32-bit Python (either because you are running on a 32-bit system or because your system administrators chose to install a 32-bit Python interpreter on a system built using 64-bit technology), each process has a 4GB upper limit on the size of its address space. Since there are many other claims on a process's memory, it is unlikely you will be able to map all of a file much above 1GB in size in a 32-bit Python environment.

> First, the memory of your machine is irrelevant. It's the size of your process's address space that's relevant. With a 32-bit Python, this will be somewhere under 4GB. With a 64-bit Python, it will be more than enough.
The reason for this is that mmap isn't about mapping a file into physical memory, but into virtual memory. An mmapped file becomes just like a special swap file for your program. Thinking about this can get a bit complicated, but the Wikipedia links above should help.
So, the first answer is "use a 64-bit Python". But obviously that may not be applicable in your case.
The obvious alternative is to map in the first 1GB, search that, unmap it, map in the next 1GB, etc. The way you do this is by specifying the length and offset parameters to the mmap method. For example: `m = mmap.mmap(f.fileno(), length=1024*1024*1024, offset=1536*1024*1024)`

### Uporaba

You want to memory map a binary file into a mutable byte array, possibly for random
access to its contents or to make in-place modifications.

Use the mmap module to memory map files. Here is a utility function that shows how to
open a file and memory map it in a portable manner:

In [53]:
import os
import mmap

def memory_map(filename, access=mmap.ACCESS_WRITE):
    size = os.path.getsize(filename)
    fd = os.open(filename, os.O_RDWR)
    return mmap.mmap(fd, size, access=access)

Now here is an example of memory mapping the contents using the memory_map()
function:

In [54]:
m = memory_map('data/mapping.txt')

In [55]:
len(m)

14

In [56]:
m[0:10]

b'Hello LLXX'

In [57]:
m.close()

The mmap object returned by mmap() can also be used as a context manager, in which
case the underlying file is closed automatically. For example:

In [58]:
with memory_map('data/mapping.txt') as m:
    print(len(m))
    print(m[0:10])

14
b'Hello LLXX'


By default, the memory_map() function shown opens a file for both reading and writing.
Any modifications made to the data are copied back to the original file. If read-only
access is needed instead, supply mmap.ACCESS_READ for the access argument. For
example:
- `m = memory_map(filename, mmap.ACCESS_READ)`

If you intend to modify the data locally, but don’t want those changes written back to
the original file, use mmap.ACCESS_COPY:
- `m = memory_map(filename, mmap.ACCESS_COPY)`

Using mmap to map files into memory can be an efficient and elegant means for randomly
accessing the contents of a file. For example, instead of opening a file and performing
various combinations of seek(), read(), and write() calls, you can simply map the
file and access the data using slicing operations.

It should be emphasized that memory mapping a file does not cause the entire file to be
read into memory. That is, it’s not copied into some kind of memory buffer or array.
Instead, the operating system merely reserves a section of virtual memory for the file
contents. As you access different regions, those portions of the file will be read and
mapped into the memory region as needed. However, parts of the file that are never
accessed simply stay on disk. This all happens transparently, behind the scenes.

If more than one Python interpreter memory maps the same file, the resulting mmap
object can be used to exchange data between interpreters. That is, all interpreters can
read/write data simultaneously, and changes made to the data in one interpreter will
automatically appear in the others. Obviously, some extra care is required to synchronize
things, but this kind of approach is sometimes used as an alternative to transmitting
data in messages over pipes or sockets.

As shown, this recipe has been written to be as general purpose as possible, working on
both Unix and Windows. Be aware that there are some platform differences concerning
the use of the mmap() call hidden behind the scenes.

Another trick: you can alleviate the possible memory problems by using mmap.mmap() to create a "string-like" object that uses the underlying file (instead of reading the whole file in memory):

In [59]:
import mmap

with memory_map('data/mapping.txt') as m:
    if m.find(b'wXX') != -1:
        print('true')

#### Regular Expressions

Since a memory mapped file can act like a string, it can be used with other modules that operate on strings, such as regular expressions. This example finds all of the sentences with “nulla” in them.

In [60]:
import mmap
import re

pattern = re.compile(rb'(\.\W+)?([^.]?nulla[^.]*?\.)', re.DOTALL | re.IGNORECASE | re.MULTILINE)

with open('data/lorem.txt', 'r') as f:
    with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as m:
        for match in pattern.findall(m):
            print(match[1].replace(b'\n', b' '))
            #print(match)

b'Nulla facilisi.'
b'Nulla feugiat augue eleifend nulla.'


Because the pattern includes two groups, the return value from findall() is a sequence of tuples. The print statement pulls out the matching sentence and replaces newlines with spaces so each result prints on a single line.