<img src="images/MIK.png" style="width:375px;height:200px;">

## <center> MIK - Python for beginners: Files</center>
### <center>by Ivaldo Tributino and Marcos Machado</center>

## Indroduction

All programs need the input to process and output to display data, for that we use `files`. You can picture that as a storage compartments on computers that are managed by your OS. Variables also provide us a way to store data, however only while our program runs.

We will primarily focus on opening, reading and writing text on files such as those we create in a text editor. Later we will see how to work with database files which are binary files, specifically designed to be read and written through database software.

## Opening files

When we want to `read` or `write` a file (say on your hard drive), we first must `open` the file. Opening the file communicates with your operating system, which knows where the data for each file is stored. When you open a file, you are asking the operating system to find the file by name and make sure the file exists. In this example, we open the file `mbox.txt`, which should be stored in the same folder as this notebook.

In [None]:
fhand = open('mbox.txt')
print(fhand) 

If the open is successful, the operating system returns us a `file handle`. The `file handle` is not the actual data contained in the file, but instead it is a “handle” that we can use to read the data. You are given a handle if the requested file exists and you have the proper permissions to read the file.

Also, it is important to highlight that **TextIOWrapper** provides methods and attributes to read or write data to and from files.

## Text files and lines

In this notebook, we will work with two files: The entire file of mail interactions named `mbox.txt` and a shortened version of it named `mbox-short.txt`. 

These files are in a standard format for a file containing multiple mail messages. The lines which start with “From” separate the messages and the lines which start with “From:” are part of the messages. For more information about the mbox format, see https://en.wikipedia.org/wiki/Mbox.

To break the file into lines, there is a special character that represents the “end of the line” called the `newline` character. In Python, we represent the `newline` character as `\n`. Even though this looks like two characters, it is actually a single character. 

<img src="images/newline.png" style="width:100px;height:90px;">

In [None]:
learn = 'MIK Tutors'
newline_learn = 'MIK \nTutors'
print(learn)
print(newline_learn)

In [None]:
len(newline_learn)-len(learn) # newline character is a single character.

So when we look at the `lines` in a file, we need to imagine that there is a special invisible character called the `newline` at the end of each line that marks the end of the line.

## Reading files

While the `file handle` does not contain the data for the file, it is quite easy to construct a for loop to read through and count each of the lines in a file:

In [None]:
# This piece of code it is only counitng the number of lines measuring the size of our file!
count = 0
length = 0
fhand = open('mbox-short.txt') 
for line in fhand:
    count = count + 1
    length += len(line)
print('Line Count:', count)
print('Length:', length)

Because the for loop reads the data one line `at a time`, it can efficiently read and count the lines in very large files without running out of main memory to store the data. The above program can count the lines in any size file using very little memory since each line is read, counted, and then discarded.

If you know the file is relatively small compared to the size of your main memory, you can `read` the whole file into one string using the `read method` on the file handle.

In [None]:
# In this case, we are reading the entire file as a single string!
with open('mbox-short.txt') as fhand:
    fread = fhand.read()
print(len(fread))    

In [None]:
fread[:20]

In the example above, the entire contents (all 94,626 characters) of the file mbox-short.txt are read directly into the variable `fread`. We use string slicing to print out the first 20 characters of the string data stored in fread.

**Observation:** The `open` function should only be used if the file data will fit comfortably in the main memory of your computer. If the file is too large to fit in main memory, you should write your program to read the file in `chunks` using a for or while loop.

## Searching through a file

We can combine the pattern for reading a file with `string methods` to build simple search mechanisms.

For example, if we wanted to read a file and only print out lines which started with the prefix “From:”, we could use the string `method startswith` to select only those lines with the desired prefix:

In [None]:
# This piece of code search for lines that starts with "From:" and print them!
count = 0
fhand = open('mbox-short.txt')
for line in fhand:
    if count < 5:
        if line.startswith('From:'): 
            print(line)
            count+=1

The output looks great since the only lines we are seeing are those which start with “From:”, **but why are we seeing the extra blank lines?** 

This is due to that invisible `newline` character. Each of the lines ends with a `newline`, so the print statement  includes a `newline`, resulting in the double spacing effect we see.

```python
print('From: stephen.marquard@uct.ac.za') -----> 'From: stephen.marquard@uct.ac.za\n' 
          
```

We could use line slicing to print all but the last character, but a simpler approach is to use the `rstrip()` method which strips whitespace from the right side of a string as follows:

In [None]:
# This method will strips whitespace from the right side of the string!
'MIK     '.rstrip()

In [None]:
count = 0
fhand = open('mbox-short.txt')
for line in fhand:
    if count < 5:
        line = line.rstrip()
        if line.startswith('From:'): 
            print(line)
            count+=1        

As your file processing programs get more complicated, you may want to structure your search loops using `continue`. The basic idea of the search loop is that you are looking for “interesting” lines and effectively skipping “uninteresting” lines. Example: 

In [None]:
# Here we are searching for emails from this domain: @uct.ac.za !
count = 0
fhand = open('mbox-short.txt') 
for line in fhand:
    if count < 5:
        line = line.rstrip()
        if line.find('@uct.ac.za') == -1: #  returns -1 if the string was not found
            continue 
        count+=1    
        print(line)

## Writing files

To write a file, you have to open it with mode `“w”` as a second parameter: 

```python
fileobject = open(filename,  mode)
```
Opens the file for `writing`. In this mode, if file specified doesn't exists, it will be created. If the file exists, then it's data is destroyed, so be careful!

In [None]:
# Here we are simply opening the file with writiing permission!
ftest = open('text.txt', 'w')
print(ftest)

The `write method` of the file handle object allow us to insert data into the file, returning the number of characters written. 

In [None]:
# Now, we are writing this line in the same file, and couting the number of characteres!
line1 = "Course: Python with MIK,\n"
ftest.write(line1)

The print statement automatically appends a newline, but the `write method` does not add the newline automatically. So be sure to manage the ends of the lines by inserting the `newline` character when you want to end a line.

In [None]:
# We are now adding a new line and making sure we are entering a new line, preparing for the new entry!
line2 = 'Class Meetings:   Tuesday 13:30PM - 17:30PM via Microsoft teams\n'
ftest.write(line2)

When you are `done` writing, you have to `close the file` to make sure that the last bit of data is physically written to the disk so it will not be lost if the power goes off.

In [None]:
# This method closes the file!
ftest.close()

In [None]:
count = 0
ftest = open('text.txt') 
for line in ftest:
    print(line)
    count = count + 1
print('Line Count:', count)

In [None]:
with open('text.txt') as f:
    data = f.read()
    print(data)

#### References:

- Automate the boring stuff with Python: practical programming for total beginners by Sweigart, A.
- Python for Everybody Exploring Data Using Python 3 by Dr. Charles R. Severance

<img align="right" src="images/logo.png" style="width:50px;height:50px;">