# What are files and why should we care about them?

## Warmup: where are your files?

Files these days are especially confusing, with the cloud and smartphones. This video (_run cell below_) tries to demistify this topic a bit.

<div class="alert alert-info">Run the cell below to load the video</div>

In [None]:
%%html

<iframe width="560" height="315" src="https://www.youtube.com/embed/gDXmTJakpT8?si=qS5UYKU-vWUqQ-2p" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

<br/><br/><br/><br/><br/><br/><br/><br/><br/>

---

## Motivating use case: index data from a file for later search

I have a file called `mbox-email-receipts.txt` with a list of email receipts:
```
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From louis@media.berkeley.edu Fri Jan  4 18:10:48 2008
From zqian@umich.edu Fri Jan  4 16:10:39 2008
From rjlowe@iupui.edu Fri Jan  4 15:46:24 2008
From zqian@umich.edu Fri Jan  4 15:03:18 2008
From rjlowe@iupui.edu Fri Jan  4 14:50:18 2008
From cwen@iupui.edu Fri Jan  4 11:37:30 2008
[...]
```

I want to know how many emails came to me from which email addresses.

I can use the indexing pattern to do this, plus a way to access the file system.

### Problem formulation

First, let's do the problem formulation together:

https://miro.com/app/board/uXjVPGeBLaY=/?share_link_id=250586896286

<!-- <img src="https://terpconnect.umd.edu/~gciampag/INST126/images/email-indexing-problem-formulation.png" height=800 width=1200></img> -->

Same but with comments

In [None]:
# READ data from the file into list

# loop over the list
    # PARSE the record
    # UPDATE the index info for the record

### Writing the code

First, let's write a function that finds an email address in each record.

In [None]:
def filter_string(s, cue):
    # split into elements
    elements = s.split(" ")
    # go through and find the element that has the cue
    for e in elements:
        # if cue in the element
        if cue in e:
            # give us the thing
            return e

In [None]:
line = "From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008"
target = filter_string(line, "@")
target

In [None]:
# READ data from the file into list
# where is the file?
fname = 'mbox-email-receipts.txt'

# open the file
f = open(fname, mode='r')

# read the file's contents into a list
records = f.readlines()

# make the index
index = {}

# loop over the list
for r in records:
    # PARSE the record
    email_address = filter_string(r, '@')
    
    # UPDATE the index info for the record

    # GET the current value of the key (email)
    # default to zero if not found
    count = index.get(email_address, 0)
    # UPDATE the value
    count += 1
    # UPDATE the index with the key and updated value
    index.update({email_address: count})
    
index

<br/><br/><br/><br/><br/><br/><br/><br/><br/>

---

## Coding Challenge

### Task 1

Wrap the above code in a function called `build_index`. The function takes a single parameter -- a string with the name of the file -- and returns a dictionary with the index.

In [None]:
# Your code here

...

<br/><br/><br/><br/><br/><br/><br/><br/><br/>

---

# Fundamental concept: Files

## What are files?

Files provide container for data that is **outside** of your program's main memory (variables and functions). 

The PY4E textbook calls it _"secondary memory"_.

Secondary memory is essential, because main memory, which holds all the data you create while your Python program is running, goes away once the program stops.

Secondary memory is a place to have data that is persistent. Sort of like long-term memory in humans.

Files provide an interface (a __handle__) between your program and the operative system, which manages them.

<img src="https://terpconnect.umd.edu/~gciampag/INST126/images/handle.png">

## Key operations on files

### Open a file

In [None]:
fname = '18-mbox-email-receipts.txt'
f = open(fname, mode='r')
f

`open()` function, as you might suspect, opens a connection to the file.

Its parameters are:
1. The name of the file you want to connect to
2. A specification of **how** you want to connect (to read, to write, etc.). 

And its output is a file connection object: `io.TextIOWrapper`

|||
|:-|:-|
| **Character** | **Meaning** |
| `'r'` | open for reading (default)
| `'w'` | open for writing, truncating the file first (_DANGER!_)
| `'x'` | create a new file and open it for writing
| `'a'` | open for writing, appending to the end of the file if it exists
| `'b'` | binary mode
| `'t'` | text mode (default)
| `'+'` | open a disk file for updating (reading and writing)
| `'U'` | universal newline mode (deprecated)

#### Specifying a connection type/permission

The second bit is how the file data structure includes some basic security: you can only write to files you have "write access" to, for example.

Let's look at an example.

In [None]:
# basic read

# file name
fname = '18-mbox-email-receipts.txt'

# put 'r' as the second argument
f = open(fpath, mode='r')

contents = f.read()

print(contents)

### Opening a file for writing

<div class="alert alert-warning"><strong>ATTENTION!</strong> opening a file for writing will destroy any existing content!</div>

You can use **append** mode (`mode="a"`) to write to an existing file without removing its content.

In [None]:
# basic write

# file name
fname = '18-mbox-email-receipts_NEW.txt'

# put 'w' as the 2nd argument
f = open(fpath, mode='w')

# file.write() is a method that writes the file
f.write("Hello world!")

# always close a file when done working with it
f.close()

In [None]:
# common error: forgetting to specify write mode

# file name
fname = 'test.txt'

# ERROR! Forgot to specify 'w' as the second argument (by default 'r' is used)
f = open(fname) 

# file.write() is a method that writes the file
f.write("Hello world from INST126 SP21 Week 11 at 9:30am!")

# always close a file when done working with it
f.close()

In [None]:
# append to a file

fname = 'test.txt'

# put 'a' as the 2nd argument
f = open(fname, mode='a')

# file.write() is a method that writes the file
f.write("More stuff from INST126 SP21 Week 11 at 9:33am!")

# always close a file when done working with it
f.close()

### Reading the contents of a file

Very often you want to connect to a file because you want to *read* it. There are three ways to do this:
1. `.read()` reads in the whole contents of the file as a `string`;
2. `.readlines()` reads in the whole contents of the file as a `list` of strings;
3. __Iteration!__ Reads one line at a time.

In all cases, you end up with strings. You can then parse it to do what you want with it.

In [None]:
# 1) .read() 

# the path
fpath = '18-mbox-email-receipts.txt'

# open the file connection and store in the variable fhand
fhand = open(fpath, mode='r') 

# read the contents of the file, and dump into a string called content_s
content_s = fhand.read() 

content_s

In [None]:
# 2) .readlines()

# the path
fpath = '18-mbox-email-receipts.txt'

# open the file connection and store in the variable fhand
fhand = open(fpath, mode='r') 

# read the contents of the file, and dump into a list of strings called content_list
content_list = fhand.readlines() 

content_list

In [None]:
# 3) Iteration

# the path
fpath = '18-mbox-email-receipts.txt'

# open the file connection and store in the variable fhand
fhand = open(fpath, mode='r') 

# loop over the lines of the file, one line at the time, and print it

for line in fhand:
    print(line)
    
line

What's appening here is that `print()` automatically add a newline character (`\n`) to anything you pass to it, which combined with the original `n` at the end of each line creates the empty space between each line of text.

In the next module we will learn how the `pandas` library connects to files to cover common parsing situations 

(e.g., I have a spreadsheet, I want to go straight into a `dataframe` for analysis). 

More on that later! The concepts of accessing files will still apply.

### Writing to a file

Another common use case for connecting to files is to *write* to secondary memory. 

The main thing to know here is the `.write()` method.

Think of it as similar to the `print()` function, except it writes to the file instead of the screen.

In [None]:
fname = 'test2.txt'
f = open(fname, mode='w')
f.write("Hello INST126!") 

In [None]:
f.close()

`.write()` returns the number of __characters__ written (in case of files opened in text mode) or of __bytes__ written (in binary mode).

### Closing a file

Once I am done working with a file, it is always a good idea to close it.

Forgetting to close a file may result in lost or corrupt data.&ast;


--- 

&ast; You may have noticed that forgetting to close a file does not seem to have a consequence, but this is only because the implementation of the Python interpreter that we use (also called [CPython](https://en.wikipedia.org/wiki/CPython)) does a really good job at closing files on your behalf. This [blog post](https://realpython.com/why-close-file-python/) digs deeper on consequences of forgetting to close files.

## Common errors with files

### Can't find the file: FileNotFoundError

In [None]:
f = open("18-mbox-email-receipts.tx", mode='r')  # <----- TYPO in the filename!
print(f.read())

### Wrong connection type/permission: UnsupportedOperation

In [None]:
# I said i would write to it
# but I tried to read it
f = open("18-mbox-email-receipts_new.txt", mode='w')
print(f.read())

In [None]:
# i said i would read it
# but i tried to write to it
f = open("18-mbox-email-receipts.txt", mode='r')
print(f.write("Hello world"))

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>


---

# Solutions

## Task 1

In [None]:
### BEGIN SOLUTION
def build_index(fname):
    # READ data from the file into list

    # open the file
    f = open(fname, 'r')

    # read the file's contents into a list
    records = f.readlines()

    # make the index
    index = {}

    # loop over the list
    for r in records:
        # PARSE the record
        email_address = filter_string(r, '@')

        # UPDATE the index info for the record

        # GET the current value of the key (email)
        # default to zero if not found
        count = index.get(email_address, 0)
        # UPDATE the value
        count += 1
        # UPDATE the index with the key and updated value
        index.update({email_address: count})

    return index
### END SOLUTION