<img src="https://github.com/christopherhuntley/BUAN5405-docs/blob/master/Slides/img/Dolan.png?raw=true" width="180px" align="right">

# Lesson 7: Files
_A sequential, persistent data structure_

# Learning Objectives

## Theory / Be able to explain ...
- aaa

## Skills / Know how to  ...
- aaa

**What follows is adapted from Chapter 7 of the _Python For Everybody_ book. If you have not read it, then please do so before continuting on.**

## Transience, Persistence, and The Cloud
Data within our programs is inherently **transient**, remembered only long enough to be useful. If not referenced by a variable, data is "garbage collected" -- yes, that is the real term -- and forgotten forever. Then, once the program is over, even the variables are forgotten, making way for whatever data the next program needs to do its work. Sometimes that end may be intentional, with the program completing its work, but it can also happen suddenly and without warning when the computer loses power or the system crashes. 

Data is said to be **persistent** if it is recallable after the program that created it ends. Such data exists in **files** located on a secondary storage device like a hard disk or SSD. Files are both useful and somewhat frustrating to deal with. Formats change over time, making some data no longer accessible by modern software. Files get lost as they are transferred from device to device or perhaps we just forget where they are or even that they exist at all. Finally, even if we know where they are and how to find them, getting our software to find and open them is sometimes pretty tricky. 

**The Cloud** is a much hyped but nonetheless very useful solution to or storage needs. It makes data appear ubiquitous, available from any device at any time. Since ultimately all persistent data resides in files, most of the advantages of cloud storage are in the way data is accessed instead of how it is stored. Instead of thinking in terms of files (e.g., MS Word files), we now think in terms of _documents_ (Google Docs, MS Office 365, etc.), page _URLs_ (uniform resource locators), or _APIs_ (application programming interfaces) for data in the Cloud. 

We will get into some of the complexities of cloud hosted data in Lesson 11, but for now we will learn about **files** and **filesystems**. 

## Filesystems and File Paths
In the wild west times of the 1950s and 1960s, every program has its own **filesystem**, a method of storing and retrieving data in files. Standard file formats and operating systems were scarce and software engineers who valued them were scarcer. Every brand of computer and every piece of software had its own. So, the programmers made up their own standards so that their programs and data could be portable from one computer to another. 

In principle, every filesystem has the following functions:
- File creation and deletion
- File retrieval and access control
- Reading and writing of data 

File creation, deletion, retrieval, and access control are generally handled by the computer's operation system. In each case the concept of a files as an entity, with a unique identity or location, comes into play. The name and location of a file is called the **file path**. Operating systems logically organize files into hierarchies (trees) of directories (or folders). 

![File Tree](https://github.com/christopherhuntley/BUAN5405-lessons/raw/master/img/L7_File_Tree_cropped.png)

A file path is then a set of navigation instructions to navigate between where the program _resides_ and where the file _resides_ in the hierarchy. Note that we need to know two locations: that of the program (e.g., `myPythonProgram.py`) and that of the file it wants to access (e.g., `data1.txt`, `data2.txt`, `data3.txt`, or `data4.txt`).   

There are two ways to encode a file path:
- **Absolute paths** are used when the program is located _outside_ of the file system. In that case it has to navigate from the **root** of the file hierarchy every time. The path to the `data1.txt` file is then given by `/myFiles/allProjects/myProject/data1.txt`. Each directory in the path is indicated by a slash `/`. The first slash in the path then represents the root of the file hierarchy. An absolute path always starts with a slash.    

- **Relative paths** always start at the location of the current file in the file system. The navigation in this case can be much simpler, especially if the file is located in the same directory as the program. For the program `myPythonProgram.py` we can use the following paths:
  - `data1.txt` 
  - `../MyData/data2.txt`
  - `../MyData/data3.txt`
  - `../../otherFiles/ExtraData/data4.txt` 
  
  If the file is located in the same directory as the program (the first example), then all we need to supply is the file name. If it is not (the rest), then we use a combination of `..` (go up one directory) and `/directory` (go down into `directory`) instructions followed by the file name. 
  
### Wait, I thought that `/` was pronounced "backslash"
In the early days of PCs, Microsoft and IBM were very wary of copying Unix conventions in their (sort of) new, definitely horrific operating system called PC-DOS. AT&T claimed ownership of Unix at the time, so this worry was not without merit. So, instead of using the slash `/` they used the backslash `\` as the directory indicator in file paths. So, anybody who was raised using PCs in the 1980s or perhaps early 1990s, reflexively use the word "backslash" when saying things like web URLs. For the record, web URLs (or file paths in Python) have never used backslashes in this way. Unfortunately, MS Windows (all versions) still uses the backslash for compatibility with PC-DOS.

### The Python `os` and `os.path` Modules
To work with filepaths in an operating system-agnostic (_canonical_) way, Python provides the [os](https://docs.python.org/3/library/os.html) module. More specifically, we can use the [os.path](https://docs.python.org/3/library/os.path.html) submodule to create canonical paths and then (if needed) use the `os` module to render the path in an operating-system specific way. 

In [8]:
import os, os.path  # Load modules from standard lib

# the absolute path from a relative path
print(os.path.abspath('.'))  # the current folder is indicated by .

# the relative path from the current folder to another location
print(os.path.relpath('/Users/chuntley/GitRepos/BUAN5405/BUAN5405-docs'))  

/Users/chuntley/GitRepos/BUAN5405/BUAN5405-lessons
../BUAN5405-docs


If on MS Windows, we can then call `os.fspath()` to convert all those nasty slashes into 1980s-style backslashes. 

### Filepaths on Google Colab
In Google Colab, notebooks [operate in a separate space from the file system](https://colab.research.google.com/notebooks/io.ipynb). **All paths then become absolute**, even when the notebook appears to be in the same directory as its data in Google Drive. There are workarounds, of course, each of which relies on the `os` and `os.path` modules.

## Reading and writing files the easy way with `with`
The Py4E textbook goes into some detail into _file handles_ and other terminology. You are recommended to refer to it when you have to open files. However, in data science we almost always write use a `with` statement like this when reading data from a file:
```python
with open( filepath ) as f:
    for line in f:
       # do something with the line of text
```

- The `open()` function does exactly what you think it does. It takes in a `filepath` and prepared the file for reading. More specifically, it creates a _generator_ (see Lesson 6) that yields data one line at a time.
- The file itself is _aliased_ (nicknamed) as `f` for use in the `with` body. 
- The `for` loop iterates over the lines in the file
- When the end of the file is reached (or an error occurs) the `with` statement closes the file for safekeeping.  

A similar process exists for writing to a file. We can also read and write binary data (as `bytes`) if we like.   

### Note: Files are Sequential
Since we access file data using an iterator, files are inherently sequential, just like strings, lists, and tuples. 

## Pro Tips
### Looking Ahead to pandas
At the end of this course we will introduce the (third-party) pandas library for working with datasets in a sructured way. pandas includes a very handy I/O module for reading and writing data in various formats and filesystems. When possible, we highly recommend using pandas insead of writing your own file system code. Leave the old ways to the old-timers and systems programmers. pandas is much better for our needs.

## Exercises