# Files

* Up to now we haven't really been doing much with data, only what we type into the notebooks (short strings and numbers)
* In the real world we don't type our data into notebooks, we store them in files!
* Opening files is where Python becomes useful for processing larger amounts of data
* Lets start with a small text file that has only a few lines
    * Look at `poem.txt` in the JupyterLab file browser

## Opening files

* Use the `open(<filepath>, <mode>)` function to establish a *connection* to a file on disk
* When you connect to a file it can have different modes. Indicate a mode using a short string
    * `r` - Read only
    * `w` - Write (overwriting existing contents)
    * `a` - Append to a file
    * `x` - Write a new file (fails is file already exists)
    * `b` - Binary mode, for opening non-text files
* Python reads files as text by default. You can also specify the encoding with the `encoding` argument.
    * `utf-8` is the default.
* Once a file has been opened we can do operations on it like reading it into memory
* Python has a special syntax for safely opening and working with files

The `with open` syntax for safely opening files:

```python
with open(<filepath>, '<mode>', <optional encoding>) as <variable>:
    # do something
    # the file is open inside this block

# the file is closed outside this block
```

* The `file_handler` is a connection to the file, but it isn't the file contents itself
* We use the `read()` function to read the entire file into memory at once
    * Don't do this with large files! We will use other techniques to read their contents

In [None]:
with open("poem.txt", 'r') as file_handler: # the 'r' tells Python you are Reading the file
    # read the file content into a variable 
    file_contents = file_handler.read()

file_contents

* One thing to note, the "\n" gets printed as a newline by the `print()` function vs. raw output from Jupyter
* When working with files it is really important to understand the *newline* character
* A newline is represented in a string by `\n`
* This is useful for processing a text file line-by-line

In [None]:
# A string with a newline in it
print("Hello\nWorld!")

In [None]:
# display the contents of file_contents using print() instead of Jupyter Output
print(file_contents)

* It is useful to know that there are some minor differences in the display of output when you use the `print()` function vs. displaying something in the last line of a cell in Jupyter

## Working with files line by line

* When you have files with data on each line, especially large files, you can loop over them 
* Just like iterating over lists, you can iterate over files
* Python reads the contents of the file until it hits "\n" and then it puts that in the loop variable
* Useful for working with *extremely large* files because you only store one line in memory at a time

In [None]:
# open the file
with open("poem.txt", 'r') as file_handler:
    for line in file_handler:
        print(line)
        

* Anyone guess why we have extra blank lines?

### Reading Data Files

* A file handler is not the file, it is a pointer to the file
* This is how python can work with HUGE files
* We can process large files line by line (assuming there are multiple lines)
* Each line gets treated as a separate string

* Lets count the lines of the file

In [None]:
with open('leaves-of-grass.txt', 'r') as file_handler:
    count = 0
    for line in file_handler:
        #count = count +1
        count += 1

print(count)

## Reading in all the data

* Why don't we read every line of the file into memory as a list

In [None]:
# create an empty list to store each line
data = [] 

# count the number of lines in the text file
with open('leaves-of-grass.txt', 'r') as file_handler:    
    for line in file_handler:
        # use the append function to add each line
        data.append(line)

print("Length:", len(data))
print("First 10 lines:", data[0:10])
