# More About Reading Files

We've seen rudimentary, yet very effective, ways to read data from files. Here we want to explore how to read **multiple** files from a directory. There is directory called `data` that contains several text files. We want to be able to read all of those in and then do some basic tasks.

We will need to load the `os` module to allow us to traverse the directory structure.

In [None]:
import os

In [None]:
# Look at files in the current working directory
os.listdir()

So, we see `.listdir()` returns a list of **all** the files in the directory passed in. In this case, we did not send in a specific directory, so it looked in the same directory / folder where our Jupyter Notebook resides (i.e., the current working directory). Now, let's look in the `data` subdirectory. To do, so we pass in the string `./data` to tell it that we want to look in subdirectory `data` from the current directory (the `.`). 

In [None]:
# See what files are in the data subfolder
os.listdir("./data")

What happens if we try to access a directory that does not exist?

In [None]:
# Subfolder answers does not exist
os.listdir("./answers")

And what happens if we don't proceed the subdirectory with `./`?

In [None]:
# Try just "data"
os.listdir("data")

What happened? Why? What if try `/data` - without the period?

In [None]:
os.listdir("/data")

In general, it is better to explicitly specify the subdirector is from the current directory with the syntax `./data` - with the period. The `.` means current working directory. You can use double periods, `..`, if you want to go up one folder. Let's try it.

In [None]:
os.listdir("..")

If you want to check to see what current working directory you are in, you can call `os.getcwd()`.

In [None]:
os.getcwd()

There is also a method called `.scandir()` which will return a iterator of (`os.DirEntry`)[https://docs.python.org/3/library/os.html#os.DirEntry] objects. Each of those objects contain other information about each file that might be useful if you really get into a lot of I/O coding. Here, I just show it to let you know that it exists.

In [None]:
# Loop over the results of os.scandir()
for i in os.scandir():
    print(f"i={i}       i.name={i.name}")

----

## Looping Over Files in Directory

Because we can get a list of all the files (and **subfolders**) back as a list, we can easily iterate over that list and read each file. Let's try it.

In [None]:
# capture list in variable named dataFiles
dataFiles = os.listdir("./data")

# Use a for loop to iterate over the files, one at a time
for file in dataFiles:
    # print out file we are reading
    print("File we are trying to open and read:", file)
    with open("./data/"+file) as f:
        # Just print out the contents of each file after file name
        print(f.read())
        print("\n\n")

----

<font color='red' size = '5'> Student Exercise </font>

## Doing Something With Stuff You Read In

There is a file called `states.csv` in the subdirectory `./data`. There are 5 fields: State, Population, Electoral Votes, Highway Miles, and Square Miles. Each line represents one of the states in the United States.

### Task

Read in the file and create a list called `sums` which should contain the sum of the data elements in this order:

1. Sum of the states' population
2. Sum of the states' electoral votes
3. Sum of the states' highway miles
4. Sum of the states' square miles

In [None]:
# YOUR CODE HERE


-----

## Memory is Faster than Disk

I have mentioned in class before that, in general, we want to open a file, get the contents out, put it in memory (RAM), and then close the file as quickly as possible. Additionally, we do want to **avoid** opening and reading the **same** file multiple times. Doing so is inefficient, especially in terms of time. RAM is faster than "hitting" the disk.

Let's look at an example using the file `states.csv` above and see if we can discern this inefficiency.

The magic command `%%timeit` **must** be the very first line of the code cell. This command runs the contents of the code cell several times with a large number of loops within each iteration. It calculates how long it takes to run and then prints out the average running time as well as a standard deviation of the running time. In both cases, we want those numbers to be **small**. 

In [None]:
%%timeit
# 1. Open and read in file to get the population

    
# 2. Open and read in file to get the electoral votes

    
# 3. Open and read in file to get the highway miles

    
# 4. Open and read in file to get the square miles


### Original Code Timing

Copy and paste your code from the Student Exercise above in the cell below. 

*Note: The **very first line** in the code cell must be the magic command %%timeit* 

In [None]:
%%timeit
# Paste original code from exercise here
