# More About Reading Files

We've seen rudimentary, yet very effective, ways to read data from files. Here we want to explore how to read **multiple** files from a directory. There is directory called `data` that contains several text files. We want to be able to read all of those in and then do some basic tasks.

There are multiple ways to accomplish these tasks. Two options include:
1. You can use the `os` module which allow us to traverse the directory structure.
2. Use the `pathlib` module which is "higher level" and perhaps simpler to use.

We will use `pathlib`.

In [None]:
from pathlib import Path

In [None]:
# Create a Path object using the current working directory, `'.'`
path = Path('.')
path

## Seeing Files and Folders
We can all the different files and folders (i.e., directories) by using `path.iterdir()`. It returns a "generator object" which we can use in a `for` loop. Let's try it.

In [None]:
path.iterdir()

In [None]:
for f in path.iterdir():
    print(f'f is {f} and has type {type(f)}')

### File or Directory (Folder)?
From the above output can you determine which of the items returned were files versus directories (i.e., folders)? One hint is that files have an extension on their names, such as `.txt`, `.html`, or `.ipynb`. Directories generally do not have filename extensions. We could check to see if the name returned ends with a filename extensions (i.e., something after a period). 

However, there are two methods that we can use Path object: 
1. `.is_dir()` returns `True` if the object is a directory, `False` otherwise.
2. `.is_file()` returns `True` if the object is a file, `False` otherwise.

Let's trying using those two methods.

In [None]:
for f in path.iterdir():
    print(f'f is {f}:')
    print(f'\tf.is_dir() returns {f.is_dir()}')
    print(f'\tf.is_file() returns {f.is_file()}')

### Thought Exercise
For some reason, it thinks the `.ipynb_checkpoints` hidden file is a directory. Can you think of way to filter this out so that it works as you think it should?

----

## Looping Over Specific Files in Directory

Recall we can also use `path.glob()` to find all the files in folder of a certain type, such as Excel files (ending `*.xlsx`). Let's look at all the files in the subfolder `data` and then try to read the ones that are text files ending with `.txt`.

When you start the `glob()` with `**`, it means to recursively look in all the subfolders (subdirectories): e.g., `path.glob('**/data/*.*')` You should be careful with this approach, making sure you know what the directory structure contains. Otherwise, this recursion may take a long time. In our case, we know that the directory `data` only contains files, so we should be fine using it. We will do it both ways to see if there are any differences.

In [None]:
# loop over the results of glob to see all files in the subfolder data
for f in path.glob('**/data/*.*'):
    print(f)

In [None]:
# Get rid of the ** at the the beginning to see if it matters
for f in path.glob('data/*.*'):
    print(f)

### Open and Read `.txt` Files
We want to only look at the `.txt` files in the folder `data`. Let's simply open them and read their contents and simply print it out. 

In [None]:
# Use a for loop to iterate over the .txt files, one at a time
for f in path.glob('data/*.txt'):
    # print out file we are reading
    print(f'File we are trying to open and read: {f}')
    with open(f) as the_file:
        # Just print out the contents of each file after file name
        print(the_file.read())
        print("\n\n")

----

<font color='red' size = '5'> Student Exercise </font>

## Doing Something With Stuff You Read In

There is a file called `states.csv` in the subdirectory `./data`. There are 5 fields: State, Population, Electoral Votes, Highway Miles, and Square Miles. Each line represents one of the states in the United States.

### Task

Read in the file and create a `list` called `sums` which should contain the sum of the data elements in this order:

1. Sum of the states' population
2. Sum of the states' electoral votes
3. Sum of the states' highway miles
4. Sum of the states' square miles

Do this task two ways: one using "base" reading of the file and one using `pandas`.

In [None]:
# YOUR CODE HERE
# Step 1: open the file and read its contents into memory and close file

# Print out the contents to see what it looks like


In [None]:
# Step 2: HINT
# One approach could be to:
# Create a list that contains sublists
# where each sublist are the 5 elements for a single state


In [None]:
# Step 3: HINT
# initialize an empty list
# loop over your 2-dimensional list created above adding up numerical "columns"


In [None]:
# Step 4: print out the list containing the total sums


### Alternative Approach - `pandas`

Now try using pandas to accomplish the same task.

In [None]:
# Use pandas because its easy


-----

## Memory is Faster than Disk

I have mentioned in class before that, in general, we want to open a file, get the contents out, put it in memory (RAM), and then close the file as quickly as possible. Additionally, we do want to **avoid** opening and reading the **same** file multiple times. Doing so is inefficient, especially in terms of time. RAM is faster than "hitting" the disk.

Let's look at an example using the file `states.csv` above and see if we can discern this inefficiency.

The magic command `%%timeit` **must** be the very first line of the code cell. This command runs the contents of the code cell several times with a large number of loops within each iteration. It calculates how long it takes to run and then prints out the average running time as well as a standard deviation of the running time. In both cases, we want those numbers to be **small**. 

In [None]:
%%timeit
# 1. Open and read in file to get the population

    
# 2. Open and read in file to get the electoral votes

    
# 3. Open and read in file to get the highway miles

    
# 4. Open and read in file to get the square miles


### Original Code Timing

Copy and paste your code from the Student Exercise above in the cell below. 

*Note: The **very first line** in the code cell must be the magic command %%timeit* 

In [None]:
%%timeit
# Paste original code from exercise here


**&copy; 2021 - Present: Matthew D. Dean, Ph.D.   
Clinical Associate Professor of Business Analytics at William \& Mary.**