# Session 3

In this session, we will learn how to load multiple files into a `DataFrame`.

## Import packages
For this session, we will only need `os` and `pandas`.

In [1]:
import os
import pandas as pd

## Input Folder
Get the path of the desired input folder, which will be wherever you saved `\Py-R\data\session3\exercise1\`. Remember to deal with the `\` characters.

In [2]:
inpath = r"C:\Users\161289\Py-R\data\session3\exercise1"

Next, we can actually check what is inside our using the `os` function `listdir()`.

In [3]:
os.listdir(inpath)

['ET101 1.csv', 'ET101 2.csv', 'ET101 3.csv', 'ET101 4.csv', 'ET101 5.csv']

If you look into our data folder, you'll see that that looks right.  
Actually, we'll want to use this list of files, so let's save it to a variable.

In [4]:
inpathfiles = os.listdir(inpath)

This list is what we will use to iterate through using a `for` loop.

Next, let's create some metadata to insert. We'll have a "PM6" module and have the site be "Lehi"

In [None]:
module = "PM6"
site = "Lehi"

## Process Logic

The process we are going to go through here is as follows:
- Read each file into a `DataFrame`
- Combine all of those `DataFrames` together
- Add metadata
- Make a `.csv` file of the final desired data structure.

First, let's make an empty list. This is going to become a list of `DataFrame` structures. Let's call it 'dfs' (short for `DataFrame`s):

In [None]:
dfs = []

Next, we would want to iterate through the list of files that we have in our input directory.

In [None]:
for filename in inpathfiles:

You'll see an error occur here if you run it -- this is expected as Python expects some logic to go after the loop conditions, so this loop won't properly run until we add some commands afterwards.

So now that we're inside the for loop, we are working with the individual files one at a time. We have the list of filenames that are in the list that we got from `os.listdir`, and we want to read each file into a `DataFrame` and add it to our list `dfs`.  
Firstly, we need to get the fully filepath rather than just the file name in order to read the right file:

In [None]:
for filename in inpathfiles:
    filename = inpath + "\\" + filename

Next, we would would want to read each file into a `DataFrame`:

In [None]:
for filename in inpathfiles:
    filename = inpath + "\\" + filename
    df = pd.read_csv(filename)

And finally, let's add this `DataFrame` to our list:

In [None]:
for filename in inpathfiles:
    filename = inpath + "\\" + filename
    df = pd.read_csv(filename)
    dfs.append(df)

Next, let's combine our list into one `DataFrame`. `pandas` has a function called `concat` that will combine `pandas` objects together.

In [None]:
result = pd.concat(dfs)

Next, insert the metadata:

In [None]:
result.insert(0,"Module",module)
result.insert(0,"Site",site)

And finally, let's configure our outputh path and write the file out:

In [None]:
outpath = inpath + r"\output"

if not os.path.exists(outpath):
    os.makedirs(outpath)
outfile = outpath + r"\session3output1.csv"

result.to_csv(outfile,index=False)

And that's all for Exercise 1! The final script should look like this:

In [None]:
# imports -- os and pandas
import os
import pandas as pd

# input folder
inpath = r"C:\Users\161289\Py-R\data\session3\exercise1"
inpathfiles = os.listdir(inpath)

# metadata -- PM6 module, Lehi site
module = "PM6"
site = "Lehi"

# empty list to collect DataFrames
dfs = []

# iterate through file list, read data, and add to list
for filename in inpathfiles:
    filename = inpath + "\\" + filename
    df = pd.read_csv(filename)
    dfs.append(df)

# combine list of DataFrames into one DataFrame
result = pd.concat(dfs)

# insert metadata
result.insert(0,"Module",module)
result.insert(0,"Site",site)

# output folder and file path
outpath = inpath + r"\output"
if not os.path.exists(outpath):
    os.makedirs(outpath)
outfile = outpath + r"\session3output1.csv"

# write DataFrame to file
result.to_csv(outfile,index=False)

# Exercise 2 -- Practice

With that, lets try doing this from scratch again, using the data in the `exercise2` folder and creating the output file `session3output2.csv`.

In [None]:
# imports -- os and pandas
import os
import pandas as pd

# input folder and list of files
inpath = r"C:\Users\161289\Py-R\data\session3\exercise2"
inpathfiles = os.listdir(inpath)

# metadata -- PM6 module and Lehi site
module = "PM6"
site = "Lehi"

# empty list to collect Dataframes
dfs = []

# iterate through file list, read data, and add to list
for filename in inpathfiles:
    filename = inpath + "\\" + filename
    df = pd.read_csv(filename)
    dfs.append(df)
    
# combine list of Dataframes into one DataFrame
result = pd.concat(dfs)

# insert metadata
result.insert(0,"Module",module)
result.insert(0,"Site",site)

# output folder and file path
outpath = inpath + r"\output"
if not os.path.exists(outpath):
    os.makedirs(outpath)
outfile = outpath + r"\session3output2.csv"

# write DataFrame to file
result.to_csv(outfile,index=False)

# Exercise 3 -- Filtering
Before moving on, let's cover one more useful tool -- being able to filter through what files you are interested collecting.  
If you open your data folder for Session 3 Exercise 3, you'll find that we have the same logs as the previous two exercises. What if we need to do different things to these two kinds of files? How could we go about doing this?

We can use `if` statements to do so. We're going to copy Exercise 2 and use it as our base for Exercise 3. Let's say that you wanted to just process the TRD logs in this exercise. Where do you think we could place an `if` statement to filter those out?

In [None]:
# imports -- os and pandas
import os
import pandas as pd

# input folder and list of files
inpath = r"C:\Users\161289\Py-R\data\session3\exercise3"
inpathfiles = os.listdir(inpath)

# metadata -- PM6 module and Lehi site
module = "PM6"
site = "Lehi"

# empty list to collect Dataframes
dfs = []

# iterate through file list, read data, and add to list
for filename in inpathfiles:
    filename = inpath + "\\" + filename
    df = pd.read_csv(filename)
    dfs.append(df)
    
# combine list of Dataframes into one DataFrame
result = pd.concat(dfs)

# insert metadata
result.insert(0,"Module",module)
result.insert(0,"Site",site)

#%% output folder and file path
outpath = inpath + r"\output"
if not os.path.exists(outpath):
    os.makedirs(outpath)
outfile = outpath + r"\session3output3.csv"

# write DataFrame to file
result.to_csv(outfile,index=False)

We actually process the files in the `for` loop, so this is where I would place the `if` statment. The condition could be set up to look for `TMP01_TRD.CSV` to filter out just the Exercise 2 files.

In [None]:
for filename in inpathfiles:
    if "TMP01_TRD.CSV" in filename:
    filename = inpath + "\\" + filename
    df = pd.read_csv(filename)
    dfs.append(df)

But we can't just simply add in that line -- we have to shift over the indentation so that the statements in the loop are only executed when the condition is met.

In [None]:
for filename in inpathfiles:
    if "TMP01_TRD.CSV" in filename:
        filename = inpath + "\\" + filename
        df = pd.read_csv(filename)
        dfs.append(df)

While we used this particular string for filtering, it can be flexible based on what the need is. Sometimes you might just need to filter by a smaller set of characters, like maybe some files contain `TMP02_TRD.CSV` and you also want those, you 
could just filter by `_TRD.CSV`.

Strings can also be checked against `startswith()` or `endswith()', so these can come in handy as well:

In [None]:
filename = "ET101.csv"
filename.endswith(".csv")

In [None]:
filename = "ET101.csv"
filename.startswith("ET101")

So our final script looks like this:

In [None]:
# imports -- os and pandas
import os
import pandas as pd

# input folder and list of files
inpath = r"C:\Users\161289\Py-R\data\session3\exercise3"
inpathfiles = os.listdir(inpath)

# metadata -- PM6 module and Lehi site
module = "PM6"
site = "Lehi"

# empty list to collect Dataframes
dfs = []

# iterate through file list, read data, and add to list
for filename in inpathfiles:
    # check file name
    if "TMP01_TRD.CSV" in filename:
        filename = inpath + "\\" + filename
        df = pd.read_csv(filename)
        dfs.append(df)
    
# combine list of Dataframes into one DataFrame
result = pd.concat(dfs)

# insert metadata
result.insert(0,"Module",module)
result.insert(0,"Site",site)

#%% output folder and file path
outpath = inpath + r"\output"
if not os.path.exists(outpath):
    os.makedirs(outpath)
outfile = outpath + r"\session3output3.csv"

# write DataFrame to file
result.to_csv(outfile,index=False)

# Thought Exercise
What if we wanted to process both of these at the same time to two different files? What adjustments would we have to make?

In [None]:
# imports -- os and pandas
import os
import pandas as pd

# input folder and list of files
inpath = r"C:\Users\161289\Py-R\data\session3\exercise3"
inpathfiles = os.listdir(inpath)

# metadata -- PM6 module and Lehi site
module = "PM6"
site = "Lehi"

# empty list to collect Dataframes
dfs = []

# iterate through file list, read data, and add to list
for filename in inpathfiles:
    if "TMP01_TRD.CSV" in filename:
        filename = inpath + "\\" + filename
        df = pd.read_csv(filename)
        dfs.append(df)
    
# combine list of Dataframes into one DataFrame
result = pd.concat(dfs)

# insert metadata
result.insert(0,"Module",module)
result.insert(0,"Site",site)

#%% output folder and file path
outpath = inpath + r"\output"
if not os.path.exists(outpath):
    os.makedirs(outpath)
outfile = outpath + r"\session3output3.csv"

# write DataFrame to file
result.to_csv(outfile,index=False)

An outline and solution is provided for this exercise in the course files, if you want to try it on your own and check your work.

# Wrap-up
That's the end of this session! We'll take a break here, and feel free to try to work out that final though exercise for extra credit!