In [1]:
import os, sys
import pandas as pd
import subprocess
from phonlab.utils import dir2df

# Post-processing and data collection

This notebook contains an article on two common data analysis tasks you might need for your work in the Phonlab: 1) post-processing a set of input files; 2) collecting measurements into a format suitable for data analysis.

1. [A sample post-processing workflow](#sample_workflow)
    1. [Mirror a directory structure](#mirror_directory)
    1. [Find rows in a source DataFrame that do not have a match in a second DataFrame](#find_non_matched_rows)
    1. [Perform a task for every row in a DataFrame](#task_per_df_row)
    1. [Check your work!](#postproc_check_work)
    1. [Summary of the post-processing workflow](#sample_workflow_summary)
1. [Collecting results for analysis](#collecting_results)
    1. [Summary of collecting results](#collecting_results_summary)


## <a name="sample_workflow"></a>A sample post-processing workflow

[Skip ahead to the summary](#sample_workflow_summary) for the steps in a compact refresher.

A typical step in your data analysis pipeline is to do post-processing on a dataset. This dataset may contain a number of audio recordings, for instance, and you wish to perform formant analysis on each of them, then collect the results into a DataFrame for statistical analysis. This section covers the steps involved in post-processing the files, the formant analysis part.

Sometimes you have a complete dataset already in hand and need to do post-processing only once. Other times you may have a dataset that grows incrementally, perhaps as new subjects are included. The steps you will see are modular and can be repeated (or skipped) so that incremental additions can be updated with reasonable efficiency. For very large datasets a more sophisticated workflow might be necessary.

In our example we will perform formant analysis on a set of .wav files found in the dataset. The directory that contains the .wav files is the source directory `srcdir`. We will write the formant measurements to a separate directory with the same internal structure as `srcdir`, and we will refer to this directory as the `cachedir` because it holds cached results obtained from the source data files.

The steps in the workflow are:

1. [Load filenames in `srcdir` and add analysis parameters](#load_filenames_and_parameters)
1. [Find filenames in `srcdir` that require post-processing](#find_non_matched_rows)
1. [Mirror the directory structure](#mirror_directory) (copy the internal structure of `srcdir` to `cachedir` in preparation for writing formant measurements)
1. [Perform a task for every row in a DataFrame](#task_per_df_row) (do formant analysis)

These steps are summarized in the tl;dr section [Summary of the post-processing workflow](#sample_workflow_summary).

### Load filenames in `srcdir` and add analysis parameters

In this section we load the filenames of the .wav files in `srcdir` and add analysis parameters to be used by `ifcformant`. First we define the locations of the source and cache directories.

In [2]:
srcdir = '../resource/postproc/orig_data'
cachedir = '../resource/postproc/cache'

***WARNING:*** `cachedir` must not be contained anywhere under `srcdir`! The mirroring technique described later in this article is very simple to implement but may produce unexpected results if `cachedir` is part of `srcdir`. It is okay if `cachedir` is a sibling of `srcdir`, as they are defined above, but avoid this:

```python
srcdir = '../resource/postproc/orig_data'

# This is not okay--cachedir is inside srcdir!
cachedir = '../resource/postproc/orig_data/cache'
```

#### Load filenames from `srcdir`

The [`dir2df()` function](Retrieving%20filenames%20in%20a%20directory%20tree%20with%20%60dir2df%28%29%60.ipynb) makes it easy to load the set of filenames of the .wav files in `srcdir`. We use a [named capture](Retrieving%20filenames%20in%20a%20directory%20tree%20with%20%60dir2df%28%29%60.ipynb#adding_variables_named_capture) to also extract the subject identifier and [`addcols`](Retrieving%20filenames%20in%20a%20directory%20tree%20with%20%60dir2df%28%29%60.ipynb#adding_variables_addcols) to add the file's barename. The barename will make it easy to match .ifc files, as you will see shortly.

In [3]:
srcdf = dir2df(
    srcdir,
    dirpat='^(?P<subject>subj\d+)',
    fnpat='\.wav$',
    addcols=['barename']
)
srcdf

Unnamed: 0,relpath,fname,barename,subject
0,subj1/trial1,acq1.wav,acq1,subj1
1,subj1/trial1,acq2.wav,acq2,subj1
2,subj1/trial2,acq1.wav,acq1,subj1
3,subj1/trial2,acq2.wav,acq2,subj1
4,subj2/trial1,acq1.wav,acq1,subj2
5,subj2/trial1,acq2.wav,acq2,subj2
6,subj2/trial2,acq1.wav,acq1,subj2
7,subj2/trial2,acq2.wav,acq2,subj2


#### Add analysis parameters

The `ifcformant` command requires two parameters, 1) the name of the input .wav file; and 2) the speaker type ('female', 'male', or 'child'). The first parameter is already available in `srcdf`, and we need to add the second.

If you code speaker type into your filenames or directory structure, you can [extract speaker type](Retrieving%20filenames%20in%20a%20directory%20tree%20with%20%60dir2df%28%29%60.ipynb#adding_variables) when you call `dir2df()` and skip the rest of this section.

If you do not encode speaker type in your filename, you may load speaker type from an external file using one of [Pandas' Input/Output functions](https://pandas.pydata.org/pandas-docs/stable/api.html#input-output), most likely either [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv) or [`read_excel`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html#pandas.read_excel). We'll use `read_csv()` to load metadata from a metadata file in `srcdir`. The 'speaker_metadata.csv' file contains two comma-separated columns labelled 'subject' and 'sex'.

In [4]:
md = pd.read_csv(os.path.join(srcdir, 'speaker_metadata.csv'))
md

Unnamed: 0,subject,sex
0,subj1,female
1,subj2,male


A left merge adds speaker type to `srcdf`. In order for this merge to work properly, the values of one of the columns in `srcdf` must match the values of one of the columns in `md`. Here the columns named 'subject' match and we merge `on` those columns. (If the column names don't match you can use `left_on` and `right_on` instead of `on`.)

In [5]:
srcdf = srcdf.merge(md, on='subject', how='left')
srcdf

Unnamed: 0,relpath,fname,barename,subject,sex
0,subj1/trial1,acq1.wav,acq1,subj1,female
1,subj1/trial1,acq2.wav,acq2,subj1,female
2,subj1/trial2,acq1.wav,acq1,subj1,female
3,subj1/trial2,acq2.wav,acq2,subj1,female
4,subj2/trial1,acq1.wav,acq1,subj2,male
5,subj2/trial1,acq2.wav,acq2,subj2,male
6,subj2/trial2,acq1.wav,acq1,subj2,male
7,subj2/trial2,acq2.wav,acq2,subj2,male


The left merge ensures that all of the rows in `srcdf` are in the merge output. If you find a NaN value in the output, it means that `md` is missing a subject and you should update your metadata document.

### <a name="find_non_matched_rows"></a>Find filenames in `srcdir` that require post-processing

The next step in the workflow is to find existing `ifcformant` output files in `cachedir` and use that result to find the rows in `srcdf` that do ***not*** have a corresponding output file. These are the files that require post-processing.

In general, if you have a source DataFrame and want to check whether a second DataFrame has a corresponding row, you can use a left merge with the source on the lefthand side. A left merge returns all of the merge keys from the lefthand DataFrame regardless of whether a matching key is found on the right. When there is no match, the columns from the right DataFrame are filled with NaN.

For this example the left DataFrame will be `srcdf`. The right DataFrame will be created from `cachedir`. We load the .ifc files in `cachedir` into `cachedf`. As you can see, the formant measurements for the first two acquisitions for subj1 have already been created and cached in .ifc files that correspond to the .wav files in the relative path 'subj1/trial1'.

In [6]:
cachedf = dir2df(cachedir, fnpat='\.ifc$', addcols=['barename'])
cachedf

Unnamed: 0,relpath,fname,barename
0,subj1/trial1,acq1.ifc,acq1
1,subj1/trial1,acq2.ifc,acq2


Our goal is to find each '.wav' file in `srcdf` that does not have a corresponding '.ifc' file in `cachedf`, and we do this in part by matching the barename values&mdash;each 'acq1.wav' should match an 'acq2.ifc'. It is not enough to match the barename values in the two DataFrames, however. This is because the barenames are not unique in the global dataset:

In [7]:
srcdf[srcdf.barename == 'acq1']   # barename 'acq1' appears four times

Unnamed: 0,relpath,fname,barename,subject,sex
0,subj1/trial1,acq1.wav,acq1,subj1,female
2,subj1/trial2,acq1.wav,acq1,subj1,female
4,subj2/trial1,acq1.wav,acq1,subj2,male
6,subj2/trial2,acq1.wav,acq1,subj2,male


To fully identify each .wav file in `srcdf` it is necessary to include 'relpath' as well as 'barename'. The combination of these two column uniquely identifies each row in `srcdf`.

Another left merge, with `srcdf` as the left DataFrame and `cachedf` as the right, matches .wav input files with cached .ifc files, using `relpath` and `barename` as the complex merge key. Since the `fname` column is in both `srcdf` and `cachedf` we provide suffixes to append to those column names in the merged DataFrame.

In [8]:
mrgdf = srcdf.merge(
    cachedf,
    on=['relpath', 'barename'],
    how='left',
    suffixes=['_wav', '_ifc']
)
mrgdf

Unnamed: 0,relpath,fname_wav,barename,subject,sex,fname_ifc
0,subj1/trial1,acq1.wav,acq1,subj1,female,acq1.ifc
1,subj1/trial1,acq2.wav,acq2,subj1,female,acq2.ifc
2,subj1/trial2,acq1.wav,acq1,subj1,female,
3,subj1/trial2,acq2.wav,acq2,subj1,female,
4,subj2/trial1,acq1.wav,acq1,subj2,male,
5,subj2/trial1,acq2.wav,acq2,subj2,male,
6,subj2/trial2,acq1.wav,acq1,subj2,male,
7,subj2/trial2,acq2.wav,acq2,subj2,male,


Select the rows of `mrgdf` that have a value of NaN in the `fname_ifc` column. These represent the .wav files that do not have cached .ifc measurements.

In [9]:
nocachedf = mrgdf[mrgdf.fname_ifc.isnull()]
nocachedf

Unnamed: 0,relpath,fname_wav,barename,subject,sex,fname_ifc
2,subj1/trial2,acq1.wav,acq1,subj1,female,
3,subj1/trial2,acq2.wav,acq2,subj1,female,
4,subj2/trial1,acq1.wav,acq1,subj2,male,
5,subj2/trial1,acq2.wav,acq2,subj2,male,
6,subj2/trial2,acq1.wav,acq1,subj2,male,
7,subj2/trial2,acq2.wav,acq2,subj2,male,


### <a name="mirror_directory"></a>Mirror the directory structure

We have already seen that the directory structure of `srcdir` organizes files into trial subdirectories nested inside subject directories. We must create the same directory structure in `cachedir` before running `ifcformant`. The `ifcformant` command will fail if a writeable output directory does not exist for the .ifc file.

The directory structure is provided by the `relpath` column, and the unique values of it in the `nocachedf` provide all of the directory names that must exist prior to running `ifcformant`.

In [10]:
unique_relpath = nocachedf.relpath.unique()
unique_relpath

array(['subj1/trial2', 'subj2/trial1', 'subj2/trial2'], dtype=object)

The simple way to copy the structure is to loop over the unique set of relative paths and create them in `cachedir` using [os.makedirs()](https://docs.python.org/3/library/os.html#os.makedirs).

The following cell will copy the required directory structure to `cachedir` and print a success message if it succeeds. If there are any problems in creating a directory an error will be raised instead.

In [11]:
for destdir in unique_relpath:
    os.makedirs(
        os.path.join(cachedir, destdir),  # e.g. ../resource/postproc/cache/subj1/trial1
        exist_ok=True
    )
sys.stderr.write('Directory mirroring succeeded.')

Directory mirroring succeeded.

The `os.makedirs()` function automatically creates parent directories where necessary. For instance, the first relative path in the example above is `subj1/trial2`, and the first call to `os.makedirs()` is a request to create the directory `../resource/postproc/cache/subj1/trial1`. If `../resource/postproc/cache/subj1` does not exist already, then that directory will be created first.

The `exist_ok=True` means that `os.makedirs()` will not raise an error if the target directory already exists. This behavior is convenient for mirroring a `srcdir` incrementally. If you add `subj3/trial1` and `subj3/trial2` directories after running the above cell, then you can simply append them to `unique_relpath` and re-rerun the loop without raising an error for the existing directories under `subj1` and `subj2`.

### <a name="task_per_df_row"></a>Perform a task for every row in a DataFrame

The `nocachedf` DataFrame contains the names of .wav files that require formant analysis and the speaker analysis parameters. In this section we'll construct a function for doing formant analysis that uses the rows of `nocachedf` as its inputs. As a reminder, `nocachedf` contains:

In [12]:
nocachedf

Unnamed: 0,relpath,fname_wav,barename,subject,sex,fname_ifc
2,subj1/trial2,acq1.wav,acq1,subj1,female,
3,subj1/trial2,acq2.wav,acq2,subj1,female,
4,subj2/trial1,acq1.wav,acq1,subj2,male,
5,subj2/trial1,acq2.wav,acq2,subj2,male,
6,subj2/trial2,acq1.wav,acq1,subj2,male,
7,subj2/trial2,acq2.wav,acq2,subj2,male,


#### Iterate over rows with `itertuples()`

The [`itertuples()` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html#pandas.DataFrame.itertuples) iterates over the rows of a DataFrame and returns each row as a [namedtuple](https://docs.python.org/3/library/collections.html#collections.namedtuple).

***Aside:*** You might come across a similarly named method `iterrows()`, and it is recommended that you avoid it. The `iterrows()` method is less convenient than `itertuples()` because 1) it doesn't provide access to column values by name; and 2) it is slower to execute than `itertuples()`.

Here is a simple example that prints the values that `itertuples()` returns:

In [13]:
for row in nocachedf.itertuples():
    print(row)

Pandas(Index=2, relpath='subj1/trial2', fname_wav='acq1.wav', barename='acq1', subject='subj1', sex='female', fname_ifc=nan)
Pandas(Index=3, relpath='subj1/trial2', fname_wav='acq2.wav', barename='acq2', subject='subj1', sex='female', fname_ifc=nan)
Pandas(Index=4, relpath='subj2/trial1', fname_wav='acq1.wav', barename='acq1', subject='subj2', sex='male', fname_ifc=nan)
Pandas(Index=5, relpath='subj2/trial1', fname_wav='acq2.wav', barename='acq2', subject='subj2', sex='male', fname_ifc=nan)
Pandas(Index=6, relpath='subj2/trial2', fname_wav='acq1.wav', barename='acq1', subject='subj2', sex='male', fname_ifc=nan)
Pandas(Index=7, relpath='subj2/trial2', fname_wav='acq2.wav', barename='acq2', subject='subj2', sex='male', fname_ifc=nan)


Notice that the value of the row index is added as the first value and is named 'Index'. The DataFrame column labels are the other attributes and are easily accessed by name with attribute '.' notation.

In [14]:
for row in nocachedf.itertuples():
    print(row.fname_wav)

acq1.wav
acq2.wav
acq1.wav
acq2.wav
acq1.wav
acq2.wav


It makes sense to put your task in a named function when the task you wish to perform is more complicated than a simple print statement. Doing so helps make your code easier to debug, and you can re-use the function in multiple places.

The `my_print` function uses `os.path.join` to construct a filepath from the 'relpath' and 'fname' attributes of a row and prints the result. The `for` loop calls `my_print` on each row in turn.

In [15]:
# Function definition.
def my_print(rowtuple):
    '''Print values from a DataFrame row provided as a namedtuple.'''
    print(rowtuple.subject)
    print(os.path.join(rowtuple.relpath, rowtuple.barename + '.ifc'))
    print(os.path.join(rowtuple.relpath, rowtuple.fname_wav))
    print('******************')

# Loop over rows and call the `my_print` function.
for row in nocachedf.itertuples():
    my_print(row)

subj1
subj1/trial2/acq1.ifc
subj1/trial2/acq1.wav
******************
subj1
subj1/trial2/acq2.ifc
subj1/trial2/acq2.wav
******************
subj2
subj2/trial1/acq1.ifc
subj2/trial1/acq1.wav
******************
subj2
subj2/trial1/acq2.ifc
subj2/trial1/acq2.wav
******************
subj2
subj2/trial2/acq1.ifc
subj2/trial2/acq1.wav
******************
subj2
subj2/trial2/acq2.ifc
subj2/trial2/acq2.wav
******************


#### The `ifcformant` function

Now that we know the mechanics of calling a named function for every row in the DataFrame, let's construct a function that runs the `ifcformant` command using the parameters provided by a namedtuple.

If we were working at the command line, a representative example of calling `ifcformant` is:

```
ifcformant --speaker female --print-header --output myfile.ifc myfile.wav
```

The arguments to `ifcformant` include the speaker type, the name of the output file that will contain the formant measurements, and the input '.wav' file. The `--print-header` argument is used to print column labels as the first row of the output file.

The `do_ifcformant` function shown below constructs an array of arguments from an input namedtuple and input and output directory names, then uses the [`subprocess` module](https://docs.python.org/3/library/subprocess.html) to execute `ifcformant`.


In [17]:
def do_ifcformant(row, indir, outdir, errors='raise'):
    '''Perform formant analysis with the ifcformant command.
    
    Parameters
    ----------
    
    row : namedtuple that contains formant analysis parameters
          in fields:
        'relpath' (relative path to audio file),
        'fname' (name of .wav file),
        'barename' (name of .wav file without extension)
        'speaker' (ifcformant speaker type, one of 'female',
            'male', 'child')
             
    indir : str
        Base pathname to input .wav file. The path to the input file
        will be: indir/row.relpath/row.fname_wav.
             
    outdir : str
        Base pathname to ifcformant output. The output file will
        be written to: outdir/row.relpath/row.barename + '.ifc'.
             
    errors : str (default 'raise')
        How to handle errors if `check_call()` fails. If
        'ignore', print debug statement to STDERR and return the
        ifcformant return code; if 'raise' immediately reraise
        the CalledProcessError.
        
    Returns
    -------
    
    The `ifcformant` return code is returned by this function,
    0 for success or non-zero for errors.
    '''
    ifcargs = [
        'ifcformant',
        '--speaker', row.sex,
        '--print-header',
        '--output', os.path.join(
            outdir, row.relpath, row.barename + '.ifc'
        ),
        os.path.join(indir, row.relpath, row.fname_wav)
    ]
    try:
        subprocess.check_call(ifcargs)
    except subprocess.CalledProcessError as e:
        if errors == 'ignore':
            msg = 'Caught error while invoking ifcformant:\n{:}'.format(e)
            sys.stderr.write(msg)
            return e.returncode
        else:
            raise e
    return 0

It's always best to include a docstring at the top of your named functions. This helps you document your workflow and to possibly re-use the function in another project. Execute the following cell to see the documentation for the new function.

In [None]:
do_ifcformant?

Once your function is created and debugged, use `itertuples()` to run the function on every row of your DataFrame.

In [18]:
for row in nocachedf.itertuples():
    do_ifcformant(row, indir=srcdir, outdir=cachedir)    

### <a name="postproc_check_work"></a>Check your work!

Use `dir2df()` to list the .ifc files that exist in `cachedir` now. All but two should show recent mtime values.

In [19]:
dir2df(cachedir, fnpat='\.ifc$', addcols=['mtime'])

Unnamed: 0,relpath,fname,mtime
0,subj1/trial1,acq1.ifc,2018-07-30 18:06:25
1,subj1/trial1,acq2.ifc,2018-07-30 18:06:25
2,subj1/trial2,acq1.ifc,2018-07-30 18:29:09
3,subj1/trial2,acq2.ifc,2018-07-30 18:29:10
4,subj2/trial1,acq1.ifc,2018-07-30 18:29:10
5,subj2/trial1,acq2.ifc,2018-07-30 18:29:10
6,subj2/trial2,acq1.ifc,2018-07-30 18:29:10
7,subj2/trial2,acq2.ifc,2018-07-30 18:29:10


Visual inspection of the results works okay for a small number of files but is cumbersome when more than a few exist.

A more reliable technique is to reload `cachedir` and merge again with `srcdf`. If any NaN values remain in the `fname_ifc` column, then something went wrong in creating the .ifc files.

The `if` block raises an error if any NaN values are found in `fname_ifc` and prints the missing rows. If no NaN values are found a success message appears instead.

In [30]:
mrgdf = srcdf.merge(
    dir2df(cachedir, fnpat='\.ifc$', addcols=['barename']), # reload cachedir
    how='left',
    on=['relpath', 'barename'],
    suffixes=['_wav', '_ifc']
)
if mrgdf.fname_ifc.isnull().any():
    sys.stderr.write('Some .wav files do not have .ifc files.\n')
    print(mrgdf[mrgdf.fname_ifc.isnull()])
else:
    sys.stderr.write('No missing .ifc files were found.')

No missing .ifc files were found.

### <a name="sample_workflow_summary"></a>Summary of the post-processing workflow

This section contains a summary of the post-processing workflow with minimal explanation. Each step is in a separate cell to make it easy to execute each separately, in modular fashion.

In [None]:
# Define source and cache directories.
srcdir = '../resource/postproc/orig_data'
cachedir = '../resource/postproc/cache'

In [None]:
# Load .wav filenames from srcdf.
srcdf = dir2df(
    srcdir,
    dirpat='^(?P<subject>subj\d+)',
    fnpat='\.wav$',
    addcols=['barename']
)

In [None]:
# Load speaker metadata and merge with srcdf.
md = pd.read_csv(os.path.join(srcdir, 'speaker_metadata.csv'))
srcdf = srcdf.merge(md, on='subject', how='left')

In [None]:
# Load cached .ifc filenames and merge with srcdf.
cachedf = dir2df(cachedir, fnpat='\.ifc$', addcols=['barename'])
mrgdf = srcdf.merge(
    cachedf,
    on=['relpath', 'barename'],
    how='left',
    suffixes=['_wav', '_ifc']
)

In [None]:
# Select .wav files that do not have a corresponding cached .ifc file.
nocachedf = mrgdf[mrgdf.fname_ifc.isnull()]

In [None]:
# Mirror srcdir directory structure in cachedir.
unique_relpath = nocachedf.relpath.unique()
for destdir in unique_relpath:
    os.makedirs(
        os.path.join(cachedir, destdir),  # e.g. ../resource/postproc/cache/subj1/trial1
        exist_ok=True
    )
sys.stderr.write('Directory mirroring succeeded.')

In [None]:
# Run ifcformant on .wav files and output to cachedir.
# NOTE: do_ifcformant() function must already be defined.
for row in nocachedf.itertuples():
    do_ifcformant(row, indir=srcdir, outdir=cachedir)

In [None]:
# Check your work.
mrgdf = srcdf.merge(
    dir2df(cachedir, fnpat='\.ifc$', addcols=['barename']), # reload cachedir
    how='left',
    on=['relpath', 'barename'],
    suffixes=['_wav', '_ifc']
)
if mrgdf.fname_ifc.isnull().any():
    sys.stderr.write('Some .wav files do not have .ifc files.\n')
    print(mrgdf[mrgdf.fname_ifc.isnull()])
else:
    sys.stderr.write('No missing .ifc files were found.')

## <a name="collecting_results"></a>Collecting results for analysis

[Skip ahead to the summary](#collecting_results_summary) for these steps in a short format.

After post-processing, you can read the cached files for data analysis. First, load the cached .ifc filenames into a DataFrame.

In [31]:
cachedf = dir2df(cachedir, fnpat='\.ifc$', addcols=['dirname', 'barename'])
cachedf

Unnamed: 0,dirname,relpath,fname,barename
0,../resource/postproc/cache,subj1/trial1,acq1.ifc,acq1
1,../resource/postproc/cache,subj1/trial1,acq2.ifc,acq2
2,../resource/postproc/cache,subj1/trial2,acq1.ifc,acq1
3,../resource/postproc/cache,subj1/trial2,acq2.ifc,acq2
4,../resource/postproc/cache,subj2/trial1,acq1.ifc,acq1
5,../resource/postproc/cache,subj2/trial1,acq2.ifc,acq2
6,../resource/postproc/cache,subj2/trial2,acq1.ifc,acq1
7,../resource/postproc/cache,subj2/trial2,acq2.ifc,acq2


Let's use information from one of the rows to figure out how to read .ifc files into a DataFrame. The filepath is constructed from values found in the `dirname`, `relpath`, and `fname` fields. The `sep` parameter is used to declare tab as the separator character, as described in the `ifcformant` documentation.

In [32]:
df = pd.read_csv(
    os.path.join('../resource/postproc/cache', 'subj1/trial1', 'acq1.ifc'),
    sep='\t'
)
df.head()

Unnamed: 0,sec,rms,f1,f2,f3,f4,f0
0,0.005,13.8,848.2,2135.2,2917.6,4359.2,0.0
1,0.015,24.2,792.7,2167.6,2795.7,4375.9,0.0
2,0.025,37.9,837.5,2191.3,2754.8,4347.7,0.0
3,0.035,51.7,776.9,2154.3,2846.5,4260.3,0.0
4,0.045,76.2,549.5,2210.4,2932.8,4107.5,0.0


The resulting DataFrame looks good, but if it is to be combined with similar DataFrames from other .ifc files, then it is incomplete; there is no identifying information in the rows to associate the measurements with a particular .wav file.

In the next cell the `assign()` method is chained to the `read_csv()` output and adds two new columns that add the missing identifiers.

In [33]:
df = pd.read_csv(
    os.path.join(cachedir, 'subj1/trial1', 'acq1.ifc'),
    sep='\t'
).assign(relpath='subj1/trial1', barename='acq1')
df.head()

Unnamed: 0,sec,rms,f1,f2,f3,f4,f0,barename,relpath
0,0.005,13.8,848.2,2135.2,2917.6,4359.2,0.0,acq1,subj1/trial1
1,0.015,24.2,792.7,2167.6,2795.7,4375.9,0.0,acq1,subj1/trial1
2,0.025,37.9,837.5,2191.3,2754.8,4347.7,0.0,acq1,subj1/trial1
3,0.035,51.7,776.9,2154.3,2846.5,4260.3,0.0,acq1,subj1/trial1
4,0.045,76.2,549.5,2210.4,2932.8,4107.5,0.0,acq1,subj1/trial1


The next step is to package the `read_csv()` call in the preceding cell into a convenient function that uses the information from a DataFrame row as its input and returns a DataFrame filled with .ifc data.

In [34]:
def ifc2df(row, errors='raise'):
    '''Read ifcformant measurements from a file into a DataFrame.
    
    Parameters
    ----------
    
    row : namedtuple that contains .ifc filepath info in fields:
        'relpath' (relative path to .ifc file),
        'fname' (name of .ifc file),
        'dirname' (base pathname of .ifc file. The path will
          be row.dirname/row.relpath/row.fname)

    errors : str (default 'raise')
        How to handle errors if `read_csv()` fails. If
        'ignore', print debug statement to STDERR; if 'raise'
        immediately reraise.

    Returns
    -------
    
    A DataFrame containing data from the .ifc file, plus columns
    for dirname, relpath, fname.
    '''
    fpath = os.path.join(row.dirname, row.relpath, row.fname)
    try:
        return pd.read_csv(
            fpath,
            sep='\t'
        ).assign(
            dirname=row.dirname,
            relpath=row.relpath,
            barename=row.barename
        )
    except Exception as e:
        if errors == 'ignore':
            sys.stderr.write('Error reading {:}:\n    '.format(fpath))
            sys.stderr.write(str(e) + '\n')
        else:
            raise

#### Read one file at a time

Test your function! A good way to do that is to choose a single file as input and check the result.

To work with a single file, select a single row from `cachedf`. The combination of 'relpath' and 'barename' identifies a unique row. The result of selecting a row always returns a DataFrame. The input of the `ifc2df()` function requires a namedtuple, and the purpose of the `squeeze()` method is to reduce a single-row DataFrame to the equivalent of a namedtuple (a Pandas Series). (If your selection returns a DataFrame of more than one row, then `squeeze()` has not effect.)

In [36]:
row = cachedf.loc[
    (cachedf.relpath == 'subj1/trial1') & (cachedf.barename == 'acq1') # select single row
].squeeze()                                                            # reduce to Series
print(row)
ifcdf = ifc2df(row)
ifcdf.head()

dirname     ../resource/postproc/cache
relpath                   subj1/trial1
fname                         acq1.ifc
barename                          acq1
Name: 0, dtype: object


Unnamed: 0,sec,rms,f1,f2,f3,f4,f0,barename,dirname,relpath
0,0.005,13.8,848.2,2135.2,2917.6,4359.2,0.0,acq1,../resource/postproc/cache,subj1/trial1
1,0.015,24.2,792.7,2167.6,2795.7,4375.9,0.0,acq1,../resource/postproc/cache,subj1/trial1
2,0.025,37.9,837.5,2191.3,2754.8,4347.7,0.0,acq1,../resource/postproc/cache,subj1/trial1
3,0.035,51.7,776.9,2154.3,2846.5,4260.3,0.0,acq1,../resource/postproc/cache,subj1/trial1
4,0.045,76.2,549.5,2210.4,2932.8,4107.5,0.0,acq1,../resource/postproc/cache,subj1/trial1


The function works!

#### Collect all files into a single DataFrame

If you wish, you can include `ifc2df()` in a loop to operate on each of the .ifc files, one at a time. The general pattern looks like this:

```python
for row in cachedf.itertuples():
    ifcdf = ifc2df(row)
    # ... do something with ifcdf
```

One way to use the loop is to collect all of the formant measurements from the .ifc files into a single DataFrame. To do this, create a list of DataFrames by appending each DataFrame in the loop, then concatenate them into a single DataFrame with `pd.concat()`. (It is faster to use `pd.concat()` once on a list of DataFrames than it is to use `pd.concat()` within the loop.)

In [37]:
ifcs = []                           # Initialize the list of DataFrames
for row in cachedf.itertuples():    # Loop over the filenames
    ifcs.append(ifc2df(row))        #   Append a new DataFrame to the list
ifcdf = pd.concat(ifcs)             # Combine the list of DataFrames into a single DataFrame
ifcdf

Unnamed: 0,sec,rms,f1,f2,f3,f4,f0,barename,dirname,relpath
0,0.005,13.8,848.2,2135.2,2917.6,4359.2,0.0,acq1,../resource/postproc/cache,subj1/trial1
1,0.015,24.2,792.7,2167.6,2795.7,4375.9,0.0,acq1,../resource/postproc/cache,subj1/trial1
2,0.025,37.9,837.5,2191.3,2754.8,4347.7,0.0,acq1,../resource/postproc/cache,subj1/trial1
3,0.035,51.7,776.9,2154.3,2846.5,4260.3,0.0,acq1,../resource/postproc/cache,subj1/trial1
4,0.045,76.2,549.5,2210.4,2932.8,4107.5,0.0,acq1,../resource/postproc/cache,subj1/trial1
5,0.055,100.6,546.5,2374.2,3044.2,4020.7,129.0,acq1,../resource/postproc/cache,subj1/trial1
6,0.065,114.4,707.3,2318.7,2988.7,4164.7,0.0,acq1,../resource/postproc/cache,subj1/trial1
7,0.075,148.6,747.7,2270.6,2901.0,4241.2,0.0,acq1,../resource/postproc/cache,subj1/trial1
8,0.085,375.0,319.6,2283.4,2842.5,4180.8,0.0,acq1,../resource/postproc/cache,subj1/trial1
9,0.095,648.4,331.2,2276.6,2800.3,4295.3,172.5,acq1,../resource/postproc/cache,subj1/trial1


For a truly Pythonic implementation, condense the loop into a list comprehension and create your DataFrame in a single line of code.

In [38]:
# Collect all ifcformant measurements into a single DataFrame.
ifcdf = pd.concat([ifc2df(row) for row in cachedf.itertuples()])
ifcdf

Unnamed: 0,sec,rms,f1,f2,f3,f4,f0,barename,dirname,relpath
0,0.005,13.8,848.2,2135.2,2917.6,4359.2,0.0,acq1,../resource/postproc/cache,subj1/trial1
1,0.015,24.2,792.7,2167.6,2795.7,4375.9,0.0,acq1,../resource/postproc/cache,subj1/trial1
2,0.025,37.9,837.5,2191.3,2754.8,4347.7,0.0,acq1,../resource/postproc/cache,subj1/trial1
3,0.035,51.7,776.9,2154.3,2846.5,4260.3,0.0,acq1,../resource/postproc/cache,subj1/trial1
4,0.045,76.2,549.5,2210.4,2932.8,4107.5,0.0,acq1,../resource/postproc/cache,subj1/trial1
5,0.055,100.6,546.5,2374.2,3044.2,4020.7,129.0,acq1,../resource/postproc/cache,subj1/trial1
6,0.065,114.4,707.3,2318.7,2988.7,4164.7,0.0,acq1,../resource/postproc/cache,subj1/trial1
7,0.075,148.6,747.7,2270.6,2901.0,4241.2,0.0,acq1,../resource/postproc/cache,subj1/trial1
8,0.085,375.0,319.6,2283.4,2842.5,4180.8,0.0,acq1,../resource/postproc/cache,subj1/trial1
9,0.095,648.4,331.2,2276.6,2800.3,4295.3,172.5,acq1,../resource/postproc/cache,subj1/trial1


### <a name="collecting_results_summary"></a>Summary of collecting results

This section contains a summary of collecting results with minimal explanation. Each step is in a separate cell to make it easy to execute each separately, in modular fashion.

In [None]:
# Load names of .ifc files into a DataFrame
cachedf = dir2df(cachedir, fnpat='\.ifc$', addcols=['dirname', 'barename'])

In [None]:
# Collect all the formant measurements into a single DataFrame. `ifc2df()` must be defined.
ifcdf = pd.concat([ifc2df(row) for row in cachedf.itertuples()])