# Step 1.4: Splitting Utterance Content

This code will perform a number of tasks. First, it will remove lines in the gold standard CSV (created in Step 1.3) which do not contain any instances of the AAL morphosyntactic feature in question. Next, it will produce a CSV spreadsheet with the content of each utterance split into columns by word. It will then duplicate lines that have more than one instance of the AAL morphosyntactic feature in question, in order for each instance within the line to be analyzed in its own line rather than all instances within an utterance being analyzed in the same line. It will then write the results into a CSV file and export it.

## Required Packages

The following packages are necessary to run this code:
os, re, [pandas](https://pypi.org/project/pandas/), [numpy](https://pypi.org/project/numpy/)

## Define the Dataframe Creating Function

This function takes the following arguments:

<ol>
<li>The filepath to the gold standard CSV file, created in Step 1.3 and manually annotated</li>
<li>The word being searched for</li>
</ol>

It will do the following:

<ol>
<li>Removes lines which do not contain the morphosyntactic feature in question (i.e., lines whose FeatureCountPerLine == 0)</li>
<li>Splits the content of the utterance into separate columns per word (Punctuation marks will remain with words they co-occur with)</li>
<li>Duplicates lines which have more than one instance of the morphosyntactic feature in question in order for each instance to be examined in its own separate line rather than all instances being examined in the same line</li>
<li>Writes the results to a CSV file</li>
</ol>

## References:

The *justify* function is the work of StackOverflow user [Divakar](https://stackoverflow.com/users/3293881/divakar) and is taken from:
<ul>
<li>https://stackoverflow.com/questions/44558215/python-justifying-numpy-array/44559180#44559180</li>
<li>https://stackoverflow.com/questions/51304610/pandas-shifting-columns-depending-on-if-nan-or-not)</li>
</ul>

In [None]:
def create_split_content_dataframe(csv_input_path, search_word_string):

    """
    Takes a gold standard csv produced by previous steps in the pipeline.
    Performs the following tasks:
    (1) Removes lines which do not contain the morphosyntactic feature in
    question (i.e., lines whose FeatureCountPerLine == 0).
    (2) Splits the Content of the utterance into separate columns per word.
    Punctuation marks will remain with words they co-occur with.
    (3) Duplicates lines which have more than one instance of the
    morphosyntactic feature in question in order for each instance to be
    examined in its own separate line rather than all instances being
    examined in the same line.
    (4) Writes the results to a .csv file

    One issue remains with this function. If there are multiple instances
    of a search word in an utterance, and there is a difference between
    the InstancesCountPerLine and the FeatureCountPerLine, then the code
    which replicates lines will replicate as many lines as there are
    instances rather than actual feature counts. This shouldn't be a
    large issue since occurences like this are rare. However, extra
    care should be taken in the manually annotation of the .csv files
    that result from this code. If the contents of ContentFeature
    are not a feature, then the row can be deleted.
    """

################################################################################
################## SECTION 1: PRELIMINARY ACTIONS ##############################
################################################################################

    import re
    import os
    import pandas as pd
    import numpy as np

    #the following function will align the dataframes of Content before the
    # feature to the right so that it will be correctly aligned in the eventual
    # csv file to simulate concordance lines correctly. This function was not my
    # own development and proper citations are given below.

    def justify(a, invalid_val=0, axis=1, side='left'): 
        
        """
        Taken from Divakar (https://stackoverflow.com/users/3293881/divakar)
        ---> https://stackoverflow.com/questions/44558215/python-justifying-numpy-array/44559180#44559180
        ---> https://stackoverflow.com/questions/51304610/pandas-shifting-columns-depending-on-if-nan-or-not

        Justifies a 2D array

        Parameters
        ----------
        A : ndarray
          Input array to be justified
        axis : int
          Axis along which justification is to be made
        side : str
          Direction of justification. It could be 'left', 'right', 'up', 'down'
          It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

        """

        if invalid_val is np.nan:
            
          mask = pd.notnull(a)
        
        else:
            
          mask = a!=invalid_val
        
        justified_mask = np.sort(mask,axis=axis)
        
        if (side=='up') | (side=='left'):
            
          justified_mask = np.flip(justified_mask,axis=axis)
        
        out = np.full(a.shape, invalid_val, dtype=object) 
        
        if axis==1:
            
          out[justified_mask] = a[mask]
        
        else:
            
          out.T[justified_mask.T] = a.T[mask.T]
        
        return out

################################################################################
###################### SECTION 2: EXECUTION OF CODE ############################
################################################################################

    # reads in the gold standard csv and creates a pandas dataframe
    gs_df = pd.read_csv(csv_input_path)
    
    # returns an empty dataframe for an empty dataframe
    if len(gs_df) == 0:
        
        #creates an empty dataframe since there are no instances
        # of the morphosyntactic feature
        gs_empty_df = pd.DataFrame(columns=[
            'File', 'Line', 'Speaker', 'UttStartTime',
            'UttEndTime', 'UttLength', 'InstancesCountPerLine', 
            'FeatureCountPerLine', 'Content', 'SubjectWordToken',
            'PredicateWordToken', 'WordPattern', 'SubjectPOS',
            'PredicatePOS', 'POSPattern'])
        
        # returns the final dataframe
        return gs_empty_df
        
    
    # checks if there are no instances of the morphosyntactic feature,
    #  if so, returns an empty dataframe
    elif max(gs_df['FeatureCountPerLine']) == 0:
        
        #creates an empty dataframe since there are no instances
        # of the morphosyntactic feature
        gs_empty_df = pd.DataFrame(columns=[
            'File', 'Line', 'Speaker', 'UttStartTime',
            'UttEndTime', 'UttLength', 'InstancesCountPerLine', 
            'FeatureCountPerLine', 'Content', 'SubjectWordToken',
            'PredicateWordToken', 'WordPattern', 'SubjectPOS',
            'PredicatePOS', 'POSPattern'])
        
        # returns the final dataframe
        return gs_empty_df
    
    else:
        
        # adds empty columns for subject word token, subject part of speech, predicate
        # word token, predicate part of speech, subject/predicate word pattern
        # surrounding the feature, and POS pattern surrounding the feature to be 
        # filled in manually
        gs_df["SubjectWordToken"] = np.nan
        gs_df["PredicateWordToken"] = np.nan
        gs_df["WordPattern"] = np.nan
        gs_df["SubjectPOS"] = np.nan
        gs_df["PredicatePOS"] = np.nan
        gs_df["POSPattern"] = np.nan

        # takes the search word input and transforms it into a regular expression 
        # that will search for only whole words if the sequence of strings submitted 
        # is contained within a larger word, this will filter those instances out
        # and leave only whole matches
        search_word_regex = f"\\b[{search_word_string[0].upper()}|{search_word_string[0].lower()}]{search_word_string[1:]}\\b"

        # the following will create columns to be changed later
        # IterationNumber will be used to help duplicate lines that contain more
        #  than one instance of the feature in question
        gs_df['IterationNumber'] = 1

        # creates an empty column to store the Content that occurs before the feature
        gs_df['ContentBeforeFeature'] = np.nan

        # creates an empty column to store the feature text
        gs_df['ContentFeature'] = np.nan

        # creates an empty column to store the Content that occurs after the feature
        gs_df['ContentAfterFeature'] = np.nan

        # creates a list of columns from the gold standard dataframe
        gs_df_column_names = list(gs_df)

        # creates an empty list for replicated lines to be appended to. A line will 
        # be replicated if it contains more than one instance of a feature. The
        # result will be as many total occurences of the line as there are number
        # of features in the line. This will allow each instance of the feature in 
        # the line to be examined separately, rather than needing to examine each
        # instance within the same line.
        replicated_lines = []

        #iterates through the rows of the gold standard dataframe
        for row in gs_df.itertuples():

            #if there is only one instance of the feature in the line, this will 
            # separate the Content into columns based on what comes before and after
            # the feature in the text and a separate column for the feature. Doing
            # this will replicate concordance lines which can center the feature
            # in question.
            #re.findall is used here to find all the instances of the search word 
            # (i.e., feature). There are other ways to search strings with re but this
            # one will find them all.
            if row.FeatureCountPerLine == 1:

              #here, re.finditer is used because finditer will iterate through the
              # matches of the search word and return a match object. match objects
              # provide more information than just the match. for example, .start()
              # returns the start position of the match within the text.
              # Check here for more info: 
              #  https://docs.python.org/3/library/re.html#match-objects
              #Loops through the matches of the search word in the row's Content
                for match in re.finditer(search_word_regex, str(row.Content)):

                    # replaces the "ContentBeforeFeature" cell with all text that occurs
                    #  before the search word (i.e., feature in question)
                    gs_df.loc[row.Index, "ContentBeforeFeature"] = str(row.Content)[:match.start()]

                    # replaces the "ContentFeature" with the text of the search word
                    #  (i.e., feature in question)
                    gs_df.loc[row.Index, "ContentFeature"] = str(row.Content)[match.start():match.end()+1]

                    # replaces the "ContentAfterFeature" cell with all text that occurs
                    #  after the search word (i.e., feature in question)
                    gs_df.loc[row.Index, "ContentAfterFeature"] = str(row.Content)[match.end()+1:]

            elif row.FeatureCountPerLine > 1:

                # sets the iteration number to 0. This number will be changed within
                # the loops below. It will be used to determine which instance of the
                # feature should be centered on that line. It will also be used later
                # to sort lines along with File and Line.
                iteration_number = 0

                #here, re.finditer is used because finditer will iterate through the
                # matches of the search word and return a match object. match objects
                # provide more information than just the match. for example, .start()
                # returns the start position of the match within the text.
                # Check here for more info: 
                #  https://docs.python.org/3/library/re.html#match-objects
                #Loops through the matches of the search word in the row's Content
                for match in re.finditer(search_word_regex, str(row.Content)):

                    # creates a variable with all text that occurs
                    #  before the search word (i.e., feature in question)
                    current_before = str(row.Content)[:match.start()]

                    # creates a variable with the text of the search word
                    #  (i.e., feature in question)
                    current_feature = str(row.Content)[match.start():match.end()+1]

                    # creates a variable with all text that occurs
                    #  after the search word (i.e., feature in question)
                    current_after = str(row.Content)[match.end()+1:]
                    
                    try:
                   
                        # creates a list of the values in cells of the row. Importantly,
                        # if the variable begins with "row." that means, it's simply
                        # taking what's already there. But the last four variables are taken
                        # from the variables created in the immediately previous steps.
                        # iteration number is added to by 1 to reflect which iteration of 
                        # the feature matches it is.
                        cell_values = [row.File, row.Line, row.Speaker, row.UttStartTime,
                                       row.UttEndTime, row.UttLength, row.Content,
                                       row.InstancesCountPerLine, row.FeatureCountPerLine,
                                       row.SubjectWordToken, row.PredicateWordToken, row.WordPattern,
                                       row.SubjectPOS, row.PredicatePOS, row.POSPattern,
                                       iteration_number+1, current_before, current_feature,
                                       current_after]

                        # zips together the column names with the cell values from the
                        # current row. Zipping creates tuples
                        zipped = zip(gs_df_column_names, cell_values)

                        # creates a python dictionary based on the zipped variable just
                        #  created. This is necessary to be able to append the row
                        #  back into the larger dataframe later.
                        line_dict = dict(zipped)

                        # appends the dictionary to the replicated lines empty list above
                        #  to be used for appending to the dataframe later
                        replicated_lines.append(line_dict)

                        # increases the iteration number by one so the next iteration number
                        #  will be correct
                        iteration_number += 1
                        
                    except:
                   
                        # creates a list of the values in cells of the row. Importantly,
                        # if the variable begins with "row." that means, it's simply
                        # taking what's already there. But the last four variables are taken
                        # from the variables created in the immediately previous steps.
                        # iteration number is added to by 1 to reflect which iteration of 
                        # the feature matches it is.
                        cell_values = [row.File, row.Line, row.Content, 
                                       row.InstancesCountPerLine, row.FeatureCountPerLine,
                                       row.SubjectWordToken, row.PredicateWordToken, row.WordPattern,
                                       row.SubjectPOS, row.PredicatePOS, row.POSPattern,
                                       iteration_number+1, current_before, current_feature,
                                       current_after]

                        # zips together the column names with the cell values from the
                        # current row. Zipping creates tuples
                        zipped = zip(gs_df_column_names, cell_values)

                        # creates a python dictionary based on the zipped variable just
                        #  created. This is necessary to be able to append the row
                        #  back into the larger dataframe later.
                        line_dict = dict(zipped)

                        # appends the dictionary to the replicated lines empty list above
                        #  to be used for appending to the dataframe later
                        replicated_lines.append(line_dict)

                        # increases the iteration number by one so the next iteration number
                        #  will be correct
                        iteration_number += 1

        # appends replicated lines to the gold standard dataframe              
        gs_replicated_df = gs_df.append(replicated_lines)

        #############################################################
        # This next section will take the Content before and after
        # the feature and split it into columns, one word per column.
        #############################################################

        ##########################
        # Content before
        ##########################

        # creates a separate dataframe with columns of words within the Content 
        #  before the feature, one word per column
        content_before_df = gs_replicated_df["ContentBeforeFeature"].str.split(expand=True)

        # aligns the words in the before_df dataframe to the right. This will 
        #  correctly align row Content in the eventual csv to appear as traditional
        #  concordance lines. This utilizes the justify function defined above.
        #  Please see citations there for authorship and for more information.
        content_before_alignedR_df = pd.DataFrame(
            justify(content_before_df.values, invalid_val=np.nan, side="right"),
            index=content_before_df.index, columns=content_before_df.columns) 

        # creates a list of the column names in the aligned dataframe just created.
        #  this list will just be numbers. These column names are automatically
        #  produced by str.split.(expand=True) two steps ago.
        content_before_column_names = list(content_before_alignedR_df)

        # creates a reversed list of column names in order to correctly count down 
        #  from left to right (i.e., "L3, L2, L1, Feature")
        content_before_column_names_rev = content_before_column_names[::-1]

        # creates a list of renamed column names to reflect tradition corpus
        #  linguistics annotation (i.e., "L3, L2, L1")
        content_before_column_names_renamed = [f"L{column_name+1}" for column_name in content_before_column_names_rev]

        # resets the column names of the aligned dataframe to the new column names
        content_before_alignedR_df.columns = content_before_column_names_renamed

        ##########################
        # Content feature
        ##########################

        # creates a dataframe with only the feature text
        content_feature_df = gs_replicated_df[['ContentFeature']]

        ##########################
        # Content after
        ##########################

        # creates a separate dataframe with columns of words within the Content 
        #  after the feature, one word per column
        content_after_df = gs_replicated_df["ContentAfterFeature"].str.split(expand = True)

        #NOTE: There is no need for aligning the columns here as the after content
        #         columns automatically align left as they should.

        # creates a list of the column names in the dataframe just created.
        #  this list will just be numbers. These column names are automatically
        #  produced by str.split.(expand=True) in the previous step.
        content_after_column_names = list(content_after_df)

        #NOTE: There is no need for reversing the columns here as the after content
        #         columns automatically increase left to right as they should

        # creates a list of renamed column names to reflect tradition corpus
        #  linguistics annotation (i.e., "R1, R2, R3")
        content_after_column_names_renamed = [f"R{column_name+1}" for column_name in content_after_column_names]

        # resets the column names of the aligned dataframe to the new column names
        content_after_df.columns = content_after_column_names_renamed

        #############################################################
        # This next section stitches everything together
        #############################################################

        # drops the columns ContentBeforeFeature, ContentFeature, ContentAfterFeature
        #  because they are no longer needed
        gs_dropped_df = gs_replicated_df.drop(columns = ['ContentBeforeFeature',
                                                          'ContentFeature',
                                                          'ContentAfterFeature'])

        # concatenates the dataframes created in previous steps
        gs_split_content_df = pd.concat([gs_dropped_df, content_before_alignedR_df,
                                         content_feature_df, content_after_df], axis=1)

        # removes original rows which have more than one instance of the feature, 
        #  but now are copies with None type entries in their cells
        gs_split_content_droppedNa_df = gs_split_content_df[gs_split_content_df['ContentFeature'].notna()]

        # gets the indexNames of all the lines which do not
        # contain the AAL morphosyntactic feature in question
        index_names = gs_split_content_droppedNa_df[gs_split_content_droppedNa_df['FeatureCountPerLine'] == 0 ].index

        # drops all lines which do not contain the AAL morphosyntactic
        # feature in question
        gs_split_content_droppedNa_df.drop(index_names, inplace=True)

        #if all rows are dropped becase no instances of the morphosyntactic feature
        # occur in any lines, then this will return an empty dataframe. for example,
        # TIMIT had no occurences of habitual 'be'. if there are no rows and the code
        # progresses beyond this point, an error will be raised and the code will 
        # stopped. 
        if len(gs_split_content_droppedNa_df) == 0:

            # removes the IterationNumber column because it is no longer needed
            #  and resets the index
            gs_final_split_df = gs_split_content_droppedNa_df.drop(
                columns = ['IterationNumber']).reset_index(drop=True) 

            #returns the final dataframe
            return gs_final_split_df

        else:
            # resets the index for the dataframe to count lines from the beginning and
            #  sorts the dataframe according to File, Line, then Iteration Number for
            #  those lines which have more than one instance
            gs_split_content_droppedNa_df = gs_split_content_droppedNa_df.sort_values(
                ['File', 'Line', 'IterationNumber']).reset_index(drop=True)

            # removes the IterationNumber column because it is no longer needed
            #  and resets the index
            gs_final_split_df = gs_split_content_droppedNa_df.drop(
                columns = ['IterationNumber']).reset_index(drop=True) 

            #returns dataframe
            return gs_final_split_df

# Creating Split Utterance Content Dataframes

This will execute the code and create the dataframes and then export them as CSV files.

In [None]:
# Designate the input path where the gold standard CSVs are stored
csv_input_path = "path"

In [None]:
import os

### Feature: Ain't

In [None]:
# Designate the search word
search_word_string = "ain\'t"

# Get filenames for gold standard CSVs
#  Ensures:
#  (1) Only CSV files will be collected (no folders, no .DS_Store)
#  (2) Only filepaths from CSV files connected to the search word will be collected
#  (3) The gold standard info CSV will be skipped
#
#  Note:
#    Make sure the file.startswith(___) portion of the code matches the beginning
#    of the filenames. Usually, the search_word_string variable is used because
#    it is both the search term and should be the beginning of the filename. 
#    However, for ain't, the search term and beginning of file will be different
#    because of the apostrophe.

csv_filenames = [file for file in os.listdir(csv_input_path) 
                 if file.endswith(".csv") and
                 file.startswith("aint") and "info" not in file]

#loop through the filepath and corpus name tuples and creates split content dataframe
for filename in csv_filenames:
    
    #gets just the first two items in the filename sequence (e.g., "be_coraal")
    stripped_filename = "_".join(filename.split("_")[:2])
    
    #creates the name for the dataframe by adding "_splitContent_df" to the stripped filename
    dataframe_name = f"{stripped_filename}_splitContent_df"
    
    globals()[dataframe_name] = create_split_content_dataframe(f"{csv_input_path}/{filename}", search_word_string)

### Feature: Be

In [None]:
# Designate the search word
search_word_string = "be"

# Get filenames for gold standard CSVs
#  Ensures:
#  (1) Only CSV files will be collected (no folders, no .DS_Store)
#  (2) Only filepaths from CSV files connected to the search word will be collected
#  (3) The gold standard info CSV will be skipped
#
#  Note:
#    Make sure the file.startswith(___) portion of the code matches the beginning
#    of the filenames. Usually, the search_word_string variable is used because
#    it is both the search term and should be the beginning of the filename. 
#    However, for ain't, the search term and beginning of file will be different
#    because of the apostrophe.

csv_filenames = [file for file in os.listdir(csv_input_path) 
                 if file.endswith(".csv") and
                 file.startswith(search_word_string) and "info" not in file]

#loop through the filepath and corpus name tuples and creates split content dataframe
for filename in csv_filenames:
    
    #gets just the first two items in the filename sequence (e.g., "be_coraal")
    stripped_filename = "_".join(filename.split("_")[:2])
    
    #creates the name for the dataframe by adding "_splitContent_df" to the stripped filename
    dataframe_name = f"{stripped_filename}_splitContent_df"
    
    globals()[dataframe_name] = create_split_content_dataframe(f"{csv_input_path}/{filename}", search_word_string)

### Feature: Done

In [None]:
# Designate the search word
search_word_string = "done"

# Get filenames for gold standard CSVs
#  Ensures:
#  (1) Only CSV files will be collected (no folders, no .DS_Store)
#  (2) Only filepaths from CSV files connected to the search word will be collected
#  (3) The gold standard info CSV will be skipped
#
#  Note:
#    Make sure the file.startswith(___) portion of the code matches the beginning
#    of the filenames. Usually, the search_word_string variable is used because
#    it is both the search term and should be the beginning of the filename. 
#    However, for ain't, the search term and beginning of file will be different
#    because of the apostrophe.

csv_filenames = [file for file in os.listdir(csv_input_path) 
                 if file.endswith(".csv") and
                 file.startswith(search_word_string) and "info" not in file]

#loop through the filepath and corpus name tuples and creates split content dataframe
for filename in csv_filenames:
    
    #gets just the first two items in the filename sequence (e.g., "be_coraal")
    stripped_filename = "_".join(filename.split("_")[:2])
    
    #creates the name for the dataframe by adding "_splitContent_df" to the stripped filename
    dataframe_name = f"{stripped_filename}_splitContent_df"
    
    globals()[dataframe_name] = create_split_content_dataframe(f"{csv_input_path}/{filename}", search_word_string)

## Sorting the Dataframes by File and Line

This will sort the dataframes first by filename and then by line number. Doing this each step will ensure consistency across the board.

### Feature: Ain't

In [None]:
aint_coraal_splitContent_df = aint_coraal_splitContent_df.sort_values(by=['File', 'Line'])

aint_fisher_splitContent_df = aint_fisher_splitContent_df.sort_values(by=['File', 'Line'])

aint_librispeech_splitContent_df = aint_librispeech_splitContent_df.sort_values(by=['File', 'Line'])

aint_switchboardHub5_splitContent_df = aint_switchboardHub5_splitContent_df.sort_values(by=['File', 'Line'])

aint_timit_splitContent_df = aint_timit_splitContent_df.sort_values(by=['File', 'Line'])

### Feature: Be

In [None]:
be_coraal_splitContent_df = be_coraal_splitContent_df.sort_values(by=['File', 'Line'])

be_fisher_splitContent_df = be_fisher_splitContent_df.sort_values(by=['File', 'Line'])

be_librispeech_splitContent_df = be_librispeech_splitContent_df.sort_values(by=['File', 'Line'])

be_switchboardHub5_splitContent_df = be_switchboardHub5_splitContent_df.sort_values(by=['File', 'Line'])

be_timit_splitContent_df = be_timit_splitContent_df.sort_values(by=['File', 'Line'])

### Feature: Done

In [None]:
done_coraal_splitContent_df = done_coraal_splitContent_df.sort_values(by=['File', 'Line'])

done_fisher_splitContent_df = done_fisher_splitContent_df.sort_values(by=['File', 'Line'])

done_librispeech_splitContent_df = done_librispeech_splitContent_df.sort_values(by=['File', 'Line'])

done_switchboardHub5_splitContent_df = done_switchboardHub5_splitContent_df.sort_values(by=['File', 'Line'])

done_timit_splitContent_df = done_timit_splitContent_df.sort_values(by=['File', 'Line'])

## Exporting Dataframes to CSV Files

This will export the dataframes to CSV files.

In [None]:
# Designate the output path where the gold standard CSVs are stored
csv_output_path = "path"

### Feature: Ain't

In [None]:
aint_coraal_splitContent_df.to_csv(f"{csv_output_path}aint_coraal_splitContent.csv", index=False)

aint_fisher_splitContent_df.to_csv(f"{csv_output_path}aint_fisher_splitContent.csv", index=False)

aint_librispeech_splitContent_df.to_csv(f"{csv_output_path}aint_librispeech_splitContent.csv", index=False)

aint_switchboardHub5_splitContent_df.to_csv(f"{csv_output_path}aint_switchboardHub5_splitContent.csv", index=False)

aint_timit_splitContent_df.to_csv(f"{csv_output_path}aint_timit_splitContent.csv", index=False)

### Feature: Be

In [None]:
be_coraal_splitContent_df.to_csv(f"{csv_output_path}be_coraal_splitContent.csv", index=False)

be_fisher_splitContent_df.to_csv(f"{csv_output_path}be_fisher_splitContent.csv", index=False)

be_librispeech_splitContent_df.to_csv(f"{csv_output_path}be_librispeech_splitContent.csv", index=False)

be_switchboardHub5_splitContent_df.to_csv(f"{csv_output_path}be_switchboardHub5_splitContent.csv", index=False)

be_timit_splitContent_df.to_csv(f"{csv_output_path}be_timit_splitContent.csv", index=False)

### Feature: Done

In [None]:
done_coraal_splitContent_df.to_csv(f"{csv_output_path}done_coraal_splitContent.csv", index=False)

done_fisher_splitContent_df.to_csv(f"{csv_output_path}done_fisher_splitContent.csv", index=False)

done_librispeech_splitContent_df.to_csv(f"{csv_output_path}done_librispeech_splitContent.csv", index=False)

done_switchboardHub5_splitContent_df.to_csv(f"{csv_output_path}done_switchboardHub5_splitContent.csv", index=False)

done_timit_splitContent_df.to_csv(f"{csv_output_path}done_timit_splitContent.csv", index=False)