# Step 2.6: Getting Word Counts

This code will perform a number of tasks. First, it will duplicate lines that have more than one instance of the AAL morphosyntactic feature in question, in order for the each instance within the line to be analyzed in its own line rather than all instances within an utterance being analyzed in the same line. 

It will also return the following:
1. Word count for the full CORAAL utterance content
2. Word count for CORAAL pre-feature utterance content
3. Word count for CORAAL post-feature utterance content
4. Columns containing the text of pre- and post- feature for CORAAL
5. Word count for full ASR output content

It will then write the results into a CSV file and export it.

## Required Packages

The following packages are necessary to run this code: re, os, [pandas](https://pypi.org/project/pandas/), [numpy](https://pypi.org/project/numpy/)

## Intitial Setup

In [None]:
# Import required packages
import pandas as pd
import numpy as np
import os

In [None]:
#filepath for the csv produced in Step 2.5
aint_file_path = "path"

be_file_path = "path"

done_file_path = "path"

#reads in the gold standard dataframe    
aint_gs_df = pd.read_csv(aint_file_path)

be_gs_df = pd.read_csv(be_file_path)

done_gs_df = pd.read_csv(done_file_path)

## Defining the Row Duplicating Function

This function takes one argument:
1. The dataframe created in step 2.5
2. The word being searched for

In [None]:
def duplicate_rows(gs_df, search_word_string):

    """
    Takes a gold standard csv produced by previous steps in the pipeline.
    Performs the following task:
    (1) Duplicates lines which have more than one instance of the
    morphosyntactic feature in question in order for each instance to be
    examined in its own separate line rather than all instances being
    examined in the same line.

    One issue remains with this function. If there are multiple instances
    of a search word in an utterance, and there is a difference between
    the InstancesCountPerLine and the FeatureCountPerLine, then the code
    which replicates lines may replicate as many lines as there are
    instances rather than actual feature counts. This shouldn't be a
    large issue since occurences like this are rare. However, extra
    care should be taken in the manually annotation of the .csv files
    that result from this code. If the contents of ContentFeature
    are not a feature, then the row can be deleted.
    """

################################################################################
################## SECTION 1: PRELIMINARY ACTIONS ############################
################################################################################

    import re
    import os
    import pandas as pd
    import numpy as np

################################################################################
###################### SECTION 2: EXECUTION OF CODE ##########################
################################################################################

    # takes the search word input and transforms it into a regular expression 
    # that will search for only whole words if the sequence of strings submitted 
    # is contained within a larger word, this will filter those instances out
    # and leave only whole matches
    search_word_regex = f"\\b[{search_word_string[0].upper()}|{search_word_string[0].lower()}]{search_word_string[1:]}\\b"

    # the following will create columns to be changed later
    # IterationNumber will be used to help duplicate lines that contain more
    #  than one instance of the feature in question
    gs_df['IterationNumber'] = 1

    # creates an empty column to store the Content that occurs before the feature
    gs_df['Content_PreFeature'] = np.nan

    # creates an empty column to store the Content that occurs after the feature
    gs_df['Content_PostFeature'] = np.nan

    
    
    
    # creates a list of columns from the gold standard dataframe
    gs_df_column_names = list(gs_df)
                
    # creates an empty list for replicated lines to be appended to. A line will 
    # be replicated if it contains more than one instance of a feature. The
    # result will be as many total occurences of the line as there are number
    # of features in the line. This will allow each instance of the feature in 
    # the line to be examined separately, rather than needing to examine each
    # instance within the same line.
    replicated_lines = []

    #iterates through the rows of the gold standard dataframe
    for row in gs_df.itertuples():
        
        #if there is only one instance of the feature in the line, this will 
        # separate the Content into columns based on what comes before and after
        # the feature in the text and a separate column for the feature. Doing
        # this will replicate concordance lines which can center the feature
        # in question.
        #re.findall is used here to find all the instances of the search word 
        # (i.e., feature). There are other ways to search strings with re but this
        # one will find them all.
        if row.InstancesCountPerLine == 1:

          #here, re.finditer is used because finditer will iterate through the
          # matches of the search word and return a match object. match objects
          # provide more information than just the match. for example, .start()
          # returns the start position of the match within the text.
          # Check here for more info: 
          #  https://docs.python.org/3/library/re.html#match-objects
          #Loops through the matches of the search word in the row's Content
            for match in re.finditer(search_word_regex, str(row.Content)):

                # replaces the "ContentBeforeFeature" cell with all text that occurs
                #  before the search word (i.e., feature in question)
                gs_df.loc[row.Index, "Content_PreFeature"] = str(row.Content)[:match.start()]

                # replaces the "ContentAfterFeature" cell with all text that occurs
                #  after the search word (i.e., feature in question)
                gs_df.loc[row.Index, "Content_PostFeature"] = str(row.Content)[match.end()+1:]

        elif row.InstancesCountPerLine > 1:

            # sets the iteration number to 0. This number will be changed within
            # the loops below. It will be used to determine which instance of the
            # feature should be centered on that line. It will also be used later
            # to sort lines along with File and Line.
            iteration_number = 0
            
            if search_word_string in ["ain't", "isn't", "aren't", "I'm not", "didn't", "hasn't", "haven't"]:

                #here, re.finditer is used because finditer will iterate through the
                # matches of the search word and return a match object. match objects
                # provide more information than just the match. for example, .start()
                # returns the start position of the match within the text.
                # Check here for more info: 
                #  https://docs.python.org/3/library/re.html#match-objects
                #Loops through the matches of the search word in the row's Content
                for match in re.finditer(search_word_regex, str(row.Content)):
                                        
                    # creates a variable with all text that occurs
                    #  before the search word (i.e., feature in question)
                    current_before = str(row.Content)[:match.start()]

                    # creates a variable with all text that occurs
                    #  after the search word (i.e., feature in question)
                    current_after = str(row.Content)[match.end()+1:]

                    # creates a list of the values in cells of the row. Importantly,
                    # if the variable begins with "row." that means, it's simply
                    # taking what's already there. But the last four variables are taken
                    # from the variables created in the immediately previous steps.
                    # iteration number is added to by 1 to reflect which iteration of 
                    # the feature matches it is.
                    
                    cell_values = [row.File, row.Line, row.Speaker, row.UttStartTime,
                                   row.UttEndTime, row.UttLength,
                                   row.Content, row.Content_cleaned, row.AintVariation,
                                   row.InstancesCountPerLine, row.FeatureCountPerLine,
                                   row.WadaSNRMeade, row.WadaSNRRigal, row.NistSNRRigal,
                                   row.SyllableCount, row.SpeechRate,
                                   row.amazon_transcription,
                                   row.amazon_transcription_cleaned,
                                   row.deepspeech_transcription,
                                   row.deepspeech_transcription_cleaned, row.deepspeech_ConfidenceLevel,
                                   row.google_transcription, row.google_transcription_cleaned,
                                   row.google_ConfidenceLevel, row.IBMWatson_transcription,
                                   row.IBMWatson_transcription_cleaned, row.IBMWatson_ConfidenceLevel,
                                   row.microsoft_transcription, row.microsoft_transcription_cleaned,
                                   iteration_number+1, current_before, current_after]
                                        
                    # zips together the column names with the cell values from the
                    # current row. Zipping creates tuples
                    zipped = zip(gs_df_column_names, cell_values)
                    
                    # creates a python dictionary based on the zipped variable just
                    #  created. This is necessary to be able to append the row
                    #  back into the larger dataframe later.
                    line_dict = dict(zipped)
                    
                    # appends the dictionary to the replicated lines empty list above
                    #  to be used for appending to the dataframe later
                    replicated_lines.append(line_dict)

                    # increases the iteration number by one so the next iteration number
                    #  will be correct
                    iteration_number += 1
                    
            else:
                
                #here, re.finditer is used because finditer will iterate through the
                # matches of the search word and return a match object. match objects
                # provide more information than just the match. for example, .start()
                # returns the start position of the match within the text.
                # Check here for more info: 
                #  https://docs.python.org/3/library/re.html#match-objects
                #Loops through the matches of the search word in the row's Content
                for match in re.finditer(search_word_regex, str(row.Content)):

                    # creates a variable with all text that occurs
                    #  before the search word (i.e., feature in question)
                    current_before = str(row.Content)[:match.start()]

                    # creates a variable with all text that occurs
                    #  after the search word (i.e., feature in question)
                    current_after = str(row.Content)[match.end()+1:]

                    # creates a list of the values in cells of the row. Importantly,
                    # if the variable begins with "row." that means, it's simply
                    # taking what's already there. But the last four variables are taken
                    # from the variables created in the immediately previous steps.
                    # iteration number is added to by 1 to reflect which iteration of 
                    # the feature matches it is.
                    cell_values = [row.File, row.Line, row.Speaker, row.UttStartTime,
                                   row.UttEndTime, row.UttLength,
                                   row.Content, row.Content_cleaned,
                                   row.InstancesCountPerLine, row.FeatureCountPerLine,
                                   row.WadaSNRMeade, row.WadaSNRRigal, row.NistSNRRigal,
                                   row.SyllableCount, row.SpeechRate,
                                   row.amazon_transcription,
                                   row.amazon_transcription_cleaned,
                                   row.deepspeech_transcription,
                                   row.deepspeech_transcription_cleaned, row.deepspeech_ConfidenceLevel,
                                   row.google_transcription, row.google_transcription_cleaned,
                                   row.google_ConfidenceLevel, row.IBMWatson_transcription,
                                   row.IBMWatson_transcription_cleaned, row.IBMWatson_ConfidenceLevel,
                                   row.microsoft_transcription, row.microsoft_transcription_cleaned,
                                   iteration_number+1, current_before, current_after]

                    # zips together the column names with the cell values from the
                    # current row. Zipping creates tuples
                    zipped = zip(gs_df_column_names, cell_values)

                    # creates a python dictionary based on the zipped variable just
                    #  created. This is necessary to be able to append the row
                    #  back into the larger dataframe later.
                    line_dict = dict(zipped)

                    # appends the dictionary to the replicated lines empty list above
                    #  to be used for appending to the dataframe later
                    replicated_lines.append(line_dict)

                    # increases the iteration number by one so the next iteration number
                    #  will be correct
                    iteration_number += 1

    # appends replicated lines to the gold standard dataframe              
    gs_df = gs_df.append(replicated_lines)
    
    gs_df_1instance = gs_df[gs_df['InstancesCountPerLine'] == 1]
    
    gs_df_replicated = gs_df[(gs_df['InstancesCountPerLine'].astype(int)>1) & (gs_df['Content_PreFeature'].notna())]
    
    gs_df = pd.concat([gs_df_1instance, gs_df_replicated], ignore_index=True).sort_values(by=['File', 'Line'])
    
    return gs_df

## Defining the Word Counting Function

TThis function takes one argument:
1. The dataframe created in step 2.5
2. The word being searched for

In [None]:
def get_word_counts(gs_df, search_word_string):

    """
    This function will get you:

    (1) word count for the full CORAAL utterance content
    (2) word count for CORAAL pre-feature utterance content
    (3) word count for CORAAL post-feature utterance content
    (4) columns containing the text of pre- and post- feature for CORAAL
    (5) word count for full ASR output content

    It will also rearrange the columns to a better reading order
    """

    import re

    gs_df["Content_cleaned_PreFeature"] = np.nan

    gs_df["Content_cleaned_PostFeature"] = np.nan


    # takes the search word input and transforms it into a regular expression 
    # that will search for only whole words if the sequence of strings submitted 
    # is contained within a larger word, this will filter those instances out
    # and leave only whole matches
    search_word_regex = f"\\b[{search_word_string[0].upper()}|{search_word_string[0].lower()}]{search_word_string[1:]}\\b"


    for row in gs_df.itertuples():
        
        if search_word_regex == "\\b[A|a]in't\\b":
            
            search_word_regex = "\\b[A|a]i n't\\b"
            
        elif search_word_regex == "\\b[I|i]sn't\\b":
            
            search_word_regex = "\\b[I|i]s n't\\b"
            
        elif search_word_regex == "\\b[A|a]ren't\\b":
            
            search_word_regex = "\\b[A|a]re n't\\b"
        
        elif search_word_regex == "\\b[I|i]'m not\\b":
            
            search_word_regex = "\\b[I|i] 'm not\\b"
            
        elif search_word_regex == "\\b[D|d]idn't\\b":
            
            search_word_regex = "\\b[D|d]id n't\\b"
            
        elif search_word_regex == "\\b[H|h]aven't\\b":
            
            search_word_regex = "\\b[H|h]ave n't\\b"
            
        elif search_word_regex == "\\b[H|h]asn't\\b":
            
            search_word_regex = "\\b[H|h]as n't\\b"
            
        
        if row.InstancesCountPerLine == 1:

            for match in re.finditer(search_word_regex, str(row.Content_cleaned)):

                # creates a variable with all text that occurs
                #  before the search word (i.e., feature in question)
                gs_df.loc[row.Index, "Content_cleaned_PreFeature"] = str(row.Content_cleaned)[:match.start()]

                # creates a variable with all text that occurs
                #  after the search word (i.e., feature in question)
                gs_df.loc[row.Index, "Content_cleaned_PostFeature"] = str(row.Content_cleaned)[match.end()+1:]
                
        else:
            
            #if there is more than one instance, the original code would only append the 
            #  pre- and post- of the first iteration. This should fix that
                        
            matches = []
            
            for match in re.finditer(search_word_regex, str(row.Content_cleaned)):
                
                matches.append(match)
                            
            if len(matches) > 0:
            
                #gets the iteration number and subtracts one to get the
                #  correct list index to draw the correlated match from
                iteration_number = row.IterationNumber - 1

                # creates a variable with all text that occurs
                #  before the search word (i.e., feature in question)
                gs_df.loc[row.Index, "Content_cleaned_PreFeature"] = str(row.Content_cleaned)[:matches[iteration_number].start()]

                # creates a variable with all text that occurs
                #  after the search word (i.e., feature in question)
                gs_df.loc[row.Index, "Content_cleaned_PostFeature"] = str(row.Content_cleaned)[matches[iteration_number].end()+1:]
                
            else:
                
                # creates a variable with all text that occurs
                #  before the search word (i.e., feature in question)
                gs_df.loc[row.Index, "Content_cleaned_PreFeature"] = np.nan

                # creates a variable with all text that occurs
                #  after the search word (i.e., feature in question)
                gs_df.loc[row.Index, "Content_cleaned_PostFeature"] = np.nan
                
            
            
    #a list of new column names to be created
    new_column_names = ["Content_WordCount", "Content_PreFeature_WordCount", "Content_PostFeature_WordCount",
                    "Content_cleaned_WordCount", "Content_cleaned_PreFeature_WordCount",
                    "Content_cleaned_PostFeature_WordCount"]


    #loops through column names
    for column in new_column_names:
        
        #creates an empty column with that name
        gs_df[column] = np.nan
        
    #loops through the rows in the dataframe    
    for row in gs_df.itertuples():
        
        #the rest of this code gets the various counts or content
        
        #original content
        if type(row.Content) != str:
            
            gs_df.loc[row.Index, "Content_WordCount"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "Content_WordCount"] = len(row.Content.split())
        
        
        #cleaned content
        if type(row.Content_cleaned) != str:
            
            gs_df.loc[row.Index, "Content_cleaned_WordCount"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "Content_cleaned_WordCount"] = len(row.Content_cleaned.split())
        
        
        #original content pre-feature
        if type(row.Content_PreFeature) != str:
            
            gs_df.loc[row.Index, "Content_PreFeature_WordCount"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "Content_PreFeature_WordCount"] = len(row.Content_PreFeature.split())
        
        
        #original content pre-feature cleaned
        if type(row.Content_cleaned_PreFeature) != str:
            
            gs_df.loc[row.Index, "Content_cleaned_PreFeatureWord_Count"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "Content_cleaned_PreFeature_WordCount"] = len(row.Content_cleaned_PreFeature.split())
        
        
        
        #cleaned content post-feature
        if type(row.Content_PostFeature) != str:
            
            gs_df.loc[row.Index, "Content_PostFeature_WordCount"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "Content_PostFeature_WordCount"] = len(row.Content_PostFeature.split())
        
        
        #cleaned content post-feature
        if type(row.Content_cleaned_PostFeature) != str:
            
            gs_df.loc[row.Index, "Content_cleaned_PostFeature_WordCount"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "Content_cleaned_PostFeature_WordCount"] = len(row.Content_cleaned_PostFeature.split())
        
        
        
        #### Amazon
        
        if type(row.amazon_transcription) != str:
            
            gs_df.loc[row.Index, "amazon_transcription_WordCount"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "amazon_transcription_WordCount"] = len(row.amazon_transcription.split())
        
        
        if type(row.amazon_transcription_cleaned) != str:
            
            gs_df.loc[row.Index, "amazon_transcription_cleaned_WordCount"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "amazon_transcription_cleaned_WordCount"] = len(row.amazon_transcription_cleaned.split())
        
        
        #### Deepspeech
        
        if type(row.deepspeech_transcription) != str:
            
            gs_df.loc[row.Index, "deepspeech_transcription_WordCount"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "deepspeech_transcription_WordCount"] = len(row.deepspeech_transcription.split())
        
        
        if type(row.deepspeech_transcription_cleaned) != str:
            
            gs_df.loc[row.Index, "deepspeech_transcription_cleaned_WordCount"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "deepspeech_transcription_cleaned_WordCount"] = len(row.deepspeech_transcription_cleaned.split())
        
        
        
        
         #### Google
        
        if type(row.google_transcription) != str:
            
            gs_df.loc[row.Index, "google_transcription_WordCount"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "google_transcription_WordCount"] = len(row.google_transcription.split())
        
        
        if type(row.google_transcription_cleaned) != str:
            
            gs_df.loc[row.Index, "google_transcription_cleaned_WordCount"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "google_transcription_cleaned_WordCount"] = len(row.google_transcription_cleaned.split())
        
        
        
        
         #### IBMWatson
        
        if type(row.IBMWatson_transcription) != str:
            
            gs_df.loc[row.Index, "IBMWatson_transcription_WordCount"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "IBMWatson_transcription_WordCount"] = len(row.IBMWatson_transcription.split())
        
        
        if type(row.IBMWatson_transcription_cleaned) != str:
            
            gs_df.loc[row.Index, "IBMWatson_transcription_cleaned_WordCount"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "IBMWatson_transcription_cleaned_WordCount"] = len(row.IBMWatson_transcription_cleaned.split())
        
        
        
        
         #### Microsoft
        
        if type(row.microsoft_transcription) != str:
            
            gs_df.loc[row.Index, "microsoft_transcription_WordCount"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "microsoft_transcription_WordCount"] = len(row.microsoft_transcription.split())
        
        
        if type(row.microsoft_transcription_cleaned) != str:
            
            gs_df.loc[row.Index, "microsoft_transcription_cleaned_WordCount"] = np.nan
        
        else:
            
            gs_df.loc[row.Index, "microsoft_transcription_cleaned_WordCount"] = len(row.microsoft_transcription_cleaned.split())
        
        
    if search_word_string in ["ain't", "isn't", "aren't", "I'm not", "didn't", "hasn't", "haven't"]:
        
        gs_df = gs_df[['File', 'Line', 'Speaker', 
                           'UttStartTime', 'UttEndTime', 
                           'UttLength', 'AintVariation', 'InstancesCountPerLine', 
                           'FeatureCountPerLine','IterationNumber', 'Content', 
                           'SyllableCount', 'SpeechRate', 'WadaSNRMeade', 
                           'WadaSNRRigal', 'NistSNRRigal',
                           'Content_WordCount', 'Content_PreFeature', 'Content_PreFeature_WordCount',
                           'Content_PostFeature', 'Content_PostFeature_WordCount', 
                           'Content_cleaned', 'Content_cleaned_WordCount',
                           'Content_cleaned_PreFeature', 'Content_cleaned_PreFeature_WordCount',
                           'Content_cleaned_PostFeature', 'Content_cleaned_PostFeature_WordCount',
                           'amazon_transcription', 'amazon_transcription_WordCount',
                           'amazon_transcription_cleaned', 'amazon_transcription_cleaned_WordCount',
                           'deepspeech_transcription', 'deepspeech_transcription_WordCount', 
                           'deepspeech_transcription_cleaned', 'deepspeech_transcription_cleaned_WordCount',
                           'deepspeech_ConfidenceLevel', 
                           'google_transcription', 'google_transcription_WordCount',
                           'google_transcription_cleaned', 'google_transcription_cleaned_WordCount', 
                           'google_ConfidenceLevel', 'IBMWatson_transcription', 
                           'IBMWatson_transcription_WordCount', 
                           'IBMWatson_transcription_cleaned', 'IBMWatson_transcription_cleaned_WordCount', 
                           'IBMWatson_ConfidenceLevel', 
                           'microsoft_transcription', 'microsoft_transcription_WordCount', 
                           'microsoft_transcription_cleaned', 'microsoft_transcription_cleaned_WordCount']]
    
    else:        
        
        gs_df = gs_df[['File', 'Line', 'Speaker', 
                           'UttStartTime', 'UttEndTime', 
                           'UttLength', 'InstancesCountPerLine', 
                           'FeatureCountPerLine','IterationNumber', 'Content', 
                           'SyllableCount', 'SpeechRate', 'WadaSNRMeade', 
                           'WadaSNRRigal', 'NistSNRRigal',
                           'Content_WordCount', 'Content_PreFeature', 'Content_PreFeature_WordCount',
                           'Content_PostFeature', 'Content_PostFeature_WordCount', 
                           'Content_cleaned', 'Content_cleaned_WordCount',
                           'Content_cleaned_PreFeature', 'Content_cleaned_PreFeature_WordCount',
                           'Content_cleaned_PostFeature', 'Content_cleaned_PostFeature_WordCount',
                           'amazon_transcription', 'amazon_transcription_WordCount',
                           'amazon_transcription_cleaned', 'amazon_transcription_cleaned_WordCount',
                           'deepspeech_transcription', 'deepspeech_transcription_WordCount', 
                           'deepspeech_transcription_cleaned', 'deepspeech_transcription_cleaned_WordCount',
                           'deepspeech_ConfidenceLevel', 
                           'google_transcription', 'google_transcription_WordCount',
                           'google_transcription_cleaned', 'google_transcription_cleaned_WordCount', 
                           'google_ConfidenceLevel', 'IBMWatson_transcription', 
                           'IBMWatson_transcription_WordCount', 
                           'IBMWatson_transcription_cleaned', 'IBMWatson_transcription_cleaned_WordCount', 
                           'IBMWatson_ConfidenceLevel', 
                           'microsoft_transcription', 'microsoft_transcription_WordCount', 
                           'microsoft_transcription_cleaned', 'microsoft_transcription_cleaned_WordCount']]
    
    gs_df = gs_df.reset_index(drop=True)
    
    return gs_df

## Executing the Code

### Feature: Ain't

Before running the code for the *ain't* variations, the variations will be split into separate dataframes to be processed. These will be concatenated again in the end.

In [None]:
aint_df = aint_gs_df[aint_gs_df["AintVariation"]=="ain't"]
isnt_df = aint_gs_df[aint_gs_df["AintVariation"]=="isn't"]
arent_df = aint_gs_df[aint_gs_df["AintVariation"]=="aren't"]
imnot_df = aint_gs_df[aint_gs_df["AintVariation"]=="I'm not"]
didnt_df = aint_gs_df[aint_gs_df["AintVariation"]=="didn't"]
havent_df = aint_gs_df[aint_gs_df["AintVariation"]=="haven't"]
hasnt_df = aint_gs_df[aint_gs_df["AintVariation"]=="hasn't"]

In [None]:
# create the duplicated rows dataframe
aint_df = duplicate_rows(aint_df, "ain't")

# create the word count dataframe
aint_df = get_word_counts(aint_df, "ain't")

In [None]:
# create the duplicated rows dataframe
isnt_df = duplicate_rows(isnt_df, "isn't")

# create the word count dataframe
isnt_df = get_word_counts(isnt_df, "isn't")

In [None]:
# create the duplicated rows dataframe
arent_df = duplicate_rows(arent_df, "aren't")

# create the word count dataframe
arent_df = get_word_counts(arent_df, "aren't")

In [None]:
# create the duplicated rows dataframe
imnot_df = duplicate_rows(imnot_df, "I'm not")

# create the word count dataframe
imnot_df = get_word_counts(imnot_df, "I'm not")

In [None]:
# create the duplicated rows dataframe
didnt_df = duplicate_rows(didnt_df, "didn't")

# create the word count dataframe
didnt_df = get_word_counts(didnt_df, "didn't")

In [None]:
# create the duplicated rows dataframe
havent_df = duplicate_rows(havent_df, "haven't")

# create the word count dataframe
havent_df = get_word_counts(havent_df, "haven't")

In [None]:
# create the duplicated rows dataframe
hasnt_df = duplicate_rows(hasnt_df, "hasn't")

# create the word count dataframe
hasnt_df = get_word_counts(hasnt_df, "hasn't")

In [None]:
aint_gs_df = pd.concat([aint_df, isnt_df, arent_df, imnot_df, didnt_df, havent_df, hasnt_df])

### Feature: Be

In [None]:
# create the duplicated rows dataframe
be_gs_df = duplicate_rows(be_gs_df, "be")

# create the word count dataframe
be_gs_df = get_word_counts(be_gs_df, "be")

### Feature: Done

In [None]:
# create the duplicated rows dataframe
done_gs_df = duplicate_rows(done_gs_df, "done")

# create the word count dataframe
done_gs_df = get_word_counts(done_gs_df, "done")

## Sorting the Dataframes by File and Line

This will sort the dataframes first by filename and then by line number. Doing this each step will ensure consistency across the board.

### Feature: Ain't

In [None]:
aint_gs_df = aint_gs_df.sort_values(by=['File', 'Line'])

### Feature: Be

In [None]:
be_gs_df = be_gs_df.sort_values(by=['File', 'Line'])

### Feature: Done

In [None]:
done_gs_df = done_gs_df.sort_values(by=['File', 'Line'])

## Exporting Dataframes to CSV Files

This will export the dataframes to CSV files.

In [None]:
# Designate the output path where the CSVs will be stored
csv_output_path = "path"

### Feature: Ain't

In [None]:
aint_gs_df.to_csv(f"{csv_output_path}aint_variations_wordCounts.csv", index=False)

### Feature: Be

In [None]:
be_gs_df.to_csv(f"{csv_output_path}be_wordCounts.csv", index=False)

### Feature: Done

In [None]:
done_gs_df.to_csv(f"{csv_output_path}done_wordCounts.csv", index=False)