# Step 1.6: Getting Quantitative Information about Feature Usage

This code will produce a number of CSV files, which will provide various information on the corpora. These CSVs will include:

<ol>
<li>Information on subject tokens which precede the feature</li>
<li>Information on predicate tokens which follow the feature</li>
<li>Information on subject parts of speech which precede the feature</li>
<li>Information on predicate parts of speech which follow the feature</li>
<li>Information on word patterns which surround/co-occur with the feature</li>
<li>Information on part of speech patterns which surround/co-occur with the feature</li>
<li>A complete all corpora info CSV with the following new information added:
    <ul>
        <li>Total feature count for the corpus</li>
        <li>Total feature count normalized for the corpus</li>
        <li>List of unique filenames (filename types) where the feature occurs</li>
        <li>Total file count where the feature occurs</li>
        <li>Percent of total corpus files of where the feature occurs</li>
        <li>Total possible part of speech patterns count</li>
        <li>Total occuring part of speech patterns count</li>
        <li>Percent of total part of speech patterns which occur</li>
    </ul>
    </li>

</ol>

## Required Packages

The following packages are necessary to run this code:
string, os, re, [pandas](https://pypi.org/project/pandas/), [numpy](https://pypi.org/project/numpy/)

## Define the Dataframe Creating Function

This function takes the following arguments:

<ol>
<li>The filepath to the split content CSV produced in Step 1.5</li>
<li>The filepath to the folder where the newly created CSVs will be stored</li>
<li>The word being searched for</li>
</ol>

In [None]:
def get_corpora_stats_csvs(csv_input_path, cvs_output_path, search_word_string):
    
    
    """
    Reads in split content csvs produced in step 1-5 and creates the following csvs:
    (1) A complete all corpora info csv with the following new information added:
        (a) Total feature count for the corpus
        (b) Total feature count normalized for the corpus
        (c) List of unique filenames (filename types) where the feature occurs
        (d) Total file count where the feature occurs
        (e) Percent of total corpus files of where the feature occurs
        (f) Total possible part of speech patterns count
        (g) Total occuring part of speech patterns count
        (h) Percent of total part of speech patterns which occur
    (2) Information on subject tokens which precede the feature
    (3) Information on predicate tokens which follow the feature
    (4) Information on subject parts of speech which precede the feature
    (5) Information on predicate parts of speech which follow the feature
    (6) Information on word patterns which surround/co-occur with the feature
    (7) Information on part of speech patterns which surround/co-occur with the feature
    """
    
    
    ####################
    
    
    import os
    import re
    import pandas as pd
    import numpy as np
    from string import punctuation
    
    
    ####################
    
    
    if search_word_string == "ain\'t":
        
        #creates a list of the split content .csv's which should be stored together in 
        # the same folder (csv_input_path)
        csv_filenames = [file for file in os.listdir(csv_input_path) 
                         if file.endswith(".csv") and
                         file.startswith("aint") and "info" not in file]
        
        # filename for the all_corpora_info csv from Step1-3
        #  creates a list of the one filename and then uses [0] to get the filename
        #  out of the list format
        all_corpora_info_csv_path = [f"{csv_input_path}{filename}" 
                                     for filename in os.listdir(csv_input_path) 
                                     if "all_corpora_info" in filename 
                                     and filename.startswith("aint")][0]
    
    else:
                                     
        #creates a list of the split content .csv's which should be stored together in 
        # the same folder (csv_input_path)
        csv_filenames = [file for file in os.listdir(csv_input_path) 
                         if file.endswith(".csv") and
                         file.startswith(search_word_string) and "info" not in file]

        # filename for the all_corpora_info csv from Step1-3
        #  creates a list of the one filename and then uses [0] to get the filename
        #  out of the list format
        all_corpora_info_csv_path = [f"{csv_input_path}{filename}" 
                                     for filename in os.listdir(csv_input_path) 
                                     if "all_corpora_info" in filename 
                                     and filename.startswith(search_word_string)][0]

    #creates a list of tuples with the csv input full paths and the corpus name
    filePath_corpusName = [(f"{csv_input_path}{filename}",
                            re.search(r"_(.*?)_", filename).group(1).lower())
                           for filename in csv_filenames if "info" not in filename]
    
    
    ####################
    
    
    # creates a dataframe from the all_corpora_info_df csv
    all_corpora_info_df = pd.read_csv(f"{all_corpora_info_csv_path}", index_col=0)
    
    # lowercases the column names of this dataframe, which are corpus names
    #  this is to ensure the next immediate lines of code will fuction correctly
    all_corpora_info_df.columns = map(str.lower, all_corpora_info_df.columns)
    
    
    ####################
    
    
    # creates an empty dataframe for feature count, normalized feature count,
    # file types, file types count, file types percent total, possible POS
    # patterns count, actually occuring POS patterns count, percent of total
    # possible POS patterns that actually occur
    # this dataframe will be filled in and then appended to the all_corpora_info
    # dataframe later
    
    all_corpora_to_append_df = pd.DataFrame(columns=['coraal',
                                                     'fisher',
                                                     'librispeech',
                                                     'switchboardhub5', 
                                                     'timit'],
                                            index=['FeatureCount', 
                                                   'FeatureCountNormalized',
                                                   'FilesWithFeatureCount', 
                                                   'PercentTotalFilesWithFeature',
                                                   'FilesWithFeatureList', 
                                                   'SubjectWordTypeCount',
                                                   'PredicateWordTypeCount',
                                                   'PossiblePOSPatternCount',
                                                   'OccuringPOSPatternCount',
                                                   'PercentTotalPOSPatternOccuringCount',
                                                   'PossiblePOSPatternList',
                                                   'OccurringPOSPatternList'])
    
    # creates an empty list for possible part of speech patterns
    #  this will be used later to calculate perent of total possible patterns
    #  that actually occur in each corpus
    total_possible_POS_patterns = []
    
    ###################
    
    
    # loops through the filepath and corpus name tuples list
    for file_path, corpus_name in filePath_corpusName:
    
        # creates a variable of the corpus' total word count
        corpus_word_count_total = all_corpora_info_df.loc['TotalCorpusWordCount', corpus_name]
        
        # creates a variable of the corpus' total file count
        corpus_file_count_total = all_corpora_info_df.loc['TotalFileCount', corpus_name]
        
        # creates a string of punctuation markers to be used to filter out punctuation
        #  in cleaning stages later
        punctuation_modified = punctuation.replace("'", "").replace("_", "")

        
        ###################
        
        # creates a dataframe from the split content csv for the corpus
        patterns_df = pd.read_csv(file_path)

        # if the feature count in a corpus is zero, skips the corpus
        if len(patterns_df) == 0:
            
            continue
            
        else:
        
            ####################


            # calculates the number of features in the corpora by counting
            #  the number of lines in the split content dataframe
            #  adds the total to the dataframe
            all_corpora_to_append_df.loc['FeatureCount', corpus_name] = len(patterns_df)

            # normalizes the number of features in the corpora by taking the feature count
            #  dividing by the total number of words in the corpus and multiplying by 100,000
            #  this gives a measure of feature occurence per 100,000 words
            #  adds the normalized count to the dataframe
            all_corpora_to_append_df.loc['FeatureCountNormalized', corpus_name] = float(f"{(len(patterns_df)/corpus_word_count_total)*100000:.3f}")

            # calculates the total number of unique files in which the feature occurs at least once
            #  adds the total to the dataframe
            all_corpora_to_append_df.loc['FilesWithFeatureCount', corpus_name] = len(sorted(set(patterns_df['File'])))

            # calculates the percent of total files in which the feature occurs
            #  adds the percentage to the dataframe
            all_corpora_to_append_df.loc['PercentTotalFilesWithFeature', corpus_name] = float(f"{(len(sorted(set(patterns_df['File'])))/corpus_file_count_total)*100:.2f}")

            # creates a list of files in which the feature occurs at least once
            #  adds the list to the dataframe
            all_corpora_to_append_df.loc['FilesWithFeatureList', corpus_name] = sorted(set(patterns_df['File']))


            ####################


            # creates a list of all subject tokens, lowercased, stripped of white space on right and left, and punctuation
            #  other than ' and _ removed. 
            subject_tokens = sorted([token.lower().strip().translate(str.maketrans("", "", punctuation_modified)) 
                                     for token in list(patterns_df['SubjectWordToken'])])

            # creates a list of unique subject word types
            subject_types = sorted(set(subject_tokens))

            # calculates the total number of subject tokens
            subject_tokens_count = len(subject_tokens)

            # calculates the total number of subject word types                                                                         
            subject_types_count = len(subject_types)

            # adds the total number of subject word types to the all_corpora dataframe                                                                       
            all_corpora_to_append_df.loc['SubjectWordTypeCount', corpus_name] = subject_types_count



            # creates a dataframe for subject word tokens
            #  uses subject word types as row index names
            subject_tokens_df = pd.DataFrame(index=subject_types)

            # counts the number of token occurences of a subject word type
            #  adds the total as a new column to the subject_tokens dataframe
            subject_tokens_df['SubjectRawCount'] = [subject_tokens.count(subject_type) for subject_type in subject_types]

            # calculates the normalized count of subject word occurences in the corpus per 100,000 words
            #  adds the normalized count to the subject_tokens dataframe
            subject_tokens_df['SubjectNormalizedCount'] = [float(f"{(subject_tokens.count(subject_type)/corpus_word_count_total)*100000:.3f}") 
                                                           for subject_type in subject_types]

            # calculates the percent of total subject token count the subject type occurs
            #  adds the percentage to the subject_tokens dataframe
            subject_tokens_df['SubjectPercentTotal'] = [float(f"{(subject_tokens.count(subject_type) / subject_tokens_count)*100:.2f}") 
                                                        for subject_type in subject_types]



            # creates an empty list to be appended to
            subject_token_files = []

            # loops through the rows in the patterns_df
            for row in patterns_df.itertuples():

                #creates a tuple of the cleaned subject word token and the file it occurs within
                # appends the tuple to the empty subject_token_files list
                subject_token_files.append((row.SubjectWordToken.lower().strip().translate(str.maketrans("", "", punctuation_modified)), row.File))

            # creates a list of unique subject token + file occurences 
            # in other words, if a subject occurs more than once in a file, that subject-file combo only counts once
            subject_type_files = sorted(set(subject_token_files))

            # creates an empty dictionary to be appended to
            subject_type_files_dict = {}

            # loops through the list of subject_type_files tuples
            for type_file in subject_type_files:

                # if the subject word is in the dictionary as a key already,
                #  the filename is added to the list of filenames stored in the value for that key
                if type_file[0] in subject_type_files_dict:
                    subject_type_files_dict[type_file[0]].append(type_file[1])

                # if the subject word is not in the dictionary yet,
                #  creates a key of that subject word and creates a list with the filename to be the value
                else:
                    subject_type_files_dict[type_file[0]] = [type_file[1]]


            # loops through the key, value pairs in the subject_type_files_dict
            #  and calculates the length of the list to get the number of unique files in which the subject word occurs
            #  adds the resulting list as a column to subject_tokens_df
            subject_tokens_df['SubjectFileCount'] = [len(filename_list) for subject_type, filename_list in subject_type_files_dict.items()]

            # loops through the key, value pairs in the subject_type_files_dict
            #  and calculates the percent of the total number of files in which each subject word occurs
            #  adds the resulting list as a column to subject_tokens_df
            subject_tokens_df['SubjectFilePercentTotal'] = [float(f"{(len(filename_list)/corpus_file_count_total)*100:.2f}") 
                                                           for subject_type, filename_list in subject_type_files_dict.items()]

            # loops through the key, value pairs in the subject_type_files_dict
            #  and stores the list of filenames for each subject word
            #  adds the resulting list as a column to subject_tokens_df
            subject_tokens_df['SubjectFileList'] = [filename_list for subject_type, filename_list in subject_type_files_dict.items()]

            # sorts the subject_tokens_df by normalized count of the subject word, from highest to lowest normalized count
            subject_tokens_df = subject_tokens_df.sort_values(by='SubjectNormalizedCount', ascending=False)
            
            if search_word_string == "ain\'t":
                
                # exports the dataframe to a csv file
                subject_tokens_df.to_csv(f"{csv_output_path}/aint_{corpus_name}_subjectTokens.csv")
                
            else:
                
                # exports the dataframe to a csv file
                subject_tokens_df.to_csv(f"{csv_output_path}/{search_word_string}_{corpus_name}_subjectTokens.csv")

            
            
            ####################   



            # the following code follows the exact same process as for word tokens and types
            #  and applies it to parts of speech. please see documentation for corresponding
            #  lines above

            subject_POS_tokens = sorted([subject_POS for subject_POS in list(patterns_df['SubjectPOS'])])

            subject_POS_types = sorted(set(subject_POS_tokens))

            subject_POS_tokens_count = len(subject_POS_tokens)

            subject_POS_types_count = len(subject_POS_types)



            subject_POS_df = pd.DataFrame(index=subject_POS_types)

            subject_POS_df['SubjectPOSRawCount'] = [subject_POS_tokens.count(subject_POS_type) for subject_POS_type in subject_POS_types]

            subject_POS_df['SubjectPOSNormalizedCount'] = [float(f"{(subject_POS_tokens.count(subject_POS_type)/corpus_word_count_total)*100000:.3f}") 
                                                           for subject_POS_type in subject_POS_types]

            subject_POS_df['SubjectPOSPercentTotal'] = [float(f"{(subject_POS_tokens.count(subject_POS_type) / subject_POS_tokens_count)*100:.2f}") 
                                                        for subject_POS_type in subject_POS_types]



            subject_POS_token_files = []

            for row in patterns_df.itertuples():
                
                subject_POS_token_files.append((row.SubjectPOS, row.File))

            subject_POS_type_files = sorted(set(subject_POS_token_files))

            subject_POS_type_files_dict = {}

            for POS_type_file in subject_POS_type_files: 

                if POS_type_file[0] in subject_POS_type_files_dict:
                    subject_POS_type_files_dict[POS_type_file[0]].append(POS_type_file[1])

                else:
                    subject_POS_type_files_dict[POS_type_file[0]] = [POS_type_file[1]]



            subject_POS_df['SubjectPOSFileCount'] = [len(filename_list) for subject_POS_type, filename_list in subject_POS_type_files_dict.items()]

            subject_POS_df['SubjectPOSFilePercentTotal'] = [float(f"{(len(filename_list)/corpus_file_count_total)*100:.2f}") 
                                                                    for subject_POS_type, filename_list in subject_POS_type_files_dict.items()]

            subject_POS_df['SubjectPOSFileList'] = [filename_list for subject_POS_type, filename_list in subject_POS_type_files_dict.items()]

            subject_POS_df = subject_POS_df.sort_values(by='SubjectPOSNormalizedCount', ascending=False)
            
            if search_word_string == "ain\'t":
                
                # exports the dataframe to a csv file
                subject_POS_df.to_csv(f"{csv_output_path}/aint_{corpus_name}_subjectPOS.csv")
                
            else:
                
                # exports the dataframe to a csv file
                subject_POS_df.to_csv(f"{csv_output_path}/{search_word_string}_{corpus_name}_subjectPOS.csv")

            
            
            ####################                                                                       



            # creates a list of all predicate tokens, lowercased, stripped of white space on right and left, and punctuation
            #  other than ' and _ removed.                                                                          
            predicate_tokens = sorted([token.lower().strip().translate(str.maketrans("", "", punctuation_modified)) 
                                       for token in list(patterns_df['PredicateWordToken'])])

            # creates a list of unique predicate word types                                                                         
            predicate_types = sorted(set(predicate_tokens))

            # calculates the total number of predicate tokens                                                                         
            predicate_tokens_count = len(predicate_tokens)

            # calculates the total number of predicate word types                                                                         
            predicate_types_count = len(predicate_types)

            # adds the total number of predicate word types to the all_corpora dataframe                                                                       
            all_corpora_to_append_df.loc['PredicateWordTypeCount', corpus_name] = predicate_types_count



            # creates a dataframe for predicate word tokens
            #  uses predicate word types as row index names
            predicate_tokens_df = pd.DataFrame(index=predicate_types) 

            # counts the number of token occurences of a predicate word type
            #  adds the total as a new column to the predicate_tokens dataframe
            predicate_tokens_df['PredicateRawCount'] = [predicate_tokens.count(predicate_type) for predicate_type in predicate_types] 

            # calculates the normalized count of predicate word occurences in the corpus per 100,000 words
            #  adds the normalized count to the predicate_tokens dataframe
            predicate_tokens_df['PredicateNormalizedCount'] = [float(f"{(predicate_tokens.count(predicate_type)/corpus_word_count_total)*100000:.3f}") 
                                                               for predicate_type in predicate_types]

            # calculates the percent of total predicate token count the predicate type occurs
            #  adds the percentage to the predicate_tokens dataframe
            predicate_tokens_df['PredicatePercentTotal'] = [float(f"{(predicate_tokens.count(predicate_type) / predicate_tokens_count)*100:.2f}") 
                                                            for predicate_type in predicate_types]



            # creates an empty list to be appended to
            predicate_token_files = []

            # loops through the rows in the patterns_df
            for row in patterns_df.itertuples():   

                #creates a tuple of the cleaned predicate word token and the file it occurs within
                # appends the tuple to the empty predicate_token_files list
                predicate_token_files.append((row.PredicateWordToken.lower().strip().translate(str.maketrans("", "", punctuation_modified)), row.File))

            # creates a list of unique predicate token + file occurences 
            # in other words, if a predicate occurs more than once in a file, that predicate-file combo only counts once
            predicate_type_files = sorted(set(predicate_token_files))

            # creates an empty dictionary to be appended to
            predicate_type_files_dict = {}

            # loops through the list of predicate_type_files tuples
            for type_file in predicate_type_files:  

                # if the predicate word is in the dictionary as a key already,
                #  the filename is added to the list of filenames stored in the value for that key
                if type_file[0] in predicate_type_files_dict:
                    
                    predicate_type_files_dict[type_file[0]].append(type_file[1])  

                # if the predicate word is not in the dictionary yet,
                #  creates a key of that predicate word and creates a list with the filename to be the value
                else:
                    predicate_type_files_dict[type_file[0]] = [type_file[1]]


            # loops through the key, value pairs in the subject_type_files_dict
            #  and calculates the length of the list to get the number of unique files in which the predicate word occurs
            #  adds the resulting list as a column to predicate_tokens_df
            predicate_tokens_df['PredicateFileCount'] = [len(filename_list) for predicate_type, filename_list in predicate_type_files_dict.items()] 

            # loops through the key, value pairs in the predicate_type_files_dict
            #  and calculates the percent of the total number of files in which each predicate word occurs
            #  adds the resulting list as a column to predicate_tokens_df
            predicate_tokens_df['PredicateFilePercentTotal'] = [float(f"{(len(filename_list)/corpus_file_count_total)*100:.2f}") 
                                                           for predicate_type, filename_list in predicate_type_files_dict.items()]

            # loops through the key, value pairs in the predicate_type_files_dict
            #  and stores the list of filenames for each predicate word
            #  adds the resulting list as a column to predicate_tokens_df
            predicate_tokens_df['PredicateFileList'] = [filename_list for predicate_type, filename_list in predicate_type_files_dict.items()]

            # sorts the predicate_tokens_df by normalized count of the predicate word, from highest to lowest normalized count
            predicate_tokens_df = predicate_tokens_df.sort_values(by='PredicateNormalizedCount', ascending=False)
            
            
            if search_word_string == "ain\'t":
                
                # exports the dataframe to a csv file
                predicate_tokens_df.to_csv(f"{csv_output_path}/aint_{corpus_name}_predicateTokens.csv")
                
            else:
                
                # exports the dataframe to a csv file
                predicate_tokens_df.to_csv(f"{csv_output_path}/{search_word_string}_{corpus_name}_predicateTokens.csv")
            
            
            
            ####################



            # the following code follows the exact same process as for word tokens and types
            #  and applies it to parts of speech. please see documentation for corresponding
            #  lines above

            predicate_POS_tokens = sorted([predicate_POS for predicate_POS in list(patterns_df['PredicatePOS'])])

            predicate_POS_types = sorted(set(predicate_POS_tokens))

            predicate_POS_tokens_count = len(predicate_POS_tokens)

            predicate_POS_types_count = len(predicate_POS_types)



            predicate_POS_df = pd.DataFrame(index=predicate_POS_types)

            predicate_POS_df['PredicatePOSRawCount'] = [predicate_POS_tokens.count(predicate_POS_type) 
                                                        for predicate_POS_type in predicate_POS_types]

            predicate_POS_df['PredicatePOSNormalizedCount'] = [float(f"{(predicate_POS_tokens.count(predicate_POS_type)/corpus_word_count_total)*100000:.3f}") 
                                                               for predicate_POS_type in predicate_POS_types]

            predicate_POS_df['PredicatePOSPercentTotal'] = [float(f"{(predicate_POS_tokens.count(predicate_POS_type) / predicate_POS_tokens_count)*100:.2f}") 
                                                            for predicate_POS_type in predicate_POS_types]



            predicate_POS_token_files = []

            for row in patterns_df.itertuples():
                
                predicate_POS_token_files.append((row.PredicatePOS, row.File))

            predicate_POS_type_files = sorted(set(predicate_POS_token_files))

            predicate_POS_type_files_dict = {}

            for POS_type_file in predicate_POS_type_files: 

                if POS_type_file[0] in predicate_POS_type_files_dict:
                    
                    predicate_POS_type_files_dict[POS_type_file[0]].append(POS_type_file[1])

                else:
                    predicate_POS_type_files_dict[POS_type_file[0]] = [POS_type_file[1]]



            predicate_POS_df['PredicatePOSFileRawCount'] = [len(filename_list) for predicate_POS_type, filename_list in predicate_POS_type_files_dict.items()]

            predicate_POS_df['PredicatePOSFilePercentTotal'] = [float(f"{(len(filename_list)/corpus_file_count_total)*100:.2f}") 
                                                                       for predicate_POS_type, filename_list in predicate_POS_type_files_dict.items()]

            predicate_POS_df['PredicatePOSFileList'] = [filename_list for predicate_POS_type, filename_list in predicate_POS_type_files_dict.items()]

            predicate_POS_df = predicate_POS_df.sort_values(by='PredicatePOSNormalizedCount', ascending=False)
            
            
            if search_word_string == "ain\'t":
                
                # exports the dataframe to a csv file
                predicate_POS_df.to_csv(f"{csv_output_path}/aint_{corpus_name}_predicatePOS.csv")
                
            else:
                
                # exports the dataframe to a csv file
                predicate_POS_df.to_csv(f"{csv_output_path}/{search_word_string}_{corpus_name}_predicatePOS.csv")
            

            
            
            ####################


            # the following code follows the exact same process as for word tokens and types
            #  and applies it to word patterns. these are the collocations of subject-predicate
            #  word token/types that co-occur with the feature
            #  please see documentation for corresponding lines above

            word_pattern_tokens = sorted([word_pattern.lower().strip().translate(str.maketrans("", "", punctuation_modified)) 
                                          for word_pattern in list(patterns_df['WordPattern'])])

            word_pattern_types = sorted(set(word_pattern_tokens))

            word_pattern_tokens_count = len(word_pattern_tokens)

            word_pattern_types_count = len(word_pattern_types)



            word_pattern_df = pd.DataFrame(index=word_pattern_types)

            word_pattern_df['WordPatternRawCount'] = [word_pattern_tokens.count(word_pattern_type) for word_pattern_type in word_pattern_types]

            word_pattern_df['WordPatternNormalizedCount'] = [float(f"{(word_pattern_tokens.count(word_pattern_type)/corpus_word_count_total)*100000:.3f}") 
                                                             for word_pattern_type in word_pattern_types]

            word_pattern_df['WordPatternPercentTotal'] = [float(f"{(word_pattern_tokens.count(word_pattern_type) / word_pattern_tokens_count)*100:.2f}") 
                                                          for word_pattern_type in word_pattern_types]                                                                         



            word_pattern_token_files = []

            for row in patterns_df.itertuples():
                
                word_pattern_token_files.append((row.WordPattern.lower().strip().translate(str.maketrans("", "", punctuation_modified)), row.File))

            word_pattern_type_files = sorted(set(word_pattern_token_files))

            word_pattern_type_files_dict = {}

            for word_pattern_type_file in word_pattern_type_files: 

                if word_pattern_type_file[0] in word_pattern_type_files_dict:
                    
                    word_pattern_type_files_dict[word_pattern_type_file[0]].append(word_pattern_type_file[1])

                else:
                    word_pattern_type_files_dict[word_pattern_type_file[0]] = [word_pattern_type_file[1]]


            word_pattern_df['WordPatternFileRawCount'] = [len(filename_list) for word_pattern_type, filename_list in word_pattern_type_files_dict.items()]

            word_pattern_df['WordPatternFilePercentTotal'] = [float(f"{(len(filename_list)/corpus_file_count_total)*100:.2f}") 
                                                                    for word_pattern_type, filename_list in word_pattern_type_files_dict.items()]

            word_pattern_df['WordPatternFileList'] = [filename_list for word_pattern_type, filename_list in word_pattern_type_files_dict.items()]

            word_pattern_df = word_pattern_df.sort_values(by='WordPatternNormalizedCount', ascending=False)                                                                         
            
            
            if search_word_string == "ain\'t":
                
                # exports the dataframe to a csv file
                word_pattern_df.to_csv(f"{csv_output_path}/aint_{corpus_name}_wordPatterns.csv")
                
            else:
                
                # exports the dataframe to a csv file
                word_pattern_df.to_csv(f"{csv_output_path}/{search_word_string}_{corpus_name}_wordPatterns.csv")
            

            
            
            ####################


            # the following code follows the exact same process as for word tokens and types
            #  and applies it to part of speech patterns. these are the collocations of subject-predicate
            #  part of speech token/types that co-occur with the feature
            #  please see documentation for corresponding lines above


            POS_pattern_tokens = sorted([POS_pattern for POS_pattern in list(patterns_df['POSPattern'])])

            POS_pattern_types = sorted(set(POS_pattern_tokens))

            POS_pattern_tokens_count = len(POS_pattern_tokens)

            POS_pattern_types_count = len(POS_pattern_types)



            POS_pattern_df = pd.DataFrame(index=POS_pattern_types)

            POS_pattern_df['POSPatternRawCount'] = [POS_pattern_tokens.count(POS_pattern_type) for POS_pattern_type in POS_pattern_types]

            POS_pattern_df['POSPatternNormalizedCount'] = [float(f"{(POS_pattern_tokens.count(POS_pattern_type)/corpus_word_count_total)*100000:.3f}") 
                                                           for POS_pattern_type in POS_pattern_types]

            POS_pattern_df['POSPatternPercentTotal'] = [float(f"{(POS_pattern_tokens.count(POS_pattern_type) / POS_pattern_tokens_count)*100:.2f}") 
                                                        for POS_pattern_type in POS_pattern_types]



            POS_pattern_token_files = []

            for row in patterns_df.itertuples():
                
                POS_pattern_token_files.append((row.POSPattern, row.File))

            POS_pattern_type_files = sorted(set(POS_pattern_token_files))

            POS_pattern_type_files_dict = {}

            for POS_type_file in POS_pattern_type_files: 

                if POS_type_file[0] in POS_pattern_type_files_dict:
                    
                    POS_pattern_type_files_dict[POS_type_file[0]].append(POS_type_file[1])

                else:
                    POS_pattern_type_files_dict[POS_type_file[0]] = [POS_type_file[1]]


            POS_pattern_df['POSPatternFileRawCount'] = [len(filename_list) for POS_pattern_type, filename_list in POS_pattern_type_files_dict.items()]

            POS_pattern_df['POSPatternFilePercentTotal'] = [float(f"{(len(filename_list)/corpus_file_count_total)*100:.2f}") 
                                                                       for POS_pattern_type, filename_list in POS_pattern_type_files_dict.items()]

            POS_pattern_df['POSPatternFileList'] = [filename_list for POS_pattern_type, filename_list in POS_pattern_type_files_dict.items()]

            POS_pattern_df = POS_pattern_df.sort_values(by='POSPatternNormalizedCount', ascending=False)

            
            if search_word_string == "ain\'t":
                
                # exports the dataframe to a csv file
                POS_pattern_df.to_csv(f"{csv_output_path}/aint_{corpus_name}_POSPatterns.csv")
                
            else:
                
                # exports the dataframe to a csv file
                POS_pattern_df.to_csv(f"{csv_output_path}/{search_word_string}_{corpus_name}_POSPatterns.csv")
        

            
            
            ####################



            # creates a list of all possible part of speech patterns based on
            #  the part of speech patterns that occur within the corpus
            #  utilizes the subject_POS_types and predicate_POS_types lists from above                                                                        
            possible_POS_pattern_types = [f"{subject_POS_type}_{search_word_string}_{predicate_POS_type}" 
                                          for subject_POS_type in subject_POS_types 
                                          for predicate_POS_type in predicate_POS_types]

            # removes part of speech patterns that include disfluencies
            remove_possible_disfluencies = [POS_pattern for POS_pattern in possible_POS_pattern_types 
                                            if 'Disfluency' not in POS_pattern]

            # removes part of speech patterns that include constituent boundaries
            remove_possible_constituent_boundaries = [POS_pattern for POS_pattern in remove_possible_disfluencies 
                                                     if 'Constituent' not in POS_pattern]

            # reassigns to new variable name for user usability
            possible_POS_pattern_types = remove_possible_constituent_boundaries

            # appends items in the list of possible part of speech to the total list way back in the beginning
            for possible_POS_pattern_type in possible_POS_pattern_types:
                
                total_possible_POS_patterns.append(possible_POS_pattern_type)
                                                                                 
           
                                                                                     
        ####################   
                                                                                     
                  
                                                                                     
        # creates a type list of possible part of speech from the total list                                                                            
        total_possible_POS_pattern_types = sorted(set(total_possible_POS_patterns))
                                        
        # calculates the total number of possible part of speech patterns based on
        #  all part of speech patterns that occur across all corpora
        possible_POS_patterns_count = len(total_possible_POS_pattern_types)
        
        # loops through the filepath and corpus name tuples list
        for file_path, corpus_name in filePath_corpusName:
            
            # creates a dataframe from the split content csv for the corpus
            patterns_df = pd.read_csv(file_path)

            # if the feature count in a corpus is zero, skips the corpus
            if len(patterns_df) == 0:
                
                continue

            else:
        
                #  adds the list of possible part of speech patterns to the dataframe
                all_corpora_to_append_df.at['PossiblePOSPatternList', corpus_name] = total_possible_POS_pattern_types
                
                #  adds the total count of possible part of speech patterns to the dataframe
                all_corpora_to_append_df.at['PossiblePOSPatternCount', corpus_name] = possible_POS_patterns_count
                                                                                     
                   
                                                                                     
        ####################   
                                                                                     
                  
                                                                                     
        # loops through the filepath and corpus name tuples list
        for file_path, corpus_name in filePath_corpusName:

            # creates a dataframe from the split content csv for the corpus
            patterns_df = pd.read_csv(file_path)

            # if the feature count in a corpus is zero, skips the corpus
            if len(patterns_df) == 0:
                continue

            else:
            
                # creates a list of part of speech pattern tokens                                                                      
                POS_pattern_tokens = sorted([POS_pattern for POS_pattern in list(patterns_df['POSPattern'])])
                
                # creates a list of part of speech pattern types 
                POS_pattern_types = sorted(set(POS_pattern_tokens))

                # creates a list of actually occuring part of speech patterns in the corpus
                # utilizes the POS_pattern_types list
                # removes part of speech patterns that include disfluencies
                remove_occuring_disfluencies = [POS_pattern for POS_pattern in POS_pattern_types 
                                                if 'Disfluency' not in POS_pattern]

                # removes part of speech patterns that include constituent boundaries
                remove_occuring_constituent_boundaries = [POS_pattern for POS_pattern in remove_occuring_disfluencies 
                                                         if 'Constituent' not in POS_pattern]

                # reassigns to new variable name for user usability
                occuring_POS_patterns = remove_occuring_constituent_boundaries
                                                                                     
                # calculates the total number of possible part of speech patterns
                occuring_POS_patterns_count = len(occuring_POS_patterns)
                                                                                     
                # adds the total number of actually occuring part of speech patterns in the corpus to the dataframe
                all_corpora_to_append_df.loc['OccuringPOSPatternCount', corpus_name] = occuring_POS_patterns_count

                                                                                     
                # calculates the percent of the total number of possible part of speech patterns actually occur
                #  adds the percentage to the dataframe
                all_corpora_to_append_df.loc['PercentTotalPOSPatternOccuringCount', corpus_name] = float(
                    f"{occuring_POS_patterns_count/possible_POS_patterns_count*100:.2f}")
               
                # adds the list of actually occuring part of speech patterns in the corpus to the dataframe
                all_corpora_to_append_df.loc['OccurringPOSPatternList', corpus_name] = occuring_POS_patterns                                                                     
        
                                                                                     
    ####################
                                                                                     
    # combines the original all_corpora_info and the new all_corpora_to_append_df                                                                                
    complete_all_corpora_df = all_corpora_info_df.append(all_corpora_to_append_df)
    
    if search_word_string == "ain\'t":
        
        # exports the dataframe to a csv file
        complete_all_corpora_df.to_csv(f"{csv_output_path}/aint_complete_all_corpora_info.csv")
        
    else:
    
        # exports the dataframe to a csv file
        complete_all_corpora_df.to_csv(f"{csv_output_path}/{search_word_string}_complete_all_corpora_info.csv")

# Creating Quantitative  Dataframes and Exporting Dataframes to CSV Files

This will execute the code and create the dataframes and then export them as CSV files.

In [None]:
# Designate the input path where the gold standard CSVs are stored
csv_input_path = "path"

# Designate the output path where the gold standard CSVs are stored
csv_output_path = "path"

### Feature: Ain't

In [None]:
# Designate the search word
search_word_string = "ain\'t"

# execute code
get_corpora_stats_csvs(csv_input_path, csv_output_path, search_word_string)

### Feature: Be

In [None]:
# Designate the search word
search_word_string = "be"

# execute code
get_corpora_stats_csvs(csv_input_path, csv_output_path, search_word_string)

### Feature: Done

In [None]:
# Designate the search word
search_word_string = "done"

# execute code
get_corpora_stats_csvs(csv_input_path, csv_output_path, search_word_string)