To install:

1. Open Terminal.
2. Type: python3 -m pip install flashqda

FlashQDA currently offers four main functions:

screen_abstracts: Provide a comma-separated list (.csv file) of abstracts and criteria for labelling them. For each abstract, FlashQDA returns whether the label applied to the abstract. Results are returned in a csv file to the Results folder. Place the list of abstracts within the Data folder. The abstracts should be listed down the first column (one abstract per cell), with the first row as the header, "Abstract". The criteria are passed to FlashQDA with your function call (see example below).

preprocess_documents: Provide a set of plain text files (.txt). FlashQDA segments each document into a list of sentences. Results are returned in a csv file to the Results folder. Place the files within the Data folder.

analyze_sentences: Provide a comma-separated list (.csv file) of sentences and choose among four analysis types: relationships_classify, tenses, list_of_terms, and relationships_extract:

- relationships_classify classifies the relationships expressed within a sentence as causal, correlational, or none.
- tenses lists the tenses used in the sentence.
- list_of_terms checks the sentence for whether it contains items on a user-specified list of terms (provided with the function call).
- relationships_extract retrieves cause/effect pairs from sentences (a sentence can yield multiple pairs).

compare_concepts (Not documented here; coming soon): Provide a set of cause/effect pairs  (.csv file) of concepts and compare them for semantic similarity. Results are returned in a .csv file to the Results folder as a triangular matrix.

In lieu of a user guide (in preparation), the notebook below provides examples of how to use the functions. Additional details on the functions are provided with the examples.

FlashQDA is currently designed to work with OpenAI's API (GPT-4o). If you have not done so already, you will need to setup an account with OpenAI and obtain an API. There is a cost for using the API.

The example uses two abstracts related to agroforestry in Peru, available in the Docs folder on GitHub (https://github.com/nmkearney/flashqda). Download the files and, once you have initialize your project (see below), add them to the Data folder. The files you need for the examples are:

- abstracts.csv
- Lojka et al. 2016
- Ocampo-Ariza et al. 2023

In [None]:
import flashqda

In [None]:
# Store your OpenAI API key
import os
#os.environ['OPENAI_API_KEY'] = '<YOUR_API_KEY>' # Uncomment and replace with your OpenAI API key

In [None]:
# Initialize a project
directory = '/Users/<user_name>/Documents/MyProject' # Replace with the directory you want to use
flashqda.initialize_project(directory) 

# initialize_project sets up the project folder and required subfolders (Data, Results).
# It also changes the working directory to <directory>.
# If you change the working directory (e.g., for another project), FlashQDA will stop working correctly.
# You can set the working directory back to <directory> using: os.chdir(directory).

In [None]:
# Screen abstracts example
file_name = "abstracts.csv" # The name of the file that contains the abstracts (place it within the Data folder)
criteria = ["agroforestry is a main topic",
            "slash-and-burn farming is a main topic",
            "cacao yield is a main topic",
            "experimental methods were used"]
flashqda.screen_abstracts(file_name, criteria)

# screen_abstracts iterates over the list of abstracts and checks each criterion for whether it applies to the abstract.
# By default, screen_abstracts checks each abstract/criterion pair 3 times and reports the % of positive test results.
# You can modify the query count using the query_count argument. For example: screen_abstracts(file_name, criteria, query_count = 5).
# Items in the criteria list must be enclosed by quotation marks and separated by commas.

In [None]:
# Preprocess documents example
custom_items = [] # Empty by default
save_name = 'analysis_1' # Name of the file where you want the sentences to be stored (will be placed within the Results subfolder)
flashqda.preprocess_documents(save_name, custom_items)

# preprocess_documents takes a set of plain text files and splits each into a list of sentences.
# Each document is assigned a document ID, and each sentence is assigned a sentence ID.
# Sentence splitting can be tricky. Sometimes sentences are split at unwanted places.
# A set of rules is used to identify common sentence starts and ends.
# For example, periods often (but not always!) mark sentence boundaries.
# FlashQDA anticipates common sentence-splitting issues (e.g., abbreviations, such as "Ms." and "Dr.").
# After running preprocess_documents, you may notice cases of unwanted sentence splitting.
# Identify the causes (e.g., the abbreviation "spp." for "species") and add the special cases to the custom_items list.
# Items in the list must be enclosed by quotation marks and separated by commas.
# For example: custom_items = ['spp.', 'B.']

In [None]:
# Analyze sentences example #1

save_name = 'analysis_1' # Name of the document containing the list of (preprocessed) sentences
analysis_type = 'relationships_classify' # The type of the current analysis
context_length = 1 # The number of sentences prior to the focal sentence to be used as context for the focal sentence
subscore_basis = ['causal'] # The categor(y|ies) used later for filtering
filter = None # Not needed on the first analysis; see example #2 below
filter_cutoff = 0 # Not needed on the first analysis; see exmple #2 example below

sentences = flashqda.read_csv_file(save_name) # Sentences used in the analysis.
flashqda.analyze_sentences(sentences=sentences,
                              save_name=save_name,
                              analysis_type=analysis_type,
                              context_length=context_length, 
                              subscore_basis=subscore_basis,
                              filter=filter,
                              filter_cutoff=filter_cutoff)

# analyze_all_sentences iterates over the list of sentences and applies the specified analysis type.
# relationships_classify analyzes a sentence for whether it expresses any causal or correlational relationships.
# If none are found, 'None' is reported.
# By default, relationships_classify, tenses, and relationships_extract have a query_count of 3.
# You can modify the query count using the query_count argument. For example, analyze_all_sentences(sentences, ..., query_count = 5).

# context_length is helpful when causal relationships are expressed over more than one sentence (which is common).
# By default, context_length = 0, but in many cases a context_lenght of 1 is helpful for detecting (and extracting) relationships.

# subscore_basis is used to create a subscore, which is compared against filter_cutoff when a filter is used.
# For example, subscore_basis = ['causal'] will report the proportion of times GPT decided the sentence expressed a causal relationship.
# The proportion is equal to (frequency of category / number of decisions for sentence).
# Items in the subscore_basis list must be enclosed by quotation marks and separated by commas.

# filter is used to skip sentences that do not meet a user-specified cutoff.
# For example, you may only want to further analyze (e.g., extract causal relationships) sentences that GPT decided expressed at least
# one causal relationship. Then you would use the 'causal_relationships' filter with a filter_cutoff of, for example, 0.01.

In [None]:
# Analyze sentences example #2

save_name = 'analysis_1' # Same file as in the above example
analysis_type = 'tenses'
context_length = 0
subscore_basis = ['simple present']
filter = 'relationships_classify'
filter_cutoff = 0.01

sentences = flashqda.read_csv_file(save_name) # Sentences used in the analysis.
flashqda.analyze_sentences(sentences=sentences, 
                              save_name=save_name,
                              analysis_type=analysis_type,
                              context_length=context_length, 
                              subscore_basis=subscore_basis,
                              filter=filter,
                              filter_cutoff=filter_cutoff)

# You can run this analysis on top of the previous one. The analysis_1.csv file will be safely overwritten with the new information.
# Only sentences that GPT decided expressed at least one causal relationship will be analyzed for the tenses used.

In [None]:
# Analyze sentences example #3

save_name = 'analysis_2' # Create a new list of sentences
flashqda.preprocess_documents(save_name)

analysis_type = ['relationships_classify', 
             'tenses', 
             'list_of_terms',
             'relationships_extract']
context_length = [1,
                  0,
                  0,
                  1]
terms_to_check = [
    [],
    [],
    ['suggest', ],
    []
]
subscore_basis = [
    ['causal'],
    ['simple present'],
    ['simple_present'],
    ['simple_present']
]
filter = [None, 
          'relationships_classify', 
          'tenses',
          'tenses']
filter_cutoff = [0,
                 0.1,
                 0.1,
                 0.1]

for i in range(len(analysis_type)):
    sentences = flashqda.read_csv_file(save_name)
    flashqda.analyze_sentences(sentences=sentences,
                                  save_name=save_name,
                                  analysis_type=analysis_type[i], 
                                  context_length=context_length[i], 
                                  terms_to_check=terms_to_check[i],
                                  subscore_basis=subscore_basis[i],
                                  filter=filter[i],
                                  filter_cutoff=filter_cutoff[i], 
                                  )

# This loop will run four analyses in sequence.
# If it is interrupted (you can hit the stop button in Jupyter Notebook to test), 
# it will continue from the last sentence analyzed when resumed.

# A counter and vectors are used to pass information to the arguments of analyze_sentences.