To install:

1. Open Terminal.
2. Type: python3 -m pip install flashqda

FlashQDA currently offers six main functions:

- preprocess_documents: Segment a corpus into paragraphs or sentences. Each document is given a unique document ID, and each paragraph/sentence is given a unique item ID (restarting at 1 for each document).
- label_items: Label an item (an abstract, a paragraph, a sentence, or an extracted pair) according to a set of user-specified labels. Can be run before and/or after classify_items and extract_from_classified.
- classify_items: Classify an item (a paragraph or a sentence) as causal or not_causal, or according to a user-specified classification scheme (e.g., tradeoff/not_tradeoff).
- extract_from_classified: Extract causes and effects from an item (a paragraph or a sentence), or extract a user-specified extraction scheme (e.g., tradeoff: gain/cost)
- embed_items: Embed extracted concepts to compute semantic similarity.
- link_items: Link cause/effect relationships by semantic similarity (compare from_effect to to_cause). Can be used to construct a causal graph.

In lieu of a user guide (in preparation), the notebook below provides examples of how to use the functions. 
Additional details on the functions are available in the doc strings. For example, help(flashqda.classify_items).

FlashQDA is currently designed to work with OpenAI's API (GPT-4o).* 
If you have not done so already, you will need to setup an account with OpenAI and obtain an API. 
There is a cost for using the API.

The example uses two articles on Ecuadorian cocoa, available in the Docs folder on GitHub (https://github.com/nmkearney/flashqda). 
Download the files and, once you have initialized your project (see below), add them to the Data folder. 
The files you need for the examples are:

- Charry et al. 2025.txt
- Tennhardt et al. 2023.txt

* Support for other LLMs is under development.

In [None]:
# 1. import flashqda
import flashqda

In [None]:
# 2. Initialize a new project folder (adjust path as needed)
project_root = "/Users/<user_name>/Documents/flashQDA projects/my_project"
flashqda.initialize_project(project_root)

# project_root is the directory where you want to store the project files
# initialize_project sets up the following file structure in project_root:
# /<project_name>
# - /data (place data for analysis here)
# - /prompts (place custom prompts here)
# - /results

# initialize_project also changes the working directory to <project_root>.
# If you change the working directory (e.g., for another project), FlashQDA will stop working correctly.
# You can set the working directory back to <directory> using: os.chdir(directory).

In [None]:
# 3. Create ProjectContext for convenient path management
project = flashqda.ProjectContext(project_root)

In [None]:
# 4. Retrieve OpenAI API key
flashqda.get_openai_api_key(
    project_root=project_root
    )

# Place a .txt file named "openapi_api_key.txt" in the root folder (add only the API key to the file)

In [None]:
# 5. Label abstracts
config = flashqda.PipelineConfig.from_type(
    "causal",
    topic = "agroforestry in Peru"
)
granularity = "abstract"
label_list = [
    "agroforestry is a main topic",
    "Peru is a main topic",
    "space travel is a main topic"
]
input_file = project.data / "abstracts.csv"
expand = True
output_directory = project.results
save_name = "abstracts_labelled.csv"

flashqda.label_items(
        project=project, 
        config=config,
        granularity=granularity,
        label_list=label_list,
        expand=expand,
        input_file=input_file, 
        output_directory=output_directory,
        save_name=save_name
        )

In [None]:
# 5. Segment documents in 'data/' folder and save CSV
granularity = "sentence" # Options: abstract, paragraph, sentence
custom_items = []
save_name = "sentences.csv"

flashqda.preprocess_documents(
    project = project,
    granularity = granularity, # Optional; default is "sentence"
    custom_items = custom_items, # Optional; default is None
    save_name = save_name # Optional: default is {granularity}.csv
    )

In [None]:
# 6. Segment documents in 'data/' folder and save CSV
granularity = "sentence"
custom_items = []
save_name = "sentences.csv"

flashqda.preprocess_documents(
    project = project,
    granularity = granularity,
    custom_items = custom_items,
    save_name = save_name
    )

In [None]:
# 7a. Classify items as causal or not (default pipeline)
pipeline_config = flashqda.PipelineConfig.from_type(
    "causal",
    topic = "agroforestry in Peru"
)
granularity = "sentence"
context_length = 1
input_file = project.data / "sentences.csv"
output_directory = project.results
save_name = "sentences_classified.csv"

classified_df = flashqda.classify_items(
    project = project,
    config=pipeline_config,
    granularity = granularity,
    context_length = context_length,
    input_file = input_file,
    output_directory = output_directory,
    save_name = save_name
)

In [None]:
# Optional 7bi. Instantiate a custom pipeline (adapt as needed)
tradeoff_pipeline = flashqda.PipelineConfig(
    pipeline_type = "tradeoff",
    labels = ["tradeoff", "not tradeoff"],
    extract_labels= ["gain", "cost"],
    prompt_files = {
        "classify": "tradeoff_classify.txt",
        "tradeoff_label_extracted": "tradeoff_label_extracted.txt",
        "extract": "tradeoff_extract.txt",
    },
    system_prompt = "You are helping identify tradeoffs in text. The topic is organic farming and just, sustainable food systems transitions."
)

In [None]:
# Optional 7bii. Execute a custom pipeline
pipeline_config = tradeoff_pipeline
granularity = "sentence"
context_length = 1
input_file = project.data / "sentences.csv"
output_directory = project.results
save_name = "sentences_tradeoffs_classified.csv"

flashqda.classify_items(
    project = project, 
    config=pipeline_config,
    granularity = granularity,
    context_length = context_length,
    input_file = input_file, 
    output_directory = output_directory,
    save_name = save_name
)

In [None]:
# 8. Label causal items (e.g., for inclusion/exclusion)
pipeline_config = flashqda.PipelineConfig.from_type(
    "causal",
    topic = "agroforestry in Peru"
)
granularity = "sentence"
context_length = 1
include_class = "causal"
label_list = ["Label: substantive_not_methodological. Description: The sentence discusses the topic being studied, not how the study was conducted, framed, or limited.",
              "Label: descriptive_not_prescriptive. Description: The sentence describes how or why something happens, without suggesting what should be done.",
              "Label: definitive_not_ambiguous. Description: The sentence states a causal relationship without hedging (e.g., 'may cause', 'could contribute to')."
              ]
expand = True
input_file = project.results / "sentences_classified.csv"
output_directory = project.results
save_name = "sentences_classified_labelled.csv"

labelled_df = flashqda.label_items(
    project = project,
    config = pipeline_config,
    granularity = granularity,
    context_length = context_length,
    include_class = include_class,
    label_list = label_list,
    expand = expand,
    input_file = input_file,
    output_directory = output_directory,
    save_name = save_name
)

In [None]:
# 9. Extract causes and effects from causal items
pipeline_config = flashqda.PipelineConfig.from_type(
    "causal",
    topic = "agroforestry in Peru"
)
granularity = "sentence"
context_length = 1
include_class = "causal"
filter_keys = "FALSE"
filter_column = "substantive_not_methodological"
input_file = project.results / "sentences_classified_labelled.csv"
output_directory = project.results # Default
save_name = "sentences_classified_labelled_extracted.csv"

extracted_df = flashqda.extract_from_classified(
    project = project, 
    config=pipeline_config,
    granularity = granularity,
    context_length = context_length,
    include_class=include_class,
    filter_keys = filter_keys,
    filter_column = filter_column,
    input_file = input_file,
    output_directory = output_directory,
    save_name = save_name
    )

In [None]:
# 10. Label extracted items
pipeline_config = flashqda.PipelineConfig.from_type(
    "causal",
    topic = "agroforestry in Peru"
)
granularity = "sentence"
context_length = 1
include_class = "causal"
label_list = ["Label: social_system. Description: The cause/effect pair relates to social systems (e.g., demography, social organization, culture, politics, economics, social actors).",
               "Label: ecological_system. Description: The cause/effect pair relates to ecological systems (e.g., resources, organisms, ecosystems, habitats, ecosystem services).",
               "Label: barrier. Description: The cause/effect pair describes how or why something does not or cannot happen.",
               "Label: driver. Description: The cause/effect pair describes how or why something does happen."
              ]
on_extracted = True
expand = True
input_file = project.results / "sentences_classified_labelled_extracted.csv"
output_directory = project.results
save_name = "sentences_classified_labelled_extracted_labelled.csv"

labelled_df = flashqda.label_items(
    project = project,
    config = pipeline_config,
    granularity = granularity,
    context_length = context_length,
    include_class = include_class,
    label_list = label_list,
    on_extracted = on_extracted,
    expand = expand,
    input_file = input_file,
    output_directory = output_directory,
    save_name = save_name
)

In [None]:
# 11. Generate embeddings
pipeline_config = flashqda.PipelineConfig.from_type("causal")
column_names = ["cause", "effect"]
input_file = project.results / "sentences_classified_labelled_extracted.csv"
output_directory = project.results
save_name = "embeddings.json"

embeddings = flashqda.embed_items(
    project = project,
    config = pipeline_config,
    column_names = column_names,
    input_file = input_file,
    output_directory = output_directory,
    save_name = save_name
)

In [None]:
# 12. Link similar causal relationships
pipeline_config = flashqda.PipelineConfig.from_type("causal")
threshold = 0.85
input_file = project.results / "sentences_classified_labelled_extracted.csv"
embedding_file = project.results / "embeddings.json"
output_directory = project.results
save_name = "suggested_links.csv"

links = flashqda.link_items(
    project = project,
    config = pipeline_config,
    threshold = threshold,
    input_file = input_file,
    embedding_file = embedding_file,
    output_directory = output_directory,
    save_name = save_name
)