<a href="https://colab.research.google.com/github/jdcox02/Cohort-2025/blob/main/ISEA_topic_code_using_openai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Qualitative coding using openai

- Prompt 1 (Zero shot):input the entire codebook, including a table that has parent code, parent code description, child code, and child code description; use the GPT to label both child and parent code for each document; if non child code work, just label the parent code.

- Prompt 2 (Chain of thoughts):  Step 1: label all parent codes with reasoning; Step 2; label child code within each parent code. if none of the child code work, just keep the parent code.

In [1]:
# install openai library
#pip install openai
!pip install langchain==0.0.316
!pip install openai==0.28.1

Collecting langchain==0.0.316
  Downloading langchain-0.0.316-py3-none-any.whl.metadata (15 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain==0.0.316)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting langsmith<0.1.0,>=0.0.43 (from langchain==0.0.316)
  Downloading langsmith-0.0.92-py3-none-any.whl.metadata (9.9 kB)
Collecting tenacity<9.0.0,>=8.1.0 (from langchain==0.0.316)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain==0.0.316)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain==0.0.316)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain==0.0.316)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl.metadata (1.1 kB)
Down

In [6]:
import pandas as pd
import openai
import os
import ast
import time
import numpy as np
import json
from IPython.display import clear_output
from tqdm import tqdm
import re

In [7]:
tqdm.pandas()

In [4]:
# Set work directory
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
from google.colab import files
uploaded = files.upload()

Saving sample_interview_data.csv to sample_interview_data (1).csv
Saving codebook.csv to codebook (1).csv


## Load data

In [9]:
# Import text data
# Use interview_data.csv here

df = pd.read_csv('sample_interview_data.csv', index_col = 0)

In [10]:
# Example text for GPT to label
df['documents'].head()

Unnamed: 0_level_0,documents
roles,Unnamed: 1_level_1
Educator,"So speaking at a local district level, we have..."
Educator,"I know that my field, the area of my expertise..."
Educator,"In terms of local policies, something that can..."
Educator,I think you were talking before about this dat...
Educator,As a educator who works with students directly...


In [11]:
# Function to preprocess text
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub(r'https?://\S+|www\.\S+', '', text)  # Remove URLs
    #text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove special characters
    text = text.strip()  # Trim spaces
    return text

# Apply preprocessing to the documents column
df['documents'] = df['documents'].astype(str).apply(preprocess_text)

# LLM API Integration

In [12]:
# You can set up .env file to restore the api key
import getpass
openai.api_key = getpass.getpass("Enter your OpenAI API key: ")

Enter your OpenAI API key: ··········


## Open coding

### Function to extract themes from the each row

In [10]:
def get_open_codes(text):
    prompt = f"""
    You are an expert qualitative researcher conducting open coding.
    You are provided with text from a transcript of an interview with an education policy stakeholder.

    Text: "{text}"

    Provide your response in JSON format with two keys: "codes" (a list of short codes) and "descriptions" (a corresponding description for each code).

    Example Response:
    {{
        "codes": ["Student Engagement", "Technology Use"],
        "descriptions": ["The text describes how students engage with the content", "The role of technology in enhancing learning"]
    }}
    """

    try:
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.5
        )

        # Parse LLM output
        content = response['choices'][0]['message']['content'].strip()
        return json.loads(content)

    except Exception as e:
        print(f"Error processing text: {text}\nError: {str(e)}")
        return {"codes": [], "descriptions": []}


In [13]:
# Drop rows with missing roles
df = df.reset_index()
df = df[df['roles'].notna()]

# Sample 20% of the DataFrame
sample_fraction = 0.2

# Randomly sample 20% of the DataFrame
df_sampled = df.groupby('roles', group_keys=False).apply(lambda x: x.sample(frac=sample_fraction, random_state=42))

  df_sampled = df.groupby('roles', group_keys=False).apply(lambda x: x.sample(frac=sample_fraction, random_state=42))


In [22]:
# Provide one row at a time
# Apply open coding
df_sampled['Open_Coding'] = df_sampled['documents'].apply(get_open_codes)

# Extract codes and descriptions into separate columns
df_sampled['Codes'] = df_sampled['Open_Coding'].apply(lambda x: x.get("codes", []))
df_sampled['Descriptions'] = df_sampled['Open_Coding'].apply(lambda x: x.get("descriptions", []))

In [23]:
# Refine and label emerging themes
# Pattern Recognition
# Validation Techniques
df_sampled.head()

Unnamed: 0,index,roles,#doc,documents,Open_Coding,Codes,Descriptions
4,4,Educator,14.0,as a educator who works with students directly...,"{'codes': ['Student Needs vs Compliance', 'Eng...","[Student Needs vs Compliance, English Language...",[Describes the frustration of balancing studen...
116,116,Educator,520.0,and there is almost never any conversations ab...,{'codes': ['Diversity in Education Leadership'...,"[Diversity in Education Leadership, Policy Mak...",[The text discusses the lack of diversity in e...
10,10,Educator,20.0,"in another word, the quality here actually hav...","{'codes': ['Program Quality', 'Student Outcome...","[Program Quality, Student Outcome]",[The text discusses the quality of the instruc...
0,0,Educator,3.0,"so speaking at a local district level, we have...","{'codes': ['Equity Policy Implementation', 'Le...","[Equity Policy Implementation, Leadership Tran...",[The text discusses the implementation of an e...
126,126,Educator,530.0,the one consistent thing that comes up is so c...,"{'codes': ['Native American Representation', '...","[Native American Representation, Federal Polic...",[The text discusses the importance of accurate...


In [24]:
# Flatten and aggregate into a codebook
pre_codebook1 = []
for _, row in df_sampled.iterrows():
    for code, desc in zip(row['Codes'], row['Descriptions']):
        pre_codebook1.append({"Code": code, "Description": desc})


In [25]:
pre_codebook1

[{'Code': 'Student Needs vs Compliance',
  'Description': 'Describes the frustration of balancing student needs with compliance requirements'},
 {'Code': 'English Language Development Services',
  'Description': 'Specific mention of challenges related to English language development services'},
 {'Code': 'Diversity in Education Leadership',
  'Description': 'The text discusses the lack of diversity in educational leadership positions and the impact on policy making'},
 {'Code': 'Policy Making',
  'Description': 'The text highlights the importance of diverse stakeholders in policy discussions'},
 {'Code': 'Program Quality',
  'Description': 'The text discusses the quality of the instructional program or services provided to students'},
 {'Code': 'Student Outcome',
  'Description': "The text mentions the importance of assessing students' language skills and other outcomes to fully participate in school programs"},
 {'Code': 'Equity Policy Implementation',
  'Description': 'The text discu

#### Function to extract collective themes from the entire dataset

In [26]:
def extract_themes(text_list):
    combined_text = " ".join(text_list)  # Merge all texts into a single document

    prompt = f"""
    You are an expert qualitative researcher performing open coding.
    You are provided with interview transcripts with education policy stakeholders.

    Text: "{combined_text}"

    Provide your response in JSON format:
    {{
        "Codes": ["Theme 1", "Theme 2", "Theme 3"],
        "Descriptions": ["Brief description of Theme 1", "Description of Theme 2", "Description of Theme 3"]
    }}
    """

    try:
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.5
        )

        # Parse the JSON output
        content = response['choices'][0]['message']['content'].strip()
        return json.loads(content)

    except Exception as e:
        print(f"Error: {str(e)}")
        return {"Codes": [], "Descriptions": []}  # Return empty structure on failure



In [77]:
# Load text data
documents_list = df_sampled['documents'].tolist()

# Step 1: Extract collective themes
pre_codebook = extract_themes(documents_list)
themes, descriptions = pre_codebook["Codes"], pre_codebook["Descriptions"]

NameError: name 'extract_themes' is not defined

In [28]:
pre_codebook

{'Codes': ['Equity in Education Policy',
  'Data Representation and Decolonization',
  'Challenges in Implementation and Governance'],
 'Descriptions': ['Theme 1: The interviewee discusses the importance of equity in education policy, highlighting the need for diversity among stakeholders and the impact of policies on students of color. They also mention the challenges of aligning policies with student outcomes and the need for state support and funding for early learning programs.',
  'Theme 2: The interviewee addresses issues related to data representation and decolonization in education, emphasizing the impact of federal policies on Native American populations and the importance of accurate data collection for funding, programming, and support. They also mention efforts to decolonize data and improve representation of diverse communities.',
  'Theme 3: The interviewee reflects on challenges in policy implementation and governance, discussing disparities in funding formulas, disparat

In [76]:
for doc in documents_list:
  print(doc)

NameError: name 'documents_list' is not defined

In [38]:
# Function to assign themes to each text entry
def assign_themes(text_list, themes, descriptions):
    prompt = f"""
    Given the following overarching themes:
    {json.dumps({"themes": themes, "descriptions": descriptions}, indent=2)}

    Now classify each text into one or more of these themes.

    Texts: {json.dumps(text_list, indent=2)}

    Provide a response in JSON format as a list where each element contains:
    {{
        "assigned_themes": ["Theme 1", "Theme 2"]
    }}
    """

    try:
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.5
        )

        # Parse the JSON output
        content = response['choices'][0]['message']['content'].strip()


        return json.loads(content)


    except json.JSONDecodeError as e:
        print(f"JSON Parse Error: {str(e)}")
        return [{"assigned_themes": []} for _ in text_list]  # Return empty themes for each text

    except Exception as e:
        print(f"Error: {str(e)}")
        return [{"assigned_themes": []} for _ in text_list]  # Handle other errorsucture on failure

# Step 2: Assign themes to individual texts
assigned_theme_results = assign_themes(documents_list, themes, descriptions)

# Convert results to DataFrame

print("length of assigned theme results:", len(assigned_theme_results))
df_sampled['Assigned_Themes'] = [result["assigned_themes"] for result in assigned_theme_results]


length of assigned theme results: 113


ValueError: Length of values (113) does not match length of index (52)

In [None]:
t['Assigned_Themes']

Unnamed: 0_level_0,Assigned_Themes
roles,Unnamed: 1_level_1
Educator,"[Teacher Quality, Language Development]"
Educator,"[Dual Language Initiatives, Language Development]"


## Codebook-based coding

### Load codebook

In [14]:
# Import codebook if applicable
# If you are open coding using LLM, skip this
# Use codebook.csv here
codebook = pd.read_csv('codebook.csv')

In [15]:
# The codebook contains parent and child codes from which GPT selects.
codebook.head()

Unnamed: 0,Topic,Parent,Child,Child_description,Agreed rating,Min's Original Coherence rating,Key words
0,23.0,"Culture, climate and environment",Trauma at home,Struggling home and family experiences of chil...,4.0,4.0,"bad, kid, children, home, school, famili, grad..."
1,14.0,"Culture, climate and environment",Anti-racism,"Talking about whiteness, success, and anti-racism",3.0,3.0,"pull, white, child, success, stay, built, job,..."
2,,Curriculum and instruction,Instructional programs,"AP/IB courses, college classes/credits in high...",,,"Equity, Students, School, District, Policy, Da..."
3,13.0,Curriculum and instruction,Curriculum development and instructional delivery,School curriculum development and instructiona...,3.0,3.0,"onlin, teach, taught, teacher, high, learn, cl..."
4,11.0,"Data, evidence, and accountability","Data access, analysis, reporting, use, quality...","Data collection, access, analysis, and use to ...",4.0,4.0,"dashboard, data, inform, assumess, collect, di..."


### *load functions*

In [16]:
def label_topic(doc_in):
    """
    Labels the topic of a given document using OpenAI's GPT model.

    Args:
    doc_in (str): The document or text snippet for which the topic needs to be determined.

    Returns:
    current_response: The labeled topic as determined by the GPT model or an error message.
    """

    # Create a prompt using the document input.
    prompt = create_prompt(doc_in)

    # Send the generated prompt to the OpenAI API and get a response.
    response = send_to_api(prompt)

    # Check if the response from the API is not an error.
    if response != "InvalidRequestError":
        current_response = response.choices[0].message.content
    else:
        current_response = response

    return current_response

In [17]:
def send_to_api(prompt):
    """
    This function sends a given prompt to OpenAI's API and retrieves the response.

    Args:
    prompt (str): A string containing the user's input that needs to be processed by the GPT model.

    Returns:
    result: The response from OpenAI's API or an error message if the request fails.
    """

    # Initialize result to None, this will store the API response
    result = None

    # A loop to attempt the API call multiple times if necessary
    i = 1
    while result is None:
        try:
            # Making a request to OpenAI's ChatCompletion API
            result = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",  # Specifies the GPT model to use
                messages=[
                    {"role": "user", "content": prompt}  # The prompt or message for the GPT model
                ],
                temperature=0.5  # Sets the creativity of the response. Lower is more deterministic.
            )

        except openai.InvalidRequestError:
            # If there is an invalid request, store an error message
            result = "InvalidRequestError"
        except:
            # For other errors, wait for 1.5 seconds before retrying
            time.sleep(1.5)
            pass

    # Return the API response or error message
    return result

send_to_api: This function is a robust way to interact with the OpenAI API, handling potential errors and retrying requests. The `temperature` parameter can be adjusted to alter the nature of the generated responses, with lower values leading to more predictable and conservative outputs.

### Prompt 1 (Zero shot):input the entire codebook, including a table that has parent code, parent code description, child code, and child code description; use the GPT to label both child and parent code for each document; if non child code work, just label the parent code.

#### Prompt archive

In [None]:
# prompt 0-0
prompt_base = f"""
You are provided with a paragraph from qualitative interviews and a codebook that contains parent and child codes for qualitative analysis. Each child code is associated with a description in the codebook.
Please assist in performing qualitative coding for the paragraph from qualitative interviews based on the provided codebook. The codebook is in CSV format and contains four useful columns for this task: 'Parent', 'Child', 'Child_description' and 'Key words'.

Here's how you should proceed:

1. Review the codebook:
   - The 'Parent' is a categorical label for the broader theme or category.
   - The 'Child' is a categorical label for the detailed theme.
   - The 'Child_description' provides a detailed description of what each child code in corresponding 'Child' represents.
   - The 'Key words' provides high frequent words that are mentioned in a child code in 'Child'.

2. For the paragraph from the qualitative interviews, assign a code from the codebook. Use the following steps:
   a. Read the paragraph.
   b. Refer to the codebook to find the most appropriate 'Parent', 'Child', and 'Child_description' for this paragraph.
   c. Provide the 'Parent', 'Child', and 'Child_description' as labels for this paragraph.

Here is a snippet from the codebook:
{codebook}

--- Paragraph to Label ---
[TEXTGOHERE]

Your task is to label the paragraph from the qualitative interviews based on the codes defined in the codebook. For each paragraph, identify the relevant parent code and child code from the codebook and provide a brief reasoning of why you selected that child code.
Please ensure that the labels you assign accurately represent the content of the paragraph. If you are uncertain or if a paragraph does not fit any code from the codebook, please return 'None'.

Format your response as a JSON object with 3 keys where
“Parent code”, “Child code”, and “Reasoning” as the keys.
"""

In [None]:
# prompt 1
prompt_base = f"""
You are a policy researcher. You are provided with a Paragraph from a transcript of an interview with an education policy stakeholder.
Your task is to performing qualitative coding for the Paragraph using the Codebook provided below. The Codebook contains four useful columns for this task: 'Parent', 'Child', 'Child_description' and 'Key words'.

First, review the Codebook:
   - 'Parent' column includes categorical labels for broader themes or categories.
   - 'Child' column includes categorical labels for detailed themes, nested within corresponding 'Parent' categories.
   - 'Child_description' column provides detailed descriptions of what corresponding 'Child' represents.
   - 'Key words' column provides highly frequent words that are relevant to corresponding 'Child' categories.

Second, using the Codebook, identify the top three most prevalient themes in each Paragraph by following the following steps:
   a. Read the Paragraph and understand the meaning in the context of Washington State K-12 public school system;
   b. Refer to the Codebook to identify three most salient themes emerge from the paragraph using 'Child' based on corresponding 'Child_description'; only when you cannot identify any appropriate child code, use the Parent. Show your reasoning behind your classifications;
   c. If a Paragraph does not fit any Child and Parent category from the Codebook, please return 'None'.

Here is the Codebook:
{codebook}

--- Paragraph ---
[TEXTGOHERE]

Format your response as a JSON object with 3 keys where
“Theme 1”, “Theme 2”, “Theme 3” and “Reasoning” as the keys.
"""

In [None]:
# prompt 1-0 remove keywords, up to 3, simplify steps
prompt_base = f"""
You are a policy researcher. You are provided with a Paragraph from a transcript of an interview with an education policy stakeholder.
Your task is to use the Codebook in CSV format to code Paragraph from the interview transcript.

First, review and understand the Codebook:
   - 'Parent' column includes categorical labels for broader themes.
   - 'Child' column includes categorical labels for subthemes, nested within corresponding 'Parent' categories.
   - 'Child_description' column provides detailed descriptions of subcategories acco 'Child' represents.

Second, using the Codebook, identify up to three the most salient themes in each Paragraph, which cover the most noticeable, central, important idea conveyed in the Paragraph, by following the following steps:
   a. Read the Paragraph and understand the meaning in the context of Washington State K-12 public school system.
   b. Identify up to three most salient themes from the 'Child' categories, based on 'Child_description'. ONLY if you could not identify an appropriate 'Child' categories, you can use the 'Parent' categories.
   c. If a Paragraph does not fit any Child or Parent categories from the Codebook, return 'None'.
   d. Review the identified themes, if they do not reflect the most salient themes of the Paragraph, repeat step a-c for 3 time maximum.
   e. Provide the identified themes as labels for this Paragraph and show your reasoning behind your classifications.

Here is the Codebook:
{codebook}

--- Paragraph ---
[TEXTGOHERE]

Format your response as a JSON object with 4 keys where
“Theme 1”, “Theme 2”, “Theme 3” and “Reasoning” as the keys.
"""

In [None]:
# prompt 1-0 remove keywords, up to 3, simplify steps
prompt_base1 = f"""
You are a policy researcher. You are provided with a Paragraph from a transcript of an interview with an education policy stakeholder.
Your task is to use the Codebook in CSV format to code Paragraph from the interview transcript.

First, review the Codebook:
   - 'Parent' column includes categorical labels for broader themes.
   - 'Child' column includes categorical labels for detailed themes, nested within corresponding 'Parent' categories.
   - 'Child_description' column provides detailed descriptions of what each child category in 'Child' represents.

Second, using the Codebook, identify three most salient themes in each Paragraph, which cover the most noticeable, central, important idea conveyed in the Paragraph, by following the following steps:
   a. Read the Paragraph and understand the meaning in the context of Washington State K-12 public school system.
   b. Identify three most salient themes from the 'Child' categories, based on 'Child_description'. ONLY if you could not identify an appropriate 'Child' categories, you can use the 'Parent' categories.
   c. Review the three identified themes, if they do not reflect the most salient themes of the Paragraph, repeat step a-b for 3 time maximum. If a Paragraph does not fit any Child or Parent categories from the Codebook, return 'None'.
   d. Provide the identified themes as labels for this Paragraph and show your reasoning behind your classifications.

Here is the Codebook:
{codebook}

--- Paragraph ---
[TEXTGOHERE]

Format your response as a JSON object with 4 keys where
“Theme 1”, “Theme 2”, “Theme 3” and “Reasoning” as the keys.
"""

In [None]:
# v02
prompt_base = f"""
You are a policy researcher. You are provided with a Paragraph from a transcript of an interview with an education policy stakeholder.
Your task is to use the Codebook in CSV format to code Paragraph from the interview transcript.

First, review the Codebook:
- 'Parent' column includes categorical labels for broader themes.
- 'Child' column includes categorical labels for detailed themes, nested within corresponding 'Parent' categories.
- 'Child_description' column provides detailed descriptions of what each child category in 'Child' represents.
- 'Key words' column provides high-frequency words that are relevant to corresponding 'Child' categories.

Second, using the Codebook, identify three most salient themes in each Paragraph, which cover the most noticeable, central, important idea conveyed in the Paragraph, by following the following steps:
a. Read the Paragraph and understand the meaning in the context of Washington State K-12 public school system;
b. Identify three of the most salient themes from the 'Child' categories, based on 'Child_description'; you can also refer to 'Key words'; ONLY if you could not identify an appropriate 'Child', you can use the 'Parent' categories. Provide the three identified themes as labels for this Paragraph and show your reasoning behind your classifications;
c. Please ensure that the themes you identified accurately represent the content of the Paragraph; if a Paragraph does not fit any Child or Parent categories from the Codebook, return 'None'.

Here is the Codebook:
{codebook}

--- Paragraph ---
[TEXTGOHERE]

Format your response as a JSON object with 4 keys where
“Theme 1”, “Theme 2”, “Theme 3” and “Reasoning” as the keys.
"""


#### Prompt in use

In [18]:
# v03
prompt_base = f"""
You are a policy researcher. You are provided with a Paragraph from a transcript of an interview with an education policy stakeholder.
Your task is to use the Codebook in CSV format to code Paragraph from the interview transcript.


First, review and learn the Codebook:
- 'Parent' column includes short labels for broader themes or categories of ideas.
- 'Child' column includes labels for subcategories of corresponding 'Parent' categories.
- 'Child_description' column provides narrative descriptions of 'Child' categories.
- 'Key words' column provides high-frequency words that are relevant to corresponding 'Child' categories.


Second, using the Codebook, identify the top three most salient themes in each Paragraph by following the following steps:
a. Read the Paragraph and understand the meaning in the context of the Washington State K-12 public school system.
b. Identify the three most salient themes using the 'Child' categories based on 'Child_description'. You can also refer to 'Key words'. Only when you cannot identify an appropriate Child category, you can label the Paragraph by using the Parent categories.
c. Provide the three identified themes as labels for this Paragraph and show your reasoning behind your classifications.
c. Review and ensure that the themes you identified accurately represent the content of the Paragraph. If a Paragraph does not fit any Child or Parent category from the Codebook, return 'None'.

Here is a snippet from the Codebook:
{codebook}

--- Paragraph ---
[TEXTGOHERE]


Format your response as a JSON object with 4 keys where
“Theme 1”, “Theme 2”, “Theme 3” and “Reasoning” as the keys.
"""


In [None]:
# Select a subset of text to test prompt
#random_rows = df.sample(n=50, random_state=2023)
t = df.iloc[8:10, ]

In [19]:
def create_prompt(doc_in):
    return prompt_base.replace('TEXTGOHERE', doc_in)

In [20]:
output = []
for index, row in tqdm(df_sampled.iterrows(), total = df_sampled.shape[0]):
  codebook=codebook[['Parent', 'Child', 'Child_description', 'Key words']]
  doc_in = row.documents # text column
  output_current = label_topic(doc_in)
  output.append(output_current)

100%|██████████| 52/52 [01:07<00:00,  1.29s/it]


In [22]:
print(output[0])

{
    "Theme 1": "Student supports and interventions",
    "Theme 2": "Curriculum and instruction",
    "Theme 3": "Data, evidence, and accountability",
    "Reasoning": "The paragraph discusses the educator's role in providing the best situation for students, highlighting the disconnect between what students need and compliance requirements. This aligns with the theme of student supports and interventions. Additionally, the mention of English language development services relates to curriculum and instruction. The frustration around compliance also points to the theme of data, evidence, and accountability."
}


### Prompt 2 (Chain of thought): Step 1: label all parent codes with reasoning; Step 2; label child code within each parent code. if none of the child code work, just keep the parent code.

#### Prompt archive

In [23]:
prompt_base2 = f"""
You are a policy researcher. You are provided with a Paragraph from a transcript of an interview with an education policy stakeholder.
Your task is to use the Codebook in CSV format to code Paragraph from the interview transcript. The Codebook contains four columns: 'Parent', 'Child', 'Child_description' and 'Key words'.

Here's how you should proceed:

1. Review parent code in the Codebook:
   - 'Parent' column includes categorical labels for broader themes or categories.

2. Using the Codebook, identify up to three the most salient themes in each Paragraph, which should cover the most noticeable, central, important idea conveyed in the Paragraph, by following the following steps:
   a. Read the Paragraph and understand the meaning in the context of Washington State K-12 public school system;
   b. Identify up to three the most salient themes from the Parent categories.
   c. Provide identified 'Parent' as label for this Paragraph.

3. Review child code in the Codebook:
   - 'Child' column includes categorical labels for detailed themes, nested within corresponding 'Parent' categories.
   - 'Child_description' column provides detailed descriptions of what each child category in 'Child' represents.
   - 'Key words' column provides highly frequent words that are relevant to corresponding 'Child' categories.

4. Using the Codebook, within the an identified 'Parent' category, identify the most salient 'Child' categories. Use the following steps:
   a. Read the Paragraph and understand the meaning in the context of Washington State K-12 public school system;
   b. Refer to the Codebook, for each identified 'Parent' category, find the most appropriate subcategories from 'Child' for this Paragraph based on 'Child_description'. You can also refer to 'Key words'.
   c. Please ensure that the 'Child' you assign accurately represent the content of the Paragraph. If you are uncertain or if a Paragraph does not fit any 'Child' categories from the Codebook, please return 'None'.
   d. Review all identified 'Parent' and 'Child' pairs, select three pairs that represent the most noticeable, central, important idea conveyed in the Paragraph.
   e. Provide identified three pairs of 'Parent' and their corresponding 'Child' as labels for this paragraph.

Here is the Codebook:
{codebook}

--- Paragraph to Label ---
[[[TEXTGOHERE]]]

Format your response as a JSON object with 7 keys where
'Parent 1', 'Child 1', 'Parent 2', 'Child 2', 'Parent3', 'Child 3', and 'Reasoning' as the keys.
"""

In [24]:
prompt_base2_1 = f"""
You are a policy researcher. You are provided with a Paragraph from a transcript of an interview with an education policy stakeholder.
Your task is to use the Codebook in CSV format to code Paragraph from the interview transcript. The Codebook contains four columns: 'Parent', 'Child', 'Child_description' and 'Key words'.

Here's how you should proceed:

1. Using the Codebook, identify up to three the most salient themes in each Paragraph, which should cover the most noticeable, central, important idea conveyed in the Paragraph, by following the following steps:
a. Read the Paragraph and understand the meaning in the context of Washington State K-12 public school system;
b. Review Parent Code in the Codebook: ‘Parent’ column includes categorical labels for broader themes
c. Identify up to three the most salient themes from the Parent categories.
d. Provide identified 'Parent' as label for this Paragraph.

2. Review child code in the Codebook:
- 'Child' column includes categorical labels for detailed themes, nested within corresponding 'Parent' categories.
- 'Child_description' column provides detailed descriptions of what each child category in 'Child' represents.
- 'Key words' column provides highly frequent words that are relevant to corresponding 'Child' categories.

3. Using the Codebook, within the an identified 'Parent' category, identify the most salient 'Child' categories. Use the following steps:
a. Read the Paragraph and understand the meaning in the context of Washington State K-12 public school system;
b. Refer to the Codebook, for each identified 'Parent' category, find the most appropriate subcategories from 'Child' for this Paragraph based on 'Child_description'. You can also refer to 'Key words'.
c. Please ensure that the 'Child' you assign accurately represents the content of the Paragraph. If you are uncertain or if a Paragraph does not fit any 'Child' categories from the Codebook, please return 'None'.
d. Review all identified 'Parent' and 'Child' pairs, select three pairs that represent the most noticeable, central, important idea conveyed in the Paragraph.
e. Provide identified three 'Parent' and their corresponding 'Child' as labels for this paragraph.

Here is the Codebook:
{codebook}

--- Paragraph to Label ---
[[[TEXTGOHERE]]]

Format your response as a JSON object with 7 keys where
'Parent 1', 'Child 1', 'Parent 2', 'Child 2', 'Parent3', 'Child 3', and 'Reasoning' as the keys.
"""


#### Prompt in use

In [25]:
prompt_base2 = f"""
Task: As a policy researcher, you’ve been provided with a paragraph extracted from an interview with an education policy stakeholder. Utilize the provided Codebook (in CSV format) to code the paragraph. The Codebook comprises four columns: ‘Parent’, ‘Child’, ‘Child_description’, and ‘Key words’.

Steps:
1. Identify Salient Themes:
Understand the paragraph’s content within the context of the Washington State K-12 public school system.
Refer to the ‘Parent’ column in the Codebook for broader thematic categories.
Pinpoint up to three salient themes from these ‘Parent’ categories.
These themes should highlight the most significant ideas in the paragraph.
Label the paragraph with the chosen ‘Parent’ themes.

2. Dive into Child Themes:
The ‘Child’ column in the Codebook lists detailed thematic subcategories, which fall under the broader ‘Parent’ categories.
The ‘Child_description’ elaborates on the ‘Child’ categories, and the ‘Key words’ column lists pertinent terms for each ‘Child’ category.

3. Associate with Child Categories:
Revisit the paragraph, keeping the Washington State K-12 public school system context in mind.
For each previously identified ‘Parent’ theme, pinpoint the apt ‘Child’ subcategories from the Codebook. The ‘Child_description’ and ‘Key words’ columns can aid your decision.
Ensure the ‘Child’ categories align with the paragraph’s content. If there’s no fit or you’re uncertain, label it as ‘None’.
From your identified ‘Parent’ and ‘Child’ pairs, pick the top three pairs that encapsulate the paragraph’s central ideas.
Label the paragraph with these three ‘Parent’ and corresponding ‘Child’ pairs.

Codebook:
{codebook}

Paragraph for Analysis:
[[[TEXTGOHERE]]]

Response Format:
Frame your answer as a JSON object containing the keys: ‘Parent 1’, ‘Child 1’, ‘Parent 2’, ‘Child 2’, ‘Parent 3’, ‘Child 3’, and ‘Reasoning’.
"""

In [26]:
def create_prompt(doc_in):
    return prompt_base2.replace('TEXTGOHERE', doc_in)

In [None]:
t = df.iloc[8:10, ]

In [27]:
output2 = []
for index, row in tqdm(df_sampled.iterrows(), total = df_sampled.shape[0]):
        doc_in = row.documents
        output_current = label_topic(doc_in)
        output2.append(output_current)

100%|██████████| 52/52 [01:36<00:00,  1.85s/it]


In [29]:
# cot prompt results
print(output2[0])

{
  "Parent 1": "Curriculum and instruction",
  "Child 1": "Instructional programs",
  "Parent 2": "Student supports and interventions",
  "Child 2": "Students' SEL and health",
  "Parent 3": "Data, evidence, and accountability",
  "Child 3": "Accountability system",
  "Reasoning": "The paragraph highlights the educator's focus on providing the best situation for students, indicating a concern for instructional programs. The mention of English language development services for students points towards student support and interventions, specifically in the area of social-emotional learning and health. The frustration around compliance suggests a need for accountability systems to ensure student needs are met effectively."
}


In [33]:
!pip install ace_tools

Collecting ace_tools
  Downloading ace_tools-0.0-py3-none-any.whl.metadata (300 bytes)
Downloading ace_tools-0.0-py3-none-any.whl (1.1 kB)
Installing collected packages: ace_tools
Successfully installed ace_tools-0.0


In [47]:
# Initialize an empty list for rows
rows = []

# Iterate over output2, df_sampled['roles'], and df_sampled['docs'] simultaneously
for json_string, role, doc in zip(output2, df_sampled['roles'], df_sampled['documents']):
    try:
        parsed_item = json.loads(json_string)  # Convert JSON string to dictionary
        parsed_item["Role"] = role  # Add corresponding role
        parsed_item["Document"] = doc  # Add corresponding document
        rows.append(parsed_item)  # Append updated dictionary
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON: {e}")
        rows.append({"Role": role, "Document": doc})  # Append only roles and documents if JSON parsing fails

# Convert list of dictionaries into a DataFrame
output2_df = pd.DataFrame(rows)

# Reorder columns to make "Role" and "Document" the first columns
column_order = ["Role", "Document"] + [col for col in output2_df.columns if col not in ["Role", "Document"]]
output2_df = output2_df[column_order]


output2_df.to_excel("output2.xlsx")

In [48]:
from google.colab import files
files.download("output2.xlsx")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [52]:
output2_df

Unnamed: 0,Role,Document,Parent 1,Child 1,Parent 2,Child 2,Parent 3,Child 3,Reasoning
0,Educator,as a educator who works with students directly...,Curriculum and instruction,Instructional programs,Student supports and interventions,Students' SEL and health,"Data, evidence, and accountability",Accountability system,The paragraph highlights the educator's focus ...
1,Educator,and there is almost never any conversations ab...,"Governance, leadership, and community partnership",Leadership in diversity,Staffing resources,Diversify teacher workforce (teacher labor mar...,"Culture, climate and environment",Anti-racism,The paragraph discusses the lack of diversity ...
2,Educator,"in another word, the quality here actually hav...",Curriculum and instruction,Instructional programs,"Data, evidence, and accountability","Goals, outcomes, and measures",Student supports and interventions,Students' SEL and health,The paragraph discusses the quality of instruc...
3,Educator,"so speaking at a local district level, we have...","Governance, leadership, and community partnership",Leadership in diversity,Staffing resources,Diversify teacher workforce (teacher labor mar...,"Culture, climate and environment",Anti-racism,The paragraph discusses the approval of an equ...
4,Educator,the one consistent thing that comes up is so c...,"Data, evidence, and accountability","Data access, analysis, reporting, use, quality...","Governance, leadership, and community partnership",Leadership in diversity,School finance,Funding formula,The paragraph discusses the impact of data rep...
5,Educator,there is a lot of great work. if you have not ...,"Data, evidence, and accountability","Data access, analysis, reporting, use, quality...",Staffing resources,Diversify teacher workforce (teacher labor mar...,Curriculum and instruction,Curriculum development and instructional delivery,"The paragraph discusses decolonizing data, rep..."
6,Educator,but now with the sat not being used as much in...,Curriculum and instruction,Curriculum development and instructional delivery,"Data, evidence, and accountability",Tests and inconsistent standards for college r...,"Governance, leadership, and community partnership",Leadership in diversity,The paragraph discusses the shift from using t...
7,Educator,i am not an anti-union person. i think the wea...,Staffing resources,"Teacher union, salary, workforce",Curriculum and instruction,Curriculum development and instructional delivery,,,The paragraph discusses the impact of teacher ...
8,Educator,if you reach out to office and native educatio...,"Data, evidence, and accountability","Data access, analysis, reporting, use, quality...","Governance, leadership, and community partnership",Leadership in diversity,,,The paragraph discusses the availability and u...
9,Educator,but we have 70% of our five-year-olds coming i...,Curriculum and instruction,Instructional programs,School finance,Progressive funding,Staffing resources,"Teacher union, salary, workforce",The paragraph highlights the need for state po...


In [57]:
for role in output2_df["Role"].unique():
  print(role)
  print(output2_df['Parent 1'].value_counts())
  print("\n")

Educator
Parent 1
Data, evidence, and accountability                   12
Curriculum and instruction                           11
Governance, leadership, and community partnership     9
Staffing resources                                    6
Culture, climate and environment                      5
School finance                                        4
System supports and interventions                     3
Student supports and interventions                    2
Name: count, dtype: int64


Funder
Parent 1
Data, evidence, and accountability                   12
Curriculum and instruction                           11
Governance, leadership, and community partnership     9
Staffing resources                                    6
Culture, climate and environment                      5
School finance                                        4
System supports and interventions                     3
Student supports and interventions                    2
Name: count, dtype: int64


Local administ

In [58]:
theme_summary = output2_df.melt(
    id_vars=["Role"],
    value_vars=["Parent 1", "Parent 2", "Parent 3"],
    var_name="Theme_Level",
    value_name="Theme"
).groupby(["Role", "Theme"]).size().reset_index(name="Count")

print(theme_summary)

                                      Role  \
0                                 Educator   
1                                 Educator   
2                                 Educator   
3                                 Educator   
4                                 Educator   
5                                 Educator   
6                                 Educator   
7                                 Educator   
8                                   Funder   
9                                   Funder   
10                                  Funder   
11                                  Funder   
12      Local administrators/board members   
13      Local administrators/board members   
14      Local administrators/board members   
15      Local administrators/board members   
16      Local administrators/board members   
17      Local administrators/board members   
18      Local administrators/board members   
19      Local administrators/board members   
20      Local administrators/board

In [59]:
for role in output2_df["Role"].unique():
    print(f"\n📌 **Role:** {role}")
    role_df = output2_df[output2_df["Role"] == role]
    theme_counts = pd.concat([role_df["Parent 1"], role_df["Parent 2"], role_df["Parent 3"]]).value_counts()
    print(theme_counts.to_string())  # Display theme counts for this role
    print("="*50)  # Separator


📌 **Role:** Educator
Curriculum and instruction                           11
Data, evidence, and accountability                    8
Governance, leadership, and community partnership     7
Staffing resources                                    7
Student supports and interventions                    5
Culture, climate and environment                      3
School finance                                        2
None                                                  2

📌 **Role:** Funder
Culture, climate and environment      2
Student supports and interventions    2
Curriculum and instruction            1
System supports and interventions     1

📌 **Role:** Local administrators/board members
Governance, leadership, and community partnership    4
Data, evidence, and accountability                   2
Curriculum and instruction                           2
Staffing resources                                   2
System supports and interventions                    1
School finance             

In [63]:
# Identify Parent and Child columns
parent_cols = [col for col in output2_df.columns if "Parent" in col]
child_cols = [col for col in output2_df.columns if "Child" in col]

# Iterate over each role
for role in output2_df["Role"].unique():
    print(f"\n📌 **Role:** {role}")
    role_df = output2_df[output2_df["Role"] == role]

    # Dictionary to store unique Parent themes with counts and corresponding Child themes
    parent_child_dict = {}

    # Iterate through Parent-Child columns
    for p_col, c_col in zip(parent_cols, child_cols):
        for _, row in role_df.iterrows():
            parent_theme = row[p_col]
            child_theme = row[c_col]

            if pd.notna(parent_theme):  # Ensure it's not empty
                if parent_theme not in parent_child_dict:
                    parent_child_dict[parent_theme] = {"count": 0, "children": {}}
                parent_child_dict[parent_theme]["count"] += 1  # Increment Parent count

                if pd.notna(child_theme):  # Ensure it's not empty
                    if child_theme not in parent_child_dict[parent_theme]["children"]:
                        parent_child_dict[parent_theme]["children"][child_theme] = 0
                    parent_child_dict[parent_theme]["children"][child_theme] += 1  # Increment Child count

    # Print the unique Parent and Child themes with counts
    for parent, data in parent_child_dict.items():
        print(f"  🔹 {parent} ({data['count']} occurrences)")
        for child, count in sorted(data["children"].items()):  # Sorting for readability
            print(f"     ↳ {child} ({count} occurrences)")

    print("=" * 50)  # Separator for readability


📌 **Role:** Educator
  🔹 Curriculum and instruction (11 occurrences)
     ↳ Curriculum development and instructional delivery (4 occurrences)
     ↳ Instructional programs (7 occurrences)
  🔹 Governance, leadership, and community partnership (7 occurrences)
     ↳ Leadership in diversity (6 occurrences)
     ↳ Local control and district policies and politics (1 occurrences)
  🔹 Data, evidence, and accountability (8 occurrences)
     ↳ Accountability system (1 occurrences)
     ↳ Data access, analysis, reporting, use, quality control (1 occurrences)
     ↳ Data access, analysis, reporting, use, quality, and equity (1 occurrences)
     ↳ Data access, analysis, reporting, use, quality, and governance (1 occurrences)
     ↳ Data capacity (1 occurrences)
     ↳ Goals, outcomes, and measures (2 occurrences)
     ↳ Tests and inconsistent standards for college readiness (1 occurrences)
  🔹 Staffing resources (7 occurrences)
     ↳ Diversify teacher workforce (teacher labor market) (3 occurren

In [75]:
for role in df['roles'].unique():
    role_count = df[df['roles'] == role].shape[0]  # Count occurrences of the role
    sample_size = int(role_count * 0.20)  # Get 20% sample size

    print(f" Role: {role}")
    print(f"Total Count: {role_count}")
    print(f"20% Sample Size: {sample_size}")
    print("=" * 50)  # Separator for readability

 Role: Educator
Total Count: 77
20% Sample Size: 15
 Role: Local administrators/board members
Total Count: 26
20% Sample Size: 5
 Role: State nonprofits/advocacy orgs
Total Count: 29
20% Sample Size: 5
 Role: State administrators/legislators/board
Total Count: 91
20% Sample Size: 18
 Role: Local nonprofits/advocacy orgs
Total Count: 30
20% Sample Size: 6
 Role: Funder
Total Count: 11
20% Sample Size: 2


In [74]:
df_sampled['roles'].value_counts()

Unnamed: 0_level_0,count
roles,Unnamed: 1_level_1
State administrators/legislators/board,18
Educator,15
Local nonprofits/advocacy orgs,6
State nonprofits/advocacy orgs,6
Local administrators/board members,5
Funder,2


In [28]:
# zero-shot prompt results
print(output[0])

{
    "Theme 1": "Student supports and interventions",
    "Theme 2": "Curriculum and instruction",
    "Theme 3": "Data, evidence, and accountability",
    "Reasoning": "The paragraph discusses the educator's role in providing the best situation for students, highlighting the disconnect between what students need and compliance requirements. This aligns with the theme of student supports and interventions. Additionally, the mention of English language development services relates to curriculum and instruction. The frustration around compliance also points to the theme of data, evidence, and accountability."
}


## Sentiment Analysis