
 # Project Title: [Your Project Title]
 ### Analyst: Matt Anthes-Washburn, aw@catapult-x.com
 #### Start Date: 12/19/2023
 **Objective:** Analyze customer interview data to identify key themes and trends.

 ## 1. Data Import


In [250]:

# Importing Necessary Libraries
from operator import le
import os
import pandas as pd
from dotenv import load_dotenv
load_dotenv()

# Constants
DATA_PATH = "data"
TRANSCRIPT_PATH = "data/transcripts"
ANNOTATIONS_PATH = "data/annotations"
AZURE_DEPLOYMENT_GPT = "eddo-gpt4"


# Load transcript data
transcripts_df = pd.read_csv(os.path.join(DATA_PATH, "merged_transcripts.csv"))
transcripts_df.head()    

# Load annotations data
# Load the annotations from the individual csv files
def load_all_annotations_from_csv():
    """Load annotations from the individual csv files."""
    annotations_df = pd.DataFrame()
    for file in os.listdir(ANNOTATIONS_PATH):
        if file.endswith(".csv"):
            path = os.path.join(ANNOTATIONS_PATH, file)
            annotations_df = pd.concat([annotations_df, pd.read_csv(path, index_col=0 )])
    # Save the combined annotations to a csv file.
    path = os.path.join(DATA_PATH, "annotations.csv")
    annotations_df.to_csv(path)
    return annotations_df
annotations_df = load_all_annotations_from_csv()
annotations_df

Unnamed: 0,speaker,theme,context,sentiment_score,brand,identified_purchases,start_time,end_time,email,last_name,first_name
0,Robert.Lehman,Purchasing Experience,Robert mentions the difficulties of purchasing...,-0.4,,[],06:04,07:03,robert.lehman@pgcps.org,Lehman,Robert
1,Robert.Lehman,Educational Policies,Robert discusses how educational policies infl...,-0.3,,[],06:04,07:03,robert.lehman@pgcps.org,Lehman,Robert
2,Robert.Lehman,Digital Resources,Robert talks about the shift towards digital t...,0.1,,[],04:49,05:25,robert.lehman@pgcps.org,Lehman,Robert
3,Robert.Lehman,Budget and Timing,Robert explains the budgeting process within t...,0.0,,[],05:44,06:09,robert.lehman@pgcps.org,Lehman,Robert
4,Robert.Lehman,Buying Habits,Robert explains his decision to spend out of p...,0.2,,[],03:21,03:44,robert.lehman@pgcps.org,Lehman,Robert
...,...,...,...,...,...,...,...,...,...,...,...
42,Mr. Ruber-Strohm,Digital Resources,Wants a video included with lab kits to help a...,0.7,,[],49:07,49:07,ruberg@eths202.org,Ruber,Gregory
43,Mr. Ruber-Strohm,Customer Service,Would utilize a safety video included in a kit...,0.6,,[],50:13,50:13,ruberg@eths202.org,Ruber,Gregory
44,Mr. Ruber-Strohm,Product Quality,"Prefers diversity of results in experiments, a...",0.5,,[],51:20,51:20,ruberg@eths202.org,Ruber,Gregory
45,Mr. Ruber-Strohm,Educational Policies,Teacher values real-life science experiences o...,0.8,,[],52:06,53:42,ruberg@eths202.org,Ruber,Gregory



 ## 2. Annotations
 ### Review transcripts and identify key themes and trends.



In [241]:
# Set up the LLM Chain

# Import necessary libraries
from langchain_core.runnables import Runnable
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.openai_functions import create_openai_fn_runnable
from langchain_openai import AzureChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List
import asyncio


class ProductPurchaseDetail(BaseModel):
    """Details of a specific product purchase."""
    product_name: str = Field(..., description="Name of the product purchased.")
    vendor: str = Field(..., description="Vendor from which the product was purchased.")
    purchase_reason: str = Field(default="", description="Reason for purchasing the product from this vendor.")

class Annotation(BaseModel):
    """Identify and annotate the presence of themes in the transcript."""
    speaker: str = Field(..., description="The speaker of the text segment. Do not annotate themes where the speaker is Daylene or Kimberly.")
    theme: str = Field(..., description="""The theme that applies to the text segment. Can include "Brand Perception", "Product Quality", "Customer Service",
    "Purchasing Experience", "Digital Resources", "Environmental Sustainability",
    "Educational Policies", "Customer Experience", "Buying Habits",
    "Purchasing Patterns", "Vendor Comparison", "Budget and Timing",
    "Generational Insights", "Carolina Purchases", "Flinn Purchases""")
    context: str = Field(..., description="A one-sentence summary of the customer's statement.")
    sentiment_score: float = Field(..., description="The sentiment score where -1 is strongly negative and 1 is strongly positive.")
    brand: str = Field(default=None, description="Brand mentioned in the segment, if applicable.")
    identified_purchases: List[ProductPurchaseDetail] = Field(default=[], description="Details of specific products purchased and their vendors.")
    start_time: str = Field(..., description="The time stamp marking the start of the relevant text.")
    end_time: str = Field(..., description="The time stamp marking the end of the relevant text.")
    
class AnnotationCollection(BaseModel):
    """A list of annotations of transcript themes."""
    annotations: List[Annotation] = Field(..., description="A list of annotations.")

INSTRUCTIONS = """As Transcript Pro, analyze educator interview transcripts for market research. Key areas include:

1. **Brand Perception:** Views on brands like Carolina, Flinn Scientific, Amazon, VWR, Ward's, etc.
2. **Product Quality:** Discussions about product durability, effectiveness, quality.
3. **Customer Service:** Experiences with customer service.
4. **Purchasing Experience:** Ease or difficulty in purchasing.
5. **Digital Resources:** Use of digital/virtual teaching tools.
6. **Environmental Sustainability:** Eco-friendly practices in education.
7. **Educational Policies:** Policy influence on purchases.
8. **Customer Experience:** Brand experiences, positive or negative.
9. **Buying Habits:** Timing and methods of buying.
10. **Purchasing Patterns:** What is bought from various vendors.
11. **Vendor Comparison:** Comparisons between Carolina Biological and others.
12. **Budget and Timing:** Budget and purchase timing considerations.
13. **Generational Insights:** Generational differences in buying.
14. **Carolina Purchases:** Specific products purchased from Carolina Biological Supply.
15. **Flinn Purchases:** Specific products purchased from Flinn Scientific.

Focus on processing and summarizing interviewee statements accurately and objectively, maintaining consistency in coding.
"""


# Chat Prompt Template from instructions
prompt = ChatPromptTemplate.from_messages(
    [
    ("system", INSTRUCTIONS),
    ("human", "Process transcript:\n {text}"),
    ]
)

llm= AzureChatOpenAI(azure_deployment=AZURE_DEPLOYMENT_GPT)
runnable = create_openai_fn_runnable([AnnotationCollection], llm, prompt)

def process_text(text):
    """Process a transcript segment and return a list of annotations."""
    try:
        response = runnable.invoke({"text": text})
        return response.annotations
    except Exception as e:
        print(f"Error processing text: {e}")
        return []
    
async def aprocess_text(text):
    """Process a transcript segment and return a list of annotations."""
    try:
        response = await runnable.ainvoke({"text": text})
        return response.annotations
    except Exception as e:
        print(f"Error processing text: {e}")
        print(f"Retrying once...")
        try:
            response = await runnable.ainvoke({"text": text})
            return response.annotations
        except Exception as e:
            print(f"Error processing text: {e}")
        return []

async def compile_annotations(transcript):
    texts = RecursiveCharacterTextSplitter(chunk_size=8000).split_text(transcript)
    tasks = [aprocess_text(text) for text in texts]
    task_results = await asyncio.gather(*tasks)
    # flatten the list of lists
    annotations = [annotation for task_result in task_results for annotation in task_result]
    return annotations

def unprocessed_emails():
    """Update the list of emails that need to be processed."""
    # make a list of emails to process
    emails = transcripts_df["Email"].unique().tolist()
    # make a list of emails that have already been processed
    processed_emails = annotations_df["email"].unique().tolist()
    # make a list of emails that need to be processed
    unprocessed_emails = list(set(emails) - set(processed_emails))

    print(f"{len(unprocessed_emails)} transcripts remaining to process.")
    return unprocessed_emails

# Load annotations from list into dataframe rows
def annotations_to_df(annotations):
    """Convert a list of annotations to a dataframe."""
    # Convert the list of Annotation objects to a list of dictionaries
    annotations_dicts = [obj.__dict__ for obj in annotations]

    # Create the DataFrame
    annotations_df = pd.DataFrame(annotations_dicts)
    
    return annotations_df

def save_to_csv(annotations_df, email):
    """Save the annotations to a csv file."""
    # Save the annotations to a csv file.
    path = os.path.join(ANNOTATIONS_PATH, f"{email}.csv")
    annotations_df.to_csv(path)

# Run the LLM Chain for the next unannotated transcript
async def process_next_transcript():
    """Process the next unannotated transcript."""

    email = unprocessed_emails()[0]
    transcript = transcripts_df[transcripts_df["Email"] == email].iloc[0]
    print(f"Processing transcript for {email}.")
    transcript_text = transcript["transcript"]
    new_annotations = await compile_annotations(transcript_text)
    
    new_annotations_df = annotations_to_df(new_annotations)
    new_annotations_df["email"] = email
    new_annotations_df["last_name"] = transcript["LastName"]
    new_annotations_df["first_name"] = transcript["FirstName"]
    save_to_csv(new_annotations_df, email)
    print(f"Saved annotations for {email}.")
    return new_annotations

In [248]:
# Loop through all unannotated transcripts

while len(unprocessed_emails()) > 0:
    new_annotations = await process_next_transcript()
    print(f"Processed {len(new_annotations)} annotations.")
    annotations_df = load_all_annotations_from_csv()
    


26 emails remaining to process.
26 emails remaining to process.
Processing transcript for libby.frost@decaturschools.org.
Processed 28 annotations.
26 emails remaining to process.
26 emails remaining to process.
25 emails remaining to process.
25 emails remaining to process.
Processing transcript for amanda.fuller@pikeroadschools.org.
Processed 31 annotations.
25 emails remaining to process.
25 emails remaining to process.
24 emails remaining to process.
24 emails remaining to process.
Processing transcript for bbesspashak@philasd.org.
Processed 27 annotations.
24 emails remaining to process.
24 emails remaining to process.
23 emails remaining to process.
23 emails remaining to process.
Processing transcript for kirsten.mahovlich@clevelandmetroschools.org.
Processed 35 annotations.
23 emails remaining to process.
23 emails remaining to process.
22 emails remaining to process.
22 emails remaining to process.
Processing transcript for jhofeld@harrahschools.com.
Processed 21 annotations.



 ## 3. Data Compilation
 #### Save the DataFrame to CSV and JSON for future use.



In [249]:

annotations_df = load_all_annotations_from_csv()

# filter out speakers Daylene and Kimberly
annotations_df = annotations_df[annotations_df["speaker"] != "Daylene Long"]
annotations_df = annotations_df[annotations_df["speaker"] != "Kimberly Herder"]


# Save the combined annotations to a csv file.
path = os.path.join(DATA_PATH, "annotations.csv")
annotations_df.to_csv(path)

# Save the combined annotations to a JSON file
path = os.path.join(DATA_PATH, "annotations.json")
annotations_df.to_json(path, orient="records", indent=4)

annotations_df

Unnamed: 0,speaker,theme,context,sentiment_score,brand,identified_purchases,start_time,end_time,email,last_name,first_name
0,Robert.Lehman,Purchasing Experience,Robert mentions the difficulties of purchasing...,-0.4,,[],06:04,07:03,robert.lehman@pgcps.org,Lehman,Robert
1,Robert.Lehman,Educational Policies,Robert discusses how educational policies infl...,-0.3,,[],06:04,07:03,robert.lehman@pgcps.org,Lehman,Robert
2,Robert.Lehman,Digital Resources,Robert talks about the shift towards digital t...,0.1,,[],04:49,05:25,robert.lehman@pgcps.org,Lehman,Robert
3,Robert.Lehman,Budget and Timing,Robert explains the budgeting process within t...,0.0,,[],05:44,06:09,robert.lehman@pgcps.org,Lehman,Robert
4,Robert.Lehman,Buying Habits,Robert explains his decision to spend out of p...,0.2,,[],03:21,03:44,robert.lehman@pgcps.org,Lehman,Robert
...,...,...,...,...,...,...,...,...,...,...,...
42,Mr. Ruber-Strohm,Digital Resources,Wants a video included with lab kits to help a...,0.7,,[],49:07,49:07,ruberg@eths202.org,Ruber,Gregory
43,Mr. Ruber-Strohm,Customer Service,Would utilize a safety video included in a kit...,0.6,,[],50:13,50:13,ruberg@eths202.org,Ruber,Gregory
44,Mr. Ruber-Strohm,Product Quality,"Prefers diversity of results in experiments, a...",0.5,,[],51:20,51:20,ruberg@eths202.org,Ruber,Gregory
45,Mr. Ruber-Strohm,Educational Policies,Teacher values real-life science experiences o...,0.8,,[],52:06,53:42,ruberg@eths202.org,Ruber,Gregory



 ## 5. Data Integrity and Validation
 #### Add your code and methodology for data validation and integrity checks


In [None]:

# Code for data validation and integrity checks
# [Placeholder for your data validation code]
 



 ## 6. Additional Analysis (if any)
 #### Include any further analysis or manipulation you need to perform


In [None]:

# Code for additional analysis or data manipulation
# [Placeholder for your additional analysis code]
    



 ## Concluding Remarks
 **Final thoughts, observations, or next steps in the project:**
 [Your observations, results of the analysis, or next steps in the project]
