# Proof of Concept

This notebook demonstrates the full pipeline implemented in the package, illustrating a potential use case in the absence of real data.

1. A synthetic framework (i.e., an annotated dataset) is first generated.
2. A BERT classifier is then trained on the synthesized data. 
3. Next, an interaction is simulated between two agents representing a 'student' and a 'tutor'.
4. The trained classifier is then applied to the simulated conversation. In this case, both agents' outputs are analyzed.
5. Finally, the classified data is presented in a descriptive format, showcasing how results might be reported in a real-world scenario.

In [9]:
import os
os.getcwd()

import pandas as pd
import yaml

## 1. Generate Synthetic Framework 

In [29]:
from mypythonpackage import FrameworkGenerator

### 1.1. Generate the Framework by Stored Prompts

The framework is generated using a prompt stored in `src/framework_generation/outline_prompts/prompt_default_4types.py`. Alternatively, a dictionary of prompts can be provided directly—see the example in section 1.2.

The synthetic data is generated using a language model hosted locally via LM Studio. To use this setup, download the model to your local machine and run it through the LM Studio server. Instructions are available here: [LM Studio API Docs](https://lmstudio.ai/docs/app/api). 

In this use case the output is saved as a CSV file, but it can also be returned as a pandas DataFrame or stored in JSON format, depending on your needs.

In [30]:
# Load the generator with the specified model and API URL
generator = FrameworkGenerator(model_name="llama-3.2-3b-instruct", api_url="http://localhost:1234/v1/completions")

In [5]:
# Generate a framework with 4 types of prompts
df_4 = generator.generate_framework(
    prompt_path="src/framework_generation/outline_prompts/prompt_default_4types.py", 
    num_samples=500, 
    csv_out="data/generated_tuning_data/synth_default_fourtypes.csv"
)

Generating for category: Clarification
Skipping invalid output for category 'Clarification': 'choices'
Generating for category: Small Talk
Generating for category: Question
Generating for category: Statement
Number of duplicates removed: 617
→ Saved CSV: data/generated_tuning_data/synth_default_fourtypes.csv


In [30]:
# Examples and length of each category
for category in df_4['category'].unique():
    print(f"\nCategory: {category}")
    print(f"Count: {len(df_4[df_4['category'] == category])}")
    print(df_4[df_4['category'] == category].sample(3))


Category: Clarification
Count: 478
                                                  text       category
392   Can you please rephrase or break it down for me?  Clarification
135  I'm not sure I follow, could you provide more ...  Clarification
359  I think I need more context, can you provide a...  Clarification

Category: Small Talk
Count: 208
                                                  text    category
603    Is your commute getting more challenging today?  Small Talk
891  Are you from around here? I love exploring new...  Small Talk
723      I'm glad I ran into you. What's new with you?  Small Talk

Category: Question
Count: 265
                                                   text  category
1165     What is the meaning of the word "sentimental"?  Question
1199  Can humans survive on a deserted island with l...  Question
1034  What are the main differences between Buddhism...  Question

Category: Statement
Count: 431
                                                   text 

The smaller numbers for some categories are an result of duplicates being removed.

#### 1.1.1 Example Using a Dictionary: 
While this data isn't used in the analysis, it is an example of how users can prompt the generator directly with a very simple system prompt with no examples. 
It is worth emphasizing that typically the better the prompting, the better the quality of data. Hence a more nuanced prompt may be required. Look in the abovementioned .py for inspiration.

In [26]:
custom_prompt_dict = {
    "Weather Example": """<|im_start|>system
You are a helpful assistant that answers provides information about the weather. Please provide fictional weather-related information.
<|im_end|>"""

}

# run with direct dict and save as df
df_direct_dict = generator.generate_framework(
    prompt_dict_input=custom_prompt_dict,
    num_samples=5
)

Generating for category: Weather Example
Number of duplicates removed: 0


In [None]:
# printing the beginning (until dot) of 3 random examples of the generated text
print("\nRandom examples from the generated text:")
for i in range(3):
    print(f"\nExample {i+1}:")
    print(df_direct_dict.iloc[i]['text'].split('.')[0])


Random examples from the generated text:

Example 1:
I'm happy to help! Here's the latest weather forecast:

**Current Weather:** A beautiful day is unfolding across the country, with plenty of sunshine and mild temperatures

Example 2:
The forecast for tomorrow is looking good! A warm front will bring clear skies and mild temperatures, with highs in the mid-60s to low 70s

Example 3:
The temperature and humidity level are still quite high in the city today


### 1.2: Quality Check of Data
The generated data is quality-checked by training a small model from Hugging Face using a few-shot learning approach. The user must create a labeled dataset containing approximately 10 examples per category in a `csv` or `json` format. This dataset is used to train the model.

Both the model, tokenizer, and training data can be stored for future use, if desired. The final stored dataset consists of the examples where both the generative model (from section 1.1) and the trained classifier predict the same label.
If the user is interested in hyperparameter tuning, `tuning` should be set to `True`, and a grid of training parameters must be provided as a dictionary to `tuning_params`. If no parameters are specified, the model is trained using the following default settings: 

```
default_tuning_params = {
    "dropout_rate": 0.01,
    "loss_function": "cross_entropy",
    "learning_rate": 5e-5,
    "batch_size_train": 8,
    "batch_size_eval": 8,
    "num_epochs": 4,
    "weight_decay": 0.01
}
```


In [31]:
filtered_df = generator.filter_with_classifier(
    train_data="data/generated_tuning_data/tiny_labeled_default.csv", # the labeled data used for training the classifier provided by user
    synth_data="data/generated_tuning_data/synth_default_fourtypes.csv", # the synthesized data generated above (the pandas df can also be used here)
    classifier_model_name = "distilbert-base-uncased", # huggingface model name
    filtered_save_path="data/generated_tuning_data/final_qual_default.csv" # output path for storage   
)

Max token: 34 Average token: 16.181818181818183


Map: 100%|██████████| 55/55 [00:00<00:00, 17640.65 examples/s]
Map: 100%|██████████| 14/14 [00:00<00:00, 7121.06 examples/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,1.254566,0.428571,0.1875,0.428571,0.25974
2,No log,1.05169,0.857143,0.914286,0.857143,0.857143
3,No log,0.912623,0.857143,0.914286,0.857143,0.863889
4,No log,0.858294,0.857143,0.914286,0.857143,0.863889


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Map: 100%|██████████| 1382/1382 [00:00<00:00, 14771.91 examples/s]


Filtered data saved to data/generated_tuning_data/final_qual_default.csv


In [37]:
# Examples and length:
filtered_df_dp = filtered_df.drop(columns=['label_id', 'predicted'])
for category in filtered_df_dp['category'].unique():
    print(f"\nCategory: {category}")
    print(f"Count: {len(filtered_df_dp[filtered_df_dp['category'] == category])}")
    print(filtered_df_dp[filtered_df_dp['category'] == category].sample(3))

    


Category: Clarification
Count: 433
                                                  text       category
470  Can you rephrase or provide more context to he...  Clarification
83       Could you rephrase the question or statement?  Clarification
200  I see what you're saying, but I'm not sure how...  Clarification

Category: Small Talk
Count: 155
                                                  text    category
549     How was your weekend, did you do anything fun?  Small Talk
616  So, what's new with you? Any exciting plans or...  Small Talk
659  I love the new shoes you're wearing. Where did...  Small Talk

Category: Question
Count: 245
                                                  text  category
855  What are the three main components of the proc...  Question
913  What are the benefits of regular exercise on m...  Question
909  How do trees adapt to changing environmental c...  Question

Category: Statement
Count: 428
                                                   text   ca

Now a quality checked  training dataset has been synthesized.

## 2. Generate Synthetic Tutor/Student Interaction 
In this step, a dialogue is synthesized between two agents. A single Hugging Face model (either MLX or the standard Hugging Face format) is used as the generator behind both the 'student' and 'tutor' agents.

The interaction setup is determined by system prompts assigned to each agent. These prompts are defined in a user-created `.yaml` file tailored to the specific use case. By default, a general problem-solving mode is used: the student is prompted to act as a learner tackling an educational task of their own choosing, while the tutor is prompted to assist in solving it. Examples of prompts can be found here: `src/dialogue_generation/txt_llm_inputs/system_prompts.yaml`. In the function the `mode` is set to match the wanted system prompt in the file. 

While this step could be omitted in a real-world pipeline—since the goal is to analyze *human* interactions with chatbots rather than chatbot-to-chatbot exchanges—it is included here to (1) generate sufficient data for completing the pipeline and carrying out the full analysis and (2) demonstrate how synthetic interaction data can be created in cases where real conversations are unavailable or impractical to collect.

In [1]:
from mypythonpackage import DialogueSimulator
from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#simulator = DialogueSimulator(backend="hf", model_id="meta-llama/Llama-3.2-1B-Instruct")
simulator = DialogueSimulator(backend="mlx", model_id="mlx-community/Qwen2.5-7B-Instruct-1M-4bit")

Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 91180.52it/s]


#### 2.1 Generating Interactions

The student agent has a default seed message: "*Hi, I'm a student seeking assistance with my studies*." However, a custom seed message can be provided to simulate a specific type of interaction or contextual scenario—for example, by specifying a particular task or subject.

In this simulation example, three different datasets are created—one for an English course, one for a Biology course, and one for learning Spanish. Each dataset consists of 30 unique student interactions, with each student receiving 1 of 5 different seed messages. This approach mimics how users might have varying use cases for the application or want to test hypotheses about interaction patterns across different educational contexts.

In [None]:
# Example of how to simulate one dialogue
seed_message = "Hi, I'm a student seeking assistance with my studies in English." # example input - same as default

df_single = simulator.simulate_dialogue(
    mode="general_task_solving",
    turns=10,
    seed_message_input=seed_message
    #save_csv_path=Path("output/dialogue.csv")
)


--- Starting Dialogue Simulation ---

[Student]: Hi, I'm a student seeking assistance with my studies in English.


mx.metal.set_wired_limt is deprecated and will be removed in a future version. Use mx.set_wired_limit instead.
mx.metal.get_peak_memory is deprecated and will be removed in a future version. Use mx.get_peak_memory instead.
mx.metal.clear_cache is deprecated and will be removed in a future version. Use mx.clear_cache instead.


[Tutor]: Great! What specific help do you need with English?

Turn 2:
[Student]: Could you please give me an example of what you'd like assistance with?
[Tutor]: Sure! How about helping with grammar, writing an essay, or improving vocabulary? Let me know what you're working on!

Turn 3:
[Student]: I'm working on improving my essay writing. Could you give me an example of a well-structured essay?
[Tutor]: Certainly! Here's a simple 5-paragraph essay structure:

1. **Introduction**: Start with a hook, provide background info, and end with a clear thesis statement.
2. **Body Paragraph 1**: Topic sentence, supporting details, evidence, and examples.
3. **Body Paragraph 2**: Topic sentence, supporting details, evidence, and examples.
4. **Body Paragraph 3**: Topic sentence, supporting details, evidence, and examples.
5. **Conclusion**: Summarize main points and restate thesis.

Need help with a specific part?

Turn 4:
[Student]: Sure, I need help with the thesis statement. Can you give me a

Now lets do it for 90 'students', 30 in each course with course specific seeds

In [None]:
n_students_per_course = 30 # kind of like in a classroom 

# Load YAML file with seed messages
with open("src/dialogue_generation/txt_llm_inputs/student_seed_messages.yaml", "r") as f:
    seed_messages = yaml.safe_load(f)

# Displaying examples of seed messages
for course_name, course_content in seed_messages.items():
    seeds = course_content.get("seeds", [])
    print(f"\nCourse: {course_name}")
    print(f"Number of seeds: {len(seeds)}")
    print("Seed messages:")
    for seed in seeds:
        print(seed)



Course: english_course
Number of seeds: 5
Seed messages:
Hi, I'm struggling with writing essays. Can you help me improve?
Hi, Can you explain how to use the passive voice in English?
Hi, I'm preparing for an English exam—what are common grammar mistakes to watch out for?
Hi, How can I expand my vocabulary effectively?
Hi, Can you help me analyze this short story for class?

Course: biology_course
Number of seeds: 5
Seed messages:
Hi, I'm having trouble understanding cell division. Can you explain mitosis vs meiosis?
Hi, Can you help me review the parts of a plant cell?
Hi, How does natural selection work in evolution?
Hi, I need help with a biology lab report on photosynthesis.
Hi, What’s the difference between DNA and RNA?

Course: spanish_course
Number of seeds: 5
Seed messages:
Hi, I’m learning Spanish and need help practising.
Hi, how do I use past tense verbs in Spanish?
Hi, can you help me practice a short dialogue in Spanish?
Hi, what are some tips for improving Spanish pronunc

In [15]:
course_dfs = {}

# Loop through each course and simulate dialogues
for course_name, course_content in seed_messages.items():
    seeds = course_content.get("seeds", [])
    n_seeds = len(seeds)
    all_course_dialogues = []

    for student_index in range(n_students_per_course):
        # Select seed message in round-robin fashion
        seed_message = seeds[student_index % n_seeds]

        df_single = simulator.simulate_dialogue(
            mode="general_task_solving",
            turns=8, # Should be adjusted if there is any assumption about the number of turns
            seed_message_input=seed_message
        )

        # Add student ID and course info
        df_single["student_id"] = f"{course_name}_student_{student_index+1}"
        df_single["course"] = course_name

        all_course_dialogues.append(df_single)

    # Combine into one dataframe per course
    course_df = pd.concat(all_course_dialogues, ignore_index=True)
    course_dfs[course_name] = course_df

# Access the dataframes
biology_df = course_dfs["biology_course"]
english_df = course_dfs["english_course"]
spanish_df = course_dfs["spanish_course"]



--- Starting Dialogue Simulation ---

[Student]: Hi, I'm struggling with writing essays. Can you help me improve?
[Tutor]: Absolutely! Let's start by focusing on structuring your essays: introduction, body paragraphs, and conclusion. Want to practice the introduction first?

Turn 2:
[Student]: Sure, let's start with the introduction. Could you give me an example of an essay topic so I can practice writing a good introduction?
[Tutor]: Sure! How about this topic: "The Impact of Social Media on Mental Health."

Good intro starts strong. Try this outline:
1. Grab attention with a surprising fact or question.
2. Briefly mention the topic.
3. State your thesis clearly.

Start with something like: "Did you know social media usage has doubled in the last decade? This growth raises serious concerns about its impact on mental health."

Turn 3:
[Student]: Got it! So for my introduction, I should start with a surprising fact or question, then mention the topic, and end with a clear thesis statem

In [21]:
# print duplicates messages if any where turn is not 1
duplicates = []
for course_name, course_df in course_dfs.items():
    duplicates_course = course_df[course_df.duplicated(subset=['student_msg'], keep=False) & (course_df['turn'] != 1)]
    duplicates.append(duplicates_course)

# make into one df
duplicates_df = pd.concat(duplicates, ignore_index=True)
duplicates_df


Unnamed: 0,turn,student_msg,tutor_msg,student_id,course
0,7,Great! Thanks for the clarification.,You're welcome! Glad to help.,english_course_student_3,english_course
1,6,Great! Thanks for the clarification.,You're welcome! Happy learning!,english_course_student_19,english_course
2,3,Could you give me another example to practice ...,"Sure! Try: ""Me llamo María.""",spanish_course_student_1,spanish_course
3,5,Could you give me another example to practice ...,"Of course! Try: ""Él es Juan.""",spanish_course_student_1,spanish_course


This indicates that each interaction between agents is almost entirely unique, despite the use of similar seed messages, as evidenced by the very few generic repetitions observed above.

In [16]:
# Saving the dataframes to CSV files
biology_df.to_csv("data/generated_dialogue_data/synth_biology_interaction.csv", index=False)
english_df.to_csv("data/generated_dialogue_data/synth_english_interaction.csv", index=False)
spanish_df.to_csv("data/generated_dialogue_data/synth_spanish_interaction.csv", index=False)

## 3. Wrap Interactions Directly
As an alternative to fully simulated interaction data or aldready collected data, this package also supports capturing real-time interactions directly through a terminal-based chat interface. As demonstrated below, only a few lines of code are required to launch an interactive session with a prompted agent. This will result in a dataframe of interactions on the same structure as the synthetic ones above. This functionality relies on a LM Studio-hosted model and server like the framework generator in Step 1.1.

The code snippet shown below can be executed either from the terminal (bash) or within a notebook. However, for stability and to avoid potential conflicts with local settings, it is recommended to run this feature from the terminal.

In [22]:
from mypythonpackage import DialogueLogger

In [23]:
prompt = "You are an AI tutor for a student trying to solve a task. Your instructions are: " \
"Answer clearly and consicely, Check for understanding, Always be short and concise."

#### 3.1 Run from Notebook

In [25]:
logger = DialogueLogger(
    api_url="http://127.0.0.1:1234/v1/chat/completions",
    model_name="llama-3.2-3b-instruct",
    temperature=0.7,
    context=prompt
)

df = logger.start_dialogue(participant_id="test_user", output_path="data/logged_dialogue_data/")

Type 'quit' to exit and save the conversation.
Conversation saved to data/logged_dialogue_data/conversation_test_user.csv


#### 3.2 Run from Terminal

In [None]:
# this should be copied to the terminal and cannot be run in the notebook

## 4. Classification Time 

The next step is to train a relevant Hugging face model (distilbert-base-uncased is default) on the synthetic framework (from Steps 1.1 and 1.2). 
The trained model is then used to classify the non labeled generated interactions. For this proof of concept we continue with the three different settings datasets (English, Biology, and Spanish). In this case both student and tutor outputs are classified, enabling downstream analyses on both, but this can be explicitely chosen.

If the user is interested in hyper parameter tuning, `tuning` should be `True` and a grid of training parameters should be provided as a dict to `tuning_params`. If not specified, the model is trained using the default parameters: [0.01, "cross_entropy", 5e-5, 8, 8, 4, 0.01]


The next step is to train a relevant Hugging Face model (`distilbert-base-uncased` is used by default) on the synthetic framework generated in Steps 1.1 and 1.2.

The trained model is then applied to classify the previously unlabeled simulated interactions. For this proof of concept, we proceed with the three datasets created for different learning contexts: English, Biology, and Spanish. In this case, both student and tutor outputs are classified, allowing for downstream analyses across both roles. However, this can be explicitly configured if classification is only needed for one side of the conversation.

If the user is interested in hyperparameter tuning, `tuning` should be set to `True`, and a grid of training parameters must be provided as a dictionary to `tuning_params`. If no parameters are specified, the model is trained using the following default settings: 

```
default_tuning_params = {
    "dropout_rate": 0.01,
    "loss_function": "cross_entropy",
    "learning_rate": 5e-5,
    "batch_size_train": 8,
    "batch_size_eval": 8,
    "num_epochs": 4,
    "weight_decay": 0.01
}
```

In [32]:
from mypythonpackage import PredictLabels

In [None]:
# loading the model and data for prediction
predictor = PredictLabels(model_name="distilbert-base-uncased")

annotaded_df = predictor.run_pipeline(
    train_data="data/generated_tuning_data/PACKAKGE_QUALDATA.csv", 
    new_data="data/logged_dialogue_data/conversation_test_user2.csv", 
    # column names in training data:
    text_column="text",
    label_column="category", 
    # column names in data to be predicted:
    new_text_column="user_msg", # the text to be predicted ### HERE NEEDS TO BE AN UPDATE THAT CAN TAKE IN MORE THAN NE
    prediction_save_path="data/final_outputs/first_annotated.csv"
)

Max token: 44 Average token: 15.587025316455696


Map:   0%|          | 0/632 [00:00<?, ? examples/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Map: 100%|██████████| 632/632 [00:00<00:00, 37655.02 examples/s]
Map: 100%|██████████| 158/158 [00:00<00:00, 31536.12 examples/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.004372,1.0,1.0,1.0,1.0
2,No log,0.026333,0.993671,0.993914,0.993671,0.993702
3,No log,0.026607,0.993671,0.993914,0.993671,0.993702
4,No log,0.026102,0.993671,0.993914,0.993671,0.993702


Predicted data saved to data/final_outputs/first_annotated.csv


In [4]:
print(annotaded_df.head())

   turn                                           user_msg  \
0     1                     hello i need help with english   
1     2                i need to understand some sentences   
2     3  The complex houses married and single soldiers...   
3     4                                  do you understand   
4     5                        do you know other examples?   

                                           tutor_msg predicted_labels  
0  What specifically do you need help with in Eng...       Small Talk  
1  What are the sentences you're having trouble u...    Clarification  
2  This sentence is saying that a complex (likely...       Small Talk  
3  Yes, I understand. The sentence is describing ...    Clarification  
4  Here are a few more examples of complex housin...    Clarification  


okay laura conclusion for now the quality is terrible but the pipeline is working !!!!!!!! now data + framework just need to be fixed hmmm to actually be useful 

## 5. Descriptive Analytics