# Phase 1: Dialogue Preprocessing

This notebook processes dialogue documents and extracts structured conversation data.

## Steps:
1. Extract text from Word/PDF documents
2. Normalize speakers (learner/bot)
3. Generate turn lists
4. Save as JSON files (W1_T1.json, etc.)


## Setup and Imports


In [1]:
import sys
from pathlib import Path
import json

# Add scripts directory to path
project_root = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
sys.path.insert(0, str(project_root / 'scripts'))

from document_extractor import extract_text, save_extracted_text
from dialogue_parser import DialogueParser

print(f"Project root: {project_root}")


Project root: C:\Users\jasmi\OneDrive\Documents\GitHub\Designing-Data-Responsive-Language-Learning-Environments-through-Conversation-Practice-with-AI


## Step 1: Extract Text from Documents


In [2]:
# Define document paths
raw_data_dir = project_root / 'data' / 'raw'
extracted_text_dir = project_root / 'data' / 'extracted_text'
processed_dir = project_root / 'data' / 'processed'

# Document mapping
documents = {
    'week1': {
        'file': raw_data_dir / '#18. Week1.docx',
        'format': 'week1_week2'
    },
    'week2': {
        'file': raw_data_dir / '#12. Week2.docx',
        'format': 'week1_week2'
    },
    'week3': {
        'file': raw_data_dir / '#16. Week3.docx',
        'format': 'week3'
    },
    'week4': {
        'file': raw_data_dir / '#14. Week4.pdf',
        'format': 'week4'
    }
}

print("Documents to process:")
for week, info in documents.items():
    exists = info['file'].exists()
    print(f"  {week}: {info['file'].name} - {'✓' if exists else '✗'}")


Documents to process:
  week1: #18. Week1.docx - ✓
  week2: #12. Week2.docx - ✓
  week3: #16. Week3.docx - ✗
  week4: #14. Week4.pdf - ✓


In [3]:
# Extract text from all documents
extracted_texts = {}
week4_color_data = None  # Store color data for Week4

from document_extractor import extract_text_with_colors_from_pdf

for week, info in documents.items():
    if not info['file'].exists():
        print(f"Warning: {info['file']} not found, skipping...")
        continue
    
    print(f"\nExtracting text from {week}...")
    try:
        # Special handling for Week4 to extract color information
        if week == 'week4':
            try:
                week4_color_data = extract_text_with_colors_from_pdf(str(info['file']))
                # Also get plain text
                text = extract_text(str(info['file']))
                extracted_texts[week] = text
                print(f"  Extracted text with color information")
            except Exception as e:
                print(f"  Warning: Could not extract colors, using plain text: {e}")
                text = extract_text(str(info['file']))
                extracted_texts[week] = text
        else:
            text = extract_text(str(info['file']))
            extracted_texts[week] = text
        
        # Save extracted text
        output_file = extracted_text_dir / f"{week}_extracted.txt"
        save_extracted_text(text, str(output_file))
        
        print(f"  Extracted {len(text)} characters")
        print(f"  Preview (first 200 chars): {text[:200]}...")
    except Exception as e:
        print(f"  Error extracting {week}: {e}")



Extracting text from week1...
Saved extracted text to: C:\Users\jasmi\OneDrive\Documents\GitHub\Designing-Data-Responsive-Language-Learning-Environments-through-Conversation-Practice-with-AI\data\extracted_text\week1_extracted.txt
  Extracted 21865 characters
  Preview (first 200 chars): Task one

You said:
“Hi, I would like to order a coffee today!”
English Conversational Partner said:
Sure! What kind of coffee would you like today — something like an espresso, latte, or cappuccino?
...

Extracting text from week2...
Saved extracted text to: C:\Users\jasmi\OneDrive\Documents\GitHub\Designing-Data-Responsive-Language-Learning-Environments-through-Conversation-Practice-with-AI\data\extracted_text\week2_extracted.txt
  Extracted 16568 characters
  Preview (first 200 chars): Week 2 – Task 1
You said：
“Some more, I'm visiting a medical clinic in person, and you will help me to book the service.”
00:14
English Conversational Partner said：
Absolutely, I'd be happy to help wi...

Extracting 

  Extracted text with color information
Saved extracted text to: C:\Users\jasmi\OneDrive\Documents\GitHub\Designing-Data-Responsive-Language-Learning-Environments-through-Conversation-Practice-with-AI\data\extracted_text\week4_extracted.txt
  Extracted 13806 characters
  Preview (first 200 chars): Week4
Task 1: Discussing Study or Career Plans with an Advisor
Hi, I want to study overseas in the future. Could you tell me how to prepare now?
Absolutely! Studying overseas is exciting. To get start...


## Step 2 & 3: Parse Dialogues and Generate Turn Lists


In [4]:
# Initialize parser
parser = DialogueParser()

# Process each week
all_dialogues = {}

for week, info in documents.items():
    if week not in extracted_texts:
        continue
    
    print(f"\n{'='*60}")
    print(f"Processing {week.upper()}")
    print(f"{'='*60}")
    
    text = extracted_texts[week]
    
    # Parse based on format
    if info['format'] == 'week1_week2':
        turns = parser.parse_week1_week2(text)
    elif info['format'] == 'week3':
        turns = parser.parse_week3(text)
    elif info['format'] == 'week4':
        # Use color data if available
        turns = parser.parse_week4_pdf(text, color_data=week4_color_data)
    else:
        print(f"Unknown format for {week}")
        continue
    
    print(f"\nParsed {len(turns)} turns")
    
    # Preview first few turns
    print("\nFirst 3 turns:")
    for turn in turns[:3]:
        print(f"  Turn {turn['turn']} ({turn['speaker']}): {turn['text'][:80]}...")
    
    all_dialogues[week] = turns



Processing WEEK1

Parsed 127 turns

First 3 turns:
  Turn 1 (learner): “Hi, I would like to order a coffee today!”...
  Turn 2 (bot): Sure! What kind of coffee would you like today — something like an espresso, lat...
  Turn 3 (learner): “Hi, could I please get a matcha latte?”...

Processing WEEK2

Parsed 0 turns

First 3 turns:

Processing WEEK4

Parsed 8 turns

First 3 turns:
  Turn 1 (bot): Week4Task 1: Discussing Study or Career Plans with an Advisor Hi, I want to stud...
  Turn 2 (bot): Do you have any recommendation for studying education in the UK?Sure! There are ...
  Turn 3 (bot): goals and celebrate those little achievements along the way. Also, exploring dif...


## Step 4a: Split Dialogues into Tasks (Optional)

In [5]:
# Split dialogues into tasks and save
saved_files = []

for week, turns in all_dialogues.items():
    if not turns:
        print(f"No turns found for {week}, skipping...")
        continue
    
    week_num = week.replace('week', '')
    text = extracted_texts[week]
    info = documents[week]  # Get document format info
    
    # Try to split into tasks
    tasks = parser.split_into_tasks(text, int(week_num))
    
    if len(tasks) > 1:
        print(f"\n{week.upper()}: Found {len(tasks)} tasks")
        # Parse each task separately
        for task_idx, (task_name, task_text) in enumerate(tasks, 1):
            # Parse the task text
            if info['format'] == 'week1_week2':
                task_turns = parser.parse_week1_week2(task_text)
            elif info['format'] == 'week3':
                task_turns = parser.parse_week3(task_text)
            elif info['format'] == 'week4':
                task_turns = parser.parse_week4_pdf(task_text, color_data=week4_color_data)
            else:
                task_turns = []
            
            if task_turns:
                # Renumber turns starting from 1
                for i, turn in enumerate(task_turns, 1):
                    turn['turn'] = i
                
                output_file = processed_dir / f"W{week_num}_T{task_idx}.json"
                parser.save_dialogue_json(task_turns, str(output_file))
                saved_files.append(output_file)
                print(f"  Saved {output_file.name} with {len(task_turns)} turns")
    else:
        # Single task - save all turns
        output_file = processed_dir / f"W{week_num}_T1.json"
        parser.save_dialogue_json(turns, str(output_file))
        saved_files.append(output_file)
        print(f"\n{week.upper()}: Saved as single task {output_file.name}")

print(f"\n{'='*60}")
print(f"Summary: Saved {len(saved_files)} dialogue files")
print(f"{'='*60}")
for f in saved_files:
    print(f"  {f.name}")


Saved dialogue to: C:\Users\jasmi\OneDrive\Documents\GitHub\Designing-Data-Responsive-Language-Learning-Environments-through-Conversation-Practice-with-AI\data\processed\W1_T1.json (127 turns)

WEEK1: Saved as single task W1_T1.json
No turns found for week2, skipping...
Saved dialogue to: C:\Users\jasmi\OneDrive\Documents\GitHub\Designing-Data-Responsive-Language-Learning-Environments-through-Conversation-Practice-with-AI\data\processed\W4_T1.json (8 turns)

WEEK4: Saved as single task W4_T1.json

Summary: Saved 2 dialogue files
  W1_T1.json
  W4_T1.json


## Verification: Preview Generated JSON Files

In [6]:
# Load and preview one of the generated files
if saved_files:
    sample_file = saved_files[0]
    print(f"Preview of {sample_file.name}:")
    print("-" * 60)
    
    with open(sample_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    print(f"Total turns: {len(data)}")
    print(f"\nFirst 5 turns:")
    for turn in data[:5]:
        print(f"\nTurn {turn['turn']} - {turn['speaker']}:")
        print(f"  {turn['text']}")
    
    if len(data) > 5:
        print(f"\n... and {len(data) - 5} more turns")


Preview of W1_T1.json:
------------------------------------------------------------
Total turns: 127

First 5 turns:

Turn 1 - learner:
  “Hi, I would like to order a coffee today!”

Turn 2 - bot:
  Sure! What kind of coffee would you like today — something like an espresso, latte, or cappuccino?

Turn 3 - learner:
  “Hi, could I please get a matcha latte?”

Turn 4 - bot:
  Absolutely, a matcha latte sounds great! Would you like it hot or iced?

Turn 5 - learner:
  “HOTAS.”

... and 122 more turns
