# Arnold Schönberg Correspondence Chatbot



Katharina Bleier, GenAI for Humanists 2025

Arnold Schönberg's correspondence with the music publishers Universal Edition and Verlag Dreililien comprises 1.400 letters. Dating from 1902 to 1951, the letters cover topics such as the production processes of music scores, public relations activities, performances, and general historical issues. Digital editions typically offer not only transcriptions and text-critical analyses of edited sources. They also provide context through commentary and finding aids, such as indexes and full-text searches. The development of a chatbot combines these tools to provide more flexible, user-friendly access to the letters.
The project consists of three components: data cleaning and chunking; retrieval-augmented generation (RAG); user interface (streamlit). 

digital edition: https://www.schoenberg-ue.at
chatbot: https://asletterbot-vqmzkltobryzn6ixr8xong.streamlit.app/

**Importing Packages**

In [None]:
import pandas as pd
import re
from pathlib import Path
import streamlit as st
import os
from llama_index.core import VectorStoreIndex, Document, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

**Preprocessing input data**

The data set consists of a CSV file containing letter IDs and letter texts. It was extracted from XML/TEI files of the letters as part of a previous project, and be will further processed for this one.

The cleaning process includes removing whitespaces, fixing fragmented OCR words, applying german-specific OCR corrections for common errors, normalizing line breaks and spacing. 

Chunking is used to adapt the length of the text to the context window of the LLM. Here, the process of chunking is initiated when the length of the text exceeds 800 characters, which is equivalent to around 180 tokens. Intelligent chunking ensures that the text is always split at the end of a sentence to maintain context. Metadata (e.g. "chunk_id") dokuments the relationship between chunks and original letters.
Output: CSV file

(Code generated with claude.ai)

In [None]:
def process_correspondence_csv(input_file, output_file='schoenberg_letters_chunks.csv', max_chunk_size=800):
    """
    Process correspondence CSV file for RAG implementation
    
    Args:
        input_file: Path to input CSV file
        output_file: Path for output CSV file
        max_chunk_size: Maximum characters per chunk
    
    Returns:
        DataFrame with processed chunks
    """
    
    # Load and clean data
    print(f"Loading {input_file}...")
    df = pd.read_csv(input_file, header=None, names=['letter_id', 'text'])
    
    # Remove header row if exists
    if df.iloc[0]['letter_id'] == 'Letter ID':
        df = df.iloc[1:].reset_index(drop=True)
    
    # Clean data
    df = df.dropna(subset=['letter_id', 'text'])
    df = df.drop_duplicates(subset=['letter_id'], keep='first')
    print(f"Loaded {len(df)} letters")
    
    # Clean text
    def clean_text(text):
        if pd.isna(text):
            return text
        
        # Remove whitespace
        text = re.sub(r' {3,}', ' ', text)
        
        # Fix fragmented words
        text = re.sub(r'\b(\w) ([a-z]{2,})\b', r'\1\2', text)
        
        # Common German OCR fixes
        ocr_fixes = {
            r'\bge meldet\b': 'gemeldet',
            r'\bver rechnet\b': 'verrechnet', 
            r'\bDurch füh rung\b': 'Durchführung',
            r'\bEr ledigung\b': 'Erledigung'
        }
        
        for pattern, replacement in ocr_fixes.items():
            text = re.sub(pattern, replacement, text)
        
        # Clean spacing and line breaks
        text = re.sub(r'\n+', ' ', text)
        text = re.sub(r' +', ' ', text)
        return text.strip()
    
    df['text_cleaned'] = df['text'].apply(clean_text)
    
    # Create chunks
    print("Creating chunks...")
    chunks = []
    
    for _, row in df.iterrows():
        letter_id = row['letter_id']
        text = row['text_cleaned']
        
        if pd.isna(text) or len(text.strip()) == 0:
            continue
        
        if len(text) <= max_chunk_size:
            # Keep short letters as single chunk
            chunks.append({
                'letter_id': letter_id,
                'chunk_id': f"{letter_id}_001",
                'text': text,
                'chunk_type': 'full_letter',
                'chunk_index': 1,
                'total_chunks': 1
            })
        else:
            # Split long letters by sentences
            sentences = re.split(r'(?<=[.!?])\s+', text)
            current_chunk = ""
            chunk_index = 1
            letter_chunks = []
            
            for sentence in sentences:
                if len(current_chunk) + len(sentence) > max_chunk_size and current_chunk:
                    letter_chunks.append(current_chunk.strip())
                    current_chunk = sentence
                    chunk_index += 1
                else:
                    current_chunk += " " + sentence if current_chunk else sentence
            
            if current_chunk.strip():
                letter_chunks.append(current_chunk.strip())
            
            # Create chunk records
            for i, chunk_text in enumerate(letter_chunks, 1):
                chunks.append({
                    'letter_id': letter_id,
                    'chunk_id': f"{letter_id}_{i:03d}",
                    'text': chunk_text,
                    'chunk_type': 'partial_letter',
                    'chunk_index': i,
                    'total_chunks': len(letter_chunks)
                })
    
    # Create final DataFrame
    chunks_df = pd.DataFrame(chunks)
    chunks_df['char_count'] = chunks_df['text'].str.len()
    chunks_df['word_count'] = chunks_df['text'].str.split().str.len()
    
    # Save results
    chunks_df.to_csv(output_file, index=False)
    
    print(f"Processing complete!")
    print(f"  Total chunks: {len(chunks_df)}")
    print(f"  Average chunk length: {chunks_df['char_count'].mean():.0f} characters")
    print(f"  Saved to: {output_file}")
    
    return chunks_df

# Simple usage
if __name__ == "__main__":
    # Process the file
    result = process_correspondence_csv('letters_extract.csv')
    
    # Show sample
    print("\n=== SAMPLE CHUNKS ===")
    for i in range(min(3, len(result))):
        chunk = result.iloc[i]
        print(f"\nChunk {i+1}: {chunk['chunk_id']}")
        print(f"  Type: {chunk['chunk_type']}")
        print(f"  Length: {chunk['char_count']} chars")
        print(f"  Preview: {chunk['text'][:150]}...")

**Streamlit Application** 

The RAG system uses the preprocessed CSV file as a source of information, and users can choose from three OpenAI models for their queries. The chatbot has been designed to provide answers in two different formats, AI-generated continuous text and IDs and quotes from the three most relevant letters. The former facilitates a general understanding of facts by providing a summarised interpretation in natural language. The latter allows verification based on quotes and referencing the specific source.

Llamaindex is a framework that supports more intelligent search through the documents by the AI model. The program loads the CSV file, stores the data in a document list, creates a searchable index, returns index and data.

The Bot was created with streamlit, so the code handles user input and output, graphical features of the web interface and also contains several error handlings. Three different models allow the user to choose different performance and cost options.

(code based on Renato Rocha Souza, Notebooks/Llamaindex (https://github.com/rsouza/GenAI4Humanists/tree/main/Notebooks/LlamaIndex); Streamlit Documantation (https://docs.streamlit.io/get-started))

In [None]:
# Configure page
st.set_page_config(page_title="Arnold Schönberg Letter Chatbot", page_icon="📜", layout="wide")

# Title
st.title("📜 Arnold Schönberg Letter Chatbot")
st.write("This Chatbot allows you to ask questions about the Correspondence between Arnold Schönberg and his publishers Universal-Edition and Verlag Dreililien. A digital edition of these letters is availabel at www.schoenberg-ue.at. The chatbot is based on a file that contains letter IDs and letter text, metadata is no included. In addition to natural language interaction, the bot provides IDs and quotes from up to three relevant letters. Most of the letters are written in German.")
st.write("Ask questions about the letters")

# Sidebar for configuration
with st.sidebar:
    st.header("Setup")
    openai_api_key = st.text_input("OpenAI API Key", type="password")
    
    # Model selection
    st.subheader("Model Selection")
    model_options = {
        "GPT-4o Mini": "gpt-4o-mini", 
        "GPT-4o": "gpt-4o",
        "GPT-3.5 Turbo": "gpt-3.5-turbo"
    }
    selected_model = st.selectbox("Choose OpenAI Model:", list(model_options.keys()))
    
    if openai_api_key:
        os.environ["OPENAI_API_KEY"] = openai_api_key
        st.success("API Key set!")
    else:
        st.warning("Please enter your OpenAI API Key")

# Initialize session state
if "messages" not in st.session_state:
    st.session_state.messages = []

if "index" not in st.session_state:
    st.session_state.index = None

if "csv_data" not in st.session_state:
    st.session_state.csv_data = None

# Load and index CSV
@st.cache_resource
def load_csv_and_create_index(_openai_api_key, _selected_model):
    if not _openai_api_key:
        return None, None
    
    try:
        # Configure LlamaIndex settings
        Settings.llm = OpenAI(model=model_options[_selected_model], temperature=0.1)
        Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
        
        # Load the CSV file
        csv_file = "schoenberg_letters_chunks.csv"
        if not os.path.exists(csv_file):
            st.error(f"CSV file '{csv_file}' not found. Please make sure it's in the same directory as this script.")
            return None, None
            
        df = pd.read_csv(csv_file)
        
        # Create documents from CSV rows
        documents = []
        for idx, row in df.iterrows():
            text_content = str(row['text'])
            letter_id = str(row['letter_id'])
            
                       
            # Create document with metadata
            doc = Document(
        text=text_content,
        metadata={
            "letter_id": letter_id,
            "chunk_id": str(row['chunk_id']),
            "chunk_type": str(row['chunk_type']),
            "chunk_index": int(row['chunk_index']),
            "total_chunks": int(row['total_chunks']),
            "char_count": int(row['char_count']),
            "word_count": int(row['word_count']),
            "row_index": idx
        }
            )
            documents.append(doc)
        
        # Create index
        index = VectorStoreIndex.from_documents(documents)
        
        return index, df
        
    except Exception as e:
        st.error(f"Error loading CSV: {str(e)}")
        return None, None

# Load index if API key is provided
if openai_api_key and (st.session_state.index is None or st.button("Reload with Selected Model")):
    with st.spinner(f"Loading and indexing your CSV with {selected_model}... This may take a moment."):
        st.session_state.index, st.session_state.csv_data = load_csv_and_create_index(openai_api_key, selected_model)
        if st.session_state.index:
            st.success(f"CSV loaded successfully with {selected_model}! You can now ask questions.")
            
  
# Create two columns for the main interface
col1, col2 = st.columns([2, 1])

with col1:
    st.subheader("Chat")
    
    # Display chat messages
    for message in st.session_state.messages:
        with st.chat_message(message["role"]):
            st.markdown(message["content"])

    # Chat input
    if prompt := st.chat_input("Stellen Sie eine Frage zu den Briefen... / Ask a question about the letters..."):
        if not openai_api_key:
            st.error("Please enter your OpenAI API Key in the sidebar.")
        elif st.session_state.index is None:
            st.error("Please wait for the CSV to finish loading.")
        else:
            # Add user message to chat
            st.session_state.messages.append({"role": "user", "content": prompt})
            with st.chat_message("user"):
                st.markdown(prompt)

            # Generate response
            with st.chat_message("assistant"):
                with st.spinner("Searching through the letters..."):
                    try:
                        # Create query engine with more sources for the second window
                        query_engine = st.session_state.index.as_query_engine(
                            similarity_top_k=5,
                            response_mode="compact"
                        )
                        
                        # Get response
                        response = query_engine.query(prompt)
                        
                        # Display response
                        st.markdown(str(response))
                        
                        # Add assistant response to chat
                        st.session_state.messages.append({"role": "assistant", "content": str(response)})
                        
                        # Store source information for the second window
                        if hasattr(response, 'source_nodes'):
                            st.session_state.last_sources = response.source_nodes[:3]  # Top 3 sources
                        
                    except Exception as e:
                        error_msg = f"Error: {str(e)}"
                        st.error(error_msg)
                        st.session_state.messages.append({"role": "assistant", "content": error_msg})

if st.button("Clear Chat History"):
        st.session_state.messages = []
        if hasattr(st.session_state, 'last_sources'):
            delattr(st.session_state, 'last_sources')
        st.rerun()      
        
with col2:
    st.subheader("Source Letters")
    
    if hasattr(st.session_state, 'last_sources') and st.session_state.last_sources:
        st.write("**Top 3 relevant letters:**")
        
        for i, source_node in enumerate(st.session_state.last_sources, 1):
            with st.expander(f"Letter {i}: {source_node.metadata.get('letter_id', 'Unknown ID')}"):
                # Show letter ID
                st.write(f"**Letter ID:** {source_node.metadata.get('letter_id', 'Unknown')}")
                
                # Show similarity score if available
                if hasattr(source_node, 'score'):
                    st.write(f"**Relevance Score:** {source_node.score:.3f}")
                
                # Show excerpt from the letter
                st.write("**Excerpt:**")
                # Limit the text to avoid overwhelming the interface
                text_preview = source_node.text[:500] + "..." if len(source_node.text) > 500 else source_node.text
                st.write(text_preview)
    else:
        st.write("Ask a question to see relevant letter sources here.")

# Instructions in sidebar
with st.sidebar:
    st.header("Instructions")
    st.write("""
    1. Enter your OpenAI API Key above
    2. Choose your preferred OpenAI model
    3. Make sure 'schoenberg_letters_chunks.csv' is in the same folder as this script
    4. Wait for the CSV to load
    5. Ask questions in German or English!
    
    **Model Information:**
    - **GPT-4o**: Most capable, best for complex analysis
    - **GPT-4o Mini**: Good balance of capability and speed
    - **GPT-3.5 Turbo**: Fastest and most economical
    
    **Example questions (German):**
    - Was schreibt Schönberg über Notation?
    - Gibt es Polemik oder Humor in den Briefen?
    - Was steht in den Briefen über Pelleas?
    - Fasse die wichtigsten Passagen über Verträge zusammen
    
    **Example questions (English):**
    - Do the letters discuss performances of Pelleas?
    - Are there sarcastic passages in the letters?
    - Summarize the main legal issues
    """)
    
    
    # Show CSV info if loaded
    if st.session_state.csv_data is not None:
        st.subheader("CSV Information")
        st.write(f"**Rows:** {len(st.session_state.csv_data)}")
        st.write(f"**Columns:** {list(st.session_state.csv_data.columns)}")