# Getting started with CoAuthor

**Goal: Download and read the CoAuthor dataset**

Steps
1. Download CoAuthor
2. Read writing sessions
3. Examine events

Below sections of this .ipynb come from code from CoAuthor [LINKED HERE](https://colab.research.google.com/drive/1nUGXP9l_jelbB4X65J0ivUvLgQz1RK1C?usp=sharing)

* Download CoAuthor
* Read writing sessions
* Examine events



## 1. Download CoAuthor

## 

In [1]:
!wget https://cs.stanford.edu/~minalee/zip/chi2022-coauthor-v1.0.zip
!unzip -q chi2022-coauthor-v1.0.zip
!rm chi2022-coauthor-v1.0.zip

--2025-03-02 22:21:42--  https://cs.stanford.edu/~minalee/zip/chi2022-coauthor-v1.0.zip
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49956179 (48M) [application/zip]
Saving to: ‘chi2022-coauthor-v1.0.zip.1’


2025-03-02 22:21:54 (4.10 MB/s) - ‘chi2022-coauthor-v1.0.zip.1’ saved [49956179/49956179]

replace coauthor-v1.0/e0435f4cf6fc435c872ffc5b66b66b0c.jsonl? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [1]:
import os

dataset_dir = './coauthor-v1.0'
paths = [
    os.path.join(dataset_dir, path)
    for path in os.listdir(dataset_dir)
    if path.endswith('jsonl')
]

print(f'Successfully downloaded {len(paths)} writing sessions in CoAuthor!')

Successfully downloaded 1447 writing sessions in CoAuthor!


In [2]:
!pip install transformers
!pip install sacremoses
!pip install numpy
!pip install nltk
!pip install unidecode



## 2. Read writing sessions and workerID info


In [3]:
import json
import pandas as pd

def read_writing_session(path):
    events = []
    with open(path, 'r') as f:
        for event in f:
            events.append(json.loads(event))
    print(f'Successfully read {len(events)} events in a writing session from {path}')
    return events



In [4]:
workerID = pd.read_csv('WorkerID and SessionID.csv')
workerID

Unnamed: 0,worker_id,session_id
0,A2QX3YJXAAHHVV,36bc101319da4b3590b96bce76f7c02c
1,A394JO4NEPCY3M,9aca14b9d4bc4e4b9b240782bd72c6db
2,A2QKAA5YS0P4CI,499f9577962c4a2c98aee3f5b6098a71
3,AZLZA0Q87TJZO,7267ada18a784f5089b55427212251b3
4,A2YTQDLACTLIBA,1a14bae2ca9f422ea8779752233a203a
...,...,...
1440,A2W121DQXNQK1,4c9727b380f34b5baa3d4407f8d91656
1441,A1PTH9KTRO06EG,c25b2c082a184e678a841035032b5468
1442,A1TW2BZRRS874Z,761aaf53b31f4cfda836ca0802f22278
1443,A2W121DQXNQK1,c2395534dfb74c8ab70eec8f139ed2bc


## 3. Getting User text and AI suggestions

In [5]:
from collections import defaultdict

def reconstruct_user_text(events):
    ''' Creates text from events by combining text-insert and text-delete events.'''
    user_texts = []
    current_text = ""

    # Sort events by eventNum
    sorted_events = sorted(events, key=lambda e: e["eventNum"])

    for event in sorted_events:
        event_type = event["eventName"]
        text_delta = event["textDelta"]

        if event_type == "text-insert" and text_delta:
            extracted_text = ""
            for op in text_delta['ops']:
                if 'insert' in op:
                    extracted_text += op['insert']
            current_text += extracted_text

        elif event_type == "text-delete" and text_delta:
            delete_len = 0
            for op in text_delta['ops']:
                if 'delete' in op:
                    delete_len += op['delete']
                elif 'retain' in op:
                    pass
            current_text = current_text[:-delete_len]


        # Store finalized text if user completes a thought
        if event_type == "suggestion-get" or event_type == "suggestion-select":
            if current_text.strip():
                user_texts.append(current_text.strip())
                current_text = ""

    return user_texts

In [6]:
def extract_selected_ai_suggestions(events):
    ai_suggestions = []  # Stores accepted AI-generated suggestions
    last_suggestion_open = None  # Store the most recent `suggestion-open` event

    for event in events:
        if event["eventName"] == "suggestion-open" and event.get("currentSuggestions"):
            last_suggestion_open = event  # Save latest `suggestion-open` event

        elif event["eventName"] == "suggestion-select":
            if last_suggestion_open and last_suggestion_open.get("currentSuggestions"):
                selected_index = event.get("currentSuggestionIndex", -1)

                # Ensure selected index is valid
                if 0 <= selected_index < len(last_suggestion_open["currentSuggestions"]):
                    selected_suggestion = last_suggestion_open["currentSuggestions"][selected_index]["trimmed"]
                    ai_suggestions.append(selected_suggestion)

    return ai_suggestions

## 4. Creating helper functions

Helper functions include getting the information on whether the accepted or rejected the text.

In [7]:
def extract_acceptance_status(events):
    """Track if a suggestion was accepted or rejected."""
    acceptance_status = []  # Stores 'accepted' or 'rejected'
    last_suggestion_open = None  # Store the most recent `suggestion-open`

    for event in events:
        if event["eventName"] == "suggestion-open" and event.get("currentSuggestions"):
            last_suggestion_open = event  # Save the latest `suggestion-open`

        elif event["eventName"] == "suggestion-select" and last_suggestion_open:
            # Suggestion selected
            acceptance_status.append('accepted')
            last_suggestion_open = None  # Reset, since we handled the acceptance

        elif event["eventName"] == "suggestion-close" and last_suggestion_open:
            # Suggestion closed without selection (rejected)
            acceptance_status.append('rejected')
            last_suggestion_open = None  # Reset, since we handled the rejection

    return acceptance_status

## 5. Tone Detection Functions

Here we use RoBERTa GoEmotions for our emotion classifier. Downloading the model here.

In [8]:
from transformers import pipeline

# Load emotion classification model
emotion_classifier = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", return_all_scores=True)

Device set to use mps:0


In [9]:
def get_emotion_vector(text):
  '''Returns a vector of emotion scores for a given text.'''
  result = emotion_classifier(text)[0]
  scores = {emotion['label']: emotion['score'] for emotion in result}
  return scores, np.array(list(scores.values()))

In [10]:
from scipy.spatial.distance import jensenshannon
import numpy as np

def similarity_jsd(vector1, vector2):
    ''' Calculate JSD similarity between two vectors.'''
    jsd =  jensenshannon(vector1, vector2)
    similarity = 1 - jsd
    return similarity

In [13]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m47.9 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## 8. POS similarity functions

In [14]:
import spacy
from typing import List, Dict
from collections import Counter
import numpy as np

# Load the English language model
nlp = spacy.load("en_core_web_sm")

def get_pos_sequence(text):
    """Extracts the POS tag sequence from a given text."""
    doc = nlp(text)
    return [token.pos_ for token in doc]

def compare_pos_sequences(user_text, ai_text):
    """Compares POS tag sequences from user and AI text."""
    user_pos = get_pos_sequence(user_text)
    ai_pos = get_pos_sequence(ai_text)
    return {"user_pos": user_pos, "ai_pos": ai_pos}

def get_pos_frequencies(text):
    """Calculates the frequency of each POS tag in the given text."""
    pos_tags = get_pos_sequence(text)
    return dict(Counter(pos_tags))

def compare_pos_frequencies(user_text, ai_text):
    """Compares POS tag frequencies between user and AI text."""
    user_freq = get_pos_frequencies(user_text)
    ai_freq = get_pos_frequencies(ai_text)
    return {"user_freq": user_freq, "ai_freq": ai_freq}

In [14]:
pip install spacy

Collecting spacy
  Using cached spacy-3.8.4-cp312-cp312-macosx_11_0_arm64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.12-cp312-cp312-macosx_11_0_arm64.whl.metadata (2.1 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp312-cp312-macosx_11_0_arm64.whl.metadata (8.5 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp312-cp312-macosx_11_0_arm64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.4-cp312-cp312-macosx_11_0_arm64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Down

Function to calculate cosine similarity between POS of texts

In [15]:
from sklearn.metrics.pairwise import cosine_similarity
def cosine_similarity_pos(pos1, pos2):
    """Computes cosine similarity between POS tag frequency vectors."""
    pos1_freq = Counter(pos1)
    pos2_freq = Counter(pos2)

    # Create a sorted list of all POS tags in both texts
    all_tags = sorted(set(pos1_freq.keys()) | set(pos2_freq.keys()))

    # Convert frequency dictionaries into vectors
    vec1 = np.array([pos1_freq.get(tag, 0) for tag in all_tags]).reshape(1, -1)
    vec2 = np.array([pos2_freq.get(tag, 0) for tag in all_tags]).reshape(1, -1)

    # Compute cosine similarity
    return cosine_similarity(vec1, vec2)[0][0]

## 7. Compute Coherence, Creativity

In [16]:
# Function to compute coherence
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def compute_coherence(suggestion, context):
    embeddings = model.encode([suggestion, context])
    similarity = util.cos_sim(embeddings[0], embeddings[1]).item()  # Convert tensor to float
    return similarity

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [17]:
# Adapted from Lu et al. 2024
import os
import nltk
import json
import time
import requests
import argparse
import numpy as np
from tqdm import tqdm
from typing import List, Callable
from dataclasses import dataclass
from unidecode import unidecode
from sacremoses import MosesDetokenizer
from transformers import AutoTokenizer

md = MosesDetokenizer(lang='en')
API_URL = 'https://api.infini-gram.io/'
HF_TOKEN = "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"
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", token=HF_TOKEN,
                                          add_bos_token=False, add_eos_token=False)
@dataclass
class Document:
    doc_id: str
    tokens: List[str]

@dataclass
class Span:
    start_index: int
    end_index: int
    span_text: str
    occurrence: int

class Hypothesis:
    def __init__(self, target_doc: Document, min_ngram: int) -> None:
        self.target_doc = target_doc
        self.min_ngram = min_ngram
        self.spans = []
        self.finished = False

    def add_span(self, new_span: Span) -> None:
        self.spans.append(new_span)
        if new_span.end_index >= len(self.target_doc.tokens):
            self.finished = True

    def replace_span(self, new_span: Span) -> None:
        self.spans = self.spans[:-1] + [new_span]
        if new_span.end_index >= len(self.target_doc.tokens):
            self.finished = True

    def get_score(self) -> float:
        if not self.spans:
            return 0.0
        progress_len = self.spans[-1].end_index if not self.finished else len(self.target_doc.tokens)
        flags = [False] * progress_len
        for span in self.spans:
            span_length = span.end_index - span.start_index
            flags[span.start_index:span.end_index] = [True] * span_length
        coverage = sum(flags) / len(flags)
        return coverage

    def format_span(self) -> str:
        return ' | '.join([s.span_text for s in self.spans])

    def __hash__(self) -> int:
        return hash(self.format_span())

    def __eq__(self, other) -> bool:
        if isinstance(other, Hypothesis):
            return self.format_span() == other.format_span()
        return NotImplemented

    def get_avg_span_len(self) -> float:
        if not self.spans:
            return 0.0
        span_lengths = [s.end_index - s.start_index for s in self.spans]
        return sum(span_lengths) / len(span_lengths)

    def export_json(self) -> dict:
        matched_spans = [{
            'start_index': s.start_index,
            'end_index': s.end_index,
            'span_text': s.span_text,
            'occurrence': s.occurrence
        } for s in self.spans]
        return {
            'matched_spans': matched_spans,
            'coverage': self.get_score(),
            'avg_span_len': self.get_avg_span_len(),
        }

def find_exact_match(detokenize: Callable, doc: Document, min_ngram: int) -> dict:
    hypothesis = Hypothesis(doc, min_ngram)
    first_pointer, second_pointer = 0, min_ngram
    while second_pointer <= len(doc.tokens):
        span_text = detokenize(doc.tokens[first_pointer:second_pointer])
        request_data = {
            'corpus': 'v4_rpj_llama_s4',
            'engine': 'c++',
            'query_type': 'count',
            'query': span_text,
        }
        search_result = requests.post(API_URL, json=request_data).json()
        occurrence = search_result.get('count', 0)

        if occurrence:
            matched_span = Span(
                start_index=first_pointer,
                end_index=second_pointer,
                span_text=span_text,
                occurrence=occurrence
            )
            if not hypothesis.spans:
                hypothesis.add_span(matched_span)
            else:
                last_span = hypothesis.spans[-1]
                if matched_span.start_index <= last_span.start_index and last_span.end_index <= matched_span.end_index:
                    hypothesis.replace_span(matched_span)
                else:
                    hypothesis.add_span(matched_span)
            second_pointer += 1

            # print("***************************************************************************************************")
            # print(hypothesis.format_span())
            # print(f'score: {hypothesis.get_score():.4f}  avg_span_length: {hypothesis.get_avg_span_len()}')
            # print("***************************************************************************************************")
        else:
            if second_pointer - first_pointer > min_ngram:
                first_pointer += 1
            elif second_pointer - first_pointer == min_ngram:
                first_pointer += 1
                second_pointer += 1
            else:
                raise ValueError("Invalid state in span detection.")

    hypothesis.finished = True
    return hypothesis.export_json()

def process_text(text: str, min_ngram: int = 6, lm_tokenizer: bool = False) -> dict:
    # Choose the appropriate tokenizer/detokenizer.
    if not lm_tokenizer:
        tokenize_func = lambda x: nltk.tokenize.casual.casual_tokenize(x)
        detokenize = lambda tokens: md.detokenize(tokens)
    else:
        tokenize_func = lambda x: tokenizer.tokenize(x)
        detokenize = lambda tokens: tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens))

    # Preprocess and tokenize the plain text.
    processed_text = unidecode(text)
    tokens = tokenize_func(processed_text)

    if len(tokens) <= min_ngram:
        raise ValueError("Input text is too short for the specified min_ngram.")

    # Build a Document and perform the exact match search.
    doc = Document(doc_id="input_text", tokens=tokens)
    result = find_exact_match(detokenize, doc, min_ngram)
    return result

print (process_text("I'm the best in the world, laughing out loud.")['coverage'])

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

0.6363636363636364


In [18]:
import spacy
from collections import deque

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

def get_parse_tree_depth(sent):
    """
    Calculate parse tree depth for a sentence using a BFS
    starting from the root token.
    """
    roots = [token for token in sent if token.head == token]
    if not roots:
        return 0
    root = roots[0]
    max_depth = 0
    depths = {root: 0}
    queue = deque([root])

    while queue:
        node = queue.popleft()
        for child in node.children:
            depths[child] = depths[node] + 1
            max_depth = max(max_depth, depths[child])
            queue.append(child)
    return max_depth

def count_subordinate_clauses(doc):
    """
    Count subordinate clauses by looking for tokens with dependency labels
    typically marking clause boundaries:
      - 'mark' (subordinating conjunctions),
      - 'advcl' (adverbial clauses), and
      - 'ccomp' (clausal complements).
    """
    count = 0
    for token in doc:
        if token.dep_ in {"mark", "advcl", "ccomp"}:
            count += 1
    return count

def count_passive_sentences(doc):
    """
    Identify and count passive sentences. A sentence is considered passive
    if it contains any token with the dependency 'nsubjpass'.
    """
    passive_count = 0
    for sent in doc.sents:
        if any(token.dep_ == "nsubjpass" for token in sent):
            passive_count += 1
    return passive_count

def syntactic_complexity(text, d_max=10, max_subordinate_per_sentence=2):
    """
    Calculate a normalized syntactic complexity score in the range [0,1].
    It is very knowledge-based.
    This function computes three components:
      1. Depth Component: Average parse tree depth normalized by d_max.
         The average is capped at d_max to avoid outlier effects.
      2. Subordinate Clause Component: The average number of subordinate clauses
         per sentence normalized by an assumed maximum (default=2).
      3. Passive Voice Component: The fraction of sentences exhibiting passive voice.

    The final score is the average of these three components.
    """
    doc = nlp(text)
    sentences = list(doc.sents)
    num_sentences = len(sentences)

    if num_sentences == 0:
        return 0.0

    # Depth Component
    total_depth = sum(get_parse_tree_depth(sent) for sent in sentences)
    avg_depth = total_depth / num_sentences
    depth_component = min(avg_depth, d_max) / d_max  # normalize to [0,1]

    # Subordinate Clause Component
    num_subordinate = count_subordinate_clauses(doc)
    subordinate_component = (num_subordinate / num_sentences) / max_subordinate_per_sentence
    subordinate_component = min(subordinate_component, 1.0)

    # Passive Voice Component
    passive_sentences = count_passive_sentences(doc)
    passive_component = passive_sentences / num_sentences  # already in [0,1]

    # Combine components equally to yield a final score between 0 and 1
    normalized_score = (depth_component + subordinate_component + passive_component) / 3.0
    return normalized_score

text = (
    "The old mansion, which had been abandoned for years, stood silent. Its walls, marked by time and neglect, told stories as if they were whispering. The garden was overgrown, and nature had reclaimed it, making it look like a scene from a forgotten fairy tale."
)

score = syntactic_complexity(text)
print("Normalized Syntactic Complexity Score:", score)


Normalized Syntactic Complexity Score: 0.5777777777777777


## 8. Create Dataframe
Creates the dataframe with user text, ai-suggestion, workerID, helpfulness metrics.

In [26]:
jsd_results = []

for path in paths:
    events = read_writing_session(path)

    if len(events) < 3000:
        user_texts = reconstruct_user_text(events)
        ai_suggestions = extract_selected_ai_suggestions(events)
        acceptance_status = extract_acceptance_status(events)

        # Extract matching path key
        path_key = os.path.basename(path).replace(".jsonl", "").replace("./coauthor-v1.0/","")
        worker_id = workerID.loc[workerID["session_id"] == path_key, "worker_id"].values[0]

        for user_text, ai_text, status in zip(user_texts, ai_suggestions, acceptance_status):
            if len(ai_text.split()) <= 6 or len(user_text.split()) <= 6:
                continue
            user_score, user_vector = get_emotion_vector(user_text)
            ai_score, ai_vector = get_emotion_vector(ai_text)

            jsd_similarity = similarity_jsd(user_vector, ai_vector)
            user_pos = get_pos_sequence(user_text)
            ai_pos = get_pos_sequence(ai_text)
            pos_similarity = cosine_similarity_pos(user_pos, ai_pos)

            # Compute coherence
            coherence_score = compute_coherence(ai_text, user_text)


            # Compute Syntactical Complexity and Coverage

            user_syn = syntactic_complexity(user_text)
            ai_syn = syntactic_complexity(ai_text)
            ai_coverage = process_text(ai_text)['coverage']
            # Store results
            jsd_results.append({
                "path": path,
                "workerID": worker_id,
                "user_text": user_text,
                "ai_suggestion": ai_text,
                "tone_similarity": jsd_similarity,
                "pos_similarity": pos_similarity,
                "coherence_score": coherence_score,
                "user_score": user_syn,
                "ai_score": ai_syn,
                "ai_coverage": ai_coverage,
                "acceptance_status": status
            })


Successfully read 3045 events in a writing session from ./coauthor-v1.0/e0435f4cf6fc435c872ffc5b66b66b0c.jsonl
Successfully read 1989 events in a writing session from ./coauthor-v1.0/74517c6eb89c46fab708de3e3d7c53db.jsonl
Successfully read 2008 events in a writing session from ./coauthor-v1.0/91e0856327a04acb8f366faea281f072.jsonl
Successfully read 1576 events in a writing session from ./coauthor-v1.0/ba931f5050e7409ebba26e00d532cc7c.jsonl
Successfully read 2115 events in a writing session from ./coauthor-v1.0/128bac319cb14dadaaf5c30c3a8ac5cb.jsonl
Successfully read 1942 events in a writing session from ./coauthor-v1.0/fa68c16d925c4ec08cbb9e393982aca9.jsonl
Successfully read 2682 events in a writing session from ./coauthor-v1.0/02a0a1349a8045bd969dfc3948ec4796.jsonl
Successfully read 1445 events in a writing session from ./coauthor-v1.0/3b6e7a4c9d65411f9342254120abcde7.jsonl
Successfully read 1968 events in a writing session from ./coauthor-v1.0/3969517100834dbc813b628a27fed790.jsonl
S

TypeError: can only concatenate str (not "dict") to str

In [27]:
# Create dataframe
df_results = pd.DataFrame(jsd_results)
df_results.to_csv("all_metrics.csv", index=False)

In [24]:
import pandas as pd

# Load the existing "main" file
df_main = pd.read_csv("new.csv")

# Load the file that needs to be merged into the main file
df_other = pd.read_csv("merged.csv")

# Filter out rows in df_other where 'path' is already present in df_main
df_other_filtered = df_other[~df_other['path'].isin(df_main['path'])]

# Concatenate the new rows to the main dataframe
df_merged = pd.concat([df_main, df_other_filtered], ignore_index=True)

# Save the merged result to a new CSV (or overwrite the original if desired)
df_merged.to_csv("merged2.csv", index=False)
