In [2]:
%pip install datasets tqdm python-dotenv google-api-core --quiet
%pip install -U -q "google-genai>=1.10.0" --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [6]:
import os
import json
import time
import concurrent.futures
import threading # Import threading for Lock
from tqdm.notebook import tqdm
from google import genai
from google.genai import types
from dotenv import load_dotenv
from datasets import load_dataset
from google.api_core import exceptions as core_exceptions

In [23]:
# --- Configuration ---
load_dotenv()
gemini_api_key = os.getenv("GEMINI_API_KEY")
MODEL_ID = "gemini-2.5-pro-preview-03-25"
OUTPUT_FILE = "./data/val_reasoning.jsonl"
MAX_CALLS_PER_RUN = 500

# --- Concurrency Configuration ---
# Adjust based on testing and RPM limits (max 1000, start lower)
MAX_WORKERS = 5 # Number of concurrent API requests


In [33]:
sys_prompt = """
You are an AI language expert simulating the process of text editing. You will receive a source text (`src`) which contains an editing instruction followed by the text to be edited.

Your task is to:
1.  Carefully identify the editing instruction given at the beginning of the `src` text.
2.  Analyze the original text segment that follows the instruction in `src`.
3.  Apply the instruction step-by-step to the original text.
4.  For each step or change you make, articulate the precise reasoning based on linguistic rules, the instruction's goal (e.g., grammar, clarity, simplification), or common writing conventions. Explain *why* a change is necessary or beneficial according to the instruction.
5.  Your reasoning should demonstrate *how* to arrive at the corrected text, simulating the thought process of performing the edit yourself.

**Crucially:** Your reasoning process should *not* refer to or imply knowledge of a pre-existing target or corrected text. Focus *only* on applying the instruction to the source text to derive the necessary changes.

Your entire output must consist *only* of the step-by-step reasoning process, enclosed within `<think></think>` tags. Do not include the original instruction, the source text, the final corrected text, or any other text outside these tags.

**Example Interaction:**

**User Input (constructed from your dataset):**

Source (src): Fix grammar in this sentence: If engineers do not come up with new ideas, they cannot find best solution for the problems.
Target (tgt): If engineers do not come up with new ideas, they cannot find the best solution for different problems.

**Your Expected Output (demonstrating the reasoning to *derive* the correction):

<think>
1. Instruction analysis: The task is to "Fix grammar". This requires identifying and correcting grammatical errors in the sentence "If engineers do not come up with new ideas, they cannot find best solution for the problems."
2. Sentence segment analysis: The first clause "If engineers do not come up with new ideas" appears grammatically sound.
3. Focus on the second clause: "they cannot find best solution for the problems."
4. Identify potential error 1: The noun phrase "best solution". "Best" is a superlative adjective. Grammatical rule: Superlative adjectives modifying singular countable nouns generally require a definite article ("the").
5. Apply correction 1: Insert "the" before "best solution". The phrase becomes "the best solution".
6. Identify potential error/improvement area 2: The phrase "for the problems". While grammatically plausible, using "the" implies specific, previously identified problems. The context seems general. Improving grammatical flow or clarity might involve adjusting this.
7. Consider alternatives for "the problems": "for problems" (too general?), "for specific problems" (adds info), "for different problems" (suggests variety, often fits general statements well).
8. Apply correction 2: Replace "the problems" with "different problems" for better contextual fit and naturalness, aligning with the goal of general grammatical improvement and clarity often included in "Fix grammar" tasks.
9. Synthesized correction based on reasoning: The corrected clause is "they cannot find the best solution for different problems." The full sentence derived from applying the grammar fixes is "If engineers do not come up with new ideas, they cannot find the best solution for different problems."
</think>
"""


In [34]:
# --- Main Processing Logic ---

if not gemini_api_key:
    print("Error: GEMINI_API_KEY not found in environment variables.")
else:
    # Initialize client
    client = genai.Client(api_key=gemini_api_key)

    # Load dataset
    print("Loading dataset...")
    try:
        # Using shuffle=False here to process in a consistent order across runs
        validation_ds = load_dataset("grammarly/coedit", split="validation", cache_dir="./data")
        print(f"Dataset loaded: {len(validation_ds)} rows")
    except Exception as e:
        print(f"Fatal Error: Could not load dataset. {e}")
        validation_ds = None # Ensure loop doesn't run

Loading dataset...
Dataset loaded: 1712 rows


In [None]:
# Global flag for controlled shutdown
shutdown_requested = threading.Event()

# Custom exception for controlled shutdown
class APIShutdownException(Exception):
    """Exception raised to trigger a controlled shutdown of API processing."""
    pass       

# --- Helper Function to Load Processed IDs ---

def load_processed_ids(filename):
    processed = set()
    if os.path.exists(filename):
        with open(filename, 'r', encoding='utf-8') as f:
            for line in f:
                try:
                    data = json.loads(line)
                    if '_id' in data:
                        processed.add(data['_id'])
                except json.JSONDecodeError:
                    print(f"Warning: Skipping invalid JSON line in {filename}")
    print(f"Loaded {len(processed)} processed IDs from {filename}")
    return processed

# --- Worker Function ---

def process_item(item, client, sys_prompt):
    """Processes a single dataset item to get reasoning."""
    item_id = item['_id']
    item_src = item['src']
    item_tgt = item['tgt']
    user_prompt = f"Source (src): {item_src}\nTarget (tgt): {item_tgt}"

    try:
        # ---  API Call ---
        response = client.models.generate_content(
            model=MODEL_ID,
            contents=user_prompt,
            config=types.GenerateContentConfig(
            system_instruction=sys_prompt,
                    thinking_config=types.ThinkingConfig(
                        thinking_budget=600 # 0 means no thinking None means no limit to thinking it can take however many tokens it wants.
                    )
                )
            )
        
        # --- Process Response ---
        reasoning_text = response.text
        
        # Remove anything before <think> and after </think>
        start_tag = "<think>"
        end_tag = "</think>"
        start_idx = reasoning_text.find(start_tag)
        end_idx = reasoning_text.find(end_tag) + len(end_tag)
        
        if start_idx != -1 and end_idx != -1:
            reasoning_text = reasoning_text[start_idx:end_idx]
            return {"_id": item_id, "reasoning": reasoning_text.strip()}
        else:
            print(f"\nWarning: Unexpected response format for ID {item_id}. Skipping. Response: {reasoning_text[:200]}...")
            return {"_id": item_id, "error": "Bad Format"} # Return error indicator
    except Exception as e:
        if hasattr(e, 'code') and 400 <= e.code < 600:
            error_msg = f"Critical HTTP error {e.code}: {e}"
            print(f"\n{error_msg}")
            # Signal all threads to stop
            shutdown_requested.set()
            # Propagate the exception to interrupt processing
            raise APIShutdownException(error_msg)
        
        # --- Handle API or Processing Errors ---
        print(f"\nError processing ID {item_id}: {e}")
        return {"_id": item_id, "error": str(e)} # Return error indicator


In [None]:

# --- Main processing section ---
if validation_ds:
    processed_ids = load_processed_ids(OUTPUT_FILE)
    api_calls_made_this_run = 0
    file_lock = threading.Lock()
    items_to_process = []

    
    # --- Pre-filter items that need processing ---
    print("Filtering items to process...")
    for item in tqdm(validation_ds, desc="Filtering"):
        if item['_id'] not in processed_ids:
            items_to_process.append(item)
    print(f"Found {len(items_to_process)} items needing processing.")

    # Process items concurrently
    with open(OUTPUT_FILE, 'a', encoding='utf-8') as outfile:
        try:
            with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
                futures = {}
                
                # Submit tasks only up to the daily limit
                num_to_submit = min(len(items_to_process), MAX_CALLS_PER_RUN - api_calls_made_this_run)
                print(f"Submitting {num_to_submit} tasks to the executor...")

                submitted_count = 0
                for i in range(len(items_to_process)):
                    if submitted_count >= num_to_submit or shutdown_requested.is_set():
                        break  # Stop submitting if we hit the limit or shutdown requested
                    item = items_to_process[i]
                    future = executor.submit(process_item, item, client, sys_prompt)
                    futures[future] = item['_id']
                    submitted_count += 1

                print(f"Tasks submitted ({submitted_count}). Waiting for results...")

                # Process results as they complete
                completed_futures = []

                for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures), desc="Processing Results"):
                    if shutdown_requested.is_set():
                        print("\nShutdown requested. Canceling remaining tasks...")
                        for f in futures:
                            if f not in completed_futures and not f.done():
                                f.cancel()
                        break
        
                    completed_futures.append(future)
                    item_id = futures[future]       
                    
                    try:
                        result_data = future.result()
                        
                        if result_data and "error" not in result_data:
                            with file_lock:
                                if result_data['_id'] not in processed_ids:
                                    json_line = json.dumps(result_data)
                                    outfile.write(json_line + '\n')
                                    outfile.flush()
                                    processed_ids.add(result_data['_id'])
                                    api_calls_made_this_run += 1
                                    
                    except APIShutdownException:
                        print(f"Task for item {item_id} aborted due to shutdown request")
                    except Exception as exc:
                        print(f'\nItem ID {item_id} generated an exception: {exc}')
                        
        except KeyboardInterrupt:
            print("\nKeyboard interrupt detected. Shutting down gracefully...")
            shutdown_requested.set()
            
    # Final status report
    if shutdown_requested.is_set():
        print("\nProcessing terminated early due to errors or interruption.")
    else:
        print(f"\nFinished processing batch. Made approximately {api_calls_made_this_run} successful API calls in this run.")
    
    print(f"Total processed IDs (including previous runs): {len(processed_ids)}")

Loaded 58 processed IDs from ./data/val_reasoning.jsonl
Filtering items to process...


Filtering:   0%|          | 0/1712 [00:00<?, ?it/s]

Found 1654 items needing processing.
Submitting 500 tasks to the executor...
Tasks submitted (500). Waiting for results...


Processing Results:   0%|          | 0/500 [00:00<?, ?it/s]


<think>
1.  **Instruction Analysis:** The instruction is "Rewrite this sentence." This is a general instruction, implying that changes can be made for clarity, conciseness, emphasis, or style, ...

<think>
1.  **Instruction Analysis:** The instruction is "Paraphrase the sentence." This means I need to reword the original sentence to express the same meaning but using different vocabulary ...

Keyboard interrupt detected. Shutting down gracefully...

Processing terminated early due to errors or interruption.
Total processed IDs (including previous runs): 64



<think>
1.  **Instruction Analysis:** The instruction is "Rewrite this sentence." This implies making the sentence sound more natural, concise, or stylistically improved, not just correcting gr...

<think>
1.  **Instruction Analysis:** The instruction is "Fix grammaticality". This requires identifying and correcting grammatical errors in the provided text.

2.  **Original Text Segment:** ...

<think>
1.  **Instruction Analysis:** The instruction is "Fix grammaticality". This requires me to identify and correct any grammatical errors in the sentence: "I bought one t-shirts and sungla...

<think>
1.  **Instruction Analysis**: The instruction is "Fix grammatical errors in this sentence". This requires a thorough check of the sentence for any grammatical mistakes.

2.  **Sentence ...

<think>
1.  **Instruction Analysis:** The instruction is "Rewrite this sentence". This means I need to change the sentence's structure or phrasing while preserving its original meaning. Common ...

<think>
1

In [None]:
prompt = """
    hello there the angel from my nightmare the... return whatever
"""
response = client.models.generate_content(
    model="gemini-2.5-pro-preview-03-25",
    contents=prompt
)

print(response)

candidates=[Candidate(content=Content(parts=[Part(video_metadata=None, thought=None, code_execution_result=None, executable_code=None, file_data=None, function_call=None, function_response=None, inline_data=None, text='Ah, a fellow Blink-182 fan! The line continues:\n\n"...**shadow in the background of the morgue.**"\n\nThat\'s from their iconic song **"I Miss You"**!\n\nThe next lines are:\n"The unsuspecting victim of darkness in the valley\nWe can live like Jack and Sally if we want\nWhere you can always find me\nAnd we\'ll have Halloween on Christmas..."\n\nGreat song!')], role='model'), citation_metadata=CitationMetadata(citations=[Citation(end_index=1277, license='', publication_date=None, start_index=1070, title=None, uri='https://www.target.com/p/insten-ac-power-adapter-compatible-with-sony-psp-go-black/-/A-78787043'), Citation(end_index=1391, license='', publication_date=None, start_index=1206, title=None, uri='https://www.target.com/p/insten-ac-power-adapter-compatible-with-so

In [14]:
print(response.text)

Ah, a fellow Blink-182 fan! The line continues:

"...**shadow in the background of the morgue.**"

That's from their iconic song **"I Miss You"**!

The next lines are:
"The unsuspecting victim of darkness in the valley
We can live like Jack and Sally if we want
Where you can always find me
And we'll have Halloween on Christmas..."

Great song!


In [41]:
reasoning_text = "```xml <think>" + "response.text" + "</think>```"

# Remove anything before <think> and after </think>
start_tag = "<think>"
end_tag = "</think>"
start_idx = reasoning_text.find(start_tag)
end_idx = reasoning_text.find(end_tag) + len(end_tag)

if start_idx != -1 and end_idx != -1:
    reasoning_text = reasoning_text[start_idx:end_idx]
    print(f'{"_id": item_id, "reasoning": reasoning_text.strip()}')
else:
    print(f"\nWarning: Unexpected response format for ID {item_id}. Skipping. Response: {reasoning_text[:200]}...")

ValueError: Invalid format specifier ' item_id, "reasoning": reasoning_text.strip()' for object of type 'str'


<think>
1.  **Instruction Analysis:** The instruction is "Rewrite this sentence." This implies improving clarity, conciseness, flow, or potentially changing the emphasis or structure. The origi...

<think>
1.  **Instruction Analysis:** The instruction is "Write a paraphrase for the sentence: The world has tried with little success to cut carbon emissions under the Kyoto Protocol." This me...

<think>
The user wants me to edit the source text to "Make the sentence fluent". This involves correcting grammar, spelling, word choice, and sentence structure to improve readability and natur...

<think>
1.  **Instruction Analysis:** The instruction is "Write a paraphrased version of the sentence". This means expressing the same meaning as the original sentence but using different wordi...

<think>
1.  **Instruction Analysis:** The instruction is "Paraphrase the sentence". This means I need to reword the original sentence while keeping the original meaning intact. The original sen...

<think>
1

In [39]:
reasoning_text = "```xml <think>" + response.text + "</think>```"
print(reasoning_text)

<think>Ah, a fellow Blink-182 fan! The line continues:

"...**shadow in the background of the morgue.**"

That's from their iconic song **"I Miss You"**!

The next lines are:
"The unsuspecting victim of darkness in the valley
We can live like Jack and Sally if we want
Where you can always find me
And we'll have Halloween on Christmas..."

Great song!</think>



<think>
1.  **Instruction Analysis:** The instruction is "Write a paraphrase for the sentence: The Commission will support the exchange of good practice in this field." This means I need to rew...
