# In-Context Learning


In-context learning is a generalisation of few-shot learning where the LLM is provided a context as part of the prompt and asked to respond by utilising the information in the context.

* Example: *"Summarize this research article into one paragraph highlighting its strengths and weaknesses: [insert article text]”*
* Example: *"Extract all the quotes from this text and organize them in alphabetical order: [insert text]”*

A very popular technique that you will learn in week 5 called Retrieval-Augmented Generation (RAG) is a form of in-context learning, where:
* a search engine is used to retrieve some relevant information
* that information is then provided to the LLM as context


In this example we download some recent research papers from arXiv papers, extract the text from the PDF files and ask Gemini to summarize the articles as well as provide the main strengths and weaknesses of the papers. Finally we print the summaries to a local html file and as markdown.

In [None]:
!pip install requests bs4 google-generativeai pypdf


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [15]:
import os
import requests
from bs4 import BeautifulSoup
import google.generativeai as genai
from urllib.request import urlopen, urlretrieve
from IPython.display import Markdown, display
from pypdf import PdfReader
from datetime import date
from tqdm import tqdm
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
# from google.colab import userdata

In [26]:
API_KEY = "AIzaSyCSnKKi45uArW2X4ebD6J364fTYJUOO5yA" 
genai.configure(api_key=API_KEY)

We select those papers that have been featured in Hugging Face papers.

In [27]:
BASE_URL = "https://huggingface.co/papers"
page = requests.get(BASE_URL)
soup = BeautifulSoup(page.content, "html.parser")
h3s = soup.find_all("h3")

papers = []

for h3 in h3s:
    a = h3.find("a")
    title = a.text
    link = a["href"].replace('/papers', '')

    papers.append({"title": title, "url": f"https://arxiv.org/pdf{link}"})

In [28]:
%pip install PyPDF2


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Code to extract text from PDFs.

In [29]:
def extract_paper(url):
    html = urlopen(url).read()
    soup = BeautifulSoup(html, features="html.parser")

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    return text


import requests
from PyPDF2 import PdfReader
import io

def extract_pdf(url):
    # 使用 requests 库下载 PDF，并明确禁用 SSL 证书验证
    response = requests.get(url, verify=False)
    
    # 检查请求是否成功
    response.raise_for_status() 
    
    # 将下载的 PDF 内容加载到内存中，而不是保存为文件
    pdf_file = io.BytesIO(response.content)
    
    # 使用 PdfReader 读取内存中的 PDF
    reader = PdfReader(pdf_file)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
        
    return text


def printmd(string):
    display(Markdown(string))

In [30]:
LLM = "gemini-2.5-flash"
model = genai.GenerativeModel(LLM)

We use Gemini to summarize the papers.

In [31]:
# 1. 在循环之前，定义我们新的、要求输出表格的 prompt 模板。
#    我们使用 {text} 作为占位符，稍后会填入论文全文。
prompt_template = """
Analyze the following research paper text.
Your task is to identify the main strengths and weaknesses of the paper.

Please format your output *only* as a Markdown table with two columns: "Pros" and "Cons".
Do not include any other text, titles, or explanations before or after the table.

Text:
{text}

Markdown Table:
"""

# 2. 使用新的 prompt 模板来重写你的 for 循环。
for paper in tqdm(papers):
    try:
        # 首先，提取论文文本
        paper_text = extract_pdf(paper["url"])
        
        # 然后，用论文文本填充我们的模板，创建最终的 prompt
        final_prompt = prompt_template.format(text=paper_text)
        
        # 最后，用这个新的、结构化的 prompt 来调用模型
        paper["summary"] = model.generate_content(final_prompt).text

    except Exception as e:
        # (可选改进) 打印出更详细的错误信息，方便调试
        print(f"Generation failed for paper {paper.get('url', 'N/A')}: {e}")
        paper["summary"] = "Paper not available or processing failed"

100%|██████████| 12/12 [05:15<00:00, 26.31s/it]


We print the results to a html file.

In [34]:
import markdown  # 1. 确保导入 markdown 库
from datetime import date # (如果 date 还没导入，也加上这行)

# --- Start of HTML Generation ---

# 2. 将所有 HTML 内容先累加到一个字符串变量中
full_html_content = f"""
<html>
<head>
    <title>Daily Dose of AI Research</title>
    <style>
        body {{ font-family: sans-serif; line-height: 1.6; }}
        table {{ border-collapse: collapse; width: 100%; }}
        th, td {{ border: 1px solid #dddddd; text-align: left; padding: 8px; vertical-align: top; }}
        th {{ background-color: #f2f2f2; }}
    </style>
</head>
<body>
    <h1>Daily Dose of AI Research</h1>
    <h4>{date.today()}</h4>
    <p><i>Summaries generated with: {LLM}</i></p>
"""

# 3. 循环处理每一篇论文
for paper in papers:
    # 从字典中获取 Markdown 格式的摘要
    markdown_summary = paper["summary"]
    
    # 将 Markdown 转换为 HTML，并指定使用 'tables' 扩展
    html_summary_table = markdown.markdown(markdown_summary, extensions=['tables'])
    
    # 将这篇论文的标题和转换后的 HTML 表格追加到主内容字符串中
    full_html_content += f'<h2><a href="{paper["url"]}">{paper["title"]}</a></h2>{html_summary_table}'

# 4. 添加 HTML 的结尾部分
full_html_content += """
</body>
</html>
"""

# 5. 最后，一次性将完整的 HTML 内容写入文件
with open("papers.html", "w") as f:
    f.write(full_html_content)

print("Successfully generated papers.html with formatted tables.")

Successfully generated papers.html with formatted tables.


We can also print the results to this notebook as markdown.

In [35]:
for paper in papers:
    printmd("**[{}]({})**<br>{}<br><br>".format(paper["title"],
                                                paper["url"],
                                                paper["summary"]))

**[Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model
  Reasoning Ability in VibeThinker-1.5B](https://arxiv.org/pdf/2511.06221)**<br>| Pros | Cons |
|---|---|
| Challenges the prevailing consensus that small models inherently lack robust reasoning capabilities. | Exhibits a substantial performance gap (20-40 points) compared to leading models on general knowledge benchmarks (GPQA). |
| Introduces an innovative post-training methodology: "Spectrum-to-Signal Principle (SSP)" with "Two-Stage Diversity-Exploring Distillation" and "MaxEnt-Guided Policy Optimization (MGPO)". | Coding performance, while competitive, is slightly less dominant than mathematical performance, attributed to base model's pre-training. |
| VibeThinker-1.5B (1.5B parameters) achieves superior reasoning capabilities on mathematical and coding benchmarks, outperforming much larger models (e.g., DeepSeek R1, Claude Opus 4, Magistral Medium). | Explicitly states the model is open-sourced "not as a deployable solution," implying potential limitations for real-world application beyond proof-of-concept. |
| Demonstrates exceptional cost-effectiveness with a total training cost of $7,800, drastically lower than large SOTA models. | The paper's date (Nov. 7, 2025) and references to benchmarks released in 2025 create an unusual temporal context, presenting findings as if from the future. |
| Significantly improves performance over its base model across all evaluated reasoning domains. | |
| Enhances research accessibility and democratizes AI by reducing computational resource requirements for advanced reasoning models. | |
| Employs rigorous data decontamination procedures and demonstrates strong performance on benchmarks (AIME25, HMMT25) released after its base model, bolstering claims of generalization. | |
| The post-trained model checkpoint is open-sourced to support future research. | |<br><br>

**[Grounding Computer Use Agents on Human Demonstrations](https://arxiv.org/pdf/2511.07332)**<br>| Pros | Cons |
|---|---|
| **Novel & High-Quality Dataset (GROUNDCUA):** Addresses a critical gap for desktop environments with the largest expert human-annotated dataset (3.56M annotations across 56K screenshots from 87 diverse open-source applications). | **Limited Dataset Diversity (Self-Acknowledged):** While extensive, the dataset may not fully represent the entire breadth of desktop software and is biased towards commonly used open-source applications. |
| **Dense & Fine-Grained Annotations:** Captures high-resolution displays, small elements (icons, toolbars), and provides rich context (average 64 elements/screenshot, fine-grained categories for 50% elements). | **Static UI Capture:** Keyframe-based annotations capture static UI states but miss dynamic UI elements, animations, and real-time updates. |
| **Realistic Data Collection:** Built from expert human demonstrations of real-world tasks, leading to a more realistic distribution of UI states compared to automated or synthetic data generation. | **Scalability Challenges of Human Annotation:** Human labeling is time-consuming and costly, potentially limiting future scalability and introducing inherent inconsistencies in large-scale manual efforts. |
| **Diverse Instruction Generation:** Leverages dense annotations to create diverse (Direct, Functional, Spatial) and context-aware instructions via MLLM prompting, tightly linked to visual and textual content. | **Limited End-to-End Agentic Evaluation:** The primary evaluation focuses on grounding accuracy, not full end-to-end task completion for complex, multi-step agentic scenarios (though OSWorld-Verified is a good step). |
| **State-of-the-Art Model Performance (GROUNDNEXT):** Achieves SOTA results on five challenging benchmarks (desktop, mobile, web) at 3B and 7B scales using supervised fine-tuning and reinforcement learning. | **Real-World Robustness Not Fully Explored:** Evaluation does not explicitly address robustness to distribution shifts, new application versions, or evolving UI updates, which are critical for real-world agents. |
| **Exceptional Data Efficiency:** Outperforms prior models with significantly less SFT data (700K instructions vs. 9M in prior work), demonstrating the value of high-quality data over sheer volume. | **RL Reward Function Simplicity:** While effective, the paper notes that more sophisticated reward functions could lead to more substantial RL gains, suggesting room for improvement. |
| **Strong Agentic Capability:** GROUNDNEXT-3B shows performance comparable to or superior to much larger models and proprietary APIs in agentic, multi-step tasks, highlighting practical utility for resource-constrained systems. | **Cross-Domain Generalization Room for Improvement:** While competitive, performance on web interfaces (e.g., on SSv2) falls behind some specialized models, indicating potential transfer bottlenecks if not explicitly augmented with web/mobile data. |
| **Cross-Platform Generalization:** Models trained primarily on desktop data show competitive performance on mobile and web benchmarks, indicating good generalization ability. | |
| **Targeted Improvements:** Significant gains in desktop benchmarks, particularly for fine-grained elements like icon recognition, attributed to the dataset's detailed diversity. | |
| **Open-Source Contribution:** Releases both the GROUNDCUA dataset and GROUNDNEXT models, fostering open research and reproducibility in the field. | |
| **Efficient RL Training:** Uses a simple yet effective discrete reward function and RLOO, avoiding complex reward strategies or the need for a separate critic model. | |<br><br>

**[Adaptive Multi-Agent Response Refinement in Conversational Systems](https://arxiv.org/pdf/2511.08319)**<br>| Pros                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | Cons                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Proposes a novel multi-agent framework (MARA) for conversational response refinement, addressing limitations of single-agent approaches.                                                                                                                                                                                                                                                                                                                                                                    | The current planner agent, based on an unsupervised LLM, has room for improvement, as indicated by the performance gap with an "ideal planner."                                                                                                                                                                                                                                                                                                                                 |
| Employs specialized refining agents for distinct aspects: factuality, personalization (persona), and coherence, enabling a more holistic improvement of responses.                                                                                                                                                                                                                                                                                                                                                | Acknowledges concerns regarding scalability and resource efficiency inherent to multi-agent frameworks, even if partially mitigated by the planner and smaller LLMs.                                                                                                                                                                                                                                                                                                         |
| Introduces a dynamic communication strategy with a planner agent that adaptively selects and sequences relevant agents based on specific query requirements, which is more flexible than fixed sequences.                                                                                                                                                                                                                                                                                                   | The G-Eval metric for "Naturalness" shows a relatively low correlation (0.35) with human judgments, suggesting it may not fully capture the human-like quality of responses as effectively as other metrics.                                                                                                                                                                                                                                                                       |
| Demonstrates significant performance improvements over various single-agent and multi-agent baselines on challenging conversational datasets across diverse metrics (Coherence, Groundedness, Naturalness, Engagingness).                                                                                                                                                                                                                                                                                       | The paper identifies the lack of a fine-tuned planner as a limitation and suggests it as future work, implying the current planner's performance is not optimal.                                                                                                                                                                                                                                                                                                                     |
| Conducts thorough ablation studies to validate the contribution of individual refining agents and the dynamic communication strategy, supporting the design choices.                                                                                                                                                                                                                                                                                                                                        | Mentions the potential for integrating additional safeguards like an agent for harmful content detection as future research, indicating this is not currently implemented in MARA.                                                                                                                                                                                                                                                                                                   |
| Includes human evaluation to corroborate G-Eval results and assess alignment, strengthening the validity of the evaluation methodology, especially for coherence, groundedness, and engagingness.                                                                                                                                                                                                                                                                                                               | The framework's overall effectiveness is dependent on the capabilities of the underlying LLMs chosen for each agent, and while it shows flexibility, it means performance could be constrained by model choice.                                                                                                                                                                                                                                                                   |
| Shows robustness and flexibility by demonstrating effectiveness across different base LLMs (Claude, GPT-4o-mini, LLaMA 3.1) and by allowing different LLMs for specific agent roles (e.g., a more powerful LLM for fact-refining).                                                                                                                                                                                                                                                                              | While the proposed system is powerful, its multi-agent architecture introduces complexity in terms of orchestration and potential computational overhead (multiple LLM calls), which might be a barrier for some applications despite efforts to mitigate it.                                                                                                                                                                                                                          |
| Directly addresses critical shortcomings of LLMs in conversational systems, such as failures in personalization, factual accuracy (hallucinations), and maintaining coherence in multi-turn dialogues.                                                                                                                                                                                                                                                                                                            | The "Ideal Planner" results, achieved by brute-forcing combinations (costly), highlight that the current dynamic planner, while good, doesn't always find the globally optimal sequence, suggesting there's still a gap in fully maximizing the multi-agent potential in a computationally efficient way.                                                                                                                                                                             |
| Provides detailed insights into the distribution of agent usage across different datasets, illustrating how the planner agent effectively adapts to varying query requirements.                                                                                                                                                                                                                                                                                                                                 | Human evaluation was conducted on a relatively small subset (288 queries from FoCus dataset with 8 participants) compared to the scale of the G-Eval, which might limit the generalizability of human-alignment claims, particularly for the weaker correlations.                                                                                                                                                                                                                  |
| The step-by-step reasoning process (verify then refine with justifications) within each agent, and the passing of planner's justifications to agents, enhances transparency and potentially the quality of refinement. | The experimental results are based on 3 different runs for the main results (Table 1), which might be considered a relatively small number of runs for statistical robustness, though significance tests are provided in the appendix for the main datasets and comparisons. |<br><br>

**[Wasm: A Pipeline for Constructing Structured Arabic Interleaved
  Multimodal Corpora](https://arxiv.org/pdf/2511.07080)**<br>| Pros | Cons |
|:---|:---|
| **Addresses a Critical Gap:** Introduces the first large-scale Arabic interleaved multimodal dataset pipeline, filling a significant void for Arabic LLM/LMM development. | **Nuance in Perplexity Model Performance:** The paper acknowledges that their custom Arabic KenLM model sometimes yields higher perplexity than a simpler Wikipedia-trained model for certain samples (Table 5 caption), indicating potential areas for refinement or further inspection regarding quality assessment for all types of Arabic text. |
| **Structural Preservation:** Uniquely preserves document-level hierarchical structure and text-image interleaving in Markdown format, which is crucial for training advanced multimodal models. | **Lack of Specific Dataset Scale:** While the paper repeatedly claims to create a "large-scale" dataset, it does not explicitly quantify the size (e.g., number of documents, tokens, images) of the final Wasm dataset within the provided text, making it difficult to fully assess its scale relative to other corpora discussed. |
| **Arabic-Specific Adaptations:** Implements comprehensive linguistic and cultural adaptations to filtering strategies (e.g., relaxed repetition/punctuation ratios, custom Arabic KenLM for perplexity) to suit Arabic text characteristics and dialectal diversity. | **Limited Downstream Evaluation:** The text mentions the dataset was used to train a vision model (Baseer), but it does not present specific quantitative results on how models trained with Wasm data perform on downstream tasks compared to those trained with existing Arabic resources. |
| **Granular Deduplication:** Utilizes a novel node-level deduplication strategy (Needleman-Wunsch algorithm) to retain unique content within documents, improving corpus diversity and processing efficiency compared to conventional document-level methods. | |
| **Computational Efficiency:** Incorporates early language-based filtering of Common Crawl WARC files to significantly reduce computational resources. | |
| **Flexibility:** The pipeline is designed to support both text-only and multimodal pre-training scenarios. | |
| **Open Science:** Publicly releases the adapted pipeline code and a representative dataset dump, fostering reproducibility and facilitating further research in Arabic NLP. | |
| **Comparative Analysis:** Provides a detailed comparative analysis of its data processing pipeline and filtering choices against those used for major existing corpora, justifying its specific design decisions. | |<br><br>

**[KLASS: KL-Guided Fast Inference in Masked Diffusion Models](https://arxiv.org/pdf/2511.05664)**<br>| Pros | Cons |
|:---|:---|
| **Novelty and Innovation:** Introduces KLASS, a novel adaptive sampling method leveraging token-level KL divergence and confidence to identify stable, high-confidence predictions. | **Scalability for Largest Models:** Cannot be evaluated on the most challenging benchmarks (e.g., agentic LLM systems) due to the current absence of discrete diffusion models as large as state-of-the-art autoregressive models. |
| **Significant Speedup:** Achieves substantial wall-clock speedups (up to 2.78×) and reduces sampling steps by 40-70% through parallel unmasking. | **Hyperparameter Search Cost:** Introduces additional hyperparameters (KL and confidence thresholds, history length) that require a minimal search cost, although the paper states performance is robust around optimal points. |
| **Improved Performance:** Consistently improves accuracy on challenging reasoning benchmarks (math, code) and maintains/improves quality in text, image, and molecular generation. | |
| **Training-Free and Efficient:** Requires no additional model training, external planners, or extra memory burden, making it a lightweight post-processing step with negligible computational and memory overhead. | |
| **Broad Applicability:** Demonstrated effectiveness across diverse modalities including text, images, and molecules, showcasing robust generalization. | |
| **Theoretical Rationale:** Provides theoretical justification for why incorrect tokens cannot remain dynamically stable, supporting the use of KL divergence for improved sample quality. | |
| **State-of-the-Art Results:** Attains state-of-the-art performance among diffusion-based samplers on reasoning benchmarks. | |
| **Adaptive and Robust:** Dynamically adapts the unmasking process based on evolving model confidence and prediction stability, showing robustness to minor hyperparameter adjustments. | |
| **Reproducibility:** Code is made publicly available. | |<br><br>

**[VideoSSR: Video Self-Supervised Reinforcement Learning](https://arxiv.org/pdf/2511.06281)**<br>| Pros                                                                    | Cons                                                                                                                              |
| :---------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------- |
| Addresses critical problem of expensive/biased video data annotation.   | Experiments primarily use a low number of input frames, limiting scalability to long videos.                                        |
| Introduces novel self-supervised reinforcement learning (VideoSSR) framework. | Relies on a limited set of three pretext tasks; broader exploration of self-supervised objectives is noted as future work.        |
| Generates high-quality, verifiable training data from intrinsic video signals. | Challenges in improving temporal perception using certain perturbations (e.g., "Slow" and "Fast") were observed.                  |
| Proposes three novel pretext tasks (Anomaly Grounding, Object Counting, Temporal Jigsaw). | A performance gap remains compared to top-tier closed-source models, particularly for long video tasks.                             |
| Pretext tasks feature parametrically scalable difficulty.               | Computational cost of training (16 hours on 8 H200 GPUs) might still be substantial for broader accessibility.                      |
| Develops new benchmark (VIUBench) and dataset (VideoSSR-30K).          | Generalizability to different MLLM base architectures beyond Qwen3-VL-8B-Instruct is not explicitly demonstrated.                   |
| Employs tailored smooth reward functions to overcome sparse reward issue in RLVR. |                                                                                                                                   |
| Extensive evaluation across 17 benchmarks in four major video domains.  |                                                                                                                                   |
| Demonstrates consistent and significant performance improvements (average > 5%). |                                                                                                                                   |
| Ablation studies confirm effectiveness of diverse tasks and smooth rewards. |                                                                                                                                   |
| Code is made publicly available.                                        |                                                                                                                                   |
| Highlights limitations of training on existing MLLM-annotated datasets. |                                                                                                                                   |<br><br>

**[Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs](https://arxiv.org/pdf/2511.07003)**<br>| Pros                                                                                                                                                                                                                                                               | Cons                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Addresses English-centric bias and promotes inclusivity by centering translation on both Chinese and English (bi-centric design).                                                                                                                                                                                                                                                                                                                                 | Evaluation primarily relies on academic benchmarks (COMET), with real-world scenarios suggested for future work, potentially limiting generalizability.                                                                                                                                                                                                                                                                                                        |
| Achieves broad language coverage, supporting 60 languages and 234 translation directions.                                                                                                                                                                                                                                                                                                                                                                          | Acknowledges that 60 languages still represent a small fraction of the world's linguistic diversity (over 7,000 languages globally).                                                                                                                                                                                                                                                                                                                         |
| Identifies and thoroughly analyzes "directional degeneration," a novel and critical issue arising from symmetric multi-way data in large-scale multilingual Supervised Fine-tuning (SFT).                                                                                                                                                                                                                                                                       | Potential for bias in quality estimation (QE) models (e.g., COMETKiwi) for underrepresented or non-English language pairs, which might affect the reliability of quality assessment for low-resource languages.                                                                                                                                                                                                                                            |
| Proposes "Strategic Downsampling," a simple yet highly effective method to mitigate directional degeneration, demonstrating its practicality and significant performance gains.                                                                                                                                                                                                                                                                                     | The bi-centric design, while an improvement over English-centric models, is acknowledged by the authors as a limitation that could be extended to tri- or multi-centric configurations.                                                                                                                                                                                                                                                                |
| Introduces "Parallel Multilingual Prompting (PMP)," an intuitive and effective technique that enhances cross-lingual transfer by leveraging typologically related or English auxiliary languages for medium- and low-resource directions.                                                                                                                                                                                                                        | PMP's effectiveness at inference can depend on the availability of a high-quality external MT system or self-generated auxiliary translation, which might introduce dependencies or potential error propagation.                                                                                                                                                                                                                                               |
| Achieves State-of-the-Art (SOTA) performance among models with comparable language coverage, with a 4B model substantially outperforming much larger models (e.g., Aya-101-13B, NLLB-54B), showcasing exceptional parameter efficiency.                                                                                                                                                                                                                          | For some low-resource languages (e.g., Tibetan, Marathi, Burmese), especially in Chinese-centric directions, the reported SacreBLEU scores (in the appendix) are very low, which could indicate limitations in fluency or adequacy despite potentially higher COMET scores.                                                                                                                                                                                        |
| Features a rigorous, multi-stage data curation pipeline encompassing large-scale collection, pseudo-parallel synthesis, and multi-dimensional filtering.                                                                                                                                                                                                                                                                                                         | The consistent use of English as the auxiliary language for all Chinese-centric PMP directions might still reinforce an English pivot, even if it's an in-context hint rather than a full two-step pivot.                                                                                                                                                                                                                                                 |
| Includes comprehensive ablation studies that quantitatively demonstrate the individual contributions and synergistic effects of Strategic Downsampling, Continued Pre-training (CPT), and Parallel Multilingual Prompting (PMP).                                                                                                                                                                                                                                  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Demonstrates the generality of the "directional degeneration" phenomenon across various foundation models (Qwen3, Llama3.1, Gemma2), highlighting its systemic nature.                                                                                                                                                                                                                                                                                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Open-sources the LMT model suite in four sizes (0.6B/1.7B/4B/8B), providing strong baselines and catalyzing future research in inclusive and scalable MMT.                                                                                                                                                                                                                                                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Contributes a new Chinese-Mongolian test set, addressing a gap in existing MT benchmarks and planning its release to the community.                                                                                                                                                                                                                                                                                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |<br><br>

**[The Path Not Taken: RLVR Provably Learns Off the Principals](https://arxiv.org/pdf/2511.08567)**<br>| Pros | Cons |
|:---|:---|
| Resolves the RLVR sparsity paradox by introducing a novel "Three-Gate Theory" (KL Anchor, Model Geometry, Precision). | Relies on "principal weights" as a proxy for high-curvature directions due to the computational infeasibility of direct quantification. |
| Provides the first parameter-level characterization of RLVR's learning dynamics, moving beyond policy or distributional effects. | Theoretical proofs and bounds often depend on "small-step" or "sufficiently small" conditions, which might limit their strict applicability in all practical training scenarios. |
| Empirically validates the theory with a comprehensive suite of experiments, including spectral analysis, weight overlap, and geometry-disrupting interventions. | Current sparse fine-tuning experiments use one-shot fixed masks, indicating that dynamic or adaptive masking strategies are left for future work. |
| Clearly identifies that RLVR operates in a distinct optimization regime from SFT, learning "off-principal directions" and preserving spectral geometry. | The core focus is on RLVR (Reinforcement Learning with Verifiable Rewards), implying that direct applicability to other, broader RL settings might require further investigation. |
| Offers practical guidance for designing RLVR-native, geometry-aware learning algorithms, demonstrating the misalignment of SFT-era PEFT methods. | The nuanced distinction between "apparent sparsity" (due to bfloat16 precision) and absolute zero updates might be considered a less definitive form of sparsity by some. |
| Uses a robust, bfloat16-aware methodology for analyzing update sparsity, addressing limitations of prior research. | |
| Demonstrates the generality of its findings across various RL algorithms, datasets, model families (Qwen, Llama, Mistral), and tasks (math, code, agentic, RLHF). | |
| Provides causal evidence for the role of pretrained model geometry in steering RL updates through targeted interventions. | |
| Explicitly challenges existing beliefs about the benefits of principal-targeted LoRA variants (PiSSA) for RLVR. | |<br><br>

**[BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives](https://arxiv.org/pdf/2511.08029)**<br>| Pros | Cons |
|---|---|
| Novel hard negative mining strategy utilizing multi-hop citation chains from PubMed, addressing challenges in distinguishing negatives in biomedical domains. | The data generation process for multi-hop citation chains is explicitly stated to be time-consuming ("can take week(s)"), constrained by API rate limits, network latency, and parsing large text data. |
| Achieves state-of-the-art performance on BEIR and LoTTE benchmarks, outperforming significantly larger baseline models. | The citation-aware hard negative mining is currently restricted to PubMed, limiting its direct applicability for general-domain retrieval without adaptation to other sources like Wikipedia. |
| Demonstrates high parameter efficiency, with BiCA small (33M parameters) often rivalling or surpassing models up to 145 times larger. | The authors acknowledge that the latency evaluation setup may not fully reflect the efficiency advantages of all baseline models (e.g., ColBERTv2), potentially leading to a non-ideal comparison for certain baselines. |
| Exhibits strong computational efficiency and low inference latency, particularly BiCA small, making it suitable for real-world, resource-constrained applications. | |
| Shows effective data efficiency, achieving strong performance with minimal fine-tuning (only 20 steps) due to the highly informative nature of the citation-aware negatives. | |
| Proven generalizability, performing well across both biomedical and general-domain retrieval tasks and showing consistent improvements when fine-tuning different model architectures. | |
| Provides open-source code and datasets, contributing to reproducibility and facilitating further research. | |
| The hard negative mining strategy incorporates robust mechanisms such as multiple start points, stochastic path selection, a global visited set, and random negative augmentation to ensure diversity and prevent overfitting. | |
| The selected 20,000-document training corpus is shown to be representative of the larger PubMed database. | |<br><br>

**[DynaAct: Large Language Model Reasoning with Dynamic Action Spaces](https://arxiv.org/pdf/2511.08043)**<br>| Pros                                                                                                                                                                                                                                                              | Cons                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Novel and Principled Approach:** Introduces dynamic action space construction for LLM reasoning using a submodular function to balance utility and diversity.                                                                                                        | **MCTS Computational Intensity:** While the action space construction is efficient, the overall reliance on Monte Carlo Tree Search (MCTS) for action evaluation can still be computationally intensive for very large or complex tasks, prompting suggestions for alternative search algorithms. |
| **Significant Performance Improvements:** Achieves state-of-the-art results across six diverse benchmarks, including substantial gains on complex math reasoning tasks (e.g., 6.8% on MATH-500).                                                                        | **Reliance on Base LLM Capabilities:** Although the method is efficient, its scalability and performance might still be inherently limited by the capabilities of the underlying pre-trained Llama models, especially when dealing with extremely large or highly specialized action spaces. |
| **Scalable Action Space Construction:** Automatically learns generalizable action patterns from diverse corpora, overcoming limitations of manually defined or unstructured action spaces, and demonstrates stable performance with large proxy action spaces.         | **Lack of Statistical Significance Reporting:** The paper does not report error bars or other statistical significance information for its experimental results, making it harder to fully assess the robustness and generalizability of the performance gains.                                                                                                                                                                                                                         |
| **Efficient Inference:** Maintains efficient inference without introducing substantial latency compared to other MCTS-based baselines, attributed to precomputed embeddings and the linear-time greedy algorithm for submodular optimization.                          | **Hyperparameter Sensitivity:** The performance is highly sensitive to the balancing parameters (α, β) between utility and diversity in the submodular function, requiring careful tuning to achieve optimal results.                                                                                                                                                                                                                                                             |
| **Compact and Diverse Action Spaces:** The submodular function effectively selects compact, information-dense action sets that enhance reasoning efficacy by preventing redundancy.                                                                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| **Strong Ablation Studies:** Comprehensive ablation studies demonstrate the critical importance of each proposed component (utility, diversity, Q-learning, submodular strategy) for the method's overall performance.                                                 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| **Low Training Overhead:** Only requires training a lightweight embedding model, keeping the larger base LLM frozen, which contributes to efficiency.                                                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| **Enhanced Action Utility:** The method leads to the selection of actions that more effectively trigger critical reasoning steps, particularly beneficial for solving complex problems.                                                                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| **Open-Source Implementation:** The implementation is made available, promoting reproducibility and further research.                                                                                                                                           |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |<br><br>

**[Intelligence per Watt: Measuring Intelligence Efficiency of Local AI](https://arxiv.org/pdf/2511.07885)**<br>| Pros | Cons |
|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Introduces novel and highly relevant metrics, "intelligence per watt" (IPW) and "intelligence per joule" (IPJ), for comprehensively evaluating local AI efficiency. | The paper is presented as if published in November 2025, describing technologies (e.g., GPT-5, Apple M4 Max) as current, making its empirical findings predictive/hypothetical rather than verifiable observations from the present. |
| Conducts a large-scale and comprehensive empirical study across 20+ state-of-the-art local LMs, 8 diverse hardware accelerators (local and cloud), and over 1 million real-world queries. | The study focuses exclusively on single-turn chat and reasoning queries, which limits the generalizability of findings to more complex and prevalent multi-turn conversational AI scenarios. |
| Utilizes a robust and representative dataset, combining real-world user queries (WILDCHAT, NATURALREASONING) with challenging standardized benchmarks (MMLU PRO, SUPERGPQA). | While "local" includes laptops, the definition of "local accelerators" includes powerful systems like Apple M4 Max, potentially overlooking the needs and constraints of smaller, truly edge/power-constrained devices. |
| Provides valuable longitudinal analysis (2023-2025) demonstrating significant improvements in IPW (5.3×) and local inference coverage, highlighting positive trends and the impact of both algorithmic and hardware advances. | The paper demonstrates *potential* gains from intelligent query routing but does not delve into the practical complexities, computational overhead, or ongoing research challenges involved in developing and deploying highly accurate routing systems in real-world, dynamic environments. |
| Quantifies substantial potential resource savings (60-80% energy, compute, and cost) achievable through effective local-cloud query routing, even with realistic routing accuracy. | Local accelerators are shown to achieve "at least 1.4× lower IPW" than cloud accelerators for identical models, indicating a current efficiency disadvantage for local hardware, even if framed as "headroom for optimization." |
| Releases a hardware-agnostic IPW profiling harness, promoting reproducibility and enabling future benchmarking as the local inference ecosystem evolves. | The reliance on LLM-as-a-judge (GPT-4O) for evaluating response accuracy, particularly for naturalistic chat, introduces potential for bias, inconsistency, or misalignment with human preferences, which are known limitations of this method. |
| Offers detailed methodological descriptions for query curation, LLM-as-a-judge prompts, and multi-platform telemetry collection, enhancing transparency and credibility. | The entire study is conducted with a batch size of 1, which, while useful for isolating intrinsic efficiency, may not accurately reflect the efficiency of cloud accelerators that often leverage larger batch sizes for higher throughput and better resource utilization. |
| Includes a nuanced, domain-specific analysis of local LM performance, revealing varying coverage across different economic sectors (e.g., high for creative tasks, lower for technical fields). | Despite overall high coverage, local LM performance drops significantly for specialized technical domains (e.g., Architecture & Engineering to 68%), indicating persistent limitations for local AI in these high-value, complex reasoning tasks. |
| Investigates the practical impact of model precision (FP16 to FP4 quantization) on accuracy and energy consumption, providing insights into efficiency tradeoffs for deployment. | |
| Connects local AI performance improvements to U.S. GDP, highlighting the economic relevance and potential impact of advancements in local inference capabilities. | |<br><br>

**[Walking the Tightrope of LLMs for Software Development: A Practitioners' Perspective](https://arxiv.org/pdf/2511.06428)**<br>| Pros | Cons |
|---|---|
| Relevant and timely topic addressing LLMs in software development from a practitioner's perspective. | Inconsistent publication date (August 2021) with reported research dates (October 2024 - September 2025), creating potential confusion about the study's timeline. |
| Employs a robust qualitative methodology (Socio-Technical Grounded Theory) with iterative data collection (22 interviews across 3 rounds) and rigorous analysis techniques. | Reliance on convenience and snowball sampling after the initial round, which may introduce some selection bias. |
| Provides a comprehensive, multi-level analysis of LLM impact, covering individual, team, organization, and society levels. | Acknowledges limited participant scope (e.g., excluding IT managers, project managers), which could offer additional perspectives. |
| Features a diverse participant pool across various roles, countries, and experience levels, enhancing the richness of data. | Adaptation of the ATAI scale from 9-point to 5-point, though for ease of use, is noted as a potential threat to validity. |
| Offers practical implications and actionable recommendations for achieving a balanced use of LLMs in software development. | The in-depth qualitative nature, while providing rich insights, inherently limits the direct generalizability of findings to all contexts without further validation. |
| Explicitly identifies novel contributions by comparing findings with existing literature, highlighting new insights. | Limited exploration of certain aspects like the broader environmental costs (e.g., water consumption) of LLMs. |
| Demonstrates strong methodological rigor by explicitly discussing credibility, analyzability, transparency, and usefulness of the STGT application and outcomes. | |
| Openly acknowledges and discusses threats to validity, showcasing academic transparency and self-awareness. | |
| Proposes clear and interesting directions for future research, contributing to the ongoing academic discourse on LLMs in SE. | |
| Findings are richly supported by numerous direct quotes from participants, enhancing the credibility and clarity of the results. | |<br><br>