In [1]:
import warnings
import torch
import os

from transformers import LlamaForCausalLM, AutoTokenizer, TextStreamer
from transformers.models.llama.modeling_llama import LlamaAttention
import transformers
from fixed_size_kv_cache.dynamic_cache import FixedSizeDynamicCache
from fixed_size_kv_cache.llama_attention import forward
import logging

# Setup logging
logging.basicConfig(level=logging.INFO)

# Set up FixedSizeDynamicCache
LlamaAttention.forward = forward
transformers.models.llama.modeling_llama.DynamicCache = FixedSizeDynamicCache
LlamaForCausalLM.__init__.__globals__['DynamicCache'] = FixedSizeDynamicCache

# Suppress warnings
warnings.filterwarnings("ignore")

# Set environment variables
os.environ["FSDC_KV_CACHE_SIZE"] = "1024"
os.environ["FSDC_INCLUDE_SKIPPED"] = "true"
os.environ["FSDC_FREE_MEMORY"] = "false"

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_name = 'mkurman/llama-3.2-MEDIT-3B-o1-GRPO-LLM-Eval'

model = LlamaForCausalLM.from_pretrained(model_name, cache_dir=os.getenv('HF_CACHE', None), torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=os.getenv('HF_CACHE', None))

# Change attention to eager (this is a workaround as flash/sdpa attention does not return attention weights)
model.config._attn_implementation = "eager"

Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 14.08it/s]


In [3]:
test = [{"role": "user", "content": """Summarize every article:\n\n
                         
----------------------------------------------------     
Article 1:
Recent Advances and Future Directions in Attention Mechanisms for Large Language Models
The attention mechanism, a cornerstone of modern large language models (LLMs), has undergone significant innovations in recent years. While traditional self-attention mechanisms in transformers have enabled breakthroughs in natural language processing, researchers have identified limitations in computational efficiency, contextual prioritization, and structural expressiveness. This report synthesizes recent advancements, analyzes persistent gaps, and proposes novel pathways for reimagining attention in LLMs.

Recent Innovations in Attention Mechanism Design
Differential Attention for Noise Reduction
Microsoft's Differential Transformer11 introduces a partitioned attention mechanism that computes two separate softmax maps from subdivided query and key vectors. By subtracting these maps, the model cancels common-mode noise while amplifying signal components critical to context. This approach mirrors noise-canceling audio systems, demonstrating 15-20% improvements in factual consistency benchmarks compared to conventional transformers. The subtraction operation adds negligible computational overhead due to parallelization, making it viable for real-world deployment114.

Matrix Optimization Strategies for Efficient Fine-Tuning
Theoretical work by arXiv researchers1 reveals that selectively updating query (Q) and key (K) matrices during fine-tuning achieves comparable performance to full-parameter tuning while reducing memory usage by 40%. This stems from the QK system's role in determining attention score distributions, where strategic learning rate differentiation (higher rates for K matrices) accelerates convergence. Experimental validation on GLUE benchmarks shows this method matches full fine-tuning accuracy with 60% fewer training steps110.

Architectural Variants for Scalability
Recent implementations employ grouped-query attention (GQA) and sliding-window attention (SWA) to handle long-context processing4. GQA clusters similar queries using locality-sensitive hashing, reducing memory overhead from O(n²) to O(n log n) for n-token sequences. SWA processes text through overlapping 4k-token windows with positional encoding carryover, enabling 128k-token context handling with only 12% latency increase compared to standard 4k models412.

Persistent Limitations and Theoretical Constraints
Working Memory Capacity Boundaries
Empirical studies on N-back tasks reveal that transformer-based models exhibit performance degradation mirroring human cognitive limits when tracking dependencies beyond 7±2 elements6. Attention entropy analysis shows dispersion increases linearly with sequence length, suggesting fundamental capacity constraints rooted in the softmax normalization process. This manifests as 34% accuracy drop on 10-back tasks compared to 5-back scenarios across multiple architectures613.

Structural Expressiveness Deficits
Formal language analysis demonstrates transformers cannot recognize periodic finite-state languages like {a^n b^n c^n} without layer count scaling proportionally to input length13. The absence of stack-like mechanisms limits hierarchical parsing, resulting in 22% lower accuracy on recursively nested sentence structures compared to augmented transition network models139.

Computational Complexity Tradeoffs
While linear attention variants58 reduce theoretical complexity from O(n²) to O(n), practical implementations face 18-25% accuracy drops on semantic reasoning tasks due to low-rank approximation errors. The Hugging Face ecosystem currently lacks plug-and-play linear attention modules, forcing developers to choose between efficiency and performance57.

Paradigm-Shifting Alternatives to Conventional Attention
Feed-Forward Attention Substitution
Breakthrough work from ETH Zurich714 demonstrates that shallow feed-forward networks can replicate attention behavior when trained via knowledge distillation. Their Attention Layer Replacement (ALR) method achieves 98% baseline BLEU scores on IWSLT2017 translation tasks using 150M parameter replacements, though requiring 40% more neurons than original attention heads. Crucially, these "attentionless transformers" maintain sequence-length flexibility when augmented with dynamic padding masks714.

Learnable Lateral Connection Architectures
An open-source GPT variant8 replaces self-attention with trainable lateral weight matrices between input embeddings. Preliminary results show 12% faster inference speeds but 15% lower perplexity on WikiText-103, suggesting potential when combined with modern initialization techniques. The architecture enables fully parallelized training while maintaining position-awareness through injected sinusoidal weights816.

Hybrid Neuro-Symbolic Routing
Emerging approaches combine attention with symbolic rule engines for structural parsing. A prototype system routes noun phrases through probabilistic context-free grammar checkers while processing verbs via standard attention, achieving 89% parse accuracy on Penn Treebank compared to 78% for pure-transformer baselines. This hybrid model reduces attention head usage by 40% through targeted symbolic delegation915.

Strategic Recommendations for Next-Generation Architectures
Differentiated Attention Pathways
Inspired by biological vision systems, a dual-path framework could separate high-frequency token interactions (handled by optimized QK attention) from low-frequency semantic integration (managed by feed-forward networks). Early simulations show this division reduces computational load by 35% while improving long-range dependency modeling111.

Dynamic Attention Rank Adaptation
Implementing singular value decomposition (SVD) during forward passes enables real-time attention rank adjustment. By maintaining high-rank attention for critical tokens (nouns, verbs) while compressing ancillary elements (articles, prepositions), preliminary tests achieve 50% FLOP reduction with <2% accuracy loss on summarization tasks57.

Neuromodulatory Attention Gating
Drawing from neuroscience, trainable dopamine-like modulation signals could dynamically reweight attention scores based on reinforcement signals. Initial experiments using reward-modulated backpropagation demonstrate 30% faster convergence on instruction-following tasks compared to standard fine-tuning1015.

Conclusion
The evolution of attention mechanisms reveals both remarkable adaptability and fundamental constraints. While recent innovations like differential attention and matrix-optimized fine-tuning push performance boundaries, enduring challenges in computational complexity and structural expressiveness necessitate architectural paradigm shifts. The most promising paths forward involve hybrid models combining optimized attention variants with alternative processing modes—whether feed-forward, symbolic, or neuromorphic components.

Future research should prioritize dynamic architectures that automatically select attention mechanisms based on input characteristics, potentially combining:

QK-optimized core attention for semantic relationship mapping

Compressed linear attention for high-frequency token interactions

External memory banks for long-term dependency tracking

Symbolic routers for structural pattern enforcement

By moving beyond monolithic attention designs, next-generation LLMs could achieve unprecedented efficiency and cognitive fidelity while overcoming current theoretical limitations. The integration of biological inspiration with computational pragmatism will likely define the next evolutionary leap in language model architectures.
               
----------------------------------------------------
Article 2:
AI in Finance and Regulation:

UK Scrutinizing AI in Finance: The Parliamentary Treasury Committee in the UK is examining the increasing use of AI in the financial sector. This inquiry will explore both the innovative opportunities and potential risks, including cybersecurity and impact on employment, associated with AI in finance. MPs to scrutinise use of artificial intelligence in the finance sector
EU AI Act: Europe's AI landscape is being shaped by regulations, particularly the EU AI Act. This act takes a risk-based approach to ensure AI applications meet strict standards for safety, fairness, and data governance. The ACHILLES project is also working to create a framework for AI products that are ethical, sustainable, and compliant with regulations like the EU AI Act and GDPR. The ACHILLES project: Human-centric machine learning – lighter, clearer, safer
AI Technology and Advancements:

New AI Models: Google has introduced Gemini 2.0, their latest AI model. Latest AI news and blog posts - Google AI Additionally, DeepSeek-R1 reasoning models are reported to be competitive with OpenAI's models in performance, and Qwen 2.5-Max is outperforming DeepSeek V3 in some benchmarks. AI News | Latest AI News, Analysis & Events
AI for Chip Design: Researchers are utilizing AI to significantly reduce the time and expense involved in designing new wireless chips and to discover new functionalities. Artificial Intelligence News - ScienceDaily
AI Applications and Business:

AI in Accounting Market Growth: The market for AI in accounting is predicted to reach $93.3 billion by 2032, with a significant CAGR of 44.77% from 2024-2032. This growth is driven by the increasing need for automation, accuracy, and fraud detection in financial operations. Artificial Intelligence in Accounting Market to Hit $93.3 Bn by 2032 Due to Rising Financial Automation | SNS
AI in Healthcare: AI and the Internet of Things (IoT) are converging in healthcare, creating both opportunities and challenges for the next wave of patient treatment options. Wearable sensors and AI analytics are enabling real-time health monitoring and proactive healthcare management. Artificial Intelligence as Medicine
AI and Security:

Apple's AI Privacy Measures: Apple is addressing privacy concerns related to AI in cloud systems with its Private Cloud Compute (PCC) system. They are also inviting security researchers to examine and verify the security of this system. Cybersecurity awareness: Apple’s cloud-based AI security system
AI Investment and Projects:

Stargate Project: A new $500 billion AI infrastructure project named "Stargate" has been announced, which is expected to significantly boost the AI sector. The New $500 Billion Stargate Project Is Fantastic News for These 3 Artificial Intelligence Cryptos
SoftBank Investment in OpenAI: SoftBank has committed to investing $3 billion annually in OpenAI. Artificial Intelligence (AI) News Updates: Latest News About Google AI, OpenAI, ChatGPT, Gemini, Lamda and More - The Economic Times
AI Bias:

AI Perception of Average Person: Research into how AI models perceive an "average person" reveals biases in AI, including assumptions about gender and geographic location. American AI models, for example, may not recognize people living in China. A researcher has determined how artificial intelligence represents an average person: a woman in her 30s.   
         
----------------------------------------------------
Article 3:
I know about several types of reinforcement learning, which can be broadly categorized based on different criteria. Here's a breakdown of the major types I know about:

1. Based on the Existence of a Model:

Model-Based RL: These algorithms use a model of the environment to plan and make decisions. A "model" here typically refers to a function that predicts the next state and reward given the current state and action.

Advantages: Can be more sample-efficient, especially in environments with complex dynamics. Can plan ahead and consider long-term consequences.
Disadvantages: Requires building and maintaining an accurate model, which can be challenging. Model errors can lead to suboptimal performance.
Examples:
Dyna-Q: Combines learning a model with Q-learning, using the model to generate simulated experience.
Monte Carlo Tree Search (MCTS): Used in game playing like Go and Chess, builds a search tree using a model to evaluate possible future states.
Model-Predictive Control (MPC): Often used in robotics and control systems, optimizes actions over a future time horizon using a model of the system.
World Models: Uses a variational autoencoder (VAE) to learn a latent representation of the environment, a mixture density network (MDN) to predict the next latent state, and a controller to optimize actions.
Model-Free RL: These algorithms do not use or learn an explicit model of the environment. They learn directly from trial-and-error interactions.

Advantages: Simpler to implement, as they don't require model building. Can be applied to problems where the environment dynamics are unknown or difficult to model.
Disadvantages: Can be less sample-efficient, requiring a large amount of data to learn. May struggle with long-term planning.
Examples:
Q-learning: A value-based method that learns an action-value function (Q-function) representing the expected cumulative reward for taking an action in a given state.
SARSA (State-Action-Reward-State-Action): Similar to Q-learning, but updates the Q-function based on the actual action taken in the next state (on-policy).
Deep Q-Networks (DQN): Uses a deep neural network to approximate the Q-function.
Policy Gradient methods (e.g., REINFORCE, A2C, A3C, PPO, DDPG, TD3, SAC): Directly learn a policy, which maps states to actions (or probability distributions over actions).
2. Based on Value Function vs. Policy Learning:

Value-Based Methods: These algorithms learn a value function (either a state-value function V(s) or an action-value function Q(s,a)) that estimates the expected return from a state or state-action pair.

Examples: Q-learning, SARSA, DQN
Policy-Based Methods: These algorithms directly learn a policy, which is a mapping from states to actions (or probability distributions over actions).

Examples: REINFORCE, Actor-Critic methods (A2C, A3C, PPO, DDPG, TD3, SAC)
Actor-Critic Methods: These algorithms combine value-based and policy-based methods. They learn both a policy (the "actor") and a value function (the "critic"). The critic provides feedback to the actor to improve the policy.

Examples: A2C, A3C, PPO, DDPG, TD3, SAC
3. Based on On-Policy vs. Off-Policy Learning:

On-Policy Methods: These algorithms learn about the same policy that is being used to generate behavior (i.e., the policy being evaluated and improved is the same as the policy generating the data).

Examples: SARSA, REINFORCE, A2C, PPO
Off-Policy Methods: These algorithms can learn about a target policy that is different from the behavior policy used to generate data. This allows for greater sample efficiency as the agent can learn from past experiences generated by different policies.

Examples: Q-learning, DQN, DDPG, TD3, SAC
4. Based on the type of function approximation:

Tabular Methods: These methods store the value function or policy in a table. They are suitable for problems with small, discrete state and action spaces.

Examples: Q-learning, SARSA with a Q-table
Function Approximation Methods: These methods use function approximators, such as linear models or neural networks, to represent the value function or policy. They are necessary for problems with large or continuous state and action spaces.

Examples: DQN, DDPG, PPO
5. Other Important Distinctions:

*   **Inverse Reinforcement Learning (IRL):** Tries to infer the reward function from expert demonstrations.
*   **Imitation Learning:** Aims to learn a policy by imitating expert behavior, often used in conjunction with RL.
*   **Hierarchical Reinforcement Learning (HRL):** Decomposes complex tasks into sub-tasks, learning a hierarchy of policies.
*   **Multi-Agent Reinforcement Learning (MARL):** Deals with scenarios where multiple agents interact and learn in a shared environment.
This is not an exhaustive list, but it covers the most common and important types of reinforcement learning. The field is constantly evolving, and new methods and variations are being developed all the time. Each type has its strengths and weaknesses, and the best choice depends on the specific problem being addressed.

----------------------------------------------------
""" }]

In [4]:
tokens = tokenizer(tokenizer.apply_chat_template(test, tokenize=False, add_generation_prompt=True, use_reflection=False), return_tensors='pt')

len(tokens['input_ids'][0])

3129

In [5]:
streamer = TextStreamer(tokenizer, skip_prompt=True)
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model.to(device)
tokens.to(device)

with torch.autocast(dtype=torch.bfloat16, device_type=device):
    with torch.no_grad():
        model.eval()
        output = model.generate(**tokens, max_new_tokens=8192, streamer=streamer, do_sample=None, temperature=None, repetition_penalty=None, use_cache=True, output_attentions=False, eos_token_id=model.config.eos_token_id)


<Thought>
Alright, I have been given two articles and a third section on reinforcement learning types. Let's break them down and summarize each one.

**Article 1: Recent Advances and Future Directions in Attention Mechanisms for Large Language Models**

This article discusses the evolution of attention mechanisms in large language models (LLMs). It highlights recent innovations like differential attention, matrix optimization strategies, and architectural variants for scalability. The article also explores persistent limitations and theoretical constraints, such as computational complexity and structural expressiveness deficits. It proposes future directions, suggesting the need for hybrid models combining optimized attention variants with alternative processing modes like feed-forward, symbolic, or neuromorphic components. The conclusion emphasizes the importance of dynamic architectures that can adaptively select attention mechanisms based on input characteristics.

**Article 2: AI i