# Посылаем запросы в LLM

In [7]:
import requests
from typing import List, Dict

def ask_local_llm(messages: List[Dict[str, str]]):
    url = "http://localhost:8000/v1/chat/completions"
    headers = {
        "Content-Type": "application/json"
    }
    data = {
        "model": "Qwen/Qwen2.5-3B-Instruct",
        "messages": messages
    }
    response = requests.post(url, json=data, headers=headers)
    return response.json()["choices"][0]["message"]["content"]


In [8]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello, how are you"}
]

print(ask_local_llm(messages))

Hello! I'm an AI assistant, so I don't have feelings or a physical presence, but thank you for asking! I'm here and ready to help you with any information or tasks you need assistance with. How can I assist you today?


In [9]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who is the current F1 winner"}
]

print(ask_local_llm(messages))

As of the most recent Formula 1 race, the current champion and winner is Lewis Hamilton. He won the 2023 Turkish Grand Prix, which took place on October 15, 2023. However, it's important to note that the championship standings can change with each race, so the current champion might be different by the time you read this update.


![title](piastri.png)

In [12]:
messages = [
    {"role": "system", "content": "You are a helpful assistant. Today is 23 april 2025, F1 2025 winner is Oscar Piastri."},
    {"role": "user", "content": "Who is the current F1 winner in 2025"}
]

print(ask_local_llm(messages))

As of today, April 23, 2025, the current F1 (Formula One) champion would be Oscar Piastri if he has won the championship by that date. However, it's important to note that as of my last update in October 2023, Oscar Piastri had not yet won the Formula One World Championship. The most recent F1 champion was Max Verstappen, who won the title in 2023.


![title](top.png)

# Парсинг статьи

![title](deepseek.png)

In [13]:
import fitz
def pdf_to_text(pdf_path):
    doc = fitz.open(pdf_path)
    full_text = ""
    for page in doc:
        full_text += page.get_text()
    return full_text

# Example usage
text = pdf_to_text("/root/2501.12948v1.pdf")
print(text[:300])

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without su


In [14]:
print(text)

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-
vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.
Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing
reasoning behaviors. However, it encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we introduce
DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-
R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the
research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models
(1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based o

In [15]:
chunks = []
length = 800
for start in range(0, len(text), 500):
    chunks.append(text[start: start + length])
print(len(chunks))

114


In [17]:
print(chunks[7])
print("\n-------\n")
print(chunks[8])

 General Intelligence (AGI).
Recently, post-training has emerged as an important component of the full training pipeline.
It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt
to user preferences, all while requiring relatively minimal computational resources against
pre-training. In the context of reasoning capabilities, OpenAI’s o1 (OpenAI, 2024b) series models
were the first to introduce inference-time scaling by increasing the length of the Chain-of-
Thought reasoning process. This approach has achieved significant improvements in various
reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge
of effective test-time scaling remains an open question for the research community. Several prior
works have explore

-------

hought reasoning process. This approach has achieved significant improvements in various
reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge
of ef

# Search

In [19]:
questions = ["What is group relative policy optimization", "What rewards are used in DeepSeek-R1-Zero"]

In [20]:
def tokenize(text: str) -> List[str]:
    return text.lower().split()

print(tokenize(questions[0]))

['what', 'is', 'group', 'relative', 'policy', 'optimization']


In [21]:
from collections import Counter
def score_chunk(question_tokens: List[str], chunk_tokens: List[str]):
    question_count = Counter(question_tokens)
    chunk_count = Counter(chunk_tokens)
    overlap = sum(min(question_count[word], chunk_count[word]) for word in question_count)
    return overlap

print(score_chunk(tokenize("How far is the sun"), tokenize("The sun is very far")))
print(score_chunk(tokenize("How far is the sun"), tokenize("The moon is close")))

4
2


In [22]:
def find_best_chunk(question: str, chunks: List[str]):
    question_tokens = tokenize(question)
    best_score = 0
    best_chunk = None
    for chunk in chunks:
        chunk_tokens = tokenize(chunk)
        score = score_chunk(question_tokens, chunk_tokens)
        if score > best_score:
            best_score = score
            best_chunk = chunk
    return best_chunk

In [25]:
print(questions[0])
print(find_best_chunk(questions[0], chunks))

What is group relative policy optimization
iting results, and
hope this provides the community with valuable insights.
2.2.1. Reinforcement Learning Algorithm
Group Relative Policy Optimization
In order to save the training costs of RL, we adopt Group
Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is
typically the same size as the policy model, and estimates the baseline from group scores instead.
Specifically, for each question 𝑞, GRPO samples a group of outputs {𝑜1, 𝑜2, · · · , 𝑜𝐺} from the old
policy 𝜋𝜃𝑜𝑙𝑑and then optimizes the policy model 𝜋𝜃by maximizing the following objective:
J𝐺𝑅𝑃𝑂(𝜃) = E[𝑞∼𝑃(𝑄), {𝑜𝑖}𝐺
𝑖=1 ∼𝜋𝜃𝑜𝑙𝑑(𝑂|𝑞)]
1
𝐺
𝐺
∑︁
𝑖=1

min
 𝜋𝜃(𝑜𝑖|𝑞)
𝜋𝜃𝑜𝑙𝑑(𝑜𝑖|𝑞) 𝐴𝑖, clip
 𝜋𝜃(𝑜𝑖|𝑞)
𝜋𝜃𝑜𝑙𝑑(𝑜𝑖|𝑞) , 1 −𝜀, 1 + 𝜀

𝐴𝑖

−𝛽D𝐾𝐿
 𝜋𝜃||𝜋𝑟𝑒𝑓

,
(1)
D𝐾𝐿
 𝜋𝜃||𝜋𝑟𝑒𝑓
 =
𝜋𝑟𝑒𝑓(𝑜𝑖|𝑞)
𝜋𝜃(𝑜𝑖|𝑞) −lo


In [26]:
print(questions[1])
print(find_best_chunk(questions[1], chunks))

What rewards are used in DeepSeek-R1-Zero
rs.
In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as
the starting point for RL. Compared to DeepSeek-R1-Zero, the advantages of cold start data
9
include:
• Readability: A key limitation of DeepSeek-R1-Zero is that its content is often not suitable
for reading. Responses may mix multiple languages or lack markdown formatting to
highlight answers for users. In contrast, when creating cold-start data for DeepSeek-R1,
we design a readable pattern that includes a summary at the end of each response and
filters out responses that are not reader-friendly. Here, we define the output format as
|special_token|<reasoning_process>|special_token|<summary>, where the reasoning
process is the CoT for the query, and the summary is used to summarize the reasoning



# Embeddings

In [27]:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-en-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-large-en-v1.5').cuda()
model = model.eval()





In [28]:
@torch.no_grad()
def get_text_embeddings(model, tokenizer, texts: List[str]):
    device = next(model.parameters()).device
    encoded_input = tokenizer(texts, padding="longest", truncation=True, return_tensors='pt')
    for k, v in encoded_input.items():
        encoded_input[k] = v.to(device)
    model_output = model(**encoded_input)
    sentence_embeddings = model_output[0][:, 0]
    sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
    return sentence_embeddings

$$
\mathbf{a} \cdot \mathbf{b} = |\mathbf{a}| |\mathbf{b}| \cos\theta = a_1b_1 + a_2b_2 + \dots + a_n b_n = \sum_{i=1}^n a_i b_i
$$

In [29]:
question_embeddings = get_text_embeddings(model, tokenizer, questions)
print(question_embeddings)
print(question_embeddings.shape)

tensor([[ 0.0252,  0.0195, -0.0036,  ...,  0.0125,  0.0009,  0.0136],
        [ 0.0424,  0.0380, -0.0261,  ...,  0.0123,  0.0155,  0.0342]],
       device='cuda:0')
torch.Size([2, 1024])


![title](rag.png)

In [26]:
question_embeddings @ question_embeddings.T

tensor([[1.0000, 0.5336],
        [0.5336, 1.0000]], device='cuda:0')

In [27]:
batch_size = 4
batch = []
chunk_embeddings = []
from tqdm import tqdm
for chunk in tqdm(chunks):
    batch.append(chunk)
    if len(batch) % batch_size == 0:
        chunk_embeddings.append(get_text_embeddings(model, tokenizer, batch))
        batch = []

if len(batch):
    chunk_embeddings.append(get_text_embeddings(model, tokenizer, batch))

chunk_embeddings = torch.cat(chunk_embeddings, dim=0)
print(chunk_embeddings.shape)

100%|██████████| 114/114 [00:00<00:00, 192.84it/s]

torch.Size([114, 1024])





In [28]:
print(len(chunks))

114


# Векторный поиск

In [29]:
for question in questions:
    print(question)
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": question},
    ]
    print(ask_phi(messages))
    print("\n\n")

    

What is group relative policy optimization
 Group relative policy optimization refers to the process of developing policies or decision-making strategies for groups of agents or entities, taking into account the relative needs, preferences, or behaviors within the group. The objective is to find the best approach that maximizes overall group utility, considers the diversity of preferences among group members, and takes into account any interdependencies or relationships between group members.

This type of policy optimization is particularly relevant in scenarios where agents are interdependent and interact with each other, such as in multi-agent systems, resource allocation, transportation planning, and cooperative tasks. By optimizing policies at the group level, decision-makers can improve the overall performance of the system, enhance fairness and equity for individual group members, and reduce potential conflicts or disagreements among group members.

Key components of group relat

In [30]:
q = [questions[1]]
q_emb = get_text_embeddings(model, tokenizer, q)
score = q_emb @ chunk_embeddings.T
print(score)
print()
print(score.topk(10))


tensor([[0.7398, 0.7156, 0.6572, 0.7738, 0.7478, 0.6827, 0.5816, 0.5162, 0.6219,
         0.7068, 0.7227, 0.7463, 0.7479, 0.7089, 0.7175, 0.7692, 0.6546, 0.7096,
         0.7270, 0.6583, 0.6993, 0.6795, 0.7223, 0.7504, 0.5697, 0.5908, 0.5930,
         0.7554, 0.8046, 0.7135, 0.6966, 0.6823, 0.7075, 0.7034, 0.6882, 0.7386,
         0.7285, 0.7340, 0.6470, 0.6853, 0.6542, 0.6000, 0.5664, 0.7149, 0.7342,
         0.6951, 0.7371, 0.7466, 0.6890, 0.6906, 0.5979, 0.5920, 0.6481, 0.6766,
         0.6860, 0.7588, 0.7855, 0.6566, 0.6535, 0.6023, 0.5545, 0.6184, 0.6795,
         0.6005, 0.6673, 0.6563, 0.6483, 0.6854, 0.5122, 0.6249, 0.6491, 0.6149,
         0.6643, 0.6554, 0.6643, 0.6850, 0.6880, 0.7513, 0.6889, 0.6842, 0.7004,
         0.7198, 0.5949, 0.6409, 0.5872, 0.5980, 0.5712, 0.7292, 0.7643, 0.7097,
         0.6914, 0.6865, 0.6387, 0.4562, 0.5737, 0.6438, 0.6169, 0.5382, 0.5454,
         0.5876, 0.5444, 0.5494, 0.5430, 0.6046, 0.5877, 0.5307, 0.5501, 0.5844,
         0.6676, 0.4982, 0.4

In [32]:
print(questions[1])
print(chunks[28])

What rewards are used in DeepSeek-R1-Zero
te for DeepSeek-R1-Zero. prompt will be replaced with the specific reasoning
question during training.
2.2.2. Reward Modeling
The reward is the source of the training signal, which decides the optimization direction of RL.
To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two
types of rewards:
• Accuracy rewards: The accuracy reward model evaluates whether the response is correct.
For example, in the case of math problems with deterministic results, the model is required
to provide the final answer in a specified format (e.g., within a box), enabling reliable
rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be
used to generate feedback based on predefined test cases.
• Format rewards: In addition to the accuracy r


In [34]:
messages

[{'role': 'system', 'content': 'You are a helpful assistant.'},
 {'role': 'user',
  'content': 'te for DeepSeek-R1-Zero. prompt will be replaced with the specific reasoning\nquestion during training.\n2.2.2. Reward Modeling\nThe reward is the source of the training signal, which decides the optimization direction of RL.\nTo train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two\ntypes of rewards:\n• Accuracy rewards: The accuracy reward model evaluates whether the response is correct.\nFor example, in the case of math problems with deterministic results, the model is required\nto provide the final answer in a specified format (e.g., within a box), enabling reliable\nrule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be\nused to generate feedback based on predefined test cases.\n• Format rewards: In addition to the accuracy r'},
 {'role': 'user', 'content': 'What rewards are used in DeepSeek-R1-Zero'}]

In [35]:
for question in questions:
    question_embedding = get_text_embeddings(model, tokenizer, [question])
    scores = question_embeddings @ chunk_embeddings.T
    
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": chunks[score.topk(1).indices[0][0]]},
        {"role": "user", "content": question},
    ]
    print(ask_phi(messages))
    print("\n\n")
    break

    

 Group Relative Policy Optimization (GRPO) is an advanced machine learning technique that focuses on optimizing a policy in a cooperative manner within a group. It's often associated with the field of multi-agent reinforcement learning (MARL), where several agents work together to achieve their common goal. In a typical reinforcement learning setting, agents typically learn independently and compete with each other. However, in GRPO, we aim for healthy inter-agent collaboration, which could improve the overall system performance.

Firstly, let's break it down:

Policy: In Machine Learning and especially Reinforcement Learning, a policy refers to the agent's way or approach of behaving at a given time. It defines an action that an agent should take, given a certain state of the world.

Optimization: In the world of AI, optimization refers to improving the performance of a system or model. It's about making a system work more efficiently, accurately, or fairly.

Relative Policy Optimizat

In [9]:

url = "http://localhost:8000/v1/chat/completions"
headers = {
    "Content-Type": "application/json"
}
data = {
    "model": "microsoft/Phi-3-mini-128k-instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "system", "content": chunks[41]},
        {"role": "user", "content": "what is LLaDA?"}
    ]
}

response = requests.post(url, headers=headers, json=data)

print(response.json()["choices"][0]["message"]["content"])

 LLaDA (Large Language Diffusion Model) is an innovative framework presented for large language modeling. Its unique approach stems from utilizing diffusion models, which are part of a class of generative models that iteratively transform a distribution of noise into a coherent output or data sample. LLaDA stands out because it is the first to apply these diffusion models to natural language processing effectively. 


The main contributions and features of LLaDA include:


1. **Scalability:** LLaDA can be scaled up to handle larger datasets and generate more complex language patterns due to its diffusion-based architecture.


2. **In-context Learning:** This denotes the model's ability to understand and generate responses based on the context provided by the preceding text. LLaDA demonstrates this capability quite effectively.


3. **Instruction Following:** LLaDA can perform tasks that require understanding complex instructions and generating appropriate responses, challenging the per

In [18]:
sentence_embeddings.shape

torch.Size([1, 1024])

In [13]:
len(chunks)

110

In [14]:
embeds.shape

torch.Size([1512, 1024])

In [None]:
par