# Посылаем запросы в LLM

In [1]:
! nvidia-smi

Wed Apr 23 17:55:02 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77                 Driver Version: 565.77         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:01:00.0 Off |                  Off |
|  0%   47C    P2             68W /  450W |   12688MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
! cat qwen.sh

vllm serve \
    Qwen/Qwen2.5-3B-Instruct \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.5


In [4]:
import requests
from typing import List, Dict

def ask_local_llm(messages: List[Dict[str, str]]):
    url = "http://localhost:8000/v1/chat/completions"
    headers = {
        "Content-Type": "application/json"
    }
    data = {
        "model": "Qwen/Qwen2.5-3B-Instruct",
        "messages": messages
    }
    response = requests.post(url, json=data, headers=headers)
    return response.json()["choices"][0]["message"]["content"]


In [5]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello, how are you"}
]

print(ask_local_llm(messages))

Hello! I'm an artificial intelligence and don't have feelings or a physical presence, but thank you for asking! I'm here and ready to help you with any information or tasks you need assistance with. How can I assist you today?


In [6]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the distance between the sun and earth"}
]

print(ask_local_llm(messages))

The average distance between the Sun and Earth is approximately 93 million miles (150 million kilometers). This distance is commonly referred to as one astronomical unit (AU).

For more precise measurements, the average distance is about:

- 92,955,807 miles (149,597,870 kilometers)

However, it's worth noting that this distance can vary slightly due to the elliptical shape of Earth's orbit around the Sun. The closest point in the orbit (perihelion) occurs in early January and is about 91.4 million miles (147.1 million kilometers) from the Sun, while the farthest point (aphelion) occurs in early July and is about 94.5 million miles (152.1 million kilometers) from the Sun.


In [7]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "will AI take over the world?"}
]

print(ask_local_llm(messages))

It's a common concern, but there's currently no evidence that suggests AI will take over the world. Here are some key points to consider:

1. **Current Capabilities**: Current AI systems are highly specialized and do not possess general intelligence like humans. They are designed for specific tasks and lack the ability to think, make decisions, or act independently in complex ways.

2. **Human Oversight**: AI systems are typically developed and controlled by human beings. There are strict guidelines and regulations around the use of AI, especially in areas like autonomous weapons or critical infrastructure.

3. **Ethical Considerations**: Many experts emphasize the importance of ethical development and deployment of AI. Ethical frameworks are being established to ensure that AI benefits humanity and does not cause harm.

4. **Technical Barriers**: The technology required for AI to achieve superintelligence (the point at which an AI system surpasses human capabilities across the board) 

In [8]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who is the current F1 winner"}
]

print(ask_local_llm(messages))

As of my last update in October 2023, the current Formula 1 World Champion is Lewis Hamilton, who won his seventh Drivers' Championship title. However, the race and championship results can change quickly, so for the most up-to-date information, you should check the official Formula 1 website or a reliable sports news source.


![title](piastri.png)

In [9]:
messages = [
    {"role": "system", "content": "You are a helpful assistant. Today is 23 april 2025, F1 2025 winner is Oscar Piastri."},
    {"role": "user", "content": "Who is the current F1 winner in 2025"}
]

print(ask_local_llm(messages))

As of today, April 23, 2025, the F1 (Formula One) champion would be Oscar Piastri, based on the information you provided. However, please note that the actual champion as of this date has not been determined yet. The F1 World Championship typically ends at the season finale, which is usually held in November or December of the same year.

So while Oscar Piastri is mentioned as the winner for April 23, 2025, it's important to verify the most recent updates from official F1 sources to confirm the current champion status.


### 

![title](top.png)

# Парсинг статьи

![title](deepseek.png)

In [10]:
questions = ["What is group relative policy optimization", "What rewards are used in DeepSeek-R1-Zero"]

for question in questions:
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": question}
    ]
    print(question)
    print(ask_local_llm(messages))
    print("-----" * 10)

What is group relative policy optimization
Group Relative Policy Optimization (GRPPO) is not a widely recognized term in the field of reinforcement learning or machine learning. It's possible that there might be a mix-up with other related concepts, or it could be a custom or specialized term used in a specific context.

However, based on common reinforcement learning techniques and terminology, we can infer some possibilities:

1. **Relative Policy Optimization**: This concept typically refers to methods where agents optimize their policies relative to others. For example, in multi-agent settings, an agent might learn its policy by comparing itself to other agents.

2. **Group Policy Optimization**: This could refer to optimizing a policy for a group of agents, rather than an individual agent. It might involve coordinating actions across multiple agents to achieve a collective goal.

3. **Policy Gradient Methods**: These are a class of reinforcement learning algorithms that optimize p

In [12]:
import fitz
def pdf_to_text(pdf_path):
    doc = fitz.open(pdf_path)
    full_text = ""
    for page in doc:
        full_text += page.get_text()
    return full_text

# Example usage
text = pdf_to_text("/root/2501.12948v1.pdf")
print(len(text))
print(text[:300])

56851
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without su


In [13]:
chunks = []
length = 800
for start in range(0, len(text), 500):
    chunks.append(text[start: start + length])
print(len(chunks))

114


**перекрытие по предложениям**

Chunk 1:
* Marie Curie was a pioneering physicist and chemist who conducted groundbreaking research on radioactivity. She was the first woman to win a Nobel Prize.

Chunk 2:
* She was the first woman to win a Nobel Prize. She is the only person to win Nobel Prizes in two different scientific fields.

Chunk 3:
* She is the only person to win Nobel Prizes in two different scientific fields. Her discoveries included the elements polonium and radium, which she isolated from pitchblende.

In [15]:
print(chunks[7])
print("\n-------\n")
print(chunks[8])

 General Intelligence (AGI).
Recently, post-training has emerged as an important component of the full training pipeline.
It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt
to user preferences, all while requiring relatively minimal computational resources against
pre-training. In the context of reasoning capabilities, OpenAI’s o1 (OpenAI, 2024b) series models
were the first to introduce inference-time scaling by increasing the length of the Chain-of-
Thought reasoning process. This approach has achieved significant improvements in various
reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge
of effective test-time scaling remains an open question for the research community. Several prior
works have explore

-------

hought reasoning process. This approach has achieved significant improvements in various
reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge
of ef

# Search

In [16]:
def tokenize(text: str) -> List[str]:
    return text.lower().split()

print(tokenize(questions[0]))

['what', 'is', 'group', 'relative', 'policy', 'optimization']


In [18]:
from collections import Counter
def score_chunk(question_tokens: List[str], chunk_tokens: List[str]):
    question_count = Counter(question_tokens)
    chunk_count = Counter(chunk_tokens)
    overlap = sum(min(question_count[word], chunk_count[word]) for word in question_count)
    return overlap

print(score_chunk(tokenize("How far is the sun"), tokenize("The sun is very far")))
print(score_chunk(tokenize("How far is the sun"), tokenize("The moon is close")))

4
2


In [19]:
def find_best_chunk(question: str, chunks: List[str]):
    question_tokens = tokenize(question)
    best_score = 0
    best_chunk = None
    for chunk in chunks:
        chunk_tokens = tokenize(chunk)
        score = score_chunk(question_tokens, chunk_tokens)
        if score > best_score:
            best_score = score
            best_chunk = chunk
    return best_chunk

In [20]:
print(questions[0])
print(find_best_chunk(questions[0], chunks).replace("\n", ""))

What is group relative policy optimization
iting results, andhope this provides the community with valuable insights.2.2.1. Reinforcement Learning AlgorithmGroup Relative Policy OptimizationIn order to save the training costs of RL, we adopt GroupRelative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that istypically the same size as the policy model, and estimates the baseline from group scores instead.Specifically, for each question 𝑞, GRPO samples a group of outputs {𝑜1, 𝑜2, · · · , 𝑜𝐺} from the oldpolicy 𝜋𝜃𝑜𝑙𝑑and then optimizes the policy model 𝜋𝜃by maximizing the following objective:J𝐺𝑅𝑃𝑂(𝜃) = E[𝑞∼𝑃(𝑄), {𝑜𝑖}𝐺𝑖=1 ∼𝜋𝜃𝑜𝑙𝑑(𝑂|𝑞)]1𝐺𝐺∑︁𝑖=1min 𝜋𝜃(𝑜𝑖|𝑞)𝜋𝜃𝑜𝑙𝑑(𝑜𝑖|𝑞) 𝐴𝑖, clip 𝜋𝜃(𝑜𝑖|𝑞)𝜋𝜃𝑜𝑙𝑑(𝑜𝑖|𝑞) , 1 −𝜀, 1 + 𝜀𝐴𝑖−𝛽D𝐾𝐿 𝜋𝜃||𝜋𝑟𝑒𝑓,(1)D𝐾𝐿 𝜋𝜃||𝜋𝑟𝑒𝑓 =𝜋𝑟𝑒𝑓(𝑜𝑖|𝑞)𝜋𝜃(𝑜𝑖|𝑞) −lo


In [21]:
print(questions[1])
print(find_best_chunk(questions[1], chunks))

What rewards are used in DeepSeek-R1-Zero
rs.
In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as
the starting point for RL. Compared to DeepSeek-R1-Zero, the advantages of cold start data
9
include:
• Readability: A key limitation of DeepSeek-R1-Zero is that its content is often not suitable
for reading. Responses may mix multiple languages or lack markdown formatting to
highlight answers for users. In contrast, when creating cold-start data for DeepSeek-R1,
we design a readable pattern that includes a summary at the end of each response and
filters out responses that are not reader-friendly. Here, we define the output format as
|special_token|<reasoning_process>|special_token|<summary>, where the reasoning
process is the CoT for the query, and the summary is used to summarize the reasoning



# Embeddings

![title](rag.png)

In [22]:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-en-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-large-en-v1.5').cuda()
model = model.eval()





In [23]:
@torch.no_grad()
def get_text_embeddings(model, tokenizer, texts: List[str]):
    device = next(model.parameters()).device
    encoded_input = tokenizer(texts, padding="longest", truncation=True, return_tensors='pt')
    for k, v in encoded_input.items():
        encoded_input[k] = v.to(device)
    model_output = model(**encoded_input)
    sentence_embeddings = model_output[0][:, 0]
    sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
    return sentence_embeddings

$$
\mathbf{a} \cdot \mathbf{b} = |\mathbf{a}| |\mathbf{b}| \cos\theta = a_1b_1 + a_2b_2 + \dots + a_n b_n = \sum_{i=1}^n a_i b_i
$$

In [24]:
question_embeddings = get_text_embeddings(model, tokenizer, questions)
print(question_embeddings)
print(question_embeddings.shape)

tensor([[ 0.0252,  0.0195, -0.0036,  ...,  0.0125,  0.0009,  0.0136],
        [ 0.0424,  0.0380, -0.0261,  ...,  0.0123,  0.0155,  0.0342]],
       device='cuda:0')
torch.Size([2, 1024])


In [27]:
question_embeddings @ question_embeddings.T

tensor([[1.0000, 0.5336],
        [0.5336, 1.0000]], device='cuda:0')

In [28]:
batch_size = 4
batch = []
chunk_embeddings = []
from tqdm import tqdm
for chunk in tqdm(chunks):
    batch.append(chunk)
    if len(batch) % batch_size == 0:
        chunk_embeddings.append(get_text_embeddings(model, tokenizer, batch))
        batch = []

if len(batch):
    chunk_embeddings.append(get_text_embeddings(model, tokenizer, batch))

chunk_embeddings = torch.cat(chunk_embeddings, dim=0)
print(chunk_embeddings.shape)
print(len(chunks))

100%|██████████| 114/114 [00:00<00:00, 188.59it/s]

torch.Size([114, 1024])
114





# Векторный поиск

In [29]:
print(questions[1])

What rewards are used in DeepSeek-R1-Zero


In [30]:
q = [questions[1]]
q_emb = get_text_embeddings(model, tokenizer, q)
score = q_emb @ chunk_embeddings.T
print(score)
print()
print(score.topk(10))


tensor([[0.7398, 0.7156, 0.6572, 0.7738, 0.7478, 0.6827, 0.5816, 0.5162, 0.6219,
         0.7068, 0.7227, 0.7463, 0.7479, 0.7089, 0.7175, 0.7692, 0.6546, 0.7096,
         0.7270, 0.6583, 0.6993, 0.6795, 0.7223, 0.7504, 0.5697, 0.5908, 0.5930,
         0.7554, 0.8046, 0.7135, 0.6966, 0.6823, 0.7075, 0.7034, 0.6882, 0.7386,
         0.7285, 0.7340, 0.6470, 0.6853, 0.6542, 0.6000, 0.5664, 0.7149, 0.7342,
         0.6951, 0.7371, 0.7466, 0.6890, 0.6906, 0.5979, 0.5920, 0.6481, 0.6766,
         0.6860, 0.7588, 0.7855, 0.6566, 0.6535, 0.6023, 0.5545, 0.6184, 0.6795,
         0.6005, 0.6673, 0.6563, 0.6483, 0.6854, 0.5122, 0.6249, 0.6491, 0.6149,
         0.6643, 0.6554, 0.6643, 0.6850, 0.6880, 0.7513, 0.6889, 0.6842, 0.7004,
         0.7198, 0.5949, 0.6409, 0.5872, 0.5980, 0.5712, 0.7292, 0.7643, 0.7097,
         0.6914, 0.6865, 0.6387, 0.4562, 0.5737, 0.6438, 0.6169, 0.5382, 0.5454,
         0.5876, 0.5444, 0.5494, 0.5430, 0.6046, 0.5877, 0.5307, 0.5501, 0.5844,
         0.6676, 0.4982, 0.4

In [34]:
print(questions[1])
print(chunks[28])

What rewards are used in DeepSeek-R1-Zero
te for DeepSeek-R1-Zero. prompt will be replaced with the specific reasoning
question during training.
2.2.2. Reward Modeling
The reward is the source of the training signal, which decides the optimization direction of RL.
To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two
types of rewards:
• Accuracy rewards: The accuracy reward model evaluates whether the response is correct.
For example, in the case of math problems with deterministic results, the model is required
to provide the final answer in a specified format (e.g., within a box), enabling reliable
rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be
used to generate feedback based on predefined test cases.
• Format rewards: In addition to the accuracy r


In [35]:
question = questions[1]
question_embedding = get_text_embeddings(model, tokenizer, [question])
scores = question_embeddings @ chunk_embeddings.T

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": chunks[score.topk(1).indices[0][0]]},
    {"role": "user", "content": question},
]
print(ask_local_llm(messages))
print("\n\n")
print("--------" * 5)



In DeepSeek-R1-Zero, the reward system primarily consists of two types of rewards:

1. **Accuracy Rewards**: These rewards evaluate whether the generated response is correct. For deterministic problems like math problems, the model needs to provide the final answer in a specified format (e.g., within a box) to enable reliable verification of correctness. For problems like those found on LeetCode, a compiler can generate feedback based on predefined test cases.

2. **Format Rewards**: These rewards ensure that the response adheres to the required format or structure. This could include checking if the response is in the correct format, such as a specific text box for answers, or ensuring that the output is structured correctly according to the problem's requirements.

These rewards guide the model towards generating responses that are both accurate and formatted correctly, thereby optimizing its performance on the given tasks.



----------------------------------------


# Reranker

In [36]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer_reranker = AutoTokenizer.from_pretrained('BAAI/bge-reranker-base')
reranker = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base').cuda().eval()

In [37]:
@torch.no_grad()
def get_pair_scores(reranker, tokenizer, pairs):
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    for k, v in inputs.items():
        inputs[k] = v.cuda()
    scores = reranker(**inputs, return_dict=True).logits.view(-1, ).float()
    return scores


pairs = [['what is panda?', 'hi'], 
         ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
get_pair_scores(reranker, tokenizer_reranker, pairs)

tensor([-8.1544,  6.1821], device='cuda:0')

In [38]:
q = [questions[1]]
q_emb = get_text_embeddings(model, tokenizer, q)
score = q_emb @ chunk_embeddings.T
top_k_indices = score.topk(10).indices[0].tolist()
print(top_k_indices)

[28, 56, 3, 15, 88, 55, 27, 77, 23, 12]


In [39]:
pairs = [
    [question, chunk[idx]]
    for idx in top_k_indices
]
scores = get_pair_scores(reranker, tokenizer_reranker, pairs)
print(scores)

tensor([-10.1965, -10.1962,  -8.6665, -10.1968, -10.1949, -10.1968, -10.1962,
        -10.1968,  -8.6665,  -8.6665], device='cuda:0')


In [40]:
argsort = scores.argsort(descending=True)
print(argsort)
sorted_indices = [top_k_indices[idx] for idx in argsort]
print(sorted_indices)

tensor([8, 2, 9, 4, 1, 6, 0, 7, 3, 5], device='cuda:0')
[23, 3, 12, 88, 56, 27, 28, 77, 15, 55]


![title](top.png)