In [1]:
import os
os.environ["https_proxy"] = "http://192.168.1.12:7891"

In [3]:
import json
import re
from datasets import load_dataset
import sys
sys.path.append("../..")
from utils.sys_prompts import SYS_PROMPT_formatter_deepseek_concise_2 as SYSTEM_PROMPT
import random

INSTRUCTION_RESPONSE_FORMAT = """\
<instruction>
{instruction}
{input}
</instruction>
<response>
{response}
</response>
"""

def get_natural_thinking_dataset(url, split="train", output_file="dataset.json", required_num_data=3500):
    dataset = load_dataset(url, split=split)
    dataset = dataset.shuffle(seed=42).select(range(min(50000, len(dataset))))  # Ensure we get up to 1000 samples
    
    def check_structure_markdown(text):
        structure_markdown_regex = r"^#{1,6} .*$"
        res = re.findall(structure_markdown_regex, text["responses"][0]['response'], re.MULTILINE)
        return len(res) > 0
    
    def check_length(text):
        text_lenth = len(text["responses"][0]['response'])
        return (text_lenth > 50) and (text_lenth < 2000)
    
    dataset = dataset.filter(check_structure_markdown)
    dataset = dataset.filter(check_length)
    
    def formatting_prompts_func(examples):
        instruction = examples["question"]
        output = examples["responses"][0]["response"]
        
        # stochastically subsititu the 'step' in the output
        # and disturb the original heading levels
        pattern = r" Step \d+: "
        num = random.random()
        if (num < 0.3):
            output = re.sub(pattern, "## ", output, flags=re.MULTILINE)
        elif (0.3 <= num < 0.6):
            output = re.sub(pattern, " ", output, flags=re.MULTILINE)
        else:
            output = re.sub(pattern, "# ", output, flags=re.MULTILINE)
        return {
            'prompt': [
                {'role': 'system', 'content': SYSTEM_PROMPT},
                {'role': 'user', 'content': INSTRUCTION_RESPONSE_FORMAT.format(instruction=instruction, input="", response=output)}
            ]
        }
    
    dataset = dataset.map(formatting_prompts_func, batched=False)
    dataset = dataset.shuffle(seed=42).select(range(min(required_num_data, len(dataset))))  # Ensure we get up to 1000 samples
    dataset = dataset.remove_columns([col for col in dataset.column_names if col != "prompt"])
    print("\033[92m Save number of data: \033[0m", len(dataset))
    
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(dataset.to_list(), f, indent=4, ensure_ascii=False)
    
    return dataset


In [4]:

dataset = get_natural_thinking_dataset("facebook/natural_reasoning", "train", "facebook_natural_reasoning-markhdown.json")

[92m Save number of data: [0m 3500


In [4]:
dataset[0]

{'prompt': [{'content': 'You are a meticulous organizational assistant specialized in structuring instruction-response pairs into a standardized markdown format. Please carefully process the input according to the following specifications:\n\n---\ntags:\n  - {general_tag} {general_tag}/{sub_tag}\n---\n# Instruction\n[The original instruction text]\n\n# Summary\n[A brief yet comprehensive summary of the response]\n\n## Details\n[The original response content]\n\nHere are Processing Guidelines:\n- The `tags` section consists of pairs of general tags and sub tags in the following format:  `{general_tag} {general_tag}/{sub_tag}`. For example: `environment environment/renewable_energy`.\n- Keep the heaidng levels in the original response and adjust heading levels as needed to maintain proper hierarchy and avoid jumping heading levels.\n\nThe instructions and responses are enclosed within `<instruction>` and `<response>` XML tags, respectively. Please process the following instruction-respon

In [4]:
for i in range(5):
    print('-'*100)
    print(dataset[i]['prompt'][1]['content'])

----------------------------------------------------------------------------------------------------
<instruction>
A box contains 8 blue balls and 2 red balls. Three balls are selected from the box at random without replacement. Find the probability that two balls are blue and one ball is red.

</instruction>
<response>
### Calculate the total number of ways to select 3 balls from 10 without replacement.
The total number of ways to select 3 balls from 10 can be calculated using the combination formula, which is given by: $C(n, k) = \frac{n!}{k!(n-k)!}$. Here, $n = 10$ (total balls) and $k = 3$ (balls to be selected). So, $C(10, 3) = \frac{10!}{3!(10-3)!} = \frac{10!}{3!7!} = \frac{10 \times 9 \times 8}{3 \times 2 \times 1} = 120$.

### Calculate the number of ways to select 2 blue balls from 8 without replacement.
Using the same combination formula, we calculate the number of ways to select 2 blue balls from 8: $C(8, 2) = \frac{8!}{2!(8-2)!} = \frac{8!}{2!6!} = \frac{8 \times 7}{2 \tim

In [19]:

example  = dataset[100]['prompt'][1]['content']


In [None]:
print(example)

<instruction>
Consider a Gauss rifle that accelerates a slug of mass $m_1$ to an exit speed of $v_1$ using a series of electromagnets. The rifle itself has a mass of $m_2$. Assuming the acceleration of the slug is uniform over the length of the barrel and neglecting any external forces, derive an expression for the recoil velocity of the rifle. How does the recoil of the Gauss rifle compare to that of a conventional firearm firing a projectile of the same mass and velocity? Be sure to discuss the implications of the force distribution over time on the recoil and the potential advantages of the Gauss rifle in terms of maintaining aim.

</instruction>
<response>
## Step 1: Understand the Problem and Identify Key Concepts
The problem involves a Gauss rifle, which uses electromagnets to accelerate a slug of mass $m_1$ to an exit speed of $v_1$. The rifle itself has a mass of $m_2$. We need to derive an expression for the recoil velocity of the rifle, considering uniform acceleration of the

In [None]:
pattern = r"^#{1,6} (Step \d: ).*$"
re.findall(pattern, example, re.MULTILINE)


['Step 1: ', 'Step 2: ', 'Step 3: ', 'Step 4: ', 'Step 5: ', 'Step 6: ']

In [None]:
pattern = r" Step \d: "
result = re.sub(pattern, "## ", example, flags=re.MULTILINE)
print(result)

<instruction>
Consider a Gauss rifle that accelerates a slug of mass $m_1$ to an exit speed of $v_1$ using a series of electromagnets. The rifle itself has a mass of $m_2$. Assuming the acceleration of the slug is uniform over the length of the barrel and neglecting any external forces, derive an expression for the recoil velocity of the rifle. How does the recoil of the Gauss rifle compare to that of a conventional firearm firing a projectile of the same mass and velocity? Be sure to discuss the implications of the force distribution over time on the recoil and the potential advantages of the Gauss rifle in terms of maintaining aim.

</instruction>
<response>
#### Understand the Problem and Identify Key Concepts
The problem involves a Gauss rifle, which uses electromagnets to accelerate a slug of mass $m_1$ to an exit speed of $v_1$. The rifle itself has a mass of $m_2$. We need to derive an expression for the recoil velocity of the rifle, considering uniform acceleration of the slug 

In [None]:
pattern = r" Step \d: "
result = re.sub(pattern, "### ", example, flags=re.MULTILINE)
print(result)

<instruction>
Consider a Gauss rifle that accelerates a slug of mass $m_1$ to an exit speed of $v_1$ using a series of electromagnets. The rifle itself has a mass of $m_2$. Assuming the acceleration of the slug is uniform over the length of the barrel and neglecting any external forces, derive an expression for the recoil velocity of the rifle. How does the recoil of the Gauss rifle compare to that of a conventional firearm firing a projectile of the same mass and velocity? Be sure to discuss the implications of the force distribution over time on the recoil and the potential advantages of the Gauss rifle in terms of maintaining aim.

</instruction>
<response>
##### Understand the Problem and Identify Key Concepts
The problem involves a Gauss rifle, which uses electromagnets to accelerate a slug of mass $m_1$ to an exit speed of $v_1$. The rifle itself has a mass of $m_2$. We need to derive an expression for the recoil velocity of the rifle, considering uniform acceleration of the slug

In [None]:
# generate a random number between 0 and 1
import random
random.random()

0.15466110926326615