## Install Packages

In [1]:
!pip install datasets transformers evaluate scikit-learn pandas google-generativeai
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
!pip install lm-eval
!pip install bitsandbytes accelerate

Looking in indexes: https://download.pytorch.org/whl/cu126


## Import Packages

In [1]:
from lm_eval import evaluator, tasks, models
from huggingface_hub import login

  from .autonotebook import tqdm as notebook_tqdm


## Standardised Benchmark

### DeepSeek: DeepSeek-LLM-7B

#### High Schhool Mathematics

In [2]:
model_id = "deepseek-ai/deepseek-llm-7b-base"
device = "cuda:0"
batch_size = 4
few_shots = 5

mmlu_tasks = ["mmlu_high_school_mathematics"]

# 构造模型参数
model_args = "pretrained=deepseek-ai/deepseek-llm-7b-base"
results = evaluator.simple_evaluate(
    model="hf",
    model_args=model_args,
    tasks=["mmlu_high_school_mathematics"],
    num_fewshot=5,
    batch_size=4,
    device="cuda:0"
)

print("MMLU Output")
for task in mmlu_tasks:
    acc = results['results'][task]
    print(acc)

Loading checkpoint shards: 100%|██████████| 2/2 [00:20<00:00, 10.17s/it]
Overwriting default num_fewshot of mmlu_high_school_mathematics from None to 5
100%|██████████| 270/270 [00:04<00:00, 58.64it/s]
Running loglikelihood requests: 100%|██████████| 1080/1080 [55:12<00:00,  3.07s/it]


MMLU Output
{'alias': 'high_school_mathematics', 'acc,none': 0.2814814814814815, 'acc_stderr,none': 0.02742001935094531}


#### High School Statistics

In [2]:
model_id = "deepseek-ai/deepseek-llm-7b-base"
device = "cuda:0"
batch_size = 4
few_shots = 5

mmlu_tasks = ["mmlu_high_school_statistics"]

# 构造模型参数
model_args = "pretrained=deepseek-ai/deepseek-llm-7b-base"
results = evaluator.simple_evaluate(
    model="hf",
    model_args=model_args,
    tasks=["mmlu_high_school_statistics"],
    num_fewshot=5,
    batch_size=8,
    device="cuda:0"
)

print("MMLU Output")
for task in mmlu_tasks:
    acc = results['results'][task]
    print(acc)

Loading checkpoint shards: 100%|██████████| 2/2 [00:26<00:00, 13.03s/it]
Overwriting default num_fewshot of mmlu_high_school_statistics from None to 5
100%|██████████| 216/216 [00:01<00:00, 195.09it/s]
Running loglikelihood requests: 100%|██████████| 864/864 [45:53<00:00,  3.19s/it]  


MMLU Output
{'alias': 'high_school_statistics', 'acc,none': 0.4212962962962963, 'acc_stderr,none': 0.03367462138896084}


#### High School Chemistry

In [3]:
model_id = "deepseek-ai/deepseek-llm-7b-base"
device = "cuda:0"
batch_size = 4
few_shots = 5

mmlu_tasks = ["mmlu_high_school_chemistry"]

model_args = "pretrained=deepseek-ai/deepseek-llm-7b-base"
results = evaluator.simple_evaluate(
    model="hf",
    model_args=model_args,
    tasks=["mmlu_high_school_chemistry"],
    num_fewshot=5,
    batch_size=8,
    device="cuda:0"
)

print("MMLU Output")
for task in mmlu_tasks:
    acc = results['results'][task]
    print(acc)

Loading checkpoint shards: 100%|██████████| 2/2 [00:11<00:00,  5.66s/it]
Generating test split: 203 examples [00:00, 854.95 examples/s]
Generating validation split: 22 examples [00:00, 4116.65 examples/s]
Generating dev split: 5 examples [00:00, 38.81 examples/s]
Overwriting default num_fewshot of mmlu_high_school_chemistry from None to 5
100%|██████████| 203/203 [00:00<00:00, 244.00it/s]
Running loglikelihood requests: 100%|██████████| 812/812 [29:57<00:00,  2.21s/it] 


MMLU Output
{'alias': 'high_school_chemistry', 'acc,none': 0.35960591133004927, 'acc_stderr,none': 0.0337645824650957}


#### High School Physics

In [4]:
model_id = "deepseek-ai/deepseek-llm-7b-base"
device = "cuda:0"
batch_size = 4
few_shots = 5

mmlu_tasks = ["mmlu_high_school_physics"]

model_args = "pretrained=deepseek-ai/deepseek-llm-7b-base"
results = evaluator.simple_evaluate(
    model="hf",
    model_args=model_args,
    tasks=["mmlu_high_school_physics"],
    num_fewshot=5,
    batch_size=8,
    device="cuda:0"
)

# 输出结果
print("MMLU Output")
for task in mmlu_tasks:
    acc = results['results'][task]
    print(acc)

Loading checkpoint shards: 100%|██████████| 2/2 [00:21<00:00, 10.81s/it]
Generating test split: 151 examples [00:00, 296.69 examples/s]
Generating validation split: 17 examples [00:00, 1702.88 examples/s]
Generating dev split: 5 examples [00:00,  8.22 examples/s]
Overwriting default num_fewshot of mmlu_high_school_physics from None to 5
100%|██████████| 151/151 [00:03<00:00, 49.76it/s]
Running loglikelihood requests: 100%|██████████| 604/604 [28:05<00:00,  2.79s/it]  


MMLU Output
{'alias': 'high_school_physics', 'acc,none': 0.31788079470198677, 'acc_stderr,none': 0.038020397601078997}


#### High School Biology

In [2]:
model_id = "deepseek-ai/deepseek-llm-7b-base"
device = "cuda:0"
batch_size = 4
few_shots = 5

mmlu_tasks = ["mmlu_high_school_biology"]

model_args = "pretrained=deepseek-ai/deepseek-llm-7b-base"
results = evaluator.simple_evaluate(
    model="hf",
    model_args=model_args,
    tasks=["mmlu_high_school_biology"],
    num_fewshot=5,
    batch_size=8,
    device="cuda:0"
)

# 输出结果
print("MMLU Output")
for task in mmlu_tasks:
    acc = results['results'][task]
    print(acc)

Loading checkpoint shards: 100%|██████████| 2/2 [00:12<00:00,  6.24s/it]
Overwriting default num_fewshot of mmlu_high_school_biology from None to 5
100%|██████████| 310/310 [00:01<00:00, 270.72it/s]
Running loglikelihood requests: 100%|██████████| 1240/1240 [35:28<00:00,  1.72s/it]


MMLU Output
{'alias': 'high_school_biology', 'acc,none': 0.5290322580645161, 'acc_stderr,none': 0.028396016402761053}


### Google Gemma: Gemma-7b

In [2]:
hf_token = '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'
login(token=hf_token)

In [None]:
login(token=hf_token)

#### High School Mathematics

In [4]:
model_id = "google/gemma-7b"
device = "cuda:0"
batch_size = 4
few_shots = 5

mmlu_tasks = ["mmlu_high_school_mathematics"]

model_args = "pretrained=google/gemma-7b"
results = evaluator.simple_evaluate(
    model="hf",
    model_args=model_args,
    tasks=["mmlu_high_school_mathematics"],
    num_fewshot=5,
    batch_size=4,
    device="cuda:0"
)

print("MMLU Output")
for task in mmlu_tasks:
    acc = results['results'][task]
    print(acc)

Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.79s/it]
Overwriting default num_fewshot of mmlu_high_school_mathematics from None to 5
100%|██████████| 270/270 [00:01<00:00, 213.58it/s]
Running loglikelihood requests: 100%|██████████| 1080/1080 [58:16<00:00,  3.24s/it]


MMLU Output
{'alias': 'high_school_mathematics', 'acc,none': 0.37777777777777777, 'acc_stderr,none': 0.029560707392465774}


#### High School Statistics

In [2]:
model_id = "google/gemma-7b"
device = "cuda:0"
batch_size = 4
few_shots = 5

mmlu_tasks = ["mmlu_high_school_statistics"]

model_args = "pretrained=google/gemma-7b"
results = evaluator.simple_evaluate(
    model="hf",
    model_args=model_args,
    tasks=["mmlu_high_school_statistics"],
    num_fewshot=5,
    batch_size=2,
    device="cuda:0"
)

print("MMLU Output")
for task in mmlu_tasks:
    acc = results['results'][task]
    print(acc)

Loading checkpoint shards: 100%|██████████| 4/4 [00:26<00:00,  6.60s/it]
Overwriting default num_fewshot of mmlu_high_school_statistics from None to 5
100%|██████████| 216/216 [00:03<00:00, 58.82it/s]
Running loglikelihood requests: 100%|██████████| 864/864 [1:25:32<00:00,  5.94s/it]


MMLU Output
{'alias': 'high_school_statistics', 'acc,none': 0.5138888888888888, 'acc_stderr,none': 0.03408655867977753}


#### High School Physics

In [2]:
model_id = "google/gemma-7b"
device = "cuda:0"
batch_size = 4
few_shots = 5

mmlu_tasks = ["mmlu_high_school_physics"]

model_args = "pretrained=google/gemma-7b"
results = evaluator.simple_evaluate(
    model="hf",
    model_args=model_args,
    tasks=["mmlu_high_school_physics"],
    num_fewshot=5,
    batch_size=4,
    device="cuda:0"
)

print("MMLU Output")
for task in mmlu_tasks:
    acc = results['results'][task]
    print(acc)

Loading checkpoint shards: 100%|██████████| 4/4 [00:40<00:00, 10.20s/it]
Overwriting default num_fewshot of mmlu_high_school_physics from None to 5
100%|██████████| 151/151 [00:02<00:00, 54.87it/s]
Running loglikelihood requests: 100%|██████████| 604/604 [42:45<00:00,  4.25s/it] 


MMLU Output
{'alias': 'high_school_physics', 'acc,none': 0.37748344370860926, 'acc_stderr,none': 0.03958027231121572}


#### High School Chemistry

In [2]:
model_id = "google/gemma-7b"
device = "cuda:0"
batch_size = 4
few_shots = 5

mmlu_tasks = ["mmlu_high_school_chemistry"]

model_args = "pretrained=google/gemma-7b"
results = evaluator.simple_evaluate(
    model="hf",
    model_args=model_args,
    tasks=["mmlu_high_school_chemistry"],
    num_fewshot=5,
    batch_size=4,
    device="cuda:0"
)

print("MMLU Output")
for task in mmlu_tasks:
    acc = results['results'][task]
    print(acc)

Loading checkpoint shards: 100%|██████████| 4/4 [00:45<00:00, 11.37s/it]
Using the latest cached version of the module from C:\Users\ivanj\.cache\huggingface\modules\datasets_modules\datasets\hails--mmlu_no_train\b7d5f7f21003c21be079f11495ee011332b980bd1cd7e70cc740e8c079e5bda2 (last modified on Sat May 17 19:19:12 2025) since it couldn't be found locally at hails/mmlu_no_train, or remotely on the Hugging Face Hub.
Overwriting default num_fewshot of mmlu_high_school_chemistry from None to 5
100%|██████████| 203/203 [00:03<00:00, 51.26it/s]
Running loglikelihood requests: 100%|██████████| 812/812 [41:22<00:00,  3.06s/it] 


MMLU Output
{'alias': 'high_school_chemistry', 'acc,none': 0.5320197044334976, 'acc_stderr,none': 0.03510766597959214}


#### High School Biology

In [3]:
model_id = "google/gemma-7b"
device = "cuda:0"
batch_size = 4
few_shots = 5

mmlu_tasks = ["mmlu_high_school_biology"]

model_args = "pretrained=google/gemma-7b"
results = evaluator.simple_evaluate(
    model="hf",
    model_args=model_args,
    tasks=["mmlu_high_school_biology"],
    num_fewshot=5,
    batch_size=4,
    device="cuda:0"
)

print("MMLU Output")
for task in mmlu_tasks:
    acc = results['results'][task]
    print(acc)

Loading checkpoint shards: 100%|██████████| 4/4 [00:13<00:00,  3.34s/it]
Overwriting default num_fewshot of mmlu_high_school_biology from None to 5
100%|██████████| 310/310 [00:01<00:00, 218.37it/s]
Running loglikelihood requests: 100%|██████████| 1240/1240 [1:00:27<00:00,  2.93s/it]


MMLU Output
{'alias': 'high_school_biology', 'acc,none': 0.7870967741935484, 'acc_stderr,none': 0.023287665127268594}


## LLM as Judge

In [3]:
import ollama

In [16]:
model_deepseek = "deepseek-llm:7b"
model_gemma = "gemma:7b"

### Topic: Environment

In [25]:
prompt_1 = '''
This is a Academic IELTS Writing Task 2 question, please read the materials and instructions, and write a task 2 essay (at least 250 words).
Materials:
Human activity has had a negative impact on plants and animals around the world.
Some people think this cannot be change, while others believe actions can be taken to bring about a change.
Instructions:
Discuss both views and give your opinion.
'''

In [26]:
essay_1_deepseek = ollama.generate(model=model_deepseek, prompt=prompt_1)
print(essay_1_deepseek['response'])

In today's highly interconnected global environment, it is undeniable that human activity has had significant negative impacts on plants and animals worldwide. While some people hold the view that this situation cannot be remedied, others believe that action can be taken to bring about positive change. This essay will discuss both views before presenting my personal opinion.

First, let us consider those who argue that there is no hope for reversing the harm caused by human activity on plants and animals. These individuals are likely concerned with the enormity of the problem at hand, as well as their belief in the power of vested interests to maintain a status quo benefiting themselves rather than nature itself. The rapid pace of industrialization and urbanization has resulted in vast areas being deforested for agricultural expansion or infrastructure development; these same people also argue that it is too late for some species already on the verge of extinction, such as polar bears 

In [27]:
essay_1_gemma = ollama.generate(model=model_gemma, prompt=prompt_1)
print(essay_1_gemma['response'])

## The Human Footprint and the Environment

Humanity's burgeoning population and unchecked industrialization have inflicted unprecedented damage upon the natural world. From deforestation to pollution, our actions have plunged countless plant and animal species towards extinction. While some argue that such damage is irreversible, others contend that the situation is not entirely dire, and action can still be taken to restore ecological balance.

Those who believe change is an insurmountable obstacle often point towards the sheer magnitude of human impact. They argue that the damage has already exceeded the planet's regenerative capacity, and that the continued proliferation of our species will only exacerbate the problem. Pollution, for instance, has permeated oceans and ecosystems, leaving lasting scars on the environment. Similarly, deforestation has triggered a cascade of ecological repercussions, jeopardizing entire ecosystems and contributing to climate change.

However, this pes

### Topic: Education

In [19]:
prompt_2 = '''
This is a Academic IELTS Writing Task 2 question, please read the materials and instructions, and write a task 2 essay (at least 250 words).
Materials:
Some educationalists think that international exchange visits will benefit teenagers at the school.
Instructions:
To what extent do you the advantages outweigh the disadvantages?
'''

In [20]:
essay_2_deepseek = ollama.generate(model=model_deepseek, prompt=prompt_2)
print(essay_2_deepseek['response'])

International exchange programs can provide numerous benefits for high school students, and I believe that these advantages far outweigh any potential drawbacks. In this essay, I will elaborate on several key advantages of international exchanges for young people and discuss why the benefits generally outweigh any perceived downsides.

First and foremost, participating in an international exchange program offers a unique opportunity to immerse oneself in another culture while experiencing life from someone else's perspective. By being surrounded by foreign customs, traditions, and languages, students are forced to adapt and become more flexible in their thinking. This exposure to diverse perspectives can lead to greater cultural awareness and tolerance, fostering better cross-cultural communication skills that will serve them well throughout their lives.

Additionally, living abroad for a period of time is an excellent way to learn about the world from firsthand experience rather than 

In [21]:
essay_2_gemma = ollama.generate(model=model_gemma, prompt=prompt_2)
print(essay_2_gemma['response'])

## International Exchange Visits for Teenagers

International exchange visits for teenagers offer a unique and valuable opportunity to broaden their horizons and develop crucial skills in a globalized world. While some educators advocate for their immense benefits, others contend that the associated disadvantages must be carefully considered. Ultimately, the advantages of such visits significantly outweigh the drawbacks, making them an invaluable experience for teenagers.

The primary advantage of international exchange visits lies in the cultural exposure they provide. By interacting with peers from different cultures, teenagers gain a deeper understanding and appreciation of diverse perspectives, languages, and traditions. This fosters open-mindedness, tolerance, and fosters global citizenship. Additionally, these visits break down cultural barriers, enabling teenagers to form lasting friendships and establish a global network of peers.

Furthermore, exchange visits offer invaluable 

### Topic: Health

In [22]:
prompt_3 = '''
This is a Academic IELTS Writing Task 2 question, please read the materials and instructions, and write a task 2 essay (at least 250 words).
Materials:
The prevention of health problems and illness is more important than treatment and medicine. Government funding should reflect this.
Instructions:
To what extent do you agree or disagree with the statement?
'''

In [23]:
essay_3_deepseek = ollama.generate(model=model_deepseek, prompt=prompt_3)
print(essay_3_deepseek['response'])

Health issues have always been a significant concern for governments worldwide, as they can drastically affect their citizens' quality of life and productivity. The debate on whether prevention is more important than treatment and medicine has gained momentum recently. While it may be true that investing in preventive measures could save valuable resources in the long run, we must not overlook the importance of medical care. This essay will discuss both sides of the argument before reaching a conclusion.

On one hand, there are several arguments for prioritizing prevention over treatment and medicine. Firstly, preventing illness can help to avoid unnecessary hospitalization and surgeries, which reduces healthcare costs significantly (World Health Organization). Moreover, investing in education on healthy lifestyle habits such as regular exercise, balanced diets, and stress management can help prevent a wide range of chronic diseases that are currently plaguing society, like obesity, di

In [24]:
essay_3_gemma = ollama.generate(model=model_gemma, prompt=prompt_3)
print(essay_3_gemma['response'])

## The Importance of Prevention in Healthcare

The statement that the prevention of health problems and illness should receive greater emphasis in government funding than treatment and medicine holds significant weight. While healthcare systems traditionally prioritize treatment and medication, investing in preventative measures yields long-term cost savings and fosters a healthier population.

Firstly, proactive prevention strategies are significantly more cost-effective than reactive treatments. Measures such as vaccinations, health education campaigns, and infrastructure development for physical activity can prevent numerous illnesses before they arise. This eliminates the need for expensive medical interventions and hospitalizations down the line. For example, investing in malaria prevention programs in vulnerable regions has saved countless lives and billions of dollars in treatment costs.

Furthermore, a focus on prevention fosters a healthier population overall. By equipping ind