# Dependency

Following packages are required for inference (the specified version has been tested):
```
accelerate==0.33.0
transformers==4.44.2
```

In [1]:
pip install accelerate==0.33.0 transformers==4.44.2

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting accelerate==0.33.0
  Downloading accelerate-0.33.0-py3-none-any.whl.metadata (18 kB)
Collecting transformers==4.44.2
  Downloading transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
Collecting huggingface-hub>=0.21.0 (from accelerate==0.33.0)
  Downloading huggingface_hub-0.24.6-py3-none-any.whl.metadata (13 kB)
Collecting safetensors>=0.3.1 (from accelerate==0.33.0)
  Downloading safetensors-0.4.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers==4.44.2)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading accelerate-0.33.0-py3-none-any.whl (315 kB)
Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m175.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading huggingface_h

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 


  from .autonotebook import tqdm as notebook_tqdm


# Load Model

In [2]:
model = AutoModelForCausalLM.from_pretrained( 
    "microsoft/GRIN-MoE",
    device_map="auto",  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 
tokenizer = AutoTokenizer.from_pretrained("microsoft/GRIN-MoE")
pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 
generation_args = { 
    "max_new_tokens": 2048, 
    "return_full_text": False, 
    "do_sample": False, 
} 

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:20<00:00,  1.23s/it]


# Inference

In [3]:
torch.random.manual_seed(0) 

<torch._C.Generator at 0x7f2f95b502b0>

## Example 1


### Question
Please complete the following single-choice question. The question has four options, and only one of them is correct. Select the option that is correct. 

Question: Given $\cos(\alpha + \beta) = m$, and $\tan(\alpha) \tan(\beta) = 2$, we want to find $\cos(\alpha - \beta) =$

Option A: $-3m$

Option B: $-\frac{m}{3}$

Option C: $\frac{m}{3}$

Option D: $3m$

### Inference

As an aligned model, GRIN MoE is best suited for prompts using the chat format as follows:

```
<|system|>
You are a helpful assistant.<|end|>
<|user|>
$USER_MESAGE<|end|>
<|assistant|>
```

The following code will format the above question in this format. 

In [4]:
message = [
    {"role": "system", "content": "You are a helpful AI assistant."}, 
    {"role": "user", "content": "Please complete the following single-choice question. The question has four options, and only one of them is correct. Select the option that is correct. \n\nQuestion: Given $\\cos(\\alpha + \\beta) = m$, and $\\tan(\\alpha) \\tan(\\beta) = 2$, we want to find $\\cos(\\alpha - \\beta) =$\n\nOption A: $-3m$\nOption B: $-\\frac{m}{3}$\nOption C: $\\frac{m}{3}$\nOption D: $3m$"}
]

output = pipe(message, **generation_args) 

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


### Result

In [5]:
print("-----------Correct Answer RESPONSE-----------")
print("The correct answer of this question is A, -3m")
print("-----------Model RESPONSE-----------")
print(output[0]["generated_text"])

-----------Correct Answer RESPONSE-----------
The correct answer of this question is A, -3m
-----------Model RESPONSE-----------
 To find $\cos(\alpha - \beta)$, we can use the cosine difference identity:

$\cos(\alpha - \beta) = \cos(\alpha)\cos(\beta) + \sin(\alpha)\sin(\beta)$

We know that $\cos(\alpha + \beta) = m$, so we can use the cosine sum identity:

$\cos(\alpha + \beta) = \cos(\alpha)\cos(\beta) - \sin(\alpha)\sin(\beta) = m$

Now, we can use the given information $\tan(\alpha) \tan(\beta) = 2$. We know that $\tan(\alpha) = \frac{\sin(\alpha)}{\cos(\alpha)}$ and $\tan(\beta) = \frac{\sin(\beta)}{\cos(\beta)}$. So,

$\frac{\sin(\alpha)}{\cos(\alpha)} \cdot \frac{\sin(\beta)}{\cos(\beta)} = 2$

$\sin(\alpha)\sin(\beta) = 2\cos(\alpha)\cos(\beta)$

Now, we can substitute this into the cosine difference identity:

$\cos(\alpha - \beta) = \cos(\alpha)\cos(\beta) + 2\cos(\alpha)\cos(\beta) = 3\cos(\alpha)\cos(\beta)$

We also know that $\cos(\alpha + \beta) = \cos(\alpha)\cos(\be

## Example 2


### Question
If the tangent line to the curve $y = e^x + x$ at the point $(0, 1)$ is also the tangent line to the curve $y = ln(x+1) + a$ at some point, find the value of the constant $a$.

### Inference

As an aligned model, GRIN MoE is best suited for prompts using the chat format as follows:

```
<|system|>
You are a helpful assistant.<|end|>
<|user|>
$USER_MESAGE<|end|>
<|assistant|>
```

The following code will format the above question in this format. 

In [6]:
message = [
    {"role": "system", "content": "You are a helpful AI assistant."}, 
    {"role": "user", "content": "If the tangent line to the curve $y = e^x + x$ at the point $(0, 1)$ is also the tangent line to the curve $y = ln(x+1) + a$ at some point, find the value of the constant $a$."}
]

output = pipe(message, **generation_args) 

### Result

In [7]:
print("-----------Correct Answer RESPONSE-----------")
print("The correct answer of this question is ln(2)")
print("-----------Model RESPONSE-----------")
print(output[0]["generated_text"])

-----------Correct Answer RESPONSE-----------
The correct answer of this question is ln(2)
-----------Model RESPONSE-----------
 First, let's find the slope of the tangent line to the curve $y = e^x + x$ at the point $(0, 1)$. To do this, we need to find the derivative of the function $y = e^x + x$ with respect to $x$.

The derivative of $e^x$ with respect to $x$ is $e^x$, and the derivative of $x$ with respect to $x$ is $1$. So, the derivative of $y = e^x + x$ is:

$y' = e^x + 1$

Now, we can find the slope of the tangent line at the point $(0, 1)$ by plugging in $x = 0$ into the derivative:

$y'(0) = e^0 + 1 = 1 + 1 = 2$

So, the slope of the tangent line to the curve $y = e^x + x$ at the point $(0, 1)$ is $2$.

Now, let's find the equation of the tangent line to the curve $y = e^x + x$ at the point $(0, 1)$. We can use the point-slope form of a linear equation:

$y - y_1 = m(x - x_1)$

where $(x_1, y_1)$ is the point $(0, 1)$ and $m$ is the slope $2$. Plugging in these values, we ge

## Interactive Chat 

Run the following command to interact with GRIN-MoE in the chat manner. 

In [None]:
message = [
    {"role": "system", "content": "You are a helpful AI assistant."}, 
]

while True:
    print('------------user-------------')
    content = input('user input (enter <|stop|> to stop): ')
    print('------------user-------------')
    if '<|stop|>' in content:
       break 
    print(content)
    message.append({"role": "user", "content": content})
    output = pipe(message, **generation_args)
    print('------------assistant-------------')
    print(output[0]["generated_text"])
    message.append({"role": "assistant", "content": content})


------------user-------------


user input (enter <|stop|> to stop):  Hello, I want to ask you some math questions. can you do that?


------------user-------------
Hello, I want to ask you some math questions. can you do that?
------------assistant-------------
 Of course! I'd be happy to help you with your math questions. Please go ahead and ask.
------------user-------------


user input (enter <|stop|> to stop):  Solve the equation: $\frac{2x + 3}{x - 1} = \frac{x + 2}{x + 1}$.


------------user-------------
Solve the equation: $\frac{2x + 3}{x - 1} = \frac{x + 2}{x + 1}$.
------------assistant-------------
 Sure, I can help you solve the equation. To solve the equation $\frac{2x + 3}{x - 1} = \frac{x + 2}{x + 1}$, we can start by cross-multiplying:

$(2x + 3)(x + 1) = (x + 2)(x - 1)$

Now, let's expand both sides:

$2x^2 + 2x + 3x + 3 = x^2 + 2x - x - 2$

Combine like terms:

$2x^2 + 5x + 3 = x^2 + x - 2$

Now, let's move all terms to one side to set the equation to zero:

$2x^2 - x^2 + 5x - x + 3 + 2 = 0$

Simplify:

$x^2 + 4x + 5 = 0$

This is a quadratic equation. We can solve it using the quadratic formula:

$x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$

In this case, $a = 1$, $b = 4$, and $c = 5$. Plugging these values into the formula, we get:

$x = \frac{-4 \pm \sqrt{4^2 - 4(1)(5)}}{2(1)}$

$x = \frac{-4 \pm \sqrt{16 - 20}}{2}$

Since the discriminant ($b^2 - 4ac$) is negative, there are no real solutions for this equation. The solutions are complex numbers

user input (enter <|stop|> to stop):  Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?


------------user-------------
Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?
------------assistant-------------
 Sally has 1 sister.

Here's the explanation:

Sally has 3 brothers. Each brother has 2 sisters, but since they all share the same sisters, we don't count them multiple times. We know Sally is one sister, so there must be one more sister to make the count of 2 sisters for each brother. Therefore, Sally has 1 sister.
------------user-------------


user input (enter <|stop|> to stop):  Can you help me revise the abstract as below? Focus on any potential grammar errors. Thanks!  Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develope a top-2 16x3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive 

------------user-------------
Can you help me revise the abstract as below? Focus on any potential grammar errors. Thanks!  Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develope a top-2 16x3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluati

user input (enter <|stop|> to stop):  i feel pursue is a better word.


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


------------user-------------
i feel pursue is a better word.
------------assistant-------------
 Certainly! Here's the revised abstract with the suggested change:

Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better **pursue** the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16x3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model 