# XMAiNframe: A Large Language Model for Mainframe Modernization

## Dataset Summary

**MainframeBench** is a comprehensive benchmark designed to assess mainframe-related knowledge, consisting of three sub-tasks:
- Multiple-Choice Questions
- Question Answering
- COBOL Code Summarization

---

## Dataset Structure

### Data Instances for Question Answering

```json
{
    "id": 0,
    "prompt": "As a supportive AI assistant, you've been presented with a query related to a Cobol-related topic. Please furnish a reply to the question.",
    "question": "What is the future of COBOL in mainframe computing?",
    "answer": "As businesses increasingly migrate away from mainframes and update their legacy applications, the future of COBOL in mainframe computing is uncertain. However, it will likely continue to be used for maintaining existing systems and for specific business needs."
}
```

---

### Data Fields

#### Question-Answering Task

- **id** (string): The unique identifier.
- **prompt** (string): Instruction to the language model.
- **question** (string): The mainframe-related question.
- **answer** (string): The corresponding answer.

#### Multiple-Choice Question Task

- **id** (string): The unique identifier.
- **prompt** (string): Instruction to the language model.
- **question** (string): The mainframe-related question.
- **A, B, C, D** (string): Four possible answers.
- **answer** (string): The correct answer choice (A, B, C, or D).

#### COBOL Code Summarization Task

- **id** (string): The unique identifier.
- **prompt** (string): Instruction to the language model.
- **source** (string): The COBOL code snippet.
- **summary** (string): The summary of the provided code.

---

## Data Splits

The benchmark is divided into three subsets, each corresponding to one of the sub-tasks:
- **Question-Answering**
- **Multiple-Choice Questions**
- **COBOL Code Summarization**

---

## Dataset Statistics

| **Type**                   | **Number of Samples** |
|----------------------------|-----------------------|
| Question Answering          | 2,598                 |
| Multiple-Choice Questions   | 1,931                 |
| COBOL Code Summarization    | 2,523                 |

[Fsoft-AIC's Collections/XMAiNframe](https://huggingface.co/datasets/Fsoft-AIC/MainframeBench)

In [11]:
import random
from datasets import load_dataset

# Load each sub-set in MainframeBench
QA_set = load_dataset("Fsoft-AIC/MainframeBench", 'question_answering')
MC_set = load_dataset("Fsoft-AIC/MainframeBench", 'multiple_choice_question')
Sum_set = load_dataset("Fsoft-AIC/MainframeBench", 'COBOL_code_summarization')

### Question-Answering Task

In [12]:
# Show dataset
QA_set

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'prompt', 'question', 'answer'],
        num_rows: 2598
    })
})

In [13]:
# Rename column 'Unnamed: 0' to id

In [14]:
QA_set = QA_set.rename_column(original_column_name='Unnamed: 0', new_column_name='id')
QA_set

DatasetDict({
    train: Dataset({
        features: ['id', 'prompt', 'question', 'answer'],
        num_rows: 2598
    })
})

In [21]:
# Random select a sample
total_qa = QA_set['train'].num_rows

random_idx = random.choice(range(total_qa))

print(QA_set['train'][random_idx])

{'id': 1705, 'prompt': "As a supportive AI assistant, you've been given a question regarding a mainframe-related topic. Please provide a response to the inquiry.", 'question': 'What is the role of a data center in the context of mainframes?', 'answer': 'A data center is a facility that houses, manages, and operates mainframes and other computer systems. It provides an environment where mainframes can be maintained, updated, and scaled to meet the demands of modern businesses.'}


### Multiple-Choice Question Task

In [22]:
# Show dataset
MC_set

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'prompt', 'question', 'A', 'B', 'C', 'D', 'answer'],
        num_rows: 1931
    })
})

In [23]:
# Rename column 'Unnamed: 0' to id

In [24]:
MC_set = MC_set.rename_column(original_column_name='Unnamed: 0', new_column_name='id')
MC_set

DatasetDict({
    train: Dataset({
        features: ['id', 'prompt', 'question', 'A', 'B', 'C', 'D', 'answer'],
        num_rows: 1931
    })
})

In [26]:
# Random select a sample
total_mc = MC_set['train'].num_rows

random_idx = random.choice(range(total_mc))

print(MC_set['train'][random_idx])

{'id': 64, 'prompt': "As a supportive AI assistant, you've received multiple-choice questions related to a cobol-related topic. Each question is presented with four options: A, B, C, or D. Kindly indicate the correct answer for each question by selecting the corresponding option (A, B, C, or D).", 'question': 'Which of the following is a valid COBOL data declaration?', 'A': '01 VARIABLE PIC X(10)', 'B': '01 VARIABLE PIC X(10) VALUE 10', 'C': "01 VARIABLE PIC X(10) VALUE 'Hello'", 'D': '01 VARIABLE PIC X(10) LENGTH 10', 'answer': 'A'}


### COBOL Code Summarization Task

In [6]:
# Show dataset
Sum_set

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'prompt', 'source', 'summary'],
        num_rows: 2523
    })
})

In [None]:
# Rename column 'Unnamed: 0' to id

In [10]:
Sum_set = Sum_set.rename_column(original_column_name='Unnamed: 0', new_column_name='id')
Sum_set

DatasetDict({
    train: Dataset({
        features: ['id', 'prompt', 'source', 'summary'],
        num_rows: 2523
    })
})

In [31]:
# Random select a sample
total_sum = Sum_set['train'].num_rows

random_idx = random.choice(range(total_sum))

print(Sum_set['train'][random_idx])

{'Unnamed: 0': 727, 'prompt': "In your role as a helpful AI assistant, you've received a Cobol code snippet. Please present a summary of the code that is clear, brief, detailed, and logically structured.", 'source': "PERFORM 1000-PROCESS-DATA.                                     \n               DISPLAY '**JDHSB43Y DATA PROCESSING**' UPON SYSOUT.         \n               IF       END-FLG   NOT  =  1                                 \n                    COMPUTE  WK-CNT =  INPUT-CNT1 - 1                   \n               ELSE                                                         \n                    MOVE     INPUT-CNT1 TO  WK-CNT                      \n               END-IF.                                                      \n               DISPLAY ' INPUT =' WK-CNT UPON SYSOUT.                       \n       1000-PROCESS-DATA-X.", 'summary': "The given COBOL code performs the following functionalities:\n\n1. It performs a user-defined PERFORM routine 1000-PROCESS-DATA.\n2. It dis

### Write function translate English to Japan

In [1]:
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

src_lang = "en"
tgt_lang = "ja"
sentence = "The sun set behind the mountains, casting a warm glow over the landscape as the evening breeze blew gently"

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M",device_map="auto")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

# translate English to Japan
tokenizer.src_lang = src_lang
encoded_sentence = tokenizer(sentence, return_tensors="pt")
generated_tokens = model.generate(**encoded_sentence, forced_bos_token_id=tokenizer.get_lang_id(tgt_lang))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]



'太陽は山の後ろに座り、夜の輝きが柔らかく輝くように、景色の上に温かい輝きを投げ込んだ。'

In [2]:
src_lang = "ja"
tgt_lang = "en"
sentence = "太陽は山の後ろに座り、夜の輝きが柔らかく輝くように、景色の上に温かい輝きを投げ込んだ。"

# translate Japan to English
tokenizer.src_lang = src_lang
encoded_sentence = tokenizer(sentence, return_tensors="pt")
generated_tokens = model.generate(**encoded_sentence, forced_bos_token_id=tokenizer.get_lang_id(tgt_lang))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

'The sun sat behind the mountain, and throwed a warm glow on the view as the brightness of the night glowed softly.'

In [3]:
def translate_en_jp(sentence):
    src_lang = "en"
    tgt_lang = "ja"
    tokenizer.src_lang = src_lang
    encoded_sentence = tokenizer(sentence, return_tensors="pt")
    generated_tokens = model.generate(**encoded_sentence, forced_bos_token_id=tokenizer.get_lang_id(tgt_lang))
    return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

In [4]:
%%time
translate_en_jp("The cat sat on the windowsill, watching birds fly by while the gentle rain tapped softly on the glass.")

CPU times: user 1.08 s, sys: 63.9 ms, total: 1.15 s
Wall time: 889 ms


'猫は窓ガラスに座り、鳥が飛んでいると見ながら、柔らかい雨がガラスに柔らかくなった。'

### Custom current dataset

In [5]:
from datasets import load_dataset

QA_set = load_dataset("Fsoft-AIC/MainframeBench", 'question_answering')
QA_set = QA_set.rename_column(original_column_name='Unnamed: 0', new_column_name='id')
total_sample = QA_set['train'].num_rows

In [6]:
def add_column_anwswer_ja(example):
    return {"anwser_ja":translate_en_jp(example["answer"])}

QA_set = QA_set.map(add_column_anwswer_ja)

QA_set

Map:   0%|          | 0/2598 [00:00<?, ? examples/s]



DatasetDict({
    train: Dataset({
        features: ['id', 'prompt', 'question', 'answer', 'anwser_ja'],
        num_rows: 2598
    })
})

In [10]:
import random

random_idx = random.choice(range(total_sample))

print("English Answer: \n",QA_set['train']['answer'][random_idx])
print("Japan Answer: \n",QA_set['train']['anwser_ja'][random_idx])

English Answer: 
 COBOL plays a crucial role in modernizing legacy systems. It allows businesses to maintain existing systems while gradually migrating to new technologies and improving their functionality.
Japan Answer: 
 COBOLは遺産システムの近代化において重要な役割を果たし、企業が既存のシステムを維持し、徐々に新しいテクノロジーに移行し、機能性を向上させることができる。


In [8]:
def add_column_anwswer_ja(example):
    return {"question_ja":translate_en_jp(example["question"])}

QA_set = QA_set.map(add_column_anwswer_ja)

QA_set

Map:   0%|          | 0/2598 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'prompt', 'question', 'answer', 'anwser_ja', 'question_ja'],
        num_rows: 2598
    })
})

In [9]:
random_idx = random.choice(range(total_sample))

print("English Question: \n",QA_set['train']['question'][random_idx])
print("Japan Question: \n",QA_set['train']['question_ja'][random_idx])

English Question: 
 What is the role of transaction tracing tools like Tracer in mainframe debugging?
English Question: 
 トラッカーのようなトランザクショントラッキングツールがメインフレームデブギングにおける役割は何ですか?


In [11]:
# Updload to HF-Hub
dataset_name = "Mainframe-QA-en-ja"
QA_set.push_to_hub(dataset_name)

print("Upload Successful!")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

Upload Successful!


In [12]:
QA_set

DatasetDict({
    train: Dataset({
        features: ['id', 'prompt', 'question', 'answer', 'anwser_ja', 'question_ja'],
        num_rows: 2598
    })
})