### Use cases
- Autocomplete
- Code generation
- Code correction
- Code summarization
- Code explanation

Select specific pages

In [2]:
#!pip install PyPDF2

In [5]:
! python extract_pdf_pages.py /home/loc/Documents/igy6lr40.pdf tmp.pdf 335 339

Extracted pages 335 to 339 into 'tmp.pdf'


Read file by `docling`

In [4]:
from docling.document_converter import DocumentConverter

source = "tmp.pdf"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
markdown_content = result.document.export_to_markdown()
print(markdown_content)  # output: "## Docling Technical Report[...]"

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

    There is an imbalance between your GPUs. You may want to exclude GPU 1 which
    has less than 75% of the memory or cores of GPU 0. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.


## Chapter 28. PROCEDURE DIVISION statements

Statements, sentences, and paragraphs in the PROCEDURE DIVISION are executed sequentially except when a procedure branching statement such as EXIT, GO TO, PERFORM, GOBACK, or STOP is used.

## ACCEPT statement

The ACCEPT statement transfers data or system date-related information into the data area referenced by the specified identifier. There is no editing or error checking of the incoming data.

## Data transfer

Format 1 transfers data from an input source into the data item referenced by identifier-1 (the receiving area). When the FROM phrase is omitted, the system input device is assumed.

<!-- image -->

Format 1 is useful for exceptional situations in a program when operator intervention (to supply a given message, code, or exception indicator) is required. The operator must of course be supplied with the appropriate messages with which to reply.

## identifier-1

The receiving area. Can be:

- · An alphanumeric group item
- · A nat

In [5]:
# Save the markdown content to a .txt file
with open("output.txt", "w") as text_file:
    text_file.write(markdown_content)

print("The document has been saved to output.txt.")

The document has been saved to output.txt.


Read content and generate synthetic data

In [13]:
import os
from openai import OpenAI


# Read openai OPENAI_API_KEY
with open("/home/loc/Documents/keys/OPENAI_API_KEY.txt","r") as f:
    api_key = f.read()
    api_key = api_key.strip()


client = OpenAI(
    api_key = api_key,
)

completion = client.chat.completions.create( # Change the method
  model = "gpt-3.5-turbo",
  messages = [ # Change the prompt parameter to messages parameter
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
  ]
)

print(completion.choices[0].message.content.strip()) # Change message content retrieval

Hello! How can I assist you today?


In [17]:
# Read saved output file
with open("output.txt","r") as f:
    content = f.read()

In [18]:
# Define the prompt to generate synthetic data
prompt = f"""
You are a helpful assistant. Based on the following COBOL content, generate synthetic data for training a language model:

{content}

1. Generate a set of questions and answers (QA) about the content.
2. Summarize the content in a few sentences.
3. Explain the function and usage of the `ACCEPT` statement.
4. Generate a COBOL code snippet that uses the `ACCEPT` statement.
5. Provide autocomplete suggestions for COBOL code.
6. Correct the following COBOL code for errors: `ACCEPT AGE GENDER`.
"""

In [19]:
# Call the OpenAI API to generate synthetic data

completion = client.chat.completions.create( # Change the method
  model = "gpt-3.5-turbo",
  messages = [ # Change the prompt parameter to messages parameter
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt},
  ]
)

# Print the generated synthetic data
generated_content = completion.choices[0].message.content.strip()
print(generated_content)

Here is some synthetic data generated based on the provided COBOL content for training a language model:

1. Q: What type of items can identifier-1 be in COBOL? A: An alphanumeric group item, a national group item, or an elementary data item of usage DISPLAY, DISPLAY-1, or NATIONAL.

2. Q: What does the ACCEPT statement do in COBOL? A: The ACCEPT statement transfers data or system date-related information into the data area referenced by the specified identifier without editing or error checking.

3. Q: Explain the function and usage of the ACCEPT statement in COBOL. A: The ACCEPT statement transfers data from an input source into a specified data item in COBOL programs. It is commonly used to retrieve user input or system date-related information.

4. COBOL code snippet using the ACCEPT statement:

```cobol
IDENTIFICATION DIVISION.
PROGRAM-ID. ACCEPT-EXAMPLE.
DATA DIVISION.
WORKING-STORAGE SECTION.
01 USER-INPUT PIC X(30).

PROCEDURE DIVISION.
ACCEPT USER-INPUT.
DISPLAY 'User input re

Based on the content you provided from the mainframe book, here are some prompt examples you can use to generate synthetic data for training a language model (LLM) in various formats:

### 1. **QA (Question and Answer)**
**Prompt:**
- **Question**: "What is the purpose of the `ACCEPT` statement in COBOL?"
- **Answer**: "The `ACCEPT` statement transfers data or system date-related information into the data area referenced by the specified identifier. It reads input from a specified source, such as the system input device or a console."

**Prompt:**
- **Question**: "What happens if no data is received when using the `ACCEPT` statement?"
- **Answer**: "If no data is received from the console, the target data item does not receive a new value and retains its previous value."

### 2. **Summary**
**Prompt:**
- **Summary**: "Summarize the section on the `ACCEPT` statement in COBOL."
- **Response**: "The `ACCEPT` statement in COBOL is used to transfer data from an input source, such as the system input device or console, into a data item. It handles system date-related information and user inputs, with no error checking or editing of the incoming data."

### 3. **Explain**
**Prompt:**
- **Instruction**: "Explain how the `ACCEPT` statement works in COBOL when the system input device is used."
- **Response**: "When the system input device is used with the `ACCEPT` statement, data is transferred into the receiving area, which can be an alphanumeric or national group item. The system reads records from the input device until the receiving area is filled or EOF is encountered. If the area is not completely filled, it is padded with spaces."

### 4. **Code Generation**
**Prompt:**
- **Instruction**: "Generate a COBOL code snippet that uses the `ACCEPT` statement to read user input for age and gender."
- **Response**: 
  ```cobol
  ACCEPT AGE. 
  ACCEPT GENDER.
  IF AGE > 60 AND GENDER = 'M' 
      DISPLAY 'THE MAN IS RETIRED' 
  ELSE IF AGE > 60 AND GENDER = 'F' 
      DISPLAY 'THE WOMAN IS RETIRED' 
  ELSE IF AGE <= 60 AND GENDER = 'M' 
      DISPLAY 'THE MAN IS NOT RETIRED' 
  ELSE IF AGE <= 60 AND GENDER = 'F' 
      DISPLAY 'THE WOMAN IS NOT RETIRED'
  ELSE
      DISPLAY 'INVALID INPUT'
  ```

### 5. **Autocomplete**
**Prompt:**
- **Instruction**: "Provide an autocomplete suggestion to fill in the following COBOL code: `ACCEPT ____`."
- **Response**: "ACCEPT identifier-1."

### 6. **Auto-correct**
**Prompt:**
- **Instruction**: "Auto-correct the following COBOL code: `ACCEPT AGE GENDER`."
- **Response**: "The correct statement should be: `ACCEPT AGE. ACCEPT GENDER.`"

### Additional Synthetic Data Examples:

#### 7. **Code Explanation**
**Prompt:**
- **Instruction**: "Explain what the `ADD` statement does in COBOL."
- **Response**: "The `ADD` statement sums two or more numeric operands and stores the result in the specified target identifier. The operands are added in a left-to-right order, and the sum is assigned to the identifier-2."

#### 8. **Edge Case Example**
**Prompt:**
- **Question**: "What happens if more than 114 characters of data are entered in the `ACCEPT` statement?"
- **Answer**: "If more than 114 characters of data are entered, only the first 114 characters will be recognized, and the rest will be ignored."

#### 9. **Data Type Clarification**
**Prompt:**
- **Question**: "What types of data items can the `ACCEPT` statement receive data into?"
- **Answer**: "The `ACCEPT` statement can transfer data into alphanumeric, national group items, or elementary data items such as alphanumeric, numeric, national, and others, as long as they are not dynamic-length group items."

#### 10. **Best Practices**
**Prompt:**
- **Instruction**: "What is a best practice when using the `ACCEPT` statement for input?"
- **Response**: "Ensure that the receiving area has appropriate lengths and formats to handle the incoming data, and be aware of system-specific limitations, such as maximum input length and padding behavior."

These prompts provide diverse examples of synthetic data that could be used to train a language model in a variety of tasks related to COBOL and mainframe programming. Each example is designed to create different types of training data (QA, summaries, explanations, etc.), which helps diversify the model's learning.

### refereneces

https://stackoverflow.com/questions/75774873/openai-api-error-this-is-a-chat-model-and-not-supported-in-the-v1-completions