# 🌊🌊 **SWELLS** — *Training Set Generator*
[![Paper](http://img.shields.io/badge/Arxiv:2503.09454-B31B1B.svg)](https://arxiv.org/abs/2503.09454)
[![GitHub](https://img.shields.io/badge/GitHub-Repo-181717?logo=github)](https://github.com/mmarmonier/SWELLS)


This notebook generates ciphered conlang translation training sets aimed at enhancing an LLM's ability to explicitly learn grammar rules using metalinguistic explanations from grammar book excerpts.  

It processes a large set of plaintext training instances and enables users to select a tailored data mix based on:  
- 🔄 **Translation direction** (e.g., English → Conlang or Conlang → English)  
- 🎯 **Targeted linguistic phenomena** (e.g., noun pluralization, verb agreement)  
- 🧠 **Prompting modality** (e.g., with or without chain-of-thought reasoning)  

Additionally, the script provides training cost estimates for OpenAI’s fine-tuning API (although rates may be outdated ⚠️).

# Dataset

In [1]:
!wget https://github.com/mmarmonier/SWELLS/releases/download/v1.0/SWELLS_dataset.7z

--2025-03-19 12:40:26--  https://github.com/mmarmonier/SWELLS/releases/download/v1.0/SWELLS_dataset.7z
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/951275150/ef3d2270-915a-4c9c-89ee-7733a6165e93?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250319%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250319T124026Z&X-Amz-Expires=300&X-Amz-Signature=9fd192032ba805e0af5526e5ae534d065d1d282eb53c3d8b9189b3607da49d05&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3DSWELLS_dataset.7z&response-content-type=application%2Foctet-stream [following]
--2025-03-19 12:40:26--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/951275150/ef3d2270-915a-4c9c-89ee-7733a6165e93?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Cr

In [2]:
!7z x /content/SWELLS_dataset.7z -p"SWELLS" -o/content/extracted_SWELLS


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan /content/                   1 file, 299744346 bytes (286 MiB)

Extracting archive: /content/SWELLS_dataset.7z
--
Path = /content/SWELLS_dataset.7z
Type = 7z
Physical Size = 299744346
Headers Size = 1754
Method = LZMA2:24 7zAES
Solid = +
Blocks = 12

  0%      0% 14 - SWELLS_Dataset/Dev/dev_French.jsonl                                               0% 15 - SWELLS_Dataset/Dev/dev_Latin.jsonl                                              1% 15 - SWELLS_Dataset/Dev/dev_Latin.jsonl           

# Conlanger object

In [None]:
import random
import re
from nltk.tokenize import sent_tokenize
import json
import nltk
#nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)


class Conlanger:
    def __init__(self, n_consonants=9, n_vowels=5, mixed_syllable=True):
        """
        Initialize the Conlanger class, which creates a random grid of syllables
        and an optional mapping for transformations.

        Parameters:
        - n_consonants (int): Number of consonants to include in the generated grid.
        - n_vowels (int): Number of vowels to include in the generated grid.
        - mixed_syllable (bool): Whether to allow mappings between digrams and extra characters.

        Attributes:
        - all_consonants (list): Full list of possible consonants.
        - all_vowels (list): Full list of possible vowels.
        - alphabet (list): Combined list of all possible characters (consonants + vowels).
        - consonants (list): Randomly sampled consonants for the grid.
        - vowels (list): Randomly sampled vowels for the grid.
        - grid (dict): Generated syllabic grid based on consonants and vowels.
        - mapping (dict): Optional mapping between digrams and extra characters.
        """
        # Define possible consonants (basic and extended sets)
        self.all_consonants_0 = ['b', 'c', 'ç', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'z']
        self.all_consonants_1 = ['b', 'c', 'ç', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'ñ', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'z', 'þ', 'ð']

        # Choose which consonant set to use (weighted random selection)
        self.all_consonants = random.choices(
            [self.all_consonants_0, self.all_consonants_1],
            weights=[3, 1],  # Prefer the basic set (weight 3:1)
            k=1
        )[0]

        # Define possible vowels (basic and extended sets)
        self.all_vowels_0 = ['a', 'e', 'i', 'o', 'u', 'é', 'è', 'à', 'y']
        self.all_vowels_1 = ['a', 'e', 'i', 'o', 'u', 'â', 'ê', 'î', 'ô', 'û', 'é', 'è', 'à', 'œ', 'y', 'ø', 'æ']

        # Choose which vowel set to use (weighted random selection)
        self.all_vowels = random.choices(
            [self.all_vowels_0, self.all_vowels_1],
            weights=[3, 1],  # Prefer the basic set (weight 3:1)
            k=1
        )[0]

        # Combine consonants and vowels into a single alphabet
        self.alphabet = self.all_consonants + self.all_vowels

        # Validate that the grid can accommodate the alphabet
        self.n_consonants = n_consonants
        self.n_vowels = n_vowels
        assert len(self.alphabet) <= n_vowels * n_consonants, (
            "Error: n_vowels * n_consonants must be >= len(alphabet)."
        )

        # Store additional settings
        self.mixed_syllable = mixed_syllable

        # Randomly select the specific consonants and vowels for the grid
        self.consonants = random.sample(self.all_consonants, n_consonants)
        self.vowels = random.sample(self.all_vowels, n_vowels)

        # Identify remaining characters not included in the grid
        self.remaining_consonants = [c for c in self.all_consonants if c not in self.consonants]
        self.remaining_vowels = [v for v in self.all_vowels if v not in self.vowels]

        # Randomly determine whether the grid is open (CV) or closed (VC)
        self.syllabic_structure = random.choice(["open", "closed"])

        # Generate the syllabic grid
        self.grid = self.create_grid()

        # Create an optional mapping if mixed syllables are enabled
        self.mapping = {}
        if self.mixed_syllable:
            self.create_mapping()


    def create_grid(self):
        """
        Creates a syllabic grid based on the syllabic structure (open or closed).

        An open grid (CV structure) uses consonants as rows and vowels as columns.
        A closed grid (VC structure) uses vowels as rows and consonants as columns.

        Returns:
        - grid (dict): A dictionary where rows (keys) map to columns (keys) with alphabet characters as values.
        """
        # Determine grid structure: open (CV) or closed (VC)
        if self.syllabic_structure == "open":
            rows = self.consonants
            columns = self.vowels
        elif self.syllabic_structure == "closed":
            rows = self.vowels
            columns = self.consonants
        else:
            raise ValueError("Invalid syllabic structure. Must be 'open' or 'closed'.")

        # Ensure the alphabet has enough elements to fill the grid
        alphabet = self.alphabet[:]
        total_cells = len(rows) * len(columns)

        # Pad the alphabet with None if there are not enough characters
        while len(alphabet) < total_cells:
            alphabet.append(None)

        # Shuffle the alphabet to randomize grid placement
        random.shuffle(alphabet)

        # Initialize the grid as a dictionary of dictionaries
        grid = {row: {col: None for col in columns} for row in rows}

        # Fill the grid with characters from the alphabet
        index = 0
        for row in rows:
            for col in columns:
                grid[row][col] = alphabet[index]
                index += 1

        return grid


    def create_mapping(self):
        """
        Creates a mapping between randomly selected bigrams (from the grid)
        and extra characters (remaining consonants and vowels).

        This mapping is optional and used for additional text transformations.

        Attributes:
        - self.mapping (dict): Maps bigrams to extra characters.
        """
        # Generate a list of valid bigrams (row+column combinations) from the grid
        digrams = []
        for row_key, row_values in self.grid.items():
            for col_key, value in row_values.items():
                if value is not None:
                    digrams.append(f"{row_key}{col_key}")

        # Combine remaining consonants and vowels as possible replacements
        first_list = digrams
        second_list = self.remaining_consonants + self.remaining_vowels

        # Determine the number of changes (percentage of available digrams to map)
        coefficient = random.choice([0.40, 0.50, 0.66, 0.75])  # Randomly choose mapping intensity
        n_changes = min(int(len(digrams) * coefficient), len(second_list))

        # Ensure there are enough replacement characters to create the mapping
        if n_changes > len(second_list):
            raise ValueError("Error! Not enough characters to map the selected bigrams.")

        # Randomly select bigrams and replacement characters for the mapping
        selected_bigrams = random.sample(first_list, n_changes)
        selected_chars = random.sample(second_list, n_changes)

        # Create a dictionary mapping the selected bigrams to replacement characters
        self.mapping = dict(zip(selected_bigrams, selected_chars))



    def replace_bigrams(self, text):
        """
        Replaces bigrams in the text with their corresponding mapped characters.

        Parameters:
        - text (str): The input text with bigrams.

        Returns:
        - str: Text with bigrams replaced by mapped characters.
        """
        for bigram, char in self.mapping.items():
            text = text.replace(bigram, char)
        return text

    def reconstruct_original(self, text):
        """
        Reverses the replacement process by reconstructing the original bigrams.

        Parameters:
        - text (str): The input text with mapped characters.

        Returns:
        - str: Text with mapped characters replaced by original bigrams.
        """
        # Create a reverse mapping (mapped char -> original bigram)
        reverse_mapping = {v: k for k, v in self.mapping.items()}
        for char, bigram in reverse_mapping.items():
            text = text.replace(char, bigram)
        return text

    def encode_character(self, char):
        """
        Encodes a single character into grid coordinates.

        Parameters:
        - char (str): The character to encode.

        Returns:
        - str or None: Encoded bigram (row+column) or None if the character is not in the grid.
        """
        is_upper = char.isupper()
        char = char.lower()
        for row_key, row_val in self.grid.items():
            for col_key, val in row_val.items():
                if val == char:
                    # Return coordinates with appropriate case
                    return (row_key.upper() if is_upper else row_key) + col_key
        return None

    def decode_coordinates(self, coords):
        """
        Decodes grid coordinates back into a character.

        Parameters:
        - coords (str): The grid coordinates (row+column).

        Returns:
        - str or None: Decoded character or None if coordinates are invalid.
        """
        row, col = coords[0], coords[1]
        if row.lower() in self.grid and col in self.grid[row.lower()]:
            decoded_char = self.grid[row.lower()][col]
            # Return with appropriate case
            return decoded_char.upper() if row.isupper() else decoded_char
        return None

    def encode_string(self, input_string):
        """
        Encodes an entire string using the grid and optional bigram replacement.

        Parameters:
        - input_string (str): The string to encode.

        Returns:
        - str: Encoded string with characters replaced by grid coordinates and bigrams replaced if enabled.
        """
        encoded = []
        for char in input_string:
            if char.lower() in self.alphabet:
                # Encode valid characters
                encoded_char = self.encode_character(char)
                if encoded_char:
                    encoded.append(encoded_char)
            else:
                # Preserve invalid characters as-is
                encoded.append(char)

        # Combine encoded characters into a single string
        encoded_string = ''.join(encoded)

        # Apply bigram replacements if enabled
        if self.mixed_syllable:
            return self.replace_bigrams(encoded_string)
        return encoded_string

    def decode_string(self, encoded_string):
        """
        Decodes an encoded string back into the original text.

        Parameters:
        - encoded_string (str): The string to decode.

        Returns:
        - str: Decoded string with original characters restored.
        """
        # Reverse bigram replacements if enabled
        if self.mixed_syllable:
            encoded_string = self.reconstruct_original(encoded_string)

        decoded = []
        i = 0

        # Process encoded string in pairs (bigram decoding)
        while i < len(encoded_string):
            # Check if current and next character form valid coordinates
            if (i + 1 < len(encoded_string) and
                encoded_string[i].lower() in self.grid and
                encoded_string[i+1] in self.grid[encoded_string[i].lower()]):
                # Decode the bigram
                coord = encoded_string[i:i+2]
                decoded_char = self.decode_coordinates(coord)
                if decoded_char:
                    decoded.append(decoded_char)
                i += 2  # Move to the next bigram
            else:
                # Preserve invalid or standalone characters as-is
                decoded.append(encoded_string[i])
                i += 1

        return ''.join(decoded)


    @staticmethod
    def reverse_letters_in_words(input_string):
        """
        Reverses only the alphabetic letters in each word, preserving the positions of non-alphabetic characters.

        Parameters:
        - input_string (str): The input string.

        Returns:
        - str: String with letters reversed in each word.
        """
        def reverse_word(word):
            # Extract letters and their positions
            letters = [char for char in word if char.isalpha()]
            non_letters = [(i, char) for i, char in enumerate(word) if not char.isalpha()]

            # Reverse the letters
            reversed_letters = letters[::-1]

            # Reinsert non-alphabetic characters at their original positions
            for i, char in non_letters:
                reversed_letters.insert(i, char)

            return ''.join(reversed_letters)

        # Replace specific punctuation with spaces before processing
        translation_table = str.maketrans("-–'", "   ")
        input_string = input_string.translate(translation_table)

        # Reverse letters in each word
        words = input_string.split()
        reversed_words = [reverse_word(word) for word in words]

        return ' '.join(reversed_words)

    @staticmethod
    def unreverse_letters_in_words(input_string):
        """
        Reverses the effect of reverse_letters_in_words, restoring the original order of letters in each word.

        Parameters:
        - input_string (str): The input string.

        Returns:
        - str: String with letters restored to their original order.
        """
        # The operation is essentially identical to `reverse_letters_in_words`
        # because reversing twice restores the original order.
        return reverse_letters_in_words(input_string)

    def reverse_and_encode_string(self, input_string):
        """
        Reverses letters in each word of the input string and encodes the result using the grid.

        Parameters:
        - input_string (str): The string to reverse and encode.

        Returns:
        - str: Encoded string after reversing letters in each word.
        """
        # Reverse letters in words
        reversed_string = self.reverse_letters_in_words(input_string)

        # Encode the reversed string
        return self.encode_string(reversed_string)

    def decode_and_unreverse_string(self, encoded_string):
        """
        Decodes an encoded string and restores the original letter order in words.

        Parameters:
        - encoded_string (str): The encoded string to decode and unreverse.

        Returns:
        - str: Decoded string with letters restored to their original order.
        """
        # Decode the string
        decoded_string = self.decode_string(encoded_string)

        # Restore original letter order
        return self.unreverse_letters_in_words(decoded_string)

    @staticmethod
    def reverse_entire_string(text):
        """
        Reverses the entire string by:
        - Reversing sentences.
        - Adjusting punctuation, parentheses, and numbers.

        Parameters:
        - text (str): The input string.

        Returns:
        - str: The fully reversed string.
        """
        def tokenize_into_sentences(text):
            """
            Tokenizes text into sentences.

            Parameters:
            - text (str): The input text.

            Returns:
            - list: List of sentences.
            """
            return sent_tokenize(text)

        def reverse_parentheses(text):
            """
            Reverses parentheses and brackets in the text.

            Parameters:
            - text (str): The input text.

            Returns:
            - str: Text with reversed parentheses and brackets.
            """
            return text.translate(str.maketrans("()[]{}", ")(][}{"))

        def reverse_numbers_in_string(text):
            """
            Reverses numbers in the string while leaving other content unchanged.

            Parameters:
            - text (str): The input text.

            Returns:
            - str: Text with reversed numbers.
            """
            return re.sub(r'\d+', lambda x: x.group(0)[::-1], text)

        # Tokenize text into sentences
        tokenized_sentences = tokenize_into_sentences(text)
        reversed_texts = []

        for sentence in tokenized_sentences:
            # Reverse the sentence
            reversed_text = sentence[::-1]

            # Adjust punctuation at the end of the sentence
            if reversed_text and reversed_text[0] in '.?!':
                reversed_text = reversed_text[1:] + reversed_text[0]

            # Fix formatting issues (e.g., misplaced spaces)
            reversed_text = re.sub(r'\.(?!$)', '', reversed_text)
            reversed_text = re.sub(r' ,', ', ', reversed_text)
            reversed_text = re.sub(r' ;', '; ', reversed_text)

            # Reverse parentheses/brackets
            reversed_text = reverse_parentheses(reversed_text)

            # Add the processed sentence to the result
            reversed_texts.append(reversed_text.lower())

        # Join sentences without additional spaces and reverse numbers
        final_reversed_text = ' '.join(reversed_texts)
        return reverse_numbers_in_string(final_reversed_text)



    def save_to_json(self, file_path):
        """
        Saves the grid, mapping and alphabet to a single JSON file.

        Parameters:
        - file_path (str): Path to the JSON file where data will be saved.
        """
        # Combine grid and mapping into a single dictionary
        data = {
            "grid": self.grid,
            "mapping": self.mapping,
            "alphabet": self.alphabet
        }

        # Write the combined dictionary to a JSON file
        with open(file_path, 'w', encoding='utf-8') as f:
            json.dump(data, f, ensure_ascii=False, indent=4)

    def load_from_json(self, file_path):
        """
        Loads the grid, mapping and alphabet from a single JSON file.

        Parameters:
        - file_path (str): Path to the JSON file where data is stored.
        """
        # Read the combined dictionary from the JSON file
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)

        # Extract and assign grid and mapping
        self.grid = data.get("grid", {})
        self.mapping = data.get("mapping", {})
        self.alphabet = data.get("alphabet", {})


In [None]:
#conlanger @ funtions
def lowercase_between_at(text):
    """
    Lowercases everything between '@' characters, except if the entire
    match is exactly '@Latin@' or '@French@'.
    """
    pattern = re.compile(r'@(.*?)@')

    def _replacer(m):
        full_match = m.group(0)  # The entire '@...@' including the @ signs
        inside_text = m.group(1) # The text between the @ signs
        if full_match in ('@Latin@', '@French@'):
            # Return as-is for these exact matches
            return full_match
        else:
            # Otherwise, lowercase the inside
            return f"@{inside_text.lower()}@"

    return pattern.sub(_replacer, text)


#conlanger @ funtions
def encode_at_strings_no_rev(text, conlanger):
        text = lowercase_between_at(text)
        return re.sub(r'@(.*?)@', lambda m: conlanger.encode_string(m.group(1)), text)

In [None]:
def encode_at_strings_rev(text, conlanger):
    """
    Finds all text between '@...@', reverses it using conlanger.reverse_entire_string,
    encodes it with conlanger.encode_string, and then *post-processes* the
    result by uppercasing if the original substring was exactly "French".
    """
    def replacer(m):
        original_substring = m.group(1)
        # Reverse the entire matched substring
        reversed_text = conlanger.reverse_entire_string(original_substring)
        # Encode the reversed text
        encoded_text = conlanger.encode_string(reversed_text)

        # === Post-processing step: if the original was "French", uppercase ===
        if original_substring == "French":
            encoded_text = encoded_text.capitalize()

        return encoded_text

    # Apply the replacer to anything between @...@
    return re.sub(r'@(.*?)@', replacer, text)


# Instance mix selection


In [None]:
!pip install ijson

Collecting ijson
  Downloading ijson-3.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (21 kB)
Downloading ijson-3.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (119 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/119.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.2/119.2 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ijson
Successfully installed ijson-3.3.0


In [None]:
# Build list of relevant JSON files
import os
from pprint import pprint

# Define the root directory
root_dir = "/content/extracted_SWELLS/SWELLS_Dataset/Train/Base"

# Dictionary to store file paths and their instance counts

file_list = []
# Walk through the directory
for subdir, _, files in os.walk(root_dir):
    for file in files:
        if "Tralalam_" in file and file.endswith(".json"):
            file_path = os.path.join(subdir, file)
            file_list.append(file_path)

pprint(sorted(file_list))
print(f"{len(file_list)} relevant files were found.")

['/content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_1_train_CoT_completions_no_rev_art_eng.json',
 '/content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_2_f_train_CoT_completions_no_rev_art_eng.json',
 '/content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_2_m_train_CoT_completions_no_rev_art_eng.json',
 '/content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_3_train_CoT_completions_no_rev_art_eng.json',
 '/content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_4_train_CoT_completions_no_rev_art_eng.json',
 '/content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_5_f_train_CoT_completions_no_rev_art_eng.json',
 '/content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_5_m_train_CoT_completions_no_rev_art_eng.json',
 '/content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_6_f_train_CoT_completions_no_rev_art_eng.json'

In [None]:
# Get per-file instance counts
import os
import ijson

# Define the root directory
root_dir = "/content/extracted_SWELLS/SWELLS_Dataset/Train/Base"

# Walk through the directory
processed_files = 0
total_instances = 0
for subdir, _, files in os.walk(root_dir):
    for file in sorted(files):
        if "Tralalam_" in file and file.endswith(".json"):
            file_path = os.path.join(subdir, file)
            instance_count = 0
            try:
                # Stream the JSON data and count instances
                with open(file_path, "r") as f:
                    # Assuming the JSON contains a top-level array of instances
                    for _ in ijson.items(f, "item"):
                        instance_count += 1
                        total_instances += 1
                print(f"File: {file_path} -> {instance_count} instances")
                processed_files += 1
            except Exception as e:
                print(f"Error processing {file_path}: {e}")

print(f"Finished processing {processed_files} files.")
print(f"Total number of instances: {str(total_instances)}.")

File: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_1_train_CoT_completions_no_rev_art_eng.json -> 103 instances
File: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_2_f_train_CoT_completions_no_rev_art_eng.json -> 107 instances
File: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_2_m_train_CoT_completions_no_rev_art_eng.json -> 394 instances
File: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_3_train_CoT_completions_no_rev_art_eng.json -> 192 instances
File: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_4_train_CoT_completions_no_rev_art_eng.json -> 15016 instances
File: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_5_f_train_CoT_completions_no_rev_art_eng.json -> 14216 instances
File: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_5_m_train_CoT_completions_no_rev_art_eng.json -> 4

In [None]:
# Selecting CoT instances
import os
import ijson
import random
import json

files = sorted(file_list[:])

# Output file
output_file = "train.jsonl"

# System prompt
system_prompt = "You are an expert linguist and translator."

# Function to stream and sample data using ijson
def sample_streamed_data(file_path, sample_size):
    sampled_data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        items = list(ijson.items(f, "item"))
        sampled_data = random.sample(items, min(len(items), sample_size))
    return sampled_data

# Open the output file for writing
with open(output_file, 'w', encoding='utf-8') as outfile:
    for file_path in files:
        print(f"Processing instances from file: {file_path}.")
        try:
            # Determine the sample size based on file type
            if "eng_art" in file_path:
                sample_size = 60 # Change as desired
            elif "art_eng" in file_path:
                sample_size = 3 # Change as desired
            else:
                # Skip if file doesn't match criteria (safety check)
                continue

            # Sample the data from the file
            sampled_instances = sample_streamed_data(file_path, sample_size)

            # Process each selected instance
            for instance in sampled_instances:
                user_content = instance["CoT_prompt_template"]
                assistant_content = instance["completion_template"]

                # Create the JSONL entry
                jsonl_entry = {
                    "messages": [
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": user_content},
                        {"role": "assistant", "content": assistant_content},
                    ],
                    "source": os.path.basename(file_path),  # Add source variable
                }

                # Write to the JSONL file (UTF-8, no ASCII enforcement)
                outfile.write(json.dumps(jsonl_entry, ensure_ascii=False) + "\n")

        except Exception as e:
            print(f"Error processing file {file_path}: {e}")

print(f"Train dataset saved to {output_file}")


Processing instances from file: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_1_train_CoT_completions_no_rev_art_eng.json.
Processing instances from file: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_2_f_train_CoT_completions_no_rev_art_eng.json.
Processing instances from file: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_2_m_train_CoT_completions_no_rev_art_eng.json.
Processing instances from file: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_3_train_CoT_completions_no_rev_art_eng.json.
Processing instances from file: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_4_train_CoT_completions_no_rev_art_eng.json.
Processing instances from file: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_5_f_train_CoT_completions_no_rev_art_eng.json.
Processing instances from file: /content/extracted_SWELLS/SWELLS_Dataset/Tra

In [None]:
# Selecting CoT-free instances
import os
import ijson
import random
import json


# Output file
output_file = "train_no_CoT.jsonl"

# System prompt
system_prompt = "You are an expert linguist and translator."

# Function to stream and sample data using ijson
def sample_streamed_data(file_path, sample_size):
    sampled_data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        items = list(ijson.items(f, "item"))
        sampled_data = random.sample(items, min(len(items), sample_size))
    return sampled_data

# Open the output file for writing
with open(output_file, 'w', encoding='utf-8') as outfile:
    for file_path in files:
        print(f"Processing instances from file: {file_path}.")
        try:
            # Determine the sample size based on file type
            if "eng_art" in file_path:
                sample_size = 15 # Change as desired
            elif "art_eng" in file_path:
                sample_size = 2 # Change as desired
            else:
                # Skip if file doesn't match criteria (safety check)
                continue

            # Sample the data from the file
            sampled_instances = sample_streamed_data(file_path, sample_size)

            # Process each selected instance
            for instance in sampled_instances:
                user_content = instance["no_CoT_prompt_template"]
                assistant_content = instance["no_CoT_completion_template"]

                # Create the JSONL entry
                jsonl_entry = {
                    "messages": [
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": user_content},
                        {"role": "assistant", "content": assistant_content},
                    ],
                    "source": os.path.basename(file_path),  # Add source variable
                }

                # Write to the JSONL file (UTF-8, no ASCII enforcement)
                outfile.write(json.dumps(jsonl_entry, ensure_ascii=False) + "\n")

        except Exception as e:
            print(f"Error processing file {file_path}: {e}")

print(f"Train dataset saved to {output_file}")



Processing instances from file: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_1_train_CoT_completions_no_rev_art_eng.json.
Processing instances from file: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_2_f_train_CoT_completions_no_rev_art_eng.json.
Processing instances from file: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_2_m_train_CoT_completions_no_rev_art_eng.json.
Processing instances from file: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_3_train_CoT_completions_no_rev_art_eng.json.
Processing instances from file: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_4_train_CoT_completions_no_rev_art_eng.json.
Processing instances from file: /content/extracted_SWELLS/SWELLS_Dataset/Train/Base/art_eng_no_rev/Tralalam_5_f_train_CoT_completions_no_rev_art_eng.json.
Processing instances from file: /content/extracted_SWELLS/SWELLS_Dataset/Tra

In [None]:
!head train.jsonl

{"messages": [{"role": "system", "content": "You are an expert linguist and translator."}, {"role": "user", "content": "@French@ is a recently devised conlang. You are to translate the following @French@ text segment into English with the help of a few dictionary entries and excerpts from a grammar book.\n\nHere is the text segment you must translate:\n@les cous@\n\nAnd here are a few dictionary entries that may be of use to you; note that each entry follows the format: lemma (grammatical gender and/or part of speech) : English equivalent.\n@cou@ (masc. n.): neck\n\nAnd here are relevant excerpts from a grammar book: \n\nBeginning of @French@ Grammar Book Excerpts\n\n\t\t\t\t\t\nNOUNS\n\t\t\t\t\t\n\n* Gender\nA defining feature of @French@ nouns is their grammatical gender, which can be masculine or feminine.\n\n* Number\nAs in English, @French@ nouns inflect for number.\n\nThe plural is generally formed from the singular by appending the morpheme @-s@ (e.g., @maison@ becomes @maisons@

In [None]:
!head train_no_CoT.jsonl

{"messages": [{"role": "system", "content": "You are an expert linguist and translator."}, {"role": "user", "content": "@French@ is a recently devised conlang. You are to translate the following @French@ text segment into English with the help of a few dictionary entries and excerpts from a grammar book.\n\nHere is the text segment you must translate:\n@les récitals@\n\nAnd here are a few dictionary entries that may be of use to you; note that each entry follows the format: lemma (grammatical gender and/or part of speech) : English equivalent.\n@récital@ (masc. n.): recital\n\nAnd here are relevant excerpts from a grammar book: \n\nBeginning of @French@ Grammar Book Excerpts\n\n§§§§§§§§§§§§§§§§§§§§§§§§\nARTICLES AND DETERMINERS\n§§§§§§§§§§§§§§§§§§§§§§§§\n\nIn @French@, articles and determiners are almost always required with common nouns, much more so than in English. These words agree in gender (masculine or feminine) and number (singular or plural) with the noun they modify, though m

In [None]:
# Merging selected CoT and CoT-free instances
file_1 = "train.jsonl"
file_2 = "train_no_CoT.jsonl"
output_file = "train_merged.jsonl"

# Open the output file for writing
with open(output_file, 'w', encoding='utf-8') as outfile:
    # Add the contents of the first file
    with open(file_1, 'r', encoding='utf-8') as f1:
        for line in f1:
            outfile.write(line)

    # Add the contents of the second file
    with open(file_2, 'r', encoding='utf-8') as f2:
        for line in f2:
            outfile.write(line)

print(f"Merged dataset saved to {output_file}")


Merged dataset saved to train_merged.jsonl


In [None]:
!wc -l train_merged.jsonl

1760 train_merged.jsonl


In [None]:
!ls -laht train_merged.jsonl

-rw-r--r-- 1 root root 39M Mar 13 18:11 train_merged.jsonl


# Ciphering

In [None]:
import json
import re


# Input and output files
input_file = "train_merged.jsonl"
output_file = "train_enciphered.jsonl"

# Encipher the JSONL file
with open(input_file, 'r', encoding='utf-8') as infile, open(output_file, 'w', encoding='utf-8') as outfile:
    for line in infile:
        try:
            # Parse the JSONL line
            data = json.loads(line)
            source = data.get("source", "")
            conlanger = Conlanger()  # Reinstantiate Conlanger for each line

            # Capture grid, alphabet, and mapping for quality checks
            data["grid"] = conlanger.grid
            data["alphabet"] = conlanger.alphabet
            data["mapping"] = conlanger.mapping

            # Determine which function to apply
            if "completions_no_rev" in source:
                # Apply encode_at_strings_no_rev
                data["messages"][1]["content"] = encode_at_strings_no_rev(data["messages"][1]["content"], conlanger)
                data["messages"][2]["content"] = encode_at_strings_no_rev(data["messages"][2]["content"], conlanger)
            elif "completions_rev" in source:
                # Apply encode_at_strings_rev
                data["messages"][1]["content"] = encode_at_strings_rev(data["messages"][1]["content"], conlanger)
                data["messages"][2]["content"] = encode_at_strings_rev(data["messages"][2]["content"], conlanger)
            else:
                print("ERROR: filename makes no mention of the encipherment direction.")

            # Write the modified line to the output file
            outfile.write(json.dumps(data, ensure_ascii=False) + "\n")

        except Exception as e:
            print(f"Error processing line: {line.strip()} - {e}")

print(f"Enciphered dataset saved to {output_file}")


Enciphered dataset saved to train_enciphered.jsonl


In [None]:
# Shuffling instances
import random

def shuffle_jsonl(file_path):
    """
    Shuffles the lines in a JSONL file in place.

    Parameters:
        file_path (str): The path to the JSONL file to shuffle.
    """
    # Read all lines from the file
    with open(file_path, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    # Shuffle the lines
    random.shuffle(lines)

    # Write the shuffled lines back to the file
    with open(file_path, 'w', encoding='utf-8') as f:
        for line in lines:
            f.write(line)

    print(f"Shuffled lines in {file_path}")

# Example usage
shuffle_jsonl("train_enciphered.jsonl")


Shuffled lines in train_enciphered.jsonl


In [None]:
!head train_enciphered.jsonl

{"messages": [{"role": "system", "content": "You are an expert linguist and translator."}, {"role": "user", "content": "Vèzçécpèd is a recently devised conlang. You are to translate the following English segment into Vèzçécpèd with the help of a few dictionary entries and excerpts from a grammar book.\n\nHere is the text segment you must translate:\nthe championships\n\nHere are a few dictionary entries that may be of use to you; note that each entry follows the format: English lemma : Vèzçécpèd equivalent (grammatical gender and/or part of speech).\nchampionship : pèdqrsxvéccqg (masc. n.)\n\nAnd here are relevant excerpts from a grammar book: \n\nBeginning of Vèzçécpèd Grammar Book Excerpts\n\n\t\t\t\t\t\nNOUNS\n\t\t\t\t\t\n\n§ Case\nIn Vèzçécpèd, nouns are not inflected for other grammatical distinctions. (Case and person inflections apply only to personal pronouns.)\n\n§ Number\nAs in English, Vèzçécpèd nouns inflect for number.\n\nThe plural is generally formed from the singular by

In [None]:
# Sanity check and post-processing
import json
import re

# Input file
enciphered_file = "train_enciphered.jsonl"

# Open the enciphered JSONL file for verification
with open(enciphered_file, 'r', encoding='utf-8') as infile:
    for line in infile:
        try:
            # Parse the JSONL line
            data = json.loads(line)
            source = data.get("source", "")
            grid = data.get("grid")
            mapping = data.get("mapping")
            alphabet = data.get("alphabet")

            # Reinstantiate Conlanger with ad hoc values
            conlanger = Conlanger()
            conlanger.grid = grid
            conlanger.mapping = mapping
            conlanger.alphabet = alphabet

            # Extract the first word from user content
            user_content = data["messages"][1]["content"]
            first_word = re.match(r"(\w+)", user_content).group(1)

            # Determine the clear first word based on source
            if "completions_no_rev" in source:
                clear_first_word = conlanger.decode_string(first_word)
                assert clear_first_word == "French", f"Mismatch in no_rev: {clear_first_word} != French"
            elif "completions_rev" in source:
                clear_first_word = conlanger.reverse_entire_string(conlanger.decode_string(first_word.lower()))
                assert clear_first_word == "french", f"Mismatch in rev: {clear_first_word} != french"
            else:
                print("ERROR: filename makes no mention of the encipherment direction.")
                continue

            print(f"Line passed verification: {line.strip()[:100]}...")

        except AssertionError as e:
            print(f"Assertion failed: {e}")
        except Exception as e:
            print(f"Error processing line: {line.strip()} - {e}")

print("Verification completed.")


Line passed verification: {"messages": [{"role": "system", "content": "You are an expert linguist and translator."}, {"role": ...
Line passed verification: {"messages": [{"role": "system", "content": "You are an expert linguist and translator."}, {"role": ...
Line passed verification: {"messages": [{"role": "system", "content": "You are an expert linguist and translator."}, {"role": ...
Line passed verification: {"messages": [{"role": "system", "content": "You are an expert linguist and translator."}, {"role": ...
Line passed verification: {"messages": [{"role": "system", "content": "You are an expert linguist and translator."}, {"role": ...
Line passed verification: {"messages": [{"role": "system", "content": "You are an expert linguist and translator."}, {"role": ...
Line passed verification: {"messages": [{"role": "system", "content": "You are an expert linguist and translator."}, {"role": ...
Line passed verification: {"messages": [{"role": "system", "content": "You are an expert l

# Training cost estimation

In [None]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.9.0


In [None]:
import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict

data_path = "/content/train_enciphered.jsonl"

# Load the dataset
with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue

    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue

    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1

        if any(k not in ("role", "content", "name", "function_call", "weight") for k in message):
            format_errors["message_unrecognized_key"] += 1

        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1

        content = message.get("content", None)
        function_call = message.get("function_call", None)

        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1

    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))

print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 65536 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 65,536 token limit, they will be truncated during fine-tuning")

# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 65536

TARGET_EPOCHS = 1
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
#print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")
cost_per_million_tokens = 3.000  # Update with the cost per million training tokens
total_cost = (n_epochs * n_billing_tokens_in_dataset / 1_000_000) * cost_per_million_tokens
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens, costing approximately ${total_cost:.2f}")


Num examples: 1760
First example:
{'role': 'system', 'content': 'You are an expert linguist and translator.'}
{'role': 'assistant', 'content': 'To translate the English phrase "the old festivals that I watched" into Ynakendmaz, we need to break down the components of the phrase based on the provided grammar and dictionary entries, identify their translation and order these constituents correctly.\n\n1. **Identify the elements and their properties**:\n- The definite article in Ynakendmaz for plural nouns (regardless of gender) is "qenah."\n- The English word "festivals" corresponds to the Ynakendmaz noun "ynenahsuxrq," which is a masculine noun.\n- According to the grammar book, the plural form of "ynenahsuxrq" is "ynenahsuxrqah.\n- The word "old" is an English adjective, in this context, and translates to "xuenygèt."\n- The adjective "xuenygèt" (old) needs to be in plural form as well. According to the grammar book, the plural of adjectives such as \'xuenygèt\' is identical to the sing

# Final formatting

In [None]:
import json

# Input and output files
input_file = "train_enciphered.jsonl"
output_file = "train_final.jsonl"

# Process the enciphered JSONL file to keep only the "messages" field
with open(input_file, 'r', encoding='utf-8') as infile, open(output_file, 'w', encoding='utf-8') as outfile:
    for line in infile:
        try:
            # Parse the JSONL line
            data = json.loads(line)

            # Extract only the "messages" field
            final_data = {"messages": data["messages"]}

            # Write the simplified data to the output file
            outfile.write(json.dumps(final_data, ensure_ascii=False) + "\n")

        except Exception as e:
            print(f"Error processing line: {line.strip()} - {e}")

print(f"Final dataset saved to {output_file}")


Final dataset saved to train_final.jsonl


In [None]:
!head train_final.jsonl

{"messages": [{"role": "system", "content": "You are an expert linguist and translator."}, {"role": "user", "content": "Vèzçécpèd is a recently devised conlang. You are to translate the following English segment into Vèzçécpèd with the help of a few dictionary entries and excerpts from a grammar book.\n\nHere is the text segment you must translate:\nthe championships\n\nHere are a few dictionary entries that may be of use to you; note that each entry follows the format: English lemma : Vèzçécpèd equivalent (grammatical gender and/or part of speech).\nchampionship : pèdqrsxvéccqg (masc. n.)\n\nAnd here are relevant excerpts from a grammar book: \n\nBeginning of Vèzçécpèd Grammar Book Excerpts\n\n\t\t\t\t\t\nNOUNS\n\t\t\t\t\t\n\n§ Case\nIn Vèzçécpèd, nouns are not inflected for other grammatical distinctions. (Case and person inflections apply only to personal pronouns.)\n\n§ Number\nAs in English, Vèzçécpèd nouns inflect for number.\n\nThe plural is generally formed from the singular by