- Train llama7B-chat-uncensored using Dataset -> Push Output (Ganymede) to HF
- Quantize Ganymede -> Push Ganymede.GGUF to HF
- Adjust Modelfile to use Ganymede.GGUF -> ollama create Ganymede -f Modelfile_Ganymede
- Check Discord implementation

In [2]:
from huggingface_hub import HfApi
from datasets import load_dataset
import subprocess
import json
import os

def convert_to_jsonl(input_directory, output_filename):
    os.makedirs(os.path.dirname(output_filename), exist_ok=True)
    
    with open(output_filename, 'w', encoding='utf-8') as jsonl_file:
        for input_filename in os.listdir(input_directory):
            if input_filename.endswith('.txt'):
                bot_name = input_filename[:-4]  # Remove '.txt' from filename to get bot_name
                file_path = os.path.join(input_directory, input_filename)
                
                with open(file_path, 'r', encoding='utf-8') as file:
                    conversation_blocks = file.read().split('\n\n')
                
                for block in conversation_blocks:
                    lines = block.split('\n')
                    question_lines = []
                    response_lines = []

                    for line in lines:
                        if line.startswith("HUMAN:"):
                            question_lines.append(line.replace("HUMAN:", "").strip())
                        elif line.startswith("RESPONSE:"):
                            response_lines.append(line.replace("RESPONSE:", "").strip())

                    if question_lines and response_lines:
                        # Construct the entry
                        entry = {
                            "system": f"{bot_name}",
                            "question": " ".join(question_lines),
                            "response": " ".join(response_lines)
                        }
                        json.dump(entry, jsonl_file)
                        jsonl_file.write('\n')

# Configuration
base_dir = 'F:/discollama'
repo_dir = 'datasets/TRACHI'
input_directory = os.path.join(base_dir, 'datasets', 'raw')
output_filename = os.path.join(base_dir, repo_dir, 'train.jsonl')

# Convert txt files to a single jsonl file
convert_to_jsonl(input_directory, output_filename)

subprocess.run(["huggingface-cli", "login", os.getenv('HUGGINGFACE_TOKEN')])
dataset = load_dataset(repo_dir)
dataset.push_to_hub("norygano/TRACHI")

In [4]:
import json
import subprocess
import os
import sys
import datasets
import huggingface_hub
from huggingface_hub import HfApi

api = HfApi()

# List of bot names
bot_names = ['Daphne', 'Ganymede']


def convert_to_jsonl(input_filename, output_filename):
    os.makedirs(os.path.dirname(output_filename), exist_ok=True)
    with open(input_filename, 'r', encoding='utf-8') as file:
        conversation_blocks = file.read().split('\n\n')

    conversations = []

    for block in conversation_blocks:
        # Initialize formatted_block for accumulating the conversation
        formatted_block = ""
        lines = block.split('\n')

        for line in lines:
            # Directly append "### " to lines starting with "HUMAN:" or "RESPONSE:"
            if line.startswith("HUMAN:") or line.startswith("RESPONSE:"):
                formatted_block += f"### {line}\n\n"
            else:
                # Handle any additional text that might be part of the response
                formatted_block += line + "\n\n"

        if formatted_block:
            # Ensure the block is properly formatted before appending
            formatted_block = formatted_block.strip() + "\n\n"
            conversations.append({"text": formatted_block})

    with open(output_filename, 'w', encoding='utf-8') as jsonl_file:
        for conversation in conversations:
            json.dump(conversation, jsonl_file)
            jsonl_file.write('\n')

def git_commit_push(repo_path, file_name, commit_message):
    os.chdir(repo_path)  # Change working directory to repo path
    subprocess.run(["git", "pull"])  # Pull latest changes
    subprocess.run(["git", "add", file_name])  # Add the .jsonl file to staging
    subprocess.run(["git", "commit", "-m", commit_message])  # Commit changes
    subprocess.run(["git", "push"])  # Push changes to remote
    os.chdir(base_dir)

def hf_push(repo_path):
    # Ensure we're in the correct directory
    os.chdir(repo_path)
    
    # Login to HF
    subprocess.run(["huggingface-cli", "login", os.getenv('HUGGINGFACE_TOKEN')])
    
    # Push
    hf_repo_path = f'norygano/{bot_name}'
    print(f'Pushing {repo_path} to {hf_repo_path}')
    api.upload_folder(

    folder_path=repo_path,

    repo_id=hf_repo_path,

    repo_type="dataset",
)



#References
base_dir = 'F:/discollama'  # Adjust base directory as necessary
os.chdir(base_dir)
print(os.getcwd())

for bot_name in bot_names:
    repo_path = os.path.join(base_dir, 'datasets', bot_name)
    raw_input_path = os.path.join(base_dir, 'datasets', 'raw', f'{bot_name}.txt')
    output_filename = f'train.jsonl'
    output_path = os.path.join(repo_path, output_filename)
    commit_message = f'Updated dataset with new conversations for {bot_name}'

    # Convert txt to jsonl
    convert_to_jsonl(raw_input_path, output_path)
    
    # Commit and push the changes
    git_commit_push(repo_path, output_filename, commit_message)
    
    # Push to Hugging Face
    hf_push(repo_path)


F:\discollama
Pushing F:/discollama\datasets\Daphne to norygano/Daphne
Pushing F:/discollama\datasets\Ganymede to norygano/Ganymede


#--- Writing Template ---
HUMAN: Can you explain the significance of the Turing Test in artificial intelligence?
RESPONSE: The Turing Test, proposed by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. According to the test, if a human judge cannot reliably tell the machine from a human based on their responses to questions, the machine is considered to have passed the test. This test has been significant in discussions of artificial intelligence as it focuses on a machine's ability to simulate human-like intelligence, rather than just executing tasks. It raises questions about what it means to think and whether a machine can possess qualities like consciousness or understanding.

HUMAN: [Next Prompt]
RESPONSE: [Next Response]

The distinction between params and template in the context of using a model like the one from the URL you've mentioned relates to how you configure the model for generating responses and how you format the input data for those responses.
Params

The params specify certain configurations or settings that control the model's behavior during inference—i.e., when it's generating responses. In your case:

    stop is an array of strings that tells the model when to stop generating text. It's like saying, "If you encounter any of these strings ("### HUMAN:", "### RESPONSE:"), consider your response complete and stop generating more text." This is useful for ensuring that the model's output is bounded and doesn't go beyond the expected response format.

Template

The template is a format string that defines how the input (prompt) to the model should be structured. It's a way of telling the model, "Here's how I'll present the data to you, and here's where you come in to generate a response." The placeholders ({{ .System }}, {{ .Prompt }}) are replaced with actual values at runtime. In this case:

    {{ .System }} might be replaced with any system-level instructions or context needed for generating a response.
    ### HUMAN: indicates where the human's part of the conversation starts.
    {{ .Prompt }} is where the actual user query or statement goes.
    ### RESPONSE: is where the model's generated response will be placed.

To rewrite the initial script to accommodate the model's params/template, you would adjust the transformation function to format the dataset according to the template structure expected by the model. Here's how you might adjust the transform_conversation function: