<a href="https://colab.research.google.com/github/AnnaValentinaHirsch/Web3CodeLLM/blob/main/NEARlabelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This notebook is used to label the training data for finetuning the Starcoder2 model for the Near dApps domain. Openai API is used to generate the labels (user prompts) corresponding to the github repos, tree structures, and readme contents.**

In [None]:
%%capture
%pip install pandas numpy openai python-dotenv datasets transformers torch matplotlib

In [None]:
import os
import time
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from datasets import Dataset, load_dataset # huggingface
from transformers import pipeline # summarizer
from openai import OpenAI # new
from dotenv import load_dotenv

os.environ['OPENAI_API_KEY'] = 'secret'
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
# Step 2: Verify GPU availability
device = 0 if torch.cuda.is_available() else -1
print(f"Using device: {'GPU' if device == 0 else 'CPU'}")

In [None]:
# Free up GPU memory
torch.cuda.empty_cache()

In [None]:
# Load environment variables from .env file
load_dotenv()

# Load Dataset
dataset = load_dataset('jcarbonnell/structTuningNEAR')

# Convert the train split of the dataset to a pandas DataFrame
train = dataset['train'].to_pandas()

In [None]:
# Order by size of readme file
train = train.sort_values(by='readme', key=lambda x: x.str.len(), ascending=True)

# Remove rows with empty readme files
train = train[train['readme'].str.len() > 200]
train = train[train['readme'].str.len() < 50000] # remove problematic files that cause crash

# Reset index and drop the old index
train = train.reset_index(drop=True)

In [None]:
# Calculate the wordcount of readme files
train['readme_word_count'] = train['readme'].apply(lambda x: len(str(x).split()))

# Plot the histogram of README word counts
plt.figure(figsize=(10, 6))
plt.hist(train['readme_word_count'], bins=30, edgecolor='black')
plt.title('Histogram of README File Word Counts')
plt.xlabel('Number of Words in README')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Display the top 20 longest README files
print(train[['repoName', 'readme_word_count']].head(20))

In [None]:
# Def params
engine = "gpt-4o"
max_output_tokens = 500
example = """Create a project structure for a NEAR DApp based on the following requirements:

1. The project should be related to the Nearuko NFT, which can be converted into a character in the Etheruko game.
2. Use necessary files typically required for a NEAR Protocol mainnet DApp.
3. Include all relevant dependencies and packages for an NFT project on the NEAR Protocol.
4. The main coding language should be TypeScript.
5. Ensure the project includes configurations, tests, and NEAR-compatible contract files.
6. Capture any common NEAR DApp structure conventions while setting up the project.

Provide a well-organized directory structure and file list based on these requirements."""

# Function to generate labels (prompts)
def generate_prompt(repoName, tree, readme, example):
    # Create a user prompt for a coding assistant
    prompt = (
        f"You are provided with a GitHub repository called \n{repoName}\n\n. This repository has the following directory structure:\n"
        f"{tree}\n\n"
        f"The README file contains the following information:\n{readme}\n\n"
        f"Step 1: Extract all the relevant information from the README file needed to predict the corresponding tree for a NEAR DApp, such as necessary files, dependencies, packages, and any particular coding languages or frameworks that should be used. "
        f"Step 2: Write a perfect user prompt asking a coding assistant to create a project stucture based only on the extracted information from the README file. Only return the user prompt from Step 2. Do not return any information about the tree or file names. Here is an example: \n{example}\n\n"
    )

    response = client.chat.completions.create(
        model=engine,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=max_output_tokens
    )
    return response.choices[0].message.content

In [None]:
# Print example prompt
print(generate_prompt(train['repoName'][500], train['tree'][500], train['readme'][500], example))

**Create labels (perfect user prompts) by extracting crucial information from readme files**


*Please Note:*
*   Summarizer is used to ensure large readme files don't exceed the api token limits.
*   Time Delay introducted to ensure number of allowed requests per minute are not exceeded.
*   Checkpoint introduced to save intermediate outputs in case of system crashing.


In [None]:
# Define the file path for checkpoint in Google Drive
checkpoint_file = '/content/drive/My Drive/checkpoint.csv'

In [None]:
# Initialize user_prompts if not loaded from checkpoint
if 'user_prompt' not in train.columns:
    train['user_prompt'] = None
    user_prompts = [None] * len(train)

In [None]:
train = train.drop('readme_word_count', axis=1)
train.head(2)

In [None]:
# Define the summarizer
summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", device=device)
max_input_tokens = 2000
delay_between_requests = 2 # in seconds

# Function to summarize long readme files
def summarize(text, max_length):
    if len(text.split()) > max_length:
        summary = summarizer(text, max_length=max_length, do_sample=False)
        return summary[0]['summary_text']
    return text

# Function to count tokens
def count_tokens(text, tokenizer):
    return len(tokenizer.encode(text))

# Save function
def save_checkpoint(dataframe, prompts, filename):
    dataframe['user_prompt'] = prompts
    dataframe.to_csv(filename, index=False, escapechar='\\')

# Load function
def load_checkpoint(filename):
    return pd.read_csv(filename, index_col=0)

In [None]:
# Processing loop
for index, row in train.iterrows():
    if user_prompts[index] is not None:
        continue  # Skip already processed rows

    repoName = row['repoName']
    tree = row['tree']
    readme = row['readme']
    try:
        # Check if readme is too long and summarize if needed
        readme = summarize(readme, max_input_tokens)

        # Generate the user prompt
        user_prompt = generate_prompt(repoName, tree, readme, example)

        # Count tokens in the user prompt
        input_tokens = count_tokens(user_prompt, summarizer.tokenizer)

        if input_tokens > max_input_tokens + max_output_tokens:
            raise ValueError(f"Prompt too long: {input_tokens} tokens (max allowed is {max_input_tokens + max_output_tokens})")

        user_prompts[index] = user_prompt  # None for errors

        # Log the length of the input and prompt for debugging
        print(f"Row {index}: Input length {len(summarizer.tokenizer.encode(readme))} tokens, Prompt length {input_tokens} tokens")

        # Free up GPU memory
        torch.cuda.empty_cache()

        # Implement delay between requests
        time.sleep(delay_between_requests)

    except Exception as e:
        print(f"Error processing row {index}: {e}")

    # Save checkpoint after each row
    save_checkpoint(train, user_prompts, checkpoint_file)