<img width="8%" alt="GitHub" src="https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/.github/assets/logos/GitHub.png" style="border-radius: 15%">

# GitHub - Create dataset from notebooks
<a href="https://app.naas.ai/user-redirect/naas/downloader?url=https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/GitHub/GitHub_Create_plugin_with_commands.ipynb" target="_parent"><img src="https://naasai-public.s3.eu-west-3.amazonaws.com/Open_in_Naas_Lab.svg"/></a><br><br><a href="https://bit.ly/3JyWIk6">Give Feedback</a> | <a href="https://github.com/jupyter-naas/awesome-notebooks/issues/new?assignees=&labels=bug&template=bug_report.md&title=GitHub+-+Create+plugin+with+commands:+Error+short+description">Bug report</a>

**Tags:** #github #finetuning #dataset #ai #chat #plugin

**Author:** [Florent Ravenel](https://www.linkedin.com/in/florent-ravenel)

**Last update:** 2023-11-09 (Created: 2023-11-09)

**Description:** This notebook creates a dataset from notebooks to be integrated to Naas ABI characters.

**References:**
- [Naas Chat Documentation](https://site.naas.ai/docs/platform/aI-powered-chat)
- [Naas Chat Plugin driver](https://github.com/jupyter-naas/drivers/blob/main/naas_drivers/tools/naas_chat_plugin.py)

## Input

### Import libraries

In [None]:
import os
import requests
import pandas as pd
import json
import openai
from naas_drivers import gsheet
import naas
import urllib.parse
pd.set_option('display.max_colwidth', None)

### Setup variables
- `github_url`: Stores the URL to a JSON file hosted on GitHub.
- `openai.api_key`: Connect to OpenAI with the API key.
- `spreadsheet_url`: Stores the URL of a Google Sheets document.
- `sheet_name`: Stores the name of a specific sheet within the Google Sheets document.
- `output_dir`: Stores the directory where the output will be stored.

In [None]:
github_url = "https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/templates.json"
openai.api_key = naas.secret.get("OPENAI_API_KEY")
spreadsheet_url = "https://docs.google.com/spreadsheets/d/1wediMdHcq5WDqLMZ7ryNrcPxCmlX8BX4ZEl3JNWT8wg/edit#gid=0"
sheet_name = "ABI_V0.1"
output_dir = "title"

## Model

### Get templates from JSON

In [None]:
def get_templates(url):
    res = requests.get(url)
    df =  pd.DataFrame(res.json())
    return df

df_templates = get_templates(github_url)
print("Templates:", len(df_templates))
df_templates.head(1)

### Prep dataset for Fine-Tuning

In [None]:
def prep_data(df_init):
    df = df_init.copy()
    df.insert(loc=1, column="title", value=df["tool"] + " - " + df["notebook"])
    to_drop = [
        "objectID",
        "tool",
        "notebook",
        "action",
        "image_url",
        "imports",
    ]
    df = df.drop(to_drop, axis=1)
    return df.reset_index(drop=True)

df_finetuning = prep_data(df_templates)
df_finetuning.head(1)

### Generate question from text

In [None]:
def generate_question_from_text(
    text,
):
    # OpenAI API call
    response = openai.ChatCompletion.create(
        model="gpt-4",
        temperature=1,
        messages=[
            {
                "role": "user",
                "content": f"Create a question from: {text}"
            }
        ]
    )
    return response['choices'][0]['message']['content']

# prompt = """
# Create a dataset of 20 questions and answers pairs based on notebooks.

# Data provided:
# title: The title of the notebook.
# tags: Tags or keywords associated with the notebook.
# author: The author of the notebook.
# author_url: The URL or link associated with the author.
# updated_at: The date and time when the notebook was last updated.
# created_at: The date and time when the notebook was created.
# description: A brief description or summary of the notebook.
# open_in_lab: A link to open the notebook or project in a lab environment.
# open_in_chat: A link to open the notebook or project in a chat environment.
# notebook_url: The GitHub URL associated with the notebook.
# imports: The libraries/packages used on the notebook.

# ```instructions
# WRITE IN THE LANGUAGE THE TEXT IS IN
# WRITE COMPLETE ANSWER IN NATURAL LANGUAGE AND NOT ONLY THE RESULT
# BE CURIOUS AND TRY MIMIC A HUMAN BEHAVIOUR
# RETURN RESULT IN A CORRECT JSON FORMAT
# ```
# """

result = generate_question_from_text(
    "AWS - Daily biling notification to slack",
)
print("Result:")
print(result)
print()

### Create dataset

In [None]:
final_df = pd.DataFrame()

for row in df_finetuning.itertuples():
    # Init
    index = row.Index
    title = row.title
    open_in_lab = row.open_in_lab
    open_in_chat = row.open_in_chat
    notebook_url = row.notebook_url
    if open_in_chat != "":
        open_in_chat = f"or use it directly in Naas Chat '{open_in_chat}'."
    else:
        open_in_chat = "."
    print(f"{index+1} - Starting with: '{title}'")

    # Create file path
    file_name = notebook_url.split("/")[-1].lower().replace(".ipynb", "_title_question.json")
    file_path = os.path.join(output_dir, file_name)
    answer = f"You can do this by using the template '{title}' from Naas awesome-notebooks catalog. Check out the notebook in our search here: 'https://naas.ai/search?q={urllib.parse.quote(title)}'"
    if not os.path.exists(file_path) and index + 1 < len(df_finetuning):        
        # Call OpenAI API
        question = generate_question_from_text(title)
#         answer = f"You can do this by using the template '{title}' from Naas awesome-notebooks catalog. Check out the notebook on GitHub '{notebook_url}' or open it directly in Naas Lab '{open_in_lab}'{open_in_chat}"
        # Create data
        data = {
            "question": question,
            "answer": answer
        }
        
        # Save the extracted data as JSON
        with open(file_path, 'w') as json_file:
            json.dump(data, json_file)
    else:
        with open(file_path, 'r') as json_file:
            data = json.load(json_file)
        data["answer"] = answer
        print(f"âœ… JSON '{file_name}' already exists.")
        
    # Concat df
    tmp_df = pd.DataFrame([data])
    final_df = pd.concat([final_df, tmp_df])

final_df

## Output

### Send to Google Sheets spreadsheet

In [None]:
gsheet.connect(spreadsheet_url).send(sheet_name=sheet_name, data=final_df, append=False)