In [1]:
import pandas as pd
import json
import random
import numpy as np
import yaml
from qa_dataset_manager import QADatasetManager
from extract import DocumentProcessor

# Transformers
from sentence_transformers import SentenceTransformer
#LLM
from huggingface_hub import InferenceClient

# search
import faiss

In [2]:
# Load hugging Face token
with open('../config.yaml', 'r') as config_file:
        config = yaml.safe_load(config_file)
hugging_face_api_key = config['huggingface']['token_api']

In [None]:
# Extract markdown documents and store them in "data/documents.csv"
processor = DocumentProcessor( root_dir='../content', output_path='../data/documents.csv')
processor.process_documents()

Extracting files from folder: account-and-profile
Extracting files from folder: actions
Extracting files from folder: admin
Extracting files from folder: apps
Extracting files from folder: authentication
Extracting files from folder: billing
Extracting files from folder: code-security
Extracting files from folder: codespaces
Extracting files from folder: communities
Extracting files from folder: contributing
Extracting files from folder: copilot
Extracting files from folder: desktop
Extracting files from folder: discussions
Extracting files from folder: education
Extracting files from folder: get-started
Extracting files from folder: github-cli
Extracting files from folder: graphql
Extracting files from folder: index.md
Extracting files from folder: issues
Extracting files from folder: migrations
Extracting files from folder: organizations
Extracting files from folder: packages
Extracting files from folder: pages
Extracting files from folder: pull-requests
Extracting files from folder:

In [3]:
inference = InferenceClient(token=hugging_face_api_key)
model_zephyr ="HuggingFaceH4/zephyr-7b-beta"
model_mistral = "mistralai/Mistral-7B-v0.1"
model_falcon = "tiiuae/falcon-7b-instruct"
model_open = "openchat/openchat_3.5"

In [4]:
df = pd.read_csv("../data/documents.csv")
data = df['content'].to_list()
df.head()

Unnamed: 0,content,title
0,\n\nChoosing how to unsubscribe\n\nTo unwatch ...,managing-your-subscriptions
1,\n\nDiagnosing why you receive too many notifi...,viewing-your-subscriptions
2,\n\nNotifications and subscriptions\n\nYou can...,about-notifications
3,\n\nNotification delivery options\n\nYou can r...,configuring-notifications
4,\n\nStarting your inbox triage\n\nBefore you s...,customizing-a-workflow-for-triaging-your-notif...


In [3]:
dataset_manager = QADatasetManager()

In [6]:
dataset_manager.metadata

{'max_index': 755,
 'last_updated': '2023-12-23 21:47',
 'creation_date': '2023-12-21 15:40'}

In [7]:
dataset_manager.create_qa_pairs(
    texts=data,
    client=InferenceClient(token=hugging_face_api_key),
    model= model_zephyr,
    max_new_tokens=200,
    num_questions_per_chunk=2,
    chunk_size = 2048)

641 - Number of chunks in text 259: 2
642 - Number of chunks in text 1340: 2
643 - Number of chunks in text 281: 1
644 - Number of chunks in text 1126: 4
645 - Number of chunks in text 178: 7
Updating QA dataset
Saved questions to --> ../data/qa_dataset_intermed.json
Updated metadata
646 - Number of chunks in text 711: 2
647 - Number of chunks in text 554: 1
648 - Number of chunks in text 1257: 3
649 - Number of chunks in text 1111: 3
650 - Number of chunks in text 2113: 5
Updating QA dataset
Saved questions to --> ../data/qa_dataset_intermed.json
Updated metadata
651 - Number of chunks in text 946: 1
652 - Number of chunks in text 519: 2
653 - Number of chunks in text 2069: 27
654 - Number of chunks in text 1529: 1
655 - Number of chunks in text 382: 1
Updating QA dataset
Saved questions to --> ../data/qa_dataset_intermed.json
Updated metadata
656 - Number of chunks in text 1539: 1
657 - Number of chunks in text 991: 3
658 - Number of chunks in text 862: 8
659 - Number of chunks in te

In [7]:
dataset_manager.add_qa_to_dataset()

Added questions from ../data/qa_dataset_intermed.json to --> ../data/qa_dataset.json


In [7]:
qa_dataset = dataset_manager.get_qa_dataset()

In [8]:
len(qa_dataset['queries']), len(qa_dataset['answers'])

(7160, 7103)

In [12]:
dataset_manager.create_answers(qa_dataset=qa_dataset,
                               client=InferenceClient(token=hugging_face_api_key),
                               model= model_zephyr,
                               max_new_tokens=500)

162 to be answered
Answering qery 0b654229-5a74-4d1f-9d7a-6cb9e46029d8
query:  How does the author's background in [insert field] contribute to the research presented in this paper
Answering qery 30de5ba7-8c08-47b3-8112-30eee5c66370
query:  What is the default branch in a GitHub repository and how can it be changed
Answering qery bc71ccdc-4a8c-4f67-8f38-d6c806572e3a
query:  What is the name of the GitHub workflow that triggered this build, as mentioned in the context information
Answering qery b8c03353-27a9-4213-ab9d-71c3bd218cda
query:  How can I ensure that Python is installed and added to the PATH in a {% data variables.product.prodname_dotcom %}-hosted runner
Answering qery dc24f24e-4672-4a83-b9e2-32a5542e499b
query:  What is the purpose of the following command in the context provided: "bundle exec appraisal install"
Updating answers dict
Saved questions to --> ../data/answers_intermed.json
Answering qery dec95dd0-0559-4860-8826-0a6390ef489d
query:  How can I run code scanning usi

KeyboardInterrupt: 

In [4]:
dataset_manager.add_answers_to_dataset()

Added answers from ../data/answers_intermed.json to --> ../data/qa_dataset.json


In [5]:
qa_dataset = dataset_manager.get_qa_dataset()

In [6]:
len(qa_dataset['answers'])

7103

In [11]:
list(qa_dataset['answers'].values())[100]

'The purpose of using the "--extractor-options-file" command-line option in CodeQL database init or "codeql database begin-tracing" is to specify extractor option bundle files that set extractor options. These files are read in the order they are specified, and if different extractor option bundle files specify the same extractor option, the behavior depends on the type that the extractor option expects. String options will use the last value provided, and array options will use all the values provided, in order. This option is processed before extractor options given via "--extractor-option". When passed to CodeQL database init or "codeql database begin-tracing", the options will only be applied to the indirect tracing environment. If your workflow also makes calls to "codeql database trace-command", then the options also need to be passed there if desired. This option is useful for setting extractor options in a more organized and structured way, especially when dealing with multiple

In [80]:
nodes = dataset_manager.parse_documents(texts=data,chunk_size=2048)

In [56]:
len(nodes)

6014

In [57]:
list(nodes.values())[:5]

['\n\nAbout forks\n\n{% data reusables.repositories.fork-definition-long %}  For more information, see "AUTOTITLE."\n\n\n\nPropose changes to someone else\'s project\n\nFor example, you can use forks to propose changes related to fixing a bug. Rather than logging an issue for a bug you have found, you can:\n\n- Fork the repository.\n- Make the fix.\n- Submit a pull request to the project owner.\n\n\n\nUse someone else\'s project as a starting point for your own idea.\n\nOpen source software is based on the idea that by sharing code, we can make better, more reliable software. For more information, see the "About the Open Source Initiative" on the Open Source Initiative.\n\nFor more information about applying open source principles to your organization\'s development work on {% data variables.location.product_location %}, see {% data variables.product.prodname_dotcom %}\'s white paper "An introduction to innersource."\n\n{% ifversion fpt or ghes or ghec %}\n\nWhen creating your public r

In [58]:
list(nodes.keys())[:5]

['Y2h1bmtfMF9pbmRleF8w',
 'Y2h1bmtfMV9pbmRleF8w',
 'Y2h1bmtfMl9pbmRleF8w',
 'Y2h1bmtfM19pbmRleF8w',
 'Y2h1bmtfNF9pbmRleF8w']