<a href="https://colab.research.google.com/github/huishingchong/agile_llm/blob/main/pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG Pipeline

## Set up
### Import Packages and API keys

In [1]:
!pip install transformers datasets torch langchain-community faiss-cpu sentence-transformers langchain gradio mlflow evaluate




In [2]:
# get a token: https://huggingface.co/docs/api-inference/quicktour#get-your-api-token

from getpass import getpass

HUGGINGFACEHUB_API_TOKEN = getpass()

··········


In [3]:
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

### Model selection

In [64]:
from langchain_community.llms import HuggingFaceHub
from langchain_community.llms import HuggingFaceEndpoint
model_name = "tiiuae/falcon-7b-instruct"
llm = HuggingFaceEndpoint(
    repo_id=model_name,
    model=model_name,
    task="text-generation",
    temperature=0.5,
    # max_length:1024,
    max_new_tokens=200
)
# llm = HuggingFaceHub(repo_id=model_name, model_kwargs={"temperature":0.5, "max_length":1024, "max_new_tokens":200})


Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## RAG from synthetic data set

### Set up documents, embeddings and retrievers

In [5]:
# Use langchain packages to help with implementing retrieval augmentation generation
from datasets import load_dataset
from langchain.document_loaders.csv_loader import CSVLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from sentence_transformers import SentenceTransformer

In [75]:
# Load data for retrieval from a csv file
loader = CSVLoader(file_path="skills_build.csv")
documents = loader.load()

# Split documents into chunks
text_split = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
# text_split = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
d = text_split.split_documents(documents)

In [6]:
modelPath = "sentence-transformers/gtr-t5-base" # Using t5 sentence transformer model to generate embeddings
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings': True} # Normalizing embeddings may help improve similarity metrics by ensuring that embeddings magnitude does not affect the similarity scores

# Initialise an instance of HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

  return self.fget.__get__(instance, owner)()


In [76]:
# Create a FAISS Index with specified document and embeddings
db = FAISS.from_documents(d, embedding=embeddings)

In [77]:
# just checking
query_example = "What is IBM SkillsBuild credentials?"
r = db.as_retriever()
docs = r.get_relevant_documents(query_example)


In [78]:
for doc in docs:
    print(f"\nContent: {doc.page_content}, Metadata: {doc.metadata}")


Content: IBM SkillsBuild: Digital credentials
Description: Digital credentials from IBM SkillsBuild are secure, web-enabled credentials that contain granular, verified information that employers can use to evaluate an individual's potential. Digital credential requirements can include online learning activities, quizzes or exams, experience that requires advisor review, or even interviews. Complete the required activities, accept the credential, and choose to share your achievements to your resume, LinkedIn, and other social media., Metadata: {'source': 'skills_build.csv', 'row': 7}

Content: IBM SkillsBuild: broader initiative
Description: IBM SkillsBuild is a free education program focused on underrepresented communities that helps adult learners, high school and university students, and faculty develop valuable new skills and access career opportunities. The program includes an online platform complemented by customised practical learning experiences that aim to respond to learners

Evaluate retrieval

In [79]:
docs_and_scores = db.similarity_search_with_score(query_example)
print(docs_and_scores)

[(Document(page_content="IBM SkillsBuild: Digital credentials\nDescription: Digital credentials from IBM SkillsBuild are secure, web-enabled credentials that contain granular, verified information that employers can use to evaluate an individual's potential. Digital credential requirements can include online learning activities, quizzes or exams, experience that requires advisor review, or even interviews. Complete the required activities, accept the credential, and choose to share your achievements to your resume, LinkedIn, and other social media.", metadata={'source': 'skills_build.csv', 'row': 7}), 0.21700807), (Document(page_content='IBM SkillsBuild: broader initiative\nDescription: IBM SkillsBuild is a free education program focused on underrepresented communities that helps adult learners, high school and university students, and faculty develop valuable new skills and access career opportunities. The program includes an online platform complemented by customised practical learni

In [80]:
for doc, score in docs_and_scores:
    print(f"\nContent: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")


Content: IBM SkillsBuild: Digital credentials
Description: Digital credentials from IBM SkillsBuild are secure, web-enabled credentials that contain granular, verified information that employers can use to evaluate an individual's potential. Digital credential requirements can include online learning activities, quizzes or exams, experience that requires advisor review, or even interviews. Complete the required activities, accept the credential, and choose to share your achievements to your resume, LinkedIn, and other social media., Metadata: {'source': 'skills_build.csv', 'row': 7}, Score: 0.2170080691576004

Content: IBM SkillsBuild: broader initiative
Description: IBM SkillsBuild is a free education program focused on underrepresented communities that helps adult learners, high school and university students, and faculty develop valuable new skills and access career opportunities. The program includes an online platform complemented by customised practical learning experiences that

### Specify model: Falcon

In [None]:
from langchain_community.llms import HuggingFaceHub
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain.prompts import PromptTemplate

model_name = "tiiuae/falcon-7b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
# llm = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

#Define a text generation pipeline using the model and tokenizer
# hf = pipeline(
#     "text-generation",
#     model=model_name,
#     tokenizer=tokenizer,
#     max_new_tokens = 200,
#     temperature = 0.1,
#     eos_token_id=tokenizer.eos_token_id,
#     do_sample=True
# )
# llm = HuggingFacePipeline(pipeline=hf)
llm = HuggingFaceHub(repo_id=model_name, model_kwargs={"temperature":0.5, "max_length":1024, "max_new_tokens":200})
# https://medium.com/international-school-of-ai-data-science/implementing-rag-with-langchain-and-hugging-face-28e3ea66c5f7

# TO DELETE CACHE: huggingface-cli delete-cache

###RetrievalQA

In [81]:
from langchain.prompts import PromptTemplate

# template = """Use the following context to answer the question at the end.
# If you don't know the answer, please think rationally and answer from your own knowledge base.
# Context: {context}

# Question: {question}
# """
template = """Use the following context to answer the question at the end.
If you don't know the answer, please think rationally and answer from your own knowledge base.
Context: {context}

Question: {question}
"""

QA_CHAIN_PROMPT = PromptTemplate(template=template, input_variables=["context", "question"])

In [82]:
from langchain.chains import RetrievalQA, RetrievalQAWithSourcesChain

chain_type_kwargs = {"prompt": QA_CHAIN_PROMPT}
qa = RetrievalQA.from_chain_type(llm=llm,
                                 retriever=db.as_retriever(),
                                 return_source_documents=True,
                                 chain_type_kwargs=chain_type_kwargs, verbose=True)


In [96]:
response = qa.invoke({"query": "What is the responsibility of the Penetration Tester?"})
print(response["result"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
A) To conduct security assessments, including penetration testing, vulnerability analysis, red teaming, and targeted attack simulations
B) To evaluate security solutions for compliance with standards and regulatory requirements
C) To continuously improve assessment capabilities through new tools, scripts, and techniques
D) To deliver effective Cyber Security outcomes for our customers
Answer: D) To deliver effective Cyber Security outcomes for our customers


In [97]:
# See the source documents
print(response.get("source_documents", []))

[Document(page_content='Job Title: Penetration Tester\nURL: https://www.linkedin.com/jobs/search/?currentJobId=3813771101&geoId=101165590&keywords=cybersecurity&location=United%20Kingdom&origin=JOB_SEARCH_PAGE_SEARCH_BUTTON&refresh=true#HYM\nDescription: The Penetration Tester position at TechForce Cyber offers an exciting opportunity for individuals passionate about cybersecurity and penetration testing. This role, available as a hybrid or fully remote opportunity, is ideal for those with at least one or two years of experience in penetration testing. The candidate will be engaged in various responsibilities, including leading or supporting penetration testing engagements, managing projects, producing detailed reports, and staying updated on the latest cybersecurity threats and best practices.', metadata={'source': 'jobs.csv', 'row': 9}), Document(page_content='Key Responsibilities:\nLead or support various penetration testing engagements, ensuring exceptional client delivery.\nEffect

In [None]:
import gradio as gr

def chat_interface(textbox, chat):
    input_dict = {'query': textbox}
    result = qa.invoke(input_dict)
    print(result)
    text = result['result']
    return text

gr.ChatInterface(
    fn=chat_interface,
    chatbot=gr.Chatbot(height=300),
    textbox=gr.Textbox(placeholder="Ask me a question", container=False, scale=7),
    title="Chatbot",
    description="Ask Chatbot any question",
    theme="soft",
    examples=["What does AI stand for?", "What is Software Engineering?", "What is Cybersecurity?"],
    cache_examples=False,
    retry_btn=None,
    undo_btn="Delete Previous",
    clear_btn="Clear",
).launch(debug=True)

###With some qa Chain

In [None]:
from langchain.chains import LLMChain
from langchain.chains.question_answering import load_qa_chain
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import pipeline
from langchain.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
Answer professionally, to the best of your ability, and where appropriate, in a Computer Science educational context.
Don't copy from the context.
Context: {context}
Question: {question}
Helpful Answer:"""

QA_CHAIN_PROMPT = PromptTemplate(
        template=template, input_variables=["context", "question"]
    )
chain_type_kwargs = {"questionPrompt": QA_CHAIN_PROMPT}

qa_chain = load_qa_chain(llm, chain_type="stuff", verbose=True)


In [None]:
import gradio as gr

def chat_interface(textbox, chat):
    # retriever = db.as_retriever()
    docs = retriever.get_relevant_documents(textbox)
    # retriever = db.similarity_search(textbox)

    # query = f"{context}\nQuestion: {textbox}\nHelpful Answer:"
    input_dict = {'question': textbox, 'input_documents': d}
    result = qa_chain.invoke(input_dict, return_only_outputs=True)
    print(result)
    text = result['output_text']
    # answer = text.split('\nHelpful Answer:')[1].strip()
    return text

gr.ChatInterface(
    fn=chat_interface,
    chatbot=gr.Chatbot(height=300),
    textbox=gr.Textbox(placeholder="Ask me a question", container=False, scale=7),
    title="Chatbot",
    description="Ask Chatbot any question",
    theme="soft",
    examples=["What does AI stand for?", "What is Software Engineering?", "What is Cybersecurity?"],
    cache_examples=False,
    retry_btn=None,
    undo_btn="Delete Previous",
    clear_btn="Clear",
).launch(debug=True)

## Domain knowledge retrieval from pdf

In [12]:
!pip install pypdf pdfminer.six unstructured unstructured[pdf]

Collecting onnx (from unstructured)
  Downloading onnx-1.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m71.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdf2image (from unstructured)
  Downloading pdf2image-1.17.0-py3-none-any.whl (11 kB)
Collecting pikepdf (from unstructured)
  Downloading pikepdf-8.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m85.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pillow-heif (from unstructured)
  Downloading pillow_heif-0.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m60.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unstructured-inference==0.7.23 (from unstructured)
  Downloading unstructured_inference-0.7.23-py3-none-

In [20]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PDFMinerLoader
from langchain_community.document_loaders import UnstructuredFileLoader

# Load data for retrieval from a csv file
pdf_loader = PyPDFLoader("skills_build.pdf")
pdf_doc = pdf_loader.load_and_split()

# Split documents into chunks
pdf_split = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
# text_split = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
pdf_split_doc = pdf_split.split_documents(pdf_doc)

In [18]:
import re

def clean_text(text):
    # Split the text into lines using the newline character
    lines = text.split('\n')
    # Clean each line
    cleaned_lines = [re.sub(r' +', ' ', line.strip()) for line in lines]
    # Join the cleaned lines back together with newline characters
    cleaned_text = '\n'.join(cleaned_lines)
    return cleaned_text

cleaned_pdf_doc = [clean_text(page.page_content) for page in pdf_doc]

In [21]:
cleaned_pdf_doc[0]

"About\nIBM\nSkillsBuild\nAt\nIBM\nwe\nbelieve\nthat\nentry-level\ntech\n(new,\nentry-level\ntech\njobs)\nrequire\nskills,\nnot\njust\ndegrees.\nIBM\nSkillsBuild\noffers\nfree\nlearning,\nsupport,\nand\nresources\nto\nfoster\nSTEM\nand\nnew\ncollar\nskills\nfrom\nsecondary\neducation\n(students\naged\n13-18)\nto\nentry-level\nemployment.\nIBM\nSkillsBuild\nis\na\nfree\neducation\nprogram\nfocused\non\nunderrepresented\ncommunities\nthat\nhelps\nadult\nlearners,\nhigh\nschool\nand\nuniversity\nstudents,\nand\nfaculty\ndevelop\nvaluable\nnew\nskills\nand\naccess\ncareer\nopportunities.\nThe\nprogram\nincludes\nan\nonline\nplatform\ncomplemented\nby\ncustomised\npractical\nlearning\nexperiences\nthat\naim\nto\nrespond\nto\nlearners’\nneeds\nas\nthey\nprogress\nthrough\ntheir\neducation\nand\ncareer\njourneys.\nThis\neffort\nis\nin\ncollaboration\nwith\na\nnetwork\nof\nworld-class\neducation\npartners,\nincluding\npublic\nhigh\nschools,\nnon-profit\norganisations,\ngovernments,\nand\ncorpo

In [56]:
# Create a FAISS Index with specified document and embeddings
pdf_db = FAISS.from_documents(pdf_doc, embedding=embeddings)

In [57]:
# just checking
skills_build_query_example = "What is digital credential?"
r = pdf_db.as_retriever()
docs = r.get_relevant_documents(skills_build_query_example)
docs_and_scores = pdf_db.similarity_search_with_score(skills_build_query_example)


In [58]:
for doc, score in docs_and_scores:
    print(f"\nContent: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")


Content: About
IBM
SkillsBuild
At
IBM
we
believe
that
entry-level
tech
(new,
entry-level
tech
jobs)
require
skills,
not
just
degrees.
IBM
SkillsBuild
offers
free
learning,
support,
and
resources
to
foster
STEM
and
new
collar
skills
from
secondary
education
(students
aged
13-18)
to
entry-level
employment.
IBM
SkillsBuild
is
a
free
education
program
focused
on
underrepresented
communities
that
helps
adult
learners,
high
school
and
university
students,
and
faculty
develop
valuable
new
skills
and
access
career
opportunities.
The
program
includes
an
online
platform
complemented
by
customised
practical
learning
experiences
that
aim
to
respond
to
learners’
needs
as
they
progress
through
their
education
and
career
journeys.
This
effort
is
in
collaboration
with
a
network
of
world-class
education
partners,
including
public
high
schools,
non-profit
organisations,
governments,
and
corporations.
Course
catalogue
IBM
SkillsBuild
provides
both
technical
skills
and
soft
skills:
●
Technical
skills
inc

In [70]:
from langchain.chains import RetrievalQA, RetrievalQAWithSourcesChain

chain_type_kwargs = {"prompt": QA_CHAIN_PROMPT}
qa = RetrievalQA.from_chain_type(llm=llm,
                                 retriever=pdf_db.as_retriever(),
                                 return_source_documents=True,
                                 chain_type_kwargs=chain_type_kwargs, verbose=True)

## Evaluation

In [25]:
!pip install torch tiktoken evaluate mlflow bert_score evaluate textstat




In [None]:
# RAG evaluation

from langchain.evaluation import load_evaluator, EvaluatorType



Prepare data set

In [83]:
import pandas as pd
data_skills_build = pd.DataFrame({
    "question": [
        "What is the goal of IBM SkillsBuild?",
        "What are the courses available at IBM SkillsBuild?",
        "What is IBM SkillsBuild Software downloads?",
        "What learning paths do IBM SkillsBuild offer to different age groups and how does it support their learning and professional development goals?",
        "How does IBM SkillsBuild support personalized learning experiences for users?",
        "What can learners do with earned digital credentials?"
    ],
    "ground_truth": [
        "The goal of IBM SkillsBuild is to address the growing demand for essential technical and professional skills in the face of digital acceleration. It aims to empower learners from underrepresented communities, including adult learners, high school and university students, and faculty, by providing them with access to free education programs. These programs are designed to help individuals develop valuable new skills and access career opportunities. IBM SkillsBuild offers an online platform supplemented by customized practical learning experiences tailored to meet the evolving needs of learners as they progress through their education and career journeys. This initiative is carried out in collaboration with a network of esteemed education partners, including public high schools, non-profit organizations, governments, and corporations, to ensure widespread access and impact.",
        "IBM SkillsBuild offers a comprehensive range of technical and soft skills to prepare individuals for the evolving job market. On the technical front, learners can delve into cutting-edge areas such as AI, Blockchain, Cloud Computing, Cybersecurity, Data Science, Quantum Computing, and Web Development, among others. Meanwhile, the platform also focuses on cultivating essential workplace skills like Agile methodologies, Design Thinking, Professional skills, Job Readiness, and more. With this diverse array of offerings, IBM SkillsBuild equips learners with the expertise and adaptability needed in today's workplace.",
        "IBM SkillsBuild Software downloads is an element within the IBM SkillsBuild program designed to provide learners and educators at degree-granting accredited academic institutions with access to select IBM assets at no-charge for classroom (teaching and learning) and non-commercial research purposes at the academic institution.",
        "Adult Learners is designed for learners (age 18+) seeking employment in the immediate future. It provides adult learners with free job role preparation specifically focused on entry-level tech roles to help them gain meaningful employment, while earning credentials. High School Students is designed for learners at the career exploration stage (aged 13 to 18). It’s designed to help students begin their career exploration with free digital learning on cutting-edge technology and workplace skills. Academia is designed for university students and faculty. Students from IBM’s partner Universities can earn digital credentials, access software, and explore a library of university guest lectures across data science, artificial intelligence, security and more. Learners also have access to expert conversations from IBM volunteers, access to IBM software, and project-based learning (practical exercises to apply learning they have acquired, including labs using IBM technology). Software Downloads is an element within the IBM SkillsBuild program designed to provide learners and educators at degree-granting accredited academic institutions with access to select IBM assets at no-charge for classroom (teaching and learning) and non-commercial research purposes at the academic institution.",
        "The customized practical learning experiences provided by IBM SkillsBuild are designed to empower learners with valuable skills and opportunities for career advancement. Through project-based learning, participants can apply their newfound knowledge in practical exercises and labs using IBM technology. Additionally, insights from IBM Mentors offer guidance on technical expertise, business acumen, and collaboration skills. Learners have access to premium content, including online courses in 22 languages and certifications equivalent to those earned by IBM employees. Moreover, the platform connects individuals with career opportunities such as job fairs, workshops, and seminars, providing support with job interviews, professional portfolio creation, and internship opportunities, ultimately helping learners navigate their educational and career journeys with confidence.",
        "Digital credentials from IBM SkillsBuild can be shared on the learner's resume, LinkedIn, and other social media for peers and employers to see."
    ]
})
#TO REVISE
data_cs_industry = pd.DataFrame({
    "question": [
        "What are the latest trends in artificial intelligence and machine learning?",
        "What are some emerging programming languages that are gaining popularity in the industry?",
        "I am a beginner that wants to get into Data Science, where should I start?",
        "I am a final-year Computer Science student wanting to find a graduate role in Cybersecurity. What are the practical skills required for a career in Cybersecurity that are currently in-demand?",
        "What are the essential skills required for a career in cybersecurity?",
        "What are some in-demand technical skills for aspiring data analysts?",
        "What are the career prospects for individuals with expertise in cybersecurity risk management?",
    ],
    "ground_truth": [
        "In the realm of artificial intelligence (AI) and machine learning (ML), several notable trends have emerged recently. Firstly, there's a growing focus on explainable AI (XAI), which aims to make AI models more transparent and understandable to humans, crucial for applications in fields like healthcare and finance where interpretability is paramount. Secondly, federated learning has gained traction, enabling training of ML models across decentralized devices while preserving data privacy, pivotal for IoT and edge computing scenarios. Additionally, reinforcement learning (RL) advancements, particularly in deep RL, have seen remarkable progress, empowering AI systems to make sequential decisions in dynamic environments, with applications spanning robotics, autonomous vehicles, and gaming. Lastly, the integration of AI with other technologies like blockchain for enhanced security and trustworthiness and with quantum computing for tackling complex optimization problems signifies promising directions for future research and innovation in the AI landscape.",
        "Several emerging programming languages are gaining traction in the industry due to their unique features and capabilities. One such language is Rust, known for its emphasis on safety, concurrency, and performance, making it suitable for systems programming where reliability and efficiency are critical. Another language on the rise is Julia, which specializes in numerical and scientific computing, offering high performance comparable to traditional languages like C and Fortran while maintaining a user-friendly syntax and extensive library support. Additionally, Kotlin, a statically typed language interoperable with Java, has become increasingly popular for Android app development, offering modern features and improved developer productivity. Lastly, Swift, developed by Apple, has gained momentum for iOS and macOS development, providing a concise and expressive syntax along with powerful features like optionals and automatic memory management. These emerging languages cater to specific niches and address evolving industry needs, showcasing their growing relevance and adoption in the programming landscape.",
        "Here is a few things to learn about Data Science to get you started: \nLearn Python or R: Choose one as your primary programming language. \nBasic Statistics: Understand mean, median, mode, standard deviation, and probability.\nData Manipulation: Learn Pandas (Python) or dplyr (R) for data cleaning and manipulation.\nData Visualization: Use Matplotlib, Seaborn (Python), or ggplot2 (R) for visualization.\nMachine Learning Basics: Start with linear regression, logistic regression, decision trees, and evaluation metrics.\nPractice: Work on projects using real-world datasets from sources like Kaggle.\nStay Updated: Follow online resources and communities for the latest trends and techniques.",
        "As a final-year Computer Science student aiming for a graduate role in cybersecurity, it's essential to focus on developing practical skills that are currently in high demand in the industry.\nSome of these key skills include:\n\n1. Knowledge of Networking: Understanding networking fundamentals, protocols (such as TCP/IP), and network architecture is crucial for identifying and mitigating security threats. Familiarize yourself with concepts like firewalls, routers, VPNs, and intrusion detection systems (IDS).\n\n2. Proficiency in Operating Systems: Gain proficiency in operating systems such as Linux and Windows, including command-line operations, system administration tasks, and security configurations. Being able to secure and harden operating systems is essential for protecting against common cybersecurity threats.\n\n3. Understanding of Cryptography: Cryptography is at the heart of cybersecurity, so having a solid understanding of encryption algorithms, cryptographic protocols, and cryptographic techniques is vital. Learn about symmetric and asymmetric encryption, digital signatures, hashing algorithms, and their applications in securing data and communications.\n\n4. Penetration Testing and Ethical Hacking: Develop skills in penetration testing and ethical hacking to identify vulnerabilities and assess the security posture of systems and networks. Familiarize yourself with tools and techniques used by ethical hackers, such as Kali Linux, Metasploit, Nmap, and Wireshark.\n\n5. Security Assessment and Risk Management: Learn how to conduct security assessments, risk assessments, and threat modeling to identify, prioritize, and mitigate security risks effectively. Understand risk management frameworks like NIST, ISO 27001, and COBIT, and how to apply them in real-world scenarios.\n\n6. Incident Response and Forensics: Acquire knowledge of incident response procedures, including detection, analysis, containment, eradication, and recovery from security incidents. Understand digital forensics principles and techniques for investigating and analyzing security breaches and cybercrimes.\n\n7. Security Awareness and Communication: Develop strong communication skills to effectively convey cybersecurity concepts, risks, and recommendations to technical and non-technical stakeholders. Being able to raise awareness about cybersecurity best practices and policies is essential for promoting a security-conscious culture within organizations.\n\n8. Continuous Learning and Adaptability: Cybersecurity is a rapidly evolving field, so it's essential to cultivate a mindset of continuous learning and adaptability. Stay updated with the latest threats, trends, technologies, and best practices through professional development, certifications, and participation in cybersecurity communities and events.\n\nBy focusing on developing these practical skills and staying abreast of industry trends and advancements, you'll be well-prepared to pursue a successful career in cybersecurity upon graduation. Additionally, consider obtaining relevant certifications such as CompTIA Security+, CEH (Certified Ethical Hacker), CISSP (Certified Information Systems Security Professional), or others to further enhance your credentials and marketability in the field.",
        "A career in cybersecurity require a broad spectrum of practical skills, including proficiency in network security protocols and tools like firewalls and intrusion detection/prevention systems (IDS/IPS) for safeguarding network infrastructure. Secure coding practices and knowledge of common vulnerabilities are essential for developing secure software applications, with expertise in frameworks like OWASP Top 10 aiding in vulnerability mitigation. Encryption techniques and cryptographic protocols are vital for securing sensitive data, while incident response and digital forensics skills, alongside tools like SIEM systems, enable effective threat detection and response. Proficiency in penetration testing frameworks like Metasploit and security assessment tools is crucial for identifying and remediating security weaknesses, while knowledge of compliance frameworks such as GDPR ensures organizational adherence to cybersecurity regulations. Effective communication and collaboration skills are imperative for conveying cybersecurity risks and recommendations to stakeholders and collaborating with cross-functional teams to implement security measures. Continued learning and staying updated with the latest cybersecurity trends and technologies are key for navigating this ever-evolving field successfully.",
        "In-demand technical skills for data analysts include proficiency in programming languages like Python, R, or SQL for data manipulation, analysis, and visualization. Familiarity with statistical analysis techniques, such as regression analysis, hypothesis testing, and predictive modeling, is essential for deriving insights from data. Knowledge of data querying and database management systems like MySQL, PostgreSQL, or MongoDB is valuable for accessing and organizing large datasets. Expertise in data wrangling techniques, using tools like pandas, dplyr, or data.table, enables cleaning and transforming raw data into actionable insights. Proficiency in data visualization libraries like Matplotlib, ggplot2, or seaborn is crucial for creating informative and visually appealing charts, graphs, and dashboards to communicate findings effectively. Additionally, experience with machine learning frameworks like scikit-learn or TensorFlow, along with knowledge of data mining techniques, enhances the ability to build predictive models and extract patterns from data.",
        "Career opportunities for individuals in cybersecurity risk management include roles such as cybersecurity risk analysts, security consultants, risk managers, compliance officers, and cybersecurity architects. These professionals play a critical role in identifying, evaluating, and prioritizing cybersecurity risks, developing risk mitigation strategies, and ensuring compliance with regulatory requirements and industry standards. With the ever-evolving threat landscape and the increasing complexity of cybersecurity challenges, individuals with expertise in cybersecurity risk management can expect to have a wide range of career opportunities and advancement prospects in both the public and private sectors, including government agencies, financial institutions, healthcare organizations, and consulting firms. Additionally, obtaining relevant certifications such as Certified Information Systems Security Professional (CISSP), Certified Information Security Manager (CISM), or Certified Risk and Information Systems Control (CRISC) can further enhance career prospects and credibility in the field.",
    ]
})

In [87]:
# Helper function: To prevent having to run predictions again, save output to the dataset as a new column 'predictions'
import datasets

def chain_predictions_from_data(llm_chain, eval_data):
  data = eval_data.copy()
  predictions = []  # List to store predictions for all questions
  for question in data['question']:
    # Invoke the llm_chain model with the current question
    input_dict = {'query': question}
    response = llm_chain.invoke(input_dict)
    generated_text = response['result']
    predictions.append(generated_text)  # Append the generated text to the predictions list
    print(question)
    print(generated_text)
    print(response.get("source_documents", []))
  data['predictions'] = predictions  # Assign the predictions list to a new column
  return data

In [88]:
rag_data_skills_build = chain_predictions_from_data(qa, data_skills_build)
rag_data_cs_industry = chain_predictions_from_data(qa, data_cs_industry)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
What is the goal of IBM SkillsBuild?
Answer: The goal of IBM SkillsBuild is to help individuals gain the necessary skills and knowledge to succeed in the modern workforce, with a focus on underrepresented communities.
[Document(page_content='IBM SkillsBuild: broader initiative\nDescription: IBM SkillsBuild is a free education program focused on underrepresented communities that helps adult learners, high school and university students, and faculty develop valuable new skills and access career opportunities. The program includes an online platform complemented by customised practical learning experiences that aim to respond to learners’ needs as they progress through their education and career journeys. This effort is in collaboration with a network of world-class education partners, including public high schools, non-profit organisations, governments, and corporations.', metadata={'source': 'skills_build.csv', 'ro

In [86]:
rag_data_skills_build

Unnamed: 0,question,ground_truth,predictions
0,What is the goal of IBM SkillsBuild?,The goal of IBM SkillsBuild is to address the ...,Answer: The goal of IBM SkillsBuild is to help...
1,What are the courses available at IBM SkillsBu...,IBM SkillsBuild offers a comprehensive range o...,Answer: IBM SkillsBuild offers a variety of co...
2,What is IBM SkillsBuild Software downloads?,IBM SkillsBuild Software downloads is an eleme...,Answer: IBM SkillsBuild Software downloads is ...
3,What learning paths do IBM SkillsBuild offer t...,Adult Learners is designed for learners (age 1...,IBM SkillsBuild offers different learning path...
4,How does IBM SkillsBuild support personalized ...,The customized practical learning experiences ...,IBM SkillsBuild supports personalized learning...
5,What can learners do with earned digital crede...,Digital credentials from IBM SkillsBuild can b...,Answer: Learners can use their digital credent...


Generate bert score results table for the data set

In [89]:
from evaluate import load

def generate_bert_results_table(eval_data):
    ref_texts = eval_data['ground_truth']
    predictions = eval_data['predictions']

    # Compute BERTScore
    bertscore = load("bertscore")
    results = bertscore.compute(predictions=predictions, references=ref_texts, model_type="distilbert-base-uncased")

    # Create DataFrame from BERTScore results
    bert_results_table = pd.DataFrame(results, index=range(0, len(results['precision'])))
    bert_results_table['question'] = eval_data['question']

    bert_results_table.drop(columns=['hashcode'], inplace=True)
    bert_results_table = bert_results_table[['question', 'precision', 'recall', 'f1']]

    return bert_results_table

In [90]:
generate_bert_results_table(rag_data_cs_industry)

Unnamed: 0,question,precision,recall,f1
0,What are the latest trends in artificial intel...,0.817619,0.746059,0.780201
1,What are some emerging programming languages t...,0.752763,0.770148,0.761356
2,I am a beginner that wants to get into Data Sc...,0.81459,0.774689,0.794138
3,I am a final-year Computer Science student wan...,0.875302,0.79892,0.835368
4,What are the essential skills required for a c...,0.839886,0.762284,0.799206
5,What are some in-demand technical skills for a...,0.810697,0.796558,0.803565
6,What are the career prospects for individuals ...,0.774195,0.769974,0.772079


In [91]:
generate_bert_results_table(rag_data_skills_build)

Unnamed: 0,question,precision,recall,f1
0,What is the goal of IBM SkillsBuild?,0.867679,0.767614,0.814585
1,What are the courses available at IBM SkillsBu...,0.947694,0.854482,0.898678
2,What is IBM SkillsBuild Software downloads?,0.988668,0.99764,0.993133
3,What learning paths do IBM SkillsBuild offer t...,0.845496,0.80081,0.822547
4,How does IBM SkillsBuild support personalized ...,0.840228,0.815857,0.827863
5,What can learners do with earned digital crede...,0.774882,0.76784,0.771345


In [None]:
# to connect to mlflow server, run in terminal: mlflow ui

In [121]:
import mlflow
from mlflow.models import infer_signature
import pandas as pd

input_columns = [{"question": "string"}]
output_columns = [{"text": "string"}]
signature = infer_signature(input_columns, output_columns)

mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("/falcon-evaluation")

# eval_data: dataframe with questions, ground_truth and predictions
def mlflow_evaluation(eval_data):
  with mlflow.start_run() as run:
    mlflow_results = mlflow.evaluate(
        data=eval_data,
        targets="ground_truth",
        predictions="predictions",
        model_type="question-answering",
    )
    # print(f"See aggregated evaluation results below: \n{mlflow_results.metrics}")

    eval_table = mlflow_results.tables["eval_results_table"]
    print(f"See evaluation table below: \n{eval_table}")

In [122]:
mlflow_evaluation(rag_data_skills_build)

2024/03/02 15:57:29 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2024/03/02 15:57:29 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


Downloading builder script:   0%|          | 0.00/6.08k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/816 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

See evaluation table below: 
                                            question  \
0  How does IBM SkillsBuild help individuals deve...   
1  Can you explain the different learning paths a...   
2  What resources does IBM SkillsBuild offer to d...   
3  How does IBM SkillsBuild support personalized ...   
4  What resources does IBM SkillsBuild offer to h...   

                                        ground_truth  \
0  With IBM SkillsBuild, individuals can develop ...   
1  IBM SkillsBuild offers diverse paths for indiv...   
2  Adult Learners is designed for learners (age 1...   
3  IBM SkillsBuild supports personalized learning...   
4  IBM SkillsBuild offers a comprehensive range o...   

                                         predictions  token_count  \
0  IBM SkillsBuild is a comprehensive, online lea...          131   
1  \nAnswer: Yes, IBM SkillsBuild offers a range ...          193   
2  \nAnswer:\n\nIBM SkillsBuild offers a range of...          164   
3  \nAnswer: IBM Skil

In [123]:
mlflow_evaluation(rag_data_cs_industry)

2024/03/02 16:00:49 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2024/03/02 16:00:49 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

See evaluation table below: 
                                            question  \
0  What are the latest trends in artificial intel...   
1  What are some emerging programming languages t...   
2  I am a beginner that wants to get into Data Sc...   
3  I am a final-year Computer Science student wan...   
4  What are the essential skills required for a c...   
5  What are some in-demand technical skills for a...   
6  What are the career prospects for individuals ...   

                                        ground_truth  \
0  In the realm of artificial intelligence (AI) a...   
1  Several emerging programming languages are gai...   
2  Here is a few things to learn about Data Scien...   
3  As a final-year Computer Science student aimin...   
4  A career in cybersecurity require a broad spec...   
5  In-demand technical skills for data analysts i...   
6  Career opportunities for individuals in cybers...   

                                         predictions  token_count  \
0  A

Collect + summarise display of evaluation set and predictions into csv?

In [124]:
rag_data_cs_industry

Unnamed: 0,question,ground_truth,predictions
0,What are the latest trends in artificial intel...,In the realm of artificial intelligence (AI) a...,Answer:\n\n1. Hybrid AI/ML: A combination of t...
1,What are some emerging programming languages t...,Several emerging programming languages are gai...,Answer: Some of the emerging programming langu...
2,I am a beginner that wants to get into Data Sc...,Here is a few things to learn about Data Scien...,"As a beginner, you should start with learning ..."
3,I am a final-year Computer Science student wan...,As a final-year Computer Science student aimin...,The practical skills required for a career in ...
4,What are the essential skills required for a c...,A career in cybersecurity require a broad spec...,Answer: The essential skills required for a ca...
5,What are some in-demand technical skills for a...,In-demand technical skills for data analysts i...,"As an AI language model, I can suggest that so..."
6,What are the career prospects for individuals ...,Career opportunities for individuals in cybers...,The career prospects for individuals with expe...


In [125]:
rag_data_skills_build

Unnamed: 0,question,ground_truth,predictions
0,How does IBM SkillsBuild help individuals deve...,"With IBM SkillsBuild, individuals can develop ...","IBM SkillsBuild is a comprehensive, online lea..."
1,Can you explain the different learning paths a...,IBM SkillsBuild offers diverse paths for indiv...,"\nAnswer: Yes, IBM SkillsBuild offers a range ..."
2,What resources does IBM SkillsBuild offer to d...,Adult Learners is designed for learners (age 1...,\nAnswer:\n\nIBM SkillsBuild offers a range of...
3,How does IBM SkillsBuild support personalized ...,IBM SkillsBuild supports personalized learning...,\nAnswer: IBM SkillsBuild is a learning platfo...
4,What resources does IBM SkillsBuild offer to h...,IBM SkillsBuild offers a comprehensive range o...,\nAnswer: IBM SkillsBuild offers a range of re...


####Hardware specs

In [None]:
#disk info
!df -h

In [None]:
# memory info
!cat /proc/meminfo

In [None]:
# cpu info
!cat /proc/cpuinfo

In [None]:
#Graphs to illustrate comparison between models

Finetuning doesn't solve hallucinations and timely context!


### T5

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain import HuggingFacePipeline
model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

pipe = pipeline(
    "text2text-generation",
    model=model_name,
    tokenizer=tokenizer,
    max_new_tokens = 200,
    temperature = 0.1,
    eos_token_id=tokenizer.eos_token_id,
    do_sample=True
)
llm = HuggingFacePipeline(pipeline=pipe)

In [None]:
from langchain.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [None]:
from langchain.chains import RetrievalQA
retriever = db.as_retriever(search_kwargs={"k": 2})

# qa = RetrievalQA.from_chain_type(
#   llm=llm,
#   chain_type="refine",
#   retriever=retriever,
#   return_source_documents=True
# )
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, return_source_documents=False, chain_type_kwargs={"prompt": QA_CHAIN_PROMPT})
print(qa)

question = "What skills should an AI solution analyst have?"
result = qa({"query": question})
print(result["result"])

In [None]:
import gradio as gr

def chat_interface(textbox, chat):
    retriever = db.as_retriever()
    docs = retriever.get_relevant_documents(textbox)
    context = docs[0].page_content

    # query = f"{context}\nQuestion: {textbox}\nHelpful Answer:"
    input_dict = {'question': textbox, 'context': context}
    result = qa.invoke(input_dict)
    print(result)
    text = result['text']
    answer = text.split('\nHelpful Answer:')[1].strip()
    return answer

gr.ChatInterface(
    fn=chat_interface,
    chatbot=gr.Chatbot(height=300),
    textbox=gr.Textbox(placeholder="Ask me a question", container=False, scale=7),
    title="Chatbot",
    description="Ask Chatbot any question",
    theme="soft",
    examples=["What does AI stand for?", "What is Software Engineering?", "What is Cybersecurity?"],
    cache_examples=False,
    retry_btn=None,
    undo_btn="Delete Previous",
    clear_btn="Clear",
).launch()

In [None]:
import gradio as gr

def chat_interface(textbox, chat):
    # retriever = db.as_retriever()
    # docs = r.get_relevant_documents(textbox)
    # context_str = docs[0].page_content

    # input_dict = {'question': textbox, 'context_str': context_str}
    # docs = retriever.get_relevant_documents(textbox)

    # query = f"{context}\nQuestion: {textbox}\nHelpful Answer:"
    result = qa({'query': textbox})
    docs = result.get("source_documents", [])
    print(docs)
    return result["result"]

gr.ChatInterface(
    fn=chat_interface,
    chatbot=gr.Chatbot(height=300),
    textbox=gr.Textbox(placeholder="Ask me a question", container=False, scale=7),
    title="Chatbot",
    description="Ask Chatbot any question",
    theme="soft",
    examples=["What does AI stand for?", "What is Software Engineering?", "What is Cybersecurity?"],
    cache_examples=False,
    retry_btn=None,
    undo_btn="Delete Previous",
    clear_btn="Clear",
).launch()

In [None]:
# MAYBE DO THIS FIRST? AND SEE THE DOWNSIDE, AND LEARN THAT IT IS NOT REQUIRED (doesn't solve hallucinations and timely context!)
# Fine-tune with input and output example data sets

# Compare with different models (one fine-tuned one just pre-trained)

## Full adapted model (combined of all approaches)

In [None]:
# Knowledge retrieved
# Augmented Prompt
# Fine-tuned/pre-trained LLM

import gradio as gr
def chat_interface(textbox, chat):
    # docs = db.similarity_search(textbox)
    # input_dict = {'question': textbox, 'input_documents': docs }
    input_dict = {'question': textbox}
    response = llm_chain.run(input_dict)

    print("user:", textbox)
    print("bot:", response)
    return response

gr.ChatInterface(
    fn=chat_interface,
    chatbot=gr.Chatbot(height=300),
    textbox=gr.Textbox(placeholder="Ask me a question", container=False, scale=7),
    title="Chatbot",
    description="Ask Chatbot any question",
    theme="soft",
    examples=["What does AI stand for?", "What is Software Engineering?", "What is Cybersecurity?"],
    cache_examples=False,
    retry_btn=None,
    undo_btn="Delete Previous",
    clear_btn="Clear",
).launch()