# Data Science Learning Accelerator- Langchain Workflow

The purpose of the notebook is to generate useful knowledge objects that will explain data science concepts and provide references and guidance for people who are interested in learning more about each concept. The idea here is that we will leverage langchain and large language models to generate the basic text so that we can take on the role of editor and refine the generated articles to produce high quality and accurate results at a scale that we would not be able to achieve when creating the knowledge objects from scratch on our own.

## Contents

1. [Initial Setup and API Key Management](#1-initial-setup)
2. [Few Shot Templates](#2-set-up-fewshot-templates) 
3. [Setup Large Language Models for Prompt Pipeline](#3-set-up-the-llms)
4. [Setup the Functions to Build Knowledge Object Articles](#4-setup-our-knowledge-object-building-function)
5. [5. Build a Dataframe and Output Markdown Rough Draft Articles](#5-setup-our-functions-and-dataframe-for-storing-output)

## 1) Initial Setup <a class="anchor" id="Initial Setup"></a>

Here we will setup our langchain api keys, and import our list of topics.

In [1]:
#This function allows for saving variables to a pickle backup file.
#To save use: save(backup_file_name, variable_name1, var_name_2, ...)
#To load a backup file use: load_pickle(backup_file_name)

import pickle

#save pickle files for variables
def save(filename, *args):
    # Get global dictionary
    glob = globals()
    d = {}
    for v in args:
        # Copy over desired values
        d[v] = glob[v]
    with open(filename, 'wb') as f:
        # Put them in the file
        pickle.dump(d, f)

def load_pickle(filename):
    # Get global dictionary
    glob = globals()
    with open(filename, 'rb') as f:
        for k, v in pickle.load(f).items():
            # Set each global variable to the value from the file
            glob[k] = v

In [2]:
# To setup a langchain API key see: https://docs.smith.langchain.com/setup
import os
from getpass import getpass

os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY']  = getpass()

In [3]:
# To setup a Google Gemini key see: https://ai.google.dev/tutorials/setup

os.environ['GOOGLE_API_KEY']  = getpass()

In [4]:
os.environ["OPENAI_API_KEY"] = getpass()

In [3]:
# To use Hugginface Hub the API token is needed, for info see: https://huggingface.co/docs/hub/security-tokens

os.environ['HUGGINGFACEHUB_API_TOKEN']  = getpass()

In [7]:
os.environ["ANTHROPIC_API_KEY"]  = getpass()
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")

In [3]:
# Import our topics list CSV file
import pandas as pd

csv_path='topics_502.csv'
topic_list = pd.read_csv(csv_path)
topic_list.head()

Unnamed: 0,Subject
0,Ollama
1,Phishing and Social Engineering Detection
2,Vulnerability Assessment and Management
3,Word Embeddings
4,Splunk


## 2) Set Up Fewshot Templates

In this section we will create example questions and answers to show the LLM models what we would like the results to look like.

Let's define some embeddings, below are all the embeddings that were used for the project, but the one that we used in the final output is left uncommented.

In [4]:
from langchain_community.embeddings import HuggingFaceEmbeddings, HuggingFaceBgeEmbeddings
from langchain.embeddings import LlamaCppEmbeddings
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings
import chromadb.utils.embedding_functions as embedding_functions
from langchain_openai import OpenAIEmbeddings

# open ai text embeddings
# gpt_key = os.environ.get("OPENAI_API_KEY")
# gpt_embeddings = OpenAIEmbeddings(openai_api_key=gpt_key,model='text-embedding-3-large')


# This embedding is using a model from huggingface hub-- this is experimental and may not work as is.
# HF_key = os.environ.get("HUGGINGFACEHUB_API_TOKEN")
# huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
#     api_key = HF_key,
#     model_name="mixedbread-ai/mxbai-embed-large-v1"
# )

# This embedding is using BAA model from huggingface (a popular open source model)
BAA_model_name = "BAAI/bge-small-en"
BAA_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
baa = HuggingFaceBgeEmbeddings(
    model_name=BAA_model_name, model_kwargs=BAA_kwargs, encode_kwargs=encode_kwargs
)

# This is the official google embeddings module using embeddings-001 as the model.
# gg_key = os.environ.get("GOOGLE_API_KEY")
# gg_embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key=gg_key)

# # This is the basic Huggingface embedding module:
# hf_embeddings = HuggingFaceEmbeddings()

# This is the llamacpp embeddings module-- this is experimental and may not work yet as written.
# lcp_embeddings = LlamaCppEmbeddings(model_path="/home/pete/models/Meta-Llama-3-8B-Instruct.Q8_0.gguf")


  from .autonotebook import tqdm as notebook_tqdm


Below we will create our examples with example questions and the resulting answers. This will give guidance to the LLM on what we are looking for.

In [5]:

from langchain.prompts.few_shot import FewShotPromptTemplate
from langchain.prompts.prompt import PromptTemplate


examples = [
    {   "question": """Create an educational article that a data scientist with some college education can use 
        to teach themselves about Anomaly Detection. Please answer the following questions with each section
        of the response:
        Explain the core principles or components and brief history of Anomaly Detection..
        What are some applications where a data scientist can use Anomaly Detection?
        When Should Anomaly Detection be utilized?
        What specific technologies should a data scientist focus on in order to become an
        expert in Anomaly Detection? Please include specific technologies by name, along with where
        a data scientist can learn more about these specific technologies. 
        authoritative institutions that are important to Anomaly Detection.
        What are the strengths of Anomaly Detection?
        What are the Limitations of Anomaly Detection?
        What are Alternative Options to Anomaly Detection?
        List the most common terminology associated with Anomaly Detection and give a brief definition for 
        each.
        What are some Example Deployments of Anomaly Detection?
        Please list some beginner level and intermediate level resources for someone learning 
        about Anomaly Detection to use in order to gain expertise and background knowledge. 
    """,
        "answer": """# Anomaly Detection


        ## Summary of Anomaly Detection:
        
        Anomaly detection is a technique used by data scientists to identify unusual or rare events, outliers, or patterns 
        within a dataset. It involves identifying deviations from normal behavior or expected outcomes, which can be useful 
        for detecting fraud, diagnosing diseases, monitoring network security, and more. The field has its roots in statistical 
        methods like control charts, clustering, and outlier detection, with recent advancements driven by machine learning 
        techniques like one-class SVM, isolation forests, and autoencoders.

        ## Applications :
        
        Anomaly detection has a wide range of applications across various domains, including:
        - Finance: detecting fraudulent transactions, identifying insider trading, or predicting stock market crashes.
        - Healthcare: diagnosing diseases, monitoring patient health, or detecting medical equipment failures.
        - Retail: identifying shoplifting, preventing inventory theft, or detecting supply chain disruptions.
        - Cybersecurity: detecting intrusions, identifying malware, or monitoring network traffic for anomalies.
        - Manufacturing: detecting equipment failures, monitoring product quality, or predicting maintenance needs.
        - Environmental monitoring: detecting pollution, monitoring climate change, or predicting natural disasters.

        ## When Should Anomaly Detection be utilized?
        
        Anomaly detection should be utilized when there's a need to identify rare or unusual events within a dataset, or when 
        there's a need to monitor system behavior for deviations from expected norms. It's particularly useful when dealing 
        with high-dimensional data, where traditional statistical methods might not be applicable or efficient.

        ## Focus Areas
        
        A data scientist should focus on learning about machine learning techniques like one-class SVM, isolation forests, autoencoders, 
        as well as statistical methods like control charts, clustering, and outlier detection. Some specific technologies include:
        - One-class SVM (Support Vector Machine): A technique that can detect outliers by constructing a hyperplane that 
        separates data points from a "pseudo-outlier" class.
        - Isolation Forests: A method that detects anomalies by growing multiple decision trees on random subsets of data, 
        and identifying deviant instances based on their unique feature subsets.
        - Autoencoders: A neural network architecture that learns data representations by encoding input data into a lower-dimensional 
        space, and decoding it back to its original form. Anomalies can be detected by comparing the encoded data with its reconstruction.
        - Control Charts: A statistical method that monitors the behavior of a process variable over time, by constructing 
        moving averages and standard deviations, and identifying deviations from expected norms.
        - Clustering: A technique that groups data points based on their similarity, by minimizing within-cluster variance or maximizing 
        between-cluster separation. Outliers can be identified as data points that do not belong to any cluster or are far from any cluster.
        - Outlier Detection: A method that identifies outliers by computing summary statistics like mean, median, or standard deviation, 
        and identifying data points that deviate significantly from these norms.

        ## Strengths
        
        The strengths of anomaly detection include the ability to identify rare or unusual events, applicability to high-dimensional data, potential for real-time monitoring, and the wide range of applications across various domains. It can also be used for 
        unsupervised learning tasks, where there's no need for labeled data or explicit definitions of anomalies.

        ## Limitations
        
        The limitations of anomaly detection include its potential for false positives or false negatives, its sensitivity to data quality, its need for domain knowledge or feature engineering, and its potential for overfitting when using complex models. It might also struggle with concept drift or changing data distributions over time.

        ## Alternative and Complimentary Options
        
        Some alternative options to anomaly detection include:
        - Change Detection: A technique that identifies changes or shifts within a dataset over time, by comparing data points across 
        different time intervals or windows.
        - Clustering: A technique that groups data points based on their similarity, by minimizing within-cluster variance or maximizing 
        between-cluster separation. Outliers can be identified as data points that do not belong to any cluster or are far from any cluster.
        - Classification: A technique that assigns labels or categories to data points, by constructing a decision boundary that separates 
        different classes. Outliers can be identified as data points that fall outside this boundary or are misclassified.

        ## Learning Resources

        - Anomaly detection using Isolation Forest – A Complete Guide https://www.analyticsvidhya.com/blog/2021/07/anomaly-detection-using-isolation-forest-a-complete-guide/
        
        - PyOD- This open-source Python library offers a variety of anomaly detection algorithms, including statistical, distance-based, density-based, and clustering-based methods. It is a versatile Python library for detecting anomalies in multivariate data. https://pypi.org/project/pyod/
        
        - Anomaly Detection in Python: Best Practices and Techniques by Dmytro Iakubovskyi. https://medium.com/data-and-beyond/anomaly-detection-in-python-best-practices-and-techniques-9b93d37244dc

        - Coursera Course: Unsupervised Learning, Recommenders, Reinforcement Learning. This is a beginner level course focusing on unsupervised learning including clustering and anomaly detection. https://www.coursera.org/learn/unsupervised-learning-recommenders-reinforcement-learning

        - Microsoft article on using anomaly detection in cyber security: DETECTING CYBER ATTACKS USING ANOMALY DETECTION WITH
        EXPLANATIONS AND EXPERT FEEDBACK https://www.microsoft.com/en-us/research/uploads/prod/2019/06/ADwithGraderFeedback.pdf

        - Kaggle Competition: Anomaly Detection\*\* This Kaggle competition provides a real-world dataset for anomaly detection in financial transactions. Participants can develop and submit their own anomaly detection models and compete for prizes. https://www.kaggle.com/c/ieee-fraud-detection 

        - Scikit-learn: Anomaly Detection\*\* The scikit-learn library includes several anomaly detection algorithms, such as Isolation Forest, Local Outlier Factor (LOF), and One-Class Support Vector Machines (OCSVM). These algorithms are easy to implement and can be used for a wide range of anomaly detection tasks. https://scikit-learn.org/stable/modules/outlier_detection.html

        ## Key Terms

        - One-class SVM
        - Isolation Forest
        - Autoencoders
        - Control Charts
        - Network Security
        - Outlier Detection
        - Clustering
        - Change Detection
    """
    },
    {"question": """Create an educational article that a data scientist with some college education can use 
        to teach themselves about Automated Knowledge Graph Construction. Please answer the following questions with each section
        of the response:
        Explain the core principles or components and brief history of Automated Knowledge Graph Construction..
        What are some applications where a data scientist can use Automated Knowledge Graph Construction?
        When Should Automated Knowledge Graph Construction be utilized?
        What specific technologies should a data scientist focus on in order to become an
        expert in Automated Knowledge Graph Construction? Please include specific technologies by name, along with where
        a data scientist can learn more about these specific technologies. 
        authoritative institutions that are important to Automated Knowledge Graph Construction.
        What are the strengths of Automated Knowledge Graph Construction?
        What are the Limitations of Automated Knowledge Graph Construction?
        What are Alternative Options to Automated Knowledge Graph Construction?
        List the most common terminology associated with Automated Knowledge Graph Construction and give a brief definition for 
        each.
        What are some Example Deployments of Automated Knowledge Graph Construction?
        Please list some beginner level and intermediate level resources for someone learning 
        about Automated Knowledge Graph Construction to use in order to gain expertise and background knowledge. 
    
     """,
        "answer": """Automated Knowledge Graph Construction

        ## Overview

        Automated Knowledge Graph Construction (AKGC) is a subfield of Artificial Intelligence (AI) that focuses on automatically constructing knowledge graphs from various data sources. A knowledge graph is a graph-based data structure that represents entities and their relationships, making it easier to analyze, reason, and draw inferences from complex data sets. AKGC involves several components, including data preprocessing, feature extraction, graph generation, and evaluation.

        The history of AKGC can be traced back to the early days of AI, with researchers exploring techniques for automated reasoning, inference, and knowledge representation. However, it was not until the advent of big data and the internet that AKGC gained significant traction, enabling organizations to manage and make sense of vast amounts of information.

        ## Applications

        AKGC has numerous applications across various domains, including:

        - Cybersecurity: Constructing knowledge graphs from security data can help identify attack patterns, visualize threats, and automate incident response.
        - Healthcare: AKGC can be used to create knowledge graphs from electronic health records, enabling clinicians to make more informed decisions about patient care.
        - Finance: AKGC can help financial institutions detect fraud, assess credit risk, and automate compliance processes.
        - Natural Language Processing (NLP): AKGC can be used to extract meaning from unstructured text data, enabling applications like chatbots, sentiment analysis, and document classification.
        - E-commerce: AKGC can help retailers analyze customer behavior, personalize recommendations, and automate supply chain management.
        - Smart Cities: AKGC can be used to manage urban infrastructure, optimize resource allocation, and improve public safety.

        ## When to use AKGC

        AKGC should be utilized when dealing with large volumes of data that require complex analysis, reasoning, and inference. It is particularly useful for tasks that involve:
        - Identifying patterns and relationships across different data sources.
        - Automating repetitive tasks, such as data cleaning, and feature extraction.
        - Enhancing decision-making processes with context-rich visualizations.
        - Supporting real-time analysis and prediction tasks.
        - Facilitating interdisciplinary collaboration by integrating domain-specific knowledge with data-driven insights.

        ## Technologies

        To become an expert in AKGC, a data scientist should focus on the following technologies:
        - Graph Databases: Neo4j, OrientDB, and Titan are popular graph databases that support knowledge graph construction and analysis.
        - Semantic Web Technologies: RDF (Resource Description Framework) and SPARQL (SPARQL Protocol and RDF Query Language) are essential for representing entities, relationships, and queries within a knowledge graph.
        - Machine Learning (ML): Techniques like clustering, classification, and deep learning can be used to extract features from data, enabling more accurate knowledge graph generation.
        - Natural Language Processing (NLP): Techniques like named entity recognition, sentiment analysis, and text classification can be used to extract meaning from unstructured text data, enriching knowledge graphs with contextual information.
        - Programming Languages: Python, Java, and JavaScript are popular programming languages for developing AKGC applications, with libraries like Py2neo, Janus, and GraphQL providing support for graph database operations and semantic web technologies.

        ## Strengths and Limitations

        Strengths of AKGC include its ability to handle large volumes of data, support real-time analysis, and facilitate interdisciplinary collaboration. However, it also has some limitations, such as data quality issues, performance challenges, and the need for expertise in various domains. Additionally, AKGC may not be widely adopted in some organizations due to concerns about data privacy, security, and ownership.

        ## Alternative Options

        Alternative options to AKGC include manual knowledge graph construction, rule-based systems, and traditional data analysis techniques like regression, clustering, and classification. However, these methods may not scale well with big data or provide the same level of context-awareness as AKGC.

        ## Terminology

        Some common terminology associated with AKGC includes:
        - Knowledge Graph: A graph-based data structure that represents entities and their relationships.
        - Entity: A thing or concept within a knowledge graph, such as a person, organization, or location.
        - Relationship: A connection between two entities, representing a type of interaction or association (e.g., "works at," "attended," "located at").
        - Feature Extraction: The process of extracting relevant features from data sources, enabling more accurate knowledge graph generation.
        - Graph Generation: The process of creating a knowledge graph from data sources, often involving feature extraction, entity recognition, and relationship identification.
        - Evaluation: The process of assessing the quality, accuracy, and completeness of a knowledge graph, often using metrics like precision, recall, and F1 score.
        - Semantic Web: A vision for the web where data is represented using standardized formats like RDF, enabling machines to process and reason about information more effectively.
        - Big Data: Large volumes of data that require specialized tools and techniques for processing, analysis, and visualization.
        - Machine Learning: A subfield of AI that focuses on developing algorithms for pattern recognition, prediction, and decision-making.
        - Natural Language Processing: A subfield of AI that focuses on enabling machines to understand, interpret, and generate human language.
        - Cybersecurity: The practice of protecting computer systems, networks, and sensitive information from unauthorized access, attack, or damage.
        - Smart Cities: The field of urban planning focused on using technology to improve public services, enhance quality of life, and reduce environmental impact.

        ## Deployments

        Example deployments of AKGC include:
        - Merative (formerly IBM Watson Health) for analyzing electronic health records and improving patient outcomes: https://www.merative.com/company
        - Amazon Neptune for managing graph databases and supporting knowledge graph applications: https://aws.amazon.com/neptune/
        - Microsoft Azure Cosmos DB for indexing and querying graph-based data models: https://azure.microsoft.com/en-us/services/cosmos-db/

        ## Resources

        To learn more about Automated Knowledge Graph Construction:
        - Tutorials: Websites like Towards Data Science, Analytics Vidhya, and Machine Learning Mastery offer tutorials on AKGC, covering topics like feature extraction, graph generation, and evaluation.
        - Books: Books like "Graph Databases" by Ian Robinson provide comprehensive overviews of AKGC, including its history, principles, and applications.

        ## Additional Resources:

        1. DeepDive: A Data-Driven Knowledge Graph Construction System: https://www.cs.cornell.edu/~cdesa/papers/sigmodrecord2016_deepdive_highlight.pdf
        DeepDive is a system for automatically constructing knowledge graphs from unstructured text. It uses a variety of natural language processing and machine learning techniques to extract and link information from text.
        2. Neural Knowledge Graph Embeddings for Link Prediction: https://arxiv.org/pdf/2008.07723.pdf
        This paper introduces a neural network-based approach to knowledge graph embedding. It shows that neural embeddings can capture complex relationships between entities and improve the accuracy of link prediction tasks.
        3. A Comprehensive Survey on Automatic Knowledge Graph
        Construction: https://dl.acm.org/doi/pdf/10.1145/3618295
        This paper is a survey of more than 300 methods to summarize
        the latest developments in knowledge graph construction.
        4. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction: https://aclanthology.org/P19-1470.pdf
        This article covers the development of Commonsense Transformers
        for Automatic Knowledge Graph Construction which aim to create generative models that learn to generate rich and diverse commonsense descriptions in natural language.

        ## Key Terms
        - Py2neo
        - GraphQL
        - Neo4j
        - OrientDB
        - Knowledge Graph
        - Graph Databases
        - Watson
        - Artificial Intelligence
        - Java
        - Azure
        - Titan
        - Semantic Web Technologies
        - Graph Generation
        - SPARQL
        - F1
        - IBM Watson Health
        - Merative
        - Feature Extraction
        - Neptune
        - NLP
        - Ian Robinson
        - OrientDB
        - JavaScript
        - RDF
        - Microsoft Azure Cosmos
        - KDnuggets
"""
    },
    {
        "question":"""Create an educational article that a person with some college education can use 
        to teach themselves about cybersecurity knowledge graphs. Please answer the following questions with each section of the answer:
        What Is cybersecurity knowledge graphs?
        Who Should Use cybersecurity knowledge graphs?
        When Should cybersecurity knowledge graphs be utilized?
        How Does a person learn about cybersecurity knowledge graphs? Please include specific steps for a person who is starting to learn about the topic, such as information about technology solutions or authoritative institutions that are important to cybersecurity knowledge graphs.
        What are the strengths of cybersecurity knowledge graphs?
        What are the Limitations of cybersecurity knowledge graphs?
        What are Alternative Options to cybersecurity knowledge graphs?
        What are some Example Deployments of cybersecurity knowledge graphs?
        For each of the items above, Include two to three specific facts or statements that support the general points you make. Please provide the answer in markdown format.
""",
        "answer":"""# Cybersecurity Knowledge Graphs
        
        ## Overview:

        Cybersecurity Knowledge Graphs (CKGs) are a type of knowledge representation that uses graph structures to represent and organize information about cybersecurity threats, vulnerabilities, and countermeasures. They visually depict entities and relationships between them in a graphical structure, making it easier to understand and reason about cybersecurity incidents. By leveraging semantic technologies, CKGs enable context-aware analysis, helping security professionals make informed decisions. As with any knowledge graph, CKGs generally consist of entities (nodes) and relationships (edges). Common entities in CKGs are:
        Threats: Information about different cyber attacks, malware, and hacking techniques.
        Vulnerabilities: Weaknesses in systems and software that attackers can exploit.
        Systems: Devices, applications, and networks within an organization.
        Indicators of Compromise (IOCs): Clues that a system might be under attack.
        By connecting these elements, the knowledge graph can reveal hidden patterns and connections. Here are some examples of how this works:idents, especially those involving multiple entities or relationships. They are particularly helpful for:

        - Identifying attack paths: Analysts can see how attackers might move from one vulnerability in a system to another, ultimately reaching sensitive data.
        - Threat intelligence: The knowledge graph can connect information about new threats with similar attacks in the past, helping security teams anticipate and respond faster.
        - Incident investigation: Security professionals can use the knowledge graph to link specific IOCs with known threats, speeding up investigations.

        ## Applications and Use Cases

        CKGs are valuable for cybersecurity analysts who need to manage large volumes of security data, identify patterns, and perform root cause analysis. They can also benefit CTI (Cyber Threat Intelligence) teams by providing context-rich visualizations that aid in understanding evolving threats and their relationships with various entities in the cybersecurity landscape.

        ## When are they needed?
        
        idents, especially those involving multiple entities or relationships. They are particularly helpful for:

        - Analyzing large-scale data breaches
        - Uncovering sophisticated attack patterns
        - Performing root cause analysis of security incidents
        - Visualizing and understanding threat intelligence
        - Supporting security automation and orchestration initiatives

        ## Focus Areas

        Learning about CKGs involves understanding their underlying technologies, applications, and best practices. Key steps include:

        - Familiarize yourself with graph databases, such as Neo4j, and the principles of knowledge representation in graphs.
        - Study semantic web technologies, including RDF (Resource Description Framework) and SPARQL (SPARQL Protocol and RDF Query Language), to learn how to represent entities and relationships effectively.
        - Explore existing CKG solutions, like MITRE's ATT&CK knowledge graph, to understand real-world applications and data models.
        - Practice using tools and platforms that enable the creation, management, and analysis of CKGs. Online courses, workshops, and tutorials can help you gain hands-on experience.

        ## Strengths

        CKGs offer several advantages:

        - Context-aware analysis: By representing relationships between entities, CKGs provide a more comprehensive understanding of cybersecurity incidents.
        - Scalability: CKGs can handle large volumes of data and grow as new information becomes available.
        - They allow for the identification of relationships and dependencies between different security components.
        - Query flexibility: graph data systems like Neo4j and SPARQL offer powerful querying capabilities to extract insights from the graph structure.
        - Real-time analysis: CKGs can be updated in real time, making them suitable for incident response scenarios.
        - Enhanced Inference Capabilities: They enable the identification of new relationships and patterns not directly observed in the data due to their inherent interpretability of semantic connections.

        ## Weaknesses

        Despite their benefits, CKGs have some limitations:

        - Complexity: Building and maintaining a CKG can be challenging, requiring expertise in various domains (e.g., graph databases, semantic web technologies).
        - Data quality: The accuracy and completeness of the data used to build the CKG significantly impact its usefulness.
        - Performance: Querying large graphs may lead to performance issues if not properly optimized.
        - Adoption: Although gaining traction, CKGs are not yet widely adopted in cybersecurity operations, which could limit access to resources and best practices.
        - Skill Acquisition: The interdisciplinary nature of the subject requires proficiency in cybersecurity, data modeling, graph theory, and semantic web technologies.

        ## Alternative and Complimentary Options

        Other options for visualizing and analyzing cybersecurity data include:

        - Security Information and Event Management (SIEM) systems.
        - Security Orchestration, Automation, and Response (SOAR) platforms.
        - Threat intelligence platforms (TIPs)

        ## Example Deployments
        
        Example Deployments of Cybersecurity Knowledge Graphs CKGs have been successfully deployed in various cybersecurity applications:

        - Security Automation: KGs can be used to automate security tasks like incident response. By linking indicators of compromise (IOCs) to known threats within the knowledge graph, security systems can automatically trigger predefined responses when an IOC is detected, saving time and improving efficiency.


        - Security Configuration Management: KGs can be used to ensure secure configurations for complex systems like container orchestration platforms (e.g., Kubernetes). By storing information about secure configuration parameters and their relationships, the knowledge graph can identify misconfigurations and recommend fixes, preventing potential security vulnerabilities.

        - Threat Intelligence Analysis: KGs can be incredibly useful for threat analysts. They can connect information about new threats with past attacks and similar malware strains. This allows analysts to understand the broader context of a threat, predict its behavior, and develop more effective mitigation strategies.

        - MITRE's ATT&CK knowledge graph for mapping adversary tactics and techniques: https://attack.mitre.org

        - IBM's X-Force Threat Intelligence Platform, which uses a graph database to represent threat data: https://exchange.xforce.ibmcloud.com

        - NIST Cybersecurity Framework for understanding and managing cybersecurity risk: https://www.nist.gov/cyberframework
        """
    },]

Below define our question or prompt that we will feed to the LLM and we build a function for picking the most similar example from above.

In [6]:
from langchain.prompts.example_selector import SemanticSimilarityExampleSelector
from langchain_community.vectorstores import Chroma
import chromadb.utils.embedding_functions as embedding_functions

def example_picker(embeddings,n_samples):
    example_sel = SemanticSimilarityExampleSelector.from_examples(
                                examples,
                                embeddings,
                                Chroma,
                                # Number of examples
                                k=n_samples,
                                )
    return example_sel

def ko_question_func(subject):
    question = f"""
            Create an educational article that a data scientist with some college education can use 
            to teach themselves about {subject}. Please answer the following questions with each section
            of the response:
            Explain the core principles or components and brief history of {subject}.
            What are some applications where a data scientist can use {subject}?
            When Should {subject} be utilized?
            What specific technologies should a data scientist focus on in order to become an
            expert in {subject}? Please include specific technologies by name, along with where
            a data scientist can learn more about these specific technologies. 
            What authoritative institutions are important to {subject}.
            What are the strengths of {subject}?
            What are the Limitations of {subject}?
            What are Alternative Options to {subject}?
            List the most common terminology associated with {subject} and give a brief definition for 
            each.
            What are some Example Deployments of {subject}?
	        Please list some beginner level and intermediate level resources for someone learning 
            about {subject} to use in order to gain expertise and background knowledge.
            """

    return question


Configure a formatter that will format the few-shot examples into a string. This formatter should be a PromptTemplate object.

In [7]:
from langchain.prompts import ChatPromptTemplate

example_prompt = ChatPromptTemplate.from_messages(
    [("human","{question}"), ("ai","{answer}")],
)

print(example_prompt.format(**examples[0]))

Human: Create an educational article that a data scientist with some college education can use 
        to teach themselves about Anomaly Detection. Please answer the following questions with each section
        of the response:
        Explain the core principles or components and brief history of Anomaly Detection..
        What are some applications where a data scientist can use Anomaly Detection?
        When Should Anomaly Detection be utilized?
        What specific technologies should a data scientist focus on in order to become an
        expert in Anomaly Detection? Please include specific technologies by name, along with where
        a data scientist can learn more about these specific technologies. 
        authoritative institutions that are important to Anomaly Detection.
        What are the strengths of Anomaly Detection?
        What are the Limitations of Anomaly Detection?
        What are Alternative Options to Anomaly Detection?
        List the most common termi

More info on creating a few shot template: https://python.langchain.com/docs/modules/model_io/prompts/few_shot_examples


## 3) Set up the LLMs

This section contains the code to set up several LLMs to compare results and find the best option. We can use any of these options for the KO creation and for the resource link request step.

### Third Party Hosted LLMs

These LLMs are hosted by outside companies like Google, OpenAI and the like.

#### Google LLMs

As of Spring 2024 Google Gemini Pro could be used with out cost. I believe this is set to change soon so I'm not sure how this will work going forward. During Spring 24 I found this to be the most cost effective and the results were OK, not the best but workable.

In [11]:
#Gemini pro can be used with an API key without a premium account. There are some limitations.
from langchain_google_genai import GoogleGenerativeAI

gg_key = os.environ.get("GOOGLE_API_KEY")
gg_model = GoogleGenerativeAI(model="gemini-pro", 
                           google_api_key=gg_key,
                           temperature=0.2, 
                           num_words=1000, 
                           convert_system_message_to_human=True, )


More info on setting up google genai:

https://python.langchain.com/docs/integrations/text_embedding/google_generative_ai

#### Using Anthropic Claude 3

Overall I found Claude 3 to be the best paid option. The cost for the version listed below was about 10 cents per API call, and they did offer a $5 credit for first time users at that time.

In [43]:
#Anthropic Claude models are pay per use, they currently offer $5 credit free for a new account.
from langchain_anthropic import ChatAnthropic

chat_claude = ChatAnthropic(anthropic_api_key=anthropic_api_key, 
                            model_name="claude-3-opus-20240229",
                            temperature=0.1,
                            verbose=True,
                            max_tokens=3000
                            )


#### Using Chat GPT

GPT 4 as shown below cost about 10 cents per API call. The results were slightly better than Google Gemini Pro and below Claude 3 in my opinion.

In [44]:
from getpass import getpass
import os
from langchain_openai import ChatOpenAI

gpt_key = os.environ.get("OPENAI_API_KEY")
gpt_model = ChatOpenAI(#model_name="gpt-3.5-instruct", 
                        model_name="gpt-4-0125-preview",
                        openai_api_key=gpt_key,
                        temperature=0.2,
                        model_kwargs={
                            "frequency_penalty": 0.5
                        }
                        )

### Local LLMS

#### Huggingface LLM Models

Below we are running a local llm using huggingface. Running this code will download the model to the local machine which can take time as the model file is typically very large. The models that were most effective with a decently powered PC were under 13 billion parameter, quantized models.

For more info on how to set up Huggingface on your system here are some references:

[More info about the models](https://huggingface.co/models)

[Huggingface Docs](https://huggingface.co/docs/transformers/main/llm_tutorial)

[How to Blog](https://www.markhneedham.com/blog/2023/06/23/hugging-face-run-llm-model-locally-laptop/)

In [None]:


from transformers import pipeline
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline


# max context is 32768 

hf_pipe = pipeline("text-generation",
                    device_map="auto",
                    do_sample=True,
                    # device=0,
                    model="meta-llama/model_name_here",
                    trust_remote_code=True,
                    max_new_tokens=2000, 
                    temperature=0.8,
                    pad_token_id="eos_token_id",
                    )
hf = HuggingFacePipeline(pipeline=hf_pipe)



#### Local LLMs using LlamaCpp and GGUF models

There are many tutorials for setting up LlamaCpp, but overall this was the system I found to be the most efficient.  Ultimately I found this to be the most effective and simple to set up. Once it's running you will just need to download models in GGUF format and place them in the model folder.

Here are a couple good tutorials on setting this up on your system:

[Datacamp Tutorial](https://www.datacamp.com/tutorial/llama-cpp-tutorial)

[Medium Article](https://medium.com/@fradin.antoine17/3-ways-to-set-up-llama-2-locally-on-cpu-part-1-5168d50795ac)

In [None]:
#This model was effective on my local machine, your mileage may vary but typically an 8B to 13B parameter model seems to work fine.

# from langchain.llms import LlamaCpp


# llm_cpp = LlamaCpp(
#             streaming = False,
#             model_path="/home/pete/models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf",
#             n_gpu_layers=1,
#             n_batch=512,
#             temperature=0.1,
#             verbose=True,
#             n_ctx=32768,
#             max_tokens=0,
#             max_seq_length = None,
#             top_p=0.9,
#             top_k=50,
#             repetition_penalty=1.2,
#             presence_penalty=1.0,
#             )



In [8]:
from langchain.llms import LlamaCpp


herm_cpp = LlamaCpp(
            streaming = False,
            model_path="/home/pete/models/nous-hermes-llama2-13b.Q4_0.gguf",
            n_gpu_layers=1,
            n_batch=512,
            temperature=0.2,
            verbose=True,
            n_ctx=32768,
            max_tokens=0,
            max_seq_length = None,
            top_p=0.9,
            top_k=50,
            repetition_penalty=1.2,
            presence_penalty=1.0,
            )



                max_seq_length was transferred to model_kwargs.
                Please confirm that max_seq_length is what you intended.
                repetition_penalty was transferred to model_kwargs.
                Please confirm that repetition_penalty is what you intended.
                presence_penalty was transferred to model_kwargs.
                Please confirm that presence_penalty is what you intended.
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /home/pete/models/nous-hermes-llama2-13b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:   

#### Using Ollama server (running on another device on local net)

For this method you first need to setup Ollama and then activate and load a model. In my case I was running Ollama on a different device and connecting remotly but if it's a local instance just replace the IP address in the base_url with localhost. Here are some resources for setting up an Ollama instance (I used a docker container and had good success with that.)

[Getting Started with Ollama and Docker](https://collabnix.com/getting-started-with-ollama-and-docker/)

[Dockerhub Ollama](https://hub.docker.com/r/ollama/ollama#!)

In [None]:
# from langchain_community.llms import Ollama
# from langchain_community.chat_models import ChatOllama
# # from langchain_experimental.llms.ollama_functions import OllamaFunctions
# from langchain_community.embeddings import OllamaEmbeddings

# ollama_model_2 = Ollama(base_url='http://192.168.68.125:11434',   #your local ollama server address here
# model="mixtral")


## 4) Setup our Knowledge Object Building Function

### Generate Knowledge Objects
Here we set up the functions to run our knowledge object prompts.



In [9]:
#The system message below will be added to our prompt and is designed to guide the LLM on how to answer the question.

from langchain.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate
from langchain_core.messages import HumanMessage, SystemMessage

def ko_model_run(embeddings, n_samples, example_prompt, subject, model):
    example_sel=example_picker(embeddings,n_samples)
    question = ko_question_func(subject)
    few_shot_prompt = FewShotChatMessagePromptTemplate(
        example_selector=example_sel,
        example_prompt=example_prompt,
        # suffix="Question: {input}",
        input_variables=["question"],
        )
    
    prompt = ChatPromptTemplate.from_messages(
        [
        SystemMessage(
            content=("""Return the response to the question in markdown format and written in paragraph 
                     form in the style of an educational blog post preferring complete paragraphs instead of
                     lists of bullet points. Please keep the response between 1000 and 1500 words in length.
                     When citing references only use real, verifiable references with links that are known to
                     exist on the internet now, do not use creativity to generate these items.
                     """
            )
        ),
        few_shot_prompt,
        HumanMessage(content=question)
        ]
        )
    chain = prompt | model
    output = chain.invoke({"question":question})
    return output



In [10]:
topic0 = topic_list['Subject'][0]

In [11]:
#For a one-off run of the above prompt this is sufficient. Below we will set up a function to run multiple outputs at once.

output = ko_model_run(baa, 1, example_prompt, topic0, herm_cpp)


llama_print_timings:        load time =   43081.19 ms
llama_print_timings:      sample time =     192.45 ms /  1460 runs   (    0.13 ms per token,  7586.39 tokens per second)
llama_print_timings: prompt eval time =  220605.72 ms /  2757 tokens (   80.02 ms per token,    12.50 tokens per second)
llama_print_timings:        eval time =  524869.32 ms /  1459 runs   (  359.75 ms per token,     2.78 tokens per second)
llama_print_timings:       total time =  750032.26 ms /  4216 tokens


In [12]:
output

'\nAI: Ollama\n        ## Overview\n        Ollama is a subfield of Artificial Intelligence (AI) that focuses on developing algorithms for natural language understanding, generation, and interaction. It involves several components, including speech recognition, language modeling, dialogue management, and machine learning. Ollama has numerous applications across various domains, including virtual assistants, chatbots, language translation, and sentiment analysis.\n        ## History\n        The history of Ollama can be traced back to the early days of AI, with researchers exploring techniques for natural language processing, understanding, and generation. However, it was not until the advent of big data and the internet that Ollama gained significant traction, enabling organizations to manage and make sense of vast amounts of information.\n        ## Applications\n        Ollama has numerous applications across various domains, including:\n        - Virtual Assistants: Ollama can be us

In [14]:
# Uncomment below to see result of one-off run:

from IPython.display import Markdown, display

display(Markdown(output))


AI: Ollama
        ## Overview
        Ollama is a subfield of Artificial Intelligence (AI) that focuses on developing algorithms for natural language understanding, generation, and interaction. It involves several components, including speech recognition, language modeling, dialogue management, and machine learning. Ollama has numerous applications across various domains, including virtual assistants, chatbots, language translation, and sentiment analysis.
        ## History
        The history of Ollama can be traced back to the early days of AI, with researchers exploring techniques for natural language processing, understanding, and generation. However, it was not until the advent of big data and the internet that Ollama gained significant traction, enabling organizations to manage and make sense of vast amounts of information.
        ## Applications
        Ollama has numerous applications across various domains, including:
        - Virtual Assistants: Ollama can be used to create virtual assistants that can understand and respond to natural language queries, enabling users to interact with devices more naturally.
        - Chatbots: Ollama can be used to develop chatbots that can understand and generate natural language responses, enabling businesses to automate customer service, sales, and marketing tasks.
        - Language Translation: Ollama can be used to develop machine translation systems that can translate text from one language to another, enabling people to communicate across linguistic barriers.
        - Sentiment Analysis: Ollama can be used to analyze text data and extract sentiment, enabling businesses to understand customer feedback, monitor social media, and improve brand reputation.
        - Speech Recognition: Ollama can be used to develop speech recognition systems that can transcribe spoken language into text, enabling people with disabilities to communicate more easily and enabling businesses to automate call centers.
        ## When to use Ollama
        Ollama should be utilized when dealing with natural language data that requires complex analysis, understanding, and generation. It is particularly useful for tasks that involve:
        - Identifying patterns and relationships across different languages.
        - Automating customer service tasks, such as answering FAQs, resolving issues, and providing product recommendations.
        - Enhancing language translation tasks with context-rich data.
        - Supporting real-time language processing tasks, such as sentiment analysis, language detection, and intent recognition.
        - Facilitating interdisciplinary collaboration by integrating domain-specific knowledge with language-specific insights.
        ## Technologies
        To become an expert in Ollama, a data scientist should focus on the following technologies:
        - Natural Language Processing (NLP) Libraries: Libraries like NLTK, spaCy, and Gensim provide support for text preprocessing, feature extraction, and language modeling tasks.
        - Speech Recognition Libraries: Libraries like SpeechRecognition, PyAudio, and Festival provide support for speech recognition tasks, enabling more natural language interactions with devices.
        - Dialogue Management Systems: Systems like Dialogflow, Rasa, and Tars provide support for dialogue management tasks, enabling more engaging and personalized conversations with users.
        - Machine Learning Frameworks: Frameworks like TensorFlow, PyTorch, and Scikit-learn provide support for machine learning tasks, enabling more accurate language processing models.
        ## Strengths and Limitations
        Strengths of Ollama include its ability to handle complex language tasks, support real-time language processing, and facilitate interdisciplinary collaboration. However, it also has some limitations, such as data quality issues, performance challenges, and the need for expertise in various domains. Additionally, Ollama may not be widely adopted in some organizations due to concerns about data privacy, security, and ownership.
        ## Alternative Options
        Alternative options to Ollama include rule-based systems, statistical language models, and traditional data analysis techniques like regression, clustering, and classification. However, these methods may not scale well with big data or provide the same level of context-awareness as Ollama.
        ## Terminology
        Some common terminology associated with Ollama includes:
        - Natural Language Processing (NLP): A subfield of AI that focuses on enabling machines to understand, interpret, and generate human language.
        - Speech Recognition: A subfield of NLP that focuses on developing algorithms for transcribing spoken language into text.
        - Dialogue Management: A subfield of NLP that focuses on developing algorithms for managing conversations with users, enabling more engaging and personalized interactions.
        - Language Modeling: A subfield of NLP that focuses on developing algorithms for predicting the next word or sentence in a sequence, enabling more accurate language generation tasks.
        - Machine Learning (ML): A subfield of AI that focuses on developing algorithms for pattern recognition, prediction, and decision-making.
        - Big Data: Large volumes of data that require specialized tools and techniques for processing, analysis, and visualization.
        - Sentiment Analysis: A subfield of NLP that focuses on analyzing text data and extracting sentiment, enabling businesses to understand customer feedback, monitor social media, and improve brand reputation.
        ## Deployments
        Example deployments of Ollama include:
        - Amazon Lex for building conversational interfaces that can understand and respond to natural language queries: https://aws.amazon.com/lex/
        - Google Cloud Natural Language API for analyzing text data and extracting features like sentiment, syntax, and entity recognition: https://cloud.google.com/natural-language/
        - Microsoft Azure Cognitive Services for developing language processing applications that can understand, generate, and interact with natural language data: https://azure.microsoft.com/en-us/services/cognitive-services/
        ## Resources
        To learn more about Ollama, a data scientist can refer to the following resources:
        - Tutorials: Websites like Towards Data Science, Analytics Vidhya, and Machine Learning Mastery offer tutorials on Ollama, covering topics like speech recognition, language modeling, and dialogue management.
        - Books: Books like "Natural Language Processing with Python" by Michael J. C. Cooper provide comprehensive overviews of Ollama, including its history, principles, and applications.
        - Authoritative Institutions: Institutions like ACL, EMNLP, and NAACL provide authoritative resources for Ollama, including conferences, workshops, and journals.

### Generate Resource Links

This intent with this section is to ask the LLMs to give us a list of resources to learn more on the topic. At this point Google Gemini is giving the best results but typically about 50% are invalid so the current iteration will require heavy editing.

In [15]:
from langchain.prompts.few_shot import FewShotPromptTemplate
from langchain.prompts.prompt import PromptTemplate


rl_examples = [
    {   "question": """Create a list of about 8 to 12 specific resources that a data science 
        student can look to for developing expertise around Cybersecurity Knowledge Graphs
        in order to develop expertise on the most relevant data tools related to this topic.
        Please ensure that the references are real and not created by you or inaccurate or 
        invalid URLs. Please focus on sources that will lead to directly to concrete 
        knowledge on this topic such as scholarly articles, free online courses, books or 
        video tutorials.""",
        "answer":"""## Additional Resources:
        - MITRE ATT&CK Knowledge Base:** https://attack.mitre.org/
        This remains a crucial resource for understanding adversary tactics, techniques, and knowledge (ATT&CK). It provides a standardized framework for classifying cyber threats, essential for populating knowledge graphs.
        - Neo4j: Graphs for Cybersecurity:** https://neo4j.com/blog/graphs-cybersecurity-knowledge-graph-digital-twin/
        This guide dives into the practical application of knowledge graphs for cybersecurity. Learn how Neo4j, a popular graph database platform, leverages these graphs for threat detection, incident analysis, and attack surface management.
        - Papers with Code: Knowledge Graph Embedding:** https://paperswithcode.com/task/knowledge-graph-embedding
        Papers with Code is a fantastic resource for staying updated on the latest research in knowledge graph embedding techniques. Knowledge embedding allows for efficient representation and analysis of knowledge graphs within data science models.
        - Amazon Web Services (AWS) Knowledge Graphs:** https://aws.amazon.com/neptune/knowledge-graphs-on-aws/
        AWS offers Neptune, which is touted as a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. Follow the link above for more info on Neptune.
        - OpenCypher:** https://opencypher.org/
        openCypher is an open source implementation of Cypher - the most widely adopted, fully-specified, and open query language for property graph databases. Cypher was developed by Neo4j. OpenCypher is part of an initiative to create an open standard GQL (Graph Query Language) within the International Organization for Standardization (ISO).
        - ResearchGate: Cybersecurity knowledge graphs:** https://www.researchgate.net/publication/370401574_Cybersecurity_knowledge_graphs
        This research paper delves into the technical aspects of building cybersecurity knowledge graphs. It explores the use of graph-based data models and discusses the benefits, challenges, and existing research projects in this area.
        - **Open-Source Knowledge Graph Frameworks:**
        Several open-source knowledge graph frameworks are available, such as Apache Jena and OpenKE. Exploring their documentation and tutorials can provide hands-on experience in building and manipulating knowledge graphs. (You can find more information by searching online for these frameworks)
        """,
    },]

In [16]:
from langchain.prompts.example_selector import SemanticSimilarityExampleSelector
from langchain_community.vectorstores import Chroma
import chromadb.utils.embedding_functions as embedding_functions

def rl_example_picker(embeddings,n_samples):
    rl_example_sel = SemanticSimilarityExampleSelector.from_examples(
                                rl_examples,
                                embeddings,
                                Chroma,
                                # Number of examples
                                k=n_samples,
                                )
    return rl_example_sel

def rl_question_func(subject):
    question = f"""
        Create a list of about 8 to 12 specific resources that a data 
        science student can look to for developing expertise around 
        {subject} in order to develop expertise on the most relevant 
        data tools related to this topic. Ensure that the references 
        are real and not created by you or inaccurate or invalid URLs.
        Focus on sources that will lead to directly to concrete 
        knowledge on this topic such as scholarly articles, free 
        online courses, books or video tutorials.
        """
    return question


In [17]:
from langchain.prompts import ChatPromptTemplate

rl_example_prompt = ChatPromptTemplate.from_messages(
    [("human","{question}"), ("ai","{answer}")],
)

print(rl_example_prompt.format(**rl_examples[0]))

Human: Create a list of about 8 to 12 specific resources that a data science 
        student can look to for developing expertise around Cybersecurity Knowledge Graphs
        in order to develop expertise on the most relevant data tools related to this topic.
        Please ensure that the references are real and not created by you or inaccurate or 
        invalid URLs. Please focus on sources that will lead to directly to concrete 
        knowledge on this topic such as scholarly articles, free online courses, books or 
        video tutorials.
AI: ## Additional Resources:
        - MITRE ATT&CK Knowledge Base:** https://attack.mitre.org/
        This remains a crucial resource for understanding adversary tactics, techniques, and knowledge (ATT&CK). It provides a standardized framework for classifying cyber threats, essential for populating knowledge graphs.
        - Neo4j: Graphs for Cybersecurity:** https://neo4j.com/blog/graphs-cybersecurity-knowledge-graph-digital-twin/
     

In [18]:
def rl_model_run(embeddings, n_samples, example_prompt, subject, model):
    example_sel=rl_example_picker(embeddings,n_samples)
    question = rl_question_func(subject)
    few_shot_prompt = FewShotChatMessagePromptTemplate(
        example_selector=example_sel,
        example_prompt=example_prompt,
        # suffix="Question: {input}",
        input_variables=["question"],
        )
    
    rl_prompt = ChatPromptTemplate.from_messages(
        [
        SystemMessage(
            content=("""Return the response to the question in markdown format with a list of real and
                     verifiable resources with a brief description of each resource. When citing references
                     only use real references and exclude any generated sources that do not exist.
                     """
            )
        ),
        few_shot_prompt,
        HumanMessage(content=question)
        ]
        )
    chain = rl_prompt | model
    output = chain.invoke({"question":question})
    return output

In [39]:
#One-off resource link prompt:

rl_output = rl_model_run(baa, 1, rl_example_prompt, "Network Security Technologies", gpt_model)

In [40]:
rl_output

AIMessage(content='Certainly! Below is a curated list of resources that can help a data science student develop expertise around Network Security Technologies. These resources include scholarly articles, free online courses, books, and video tutorials.\n\n### Online Courses\n\n1. **Cybrary - Network Security:**\n   - A comprehensive course covering the fundamentals of network security, including firewalls, VPNs, and intrusion detection systems.\n   - URL: [https://www.cybrary.it/course/network-security/](https://www.cybrary.it/course/network-security/)\n\n2. **Coursera - Introduction to Cyber Security Specialization by NYU:**\n   - Offers a series of courses that provide a comprehensive overview of network security principles and practices.\n   - URL: [https://www.coursera.org/specializations/intro-cyber-security](https://www.coursera.org/specializations/intro-cyber-security)\n\n### Books\n\n3. **"Network Security Essentials" by William Stallings:**\n   - A textbook that covers the key

In [35]:
save('backup', 'output', 'rl_output')

## 5) Setup our Functions and Dataframe for Storing Output

Here we will set up a dataframe to store our outputs, and setup our function to iterate through each of our processes. We want to create the knowledge object for each subject and store them in the KO column, then create the list of resources and store in the resource column, then extract key terms and store in the key terms column.

#### Create the DF to store the output

In [19]:
#Dataframe creation
import pandas as pd

ko_df = topic_list.copy()
ko_df['KO'] = None
ko_df['Resources'] = None
ko_df['Terms'] = None
ko_df

Unnamed: 0,Subject,KO,Resources,Terms
0,Ollama,,,
1,Phishing and Social Engineering Detection,,,
2,Vulnerability Assessment and Management,,,
3,Word Embeddings,,,
4,Splunk,,,
5,LLM Performance Evaluation,,,
6,Behavioral Analytics,,,


#### Create Function to Make our KOs

In [20]:
# Function for KO creation
import pandas as pd

#Define the function to iterate through the df and add kos
def apply_ko_model_run(row, embeddings, n_samples, example_prompt, model):
    output = ""
    #Added while loop because some outputs were incorrect and length seemed the best test.
    while len(output.split()) < 350:
        ai_output = ko_model_run(embeddings, n_samples, example_prompt, row['Subject'], model)
        
        #try for open_ai output except for others
        try:
            output = ai_output.content
        except:
            output = ai_output
    
    # Return the output to be stored in the 'KO' column

    return output

# Apply the function to each row of the DataFrame
ko_df['KO'] = ko_df.apply(apply_ko_model_run, axis=1, args=(baa, 1, example_prompt, herm_cpp))

Llama.generate: prefix-match hit

llama_print_timings:        load time =   43081.19 ms
llama_print_timings:      sample time =    2003.87 ms / 15237 runs   (    0.13 ms per token,  7603.78 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time = 8585315.82 ms / 15237 runs   (  563.45 ms per token,     1.77 tokens per second)
llama_print_timings:       total time = 8768157.40 ms / 15238 tokens
Llama.generate: prefix-match hit

llama_print_timings:        load time =   43081.19 ms
llama_print_timings:      sample time =     232.61 ms /  1765 runs   (    0.13 ms per token,  7587.84 tokens per second)
llama_print_timings: prompt eval time =  184720.30 ms /  2346 tokens (   78.74 ms per token,    12.70 tokens per second)
llama_print_timings:        eval time =  614214.49 ms /  1764 runs   (  348.19 ms per token,     2.87 tokens per second)
llama_print_timings:       to

In [21]:
#Check the output

print(ko_df.loc[3,'KO'])


AI: Word Embeddings
            ## Overview
            Word Embeddings are a type of natural language processing technique that maps words or phrases from a vocabulary to vectors of real numbers. These vectors capture semantic relationships between words, making it easier for machines to understand, interpret, and generate human language. Word Embeddings have gained significant traction in recent years, enabling applications like machine translation, sentiment analysis, and text classification.
            ## History
            The history of Word Embeddings can be traced back to the early days of AI, with researchers exploring techniques for natural language understanding, reasoning, and knowledge representation. However, it was not until the advent of big data and the internet that Word Embeddings gained significant traction, enabling organizations to manage and make sense of vast amounts of information.
            ## Applications
            Word Embeddings have numerous applica

In [22]:
save('backup2', 'ko_df')

#### Get Resource Lists 

In [59]:
# Function for resource list creation
import pandas as pd

#Define the function to iterate through the df and add the resorce texts
def apply_resource_model_run(row, embeddings, n_samples, example_prompt, model):
    output = rl_model_run(embeddings, n_samples, example_prompt, row['Subject'], model)
    # Return the output, openai usually returns as AIMessage type so added the try .content
    try:
        return output.content
    except:
        return output

# Apply the function to each row of the DataFrame
ko_df['Resources'] = ko_df.apply(apply_resource_model_run, axis=1, args=(baa, 1, rl_example_prompt, gpt_model))

In [61]:
#Check the output:

print(ko_df.loc[3,'Resources'])

Certainly! Hugging Face has become a pivotal platform in the data science and machine learning community, especially for those working with natural language processing (NLP). Below is a curated list of resources that can help a data science student develop expertise around Hugging Face and its ecosystem:

1. **Hugging Face's Official Website:** https://huggingface.co/
   - The primary source for all things related to Hugging Face, including documentation, model repositories, and community forums.

2. **Transformers Library Documentation:** https://huggingface.co/docs/transformers/index
   - Detailed documentation on the Transformers library by Hugging Face, which is crucial for understanding how to implement state-of-the-art NLP models.

3. **Hugging Face Course:** https://huggingface.co/course/chapter1
   - A free course offered by Hugging Face that covers the fundamentals of Transformers, how to use pre-trained models for various NLP tasks, and how to contribute back to the community

In [62]:
save('backup2', 'ko_df')

#### Functions to add Keyword List to KO

In [63]:
# Function for keyword extraction with yake:
import yake
import nltk
from nltk.corpus import stopwords
import spacy

# Download the stopwords from nltk
nltk.download('stopwords')
nlp = spacy.load("en_core_web_lg")

# Define a function to apply keyword extraction to the 'KO' column
def extract_keywords(row):
    
    kw_list = []
    text = row['KO']
    custom_words = row['Subject']
    english_stopwords = stopwords.words('english')
    english_stopwords.extend(custom_words)
    
    kw_extractor = yake.KeywordExtractor(
    n=3,
    top=5,
    lan='en',
    dedupLim=0.1,
    dedupFunc='seqm',
    stopwords=english_stopwords,
    windowsSize=1,
    features=None
    )
    
    if text:
        keywords = kw_extractor.extract_keywords(text)
        kw_list.extend((kw) for kw, v in keywords)
    else:
        pass
    
    #spacy entity extraction
    doc = nlp(text)
    proper_nouns = [token.text for token in doc if token.pos_ == "PROPN"]
    entities = [ent.text for ent in doc.ents]
    all_concepts = proper_nouns + entities
    unique_concepts = set(all_concepts)

    if unique_concepts:
        kw_list.extend(unique_concepts)
    else:
        pass
    
    return kw_list
        
    
# Apply the function to each row and store the results in the 'Terms' column
ko_df['Terms'] = ko_df.apply(extract_keywords, axis=1)


[nltk_data] Downloading package stopwords to /home/pete/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [64]:
#Check the output:

print(ko_df.loc[0,'Terms'])

['Brief History', 'Shot Learning', 'FSL models', 'new', 'adapt', 'Tutorials', 'Graph', 'https://arxiv.org/abs/1904.05046\n', 'Intermediate', 'Recognition', 'Snell et al.', '2021', 'Paperspace', 'Learning', '## Strengths', '## Core Principles', 'Imaging', 'Snell et al', 'Neural', 'Computer Vision', 'PyTorch', 'Vision', 'Language', 'Character', '##', 'Networks', 'https://www.youtube.com/watch?v=efL8S9udCxY', 'Tour', 'Courses', '## Alternative', 'K', 'Yannic', '1', 'Mini', '.', '## Example Deployments\n1', 'TensorFlow', 'Prototypical', 'Meta', '600', 'Meta-Learning', 'Finn', 'Alternative', 'Frameworks', 'Matching', 'Mini-ImageNet', 'Model-Agnostic Meta-Learning', 'Relation', 'Limitations', 'Meta Learning and Few Shot Learning', '2020', 'Principles', 'Core', 'Meta-Learning Libraries', 'Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks', 'Prototypical Networks', '## Limitations', 'al', 'CIFAR', 'Omniglot', 'Resources', 'https://blog.paperspace.com/few-shot-learning/', 'Image

#### Export rows to Markdown files

With this step, we'll bring together the rough knowledge object article, the list of resources and the list of keywords into a single rough markdown file. Once this is completed the articles will need to be proof read, fact checked and polished to create an accurate knowledge object to help DS students guide their learning process.

In [65]:
ko_df.head()

Unnamed: 0,Subject,KO,Resources,Terms
0,Few Shot Learning,# Few Shot Learning\n\n## Core Principles and ...,Certainly! Few-Shot Learning is a fascinating ...,"[Brief History, Shot Learning, FSL models, new..."
1,Reinforcement Learning from Human Feedback,Reinforcement Learning from Human Feedback (RL...,Certainly! Reinforcement Learning from Human F...,"[Reinforcement Learning, RLHF, systems, adapt,..."
2,LLM Fine-Tuning,LLM Fine-Tuning: Empowering Language Models fo...,Certainly! Fine-tuning Large Language Models (...,"[LLM Fine-Tuning, natural language, pre-traine..."
3,Hugging Face,# Hugging Face: A Comprehensive Guide for Data...,Certainly! Hugging Face has become a pivotal p...,"[Hugging Face, NLP, Face tools, pre-trained, I..."
4,Intrusion Detection and Prevention,# Intrusion Detection and Prevention\n\n## Cor...,Certainly! Below is a curated list of resource...,"[IDP, intrusion detection systems, data, logs,..."


In [66]:
import pandas as pd
from markdownify import markdownify as md

# Function to convert HTML to markdown (may not be necessary)
def html_to_markdown(html):
    return md(html)

# Function to create markdown list from a list of strings
def list_to_markdown(lst):
    return '\n'.join(f'- {item}' for item in lst)

# Iterate over the DataFrame and create markdown files
for index, row in ko_df.iterrows():
    
    combined_text = '# ' + row['Subject'] + '\n\n' + row['KO'] + '\n\n' + row['Resources'] + '\n\n' + '## Key Terms' + '\n\n' + list_to_markdown(row['Terms'])
    
    # Convert any possible HTML content to markdown (optional)
    combined_text = html_to_markdown(combined_text)
    
    #Define file name and convert to markdown file
    row_subj = row['Subject']
    name_nospace = row_subj.replace(' ', '_')
    file_name = f"KO_rough_{name_nospace}.md"
    
    with open(file_name, 'w', encoding='utf-8') as file:
        file.write(combined_text)

    print(f"Exported {file_name}")

Exported KO_rough_Few_Shot_Learning.md
Exported KO_rough_Reinforcement_Learning_from_Human_Feedback_.md
Exported KO_rough_LLM_Fine-Tuning.md
Exported KO_rough_Hugging_Face.md
Exported KO_rough_Intrusion_Detection_and_Prevention.md
Exported KO_rough_Knowledge_Graphs.md
Exported KO_rough_ML_Classification_for_NLP.md
Exported KO_rough_Named_Entity_Recognition.md
Exported KO_rough_Natural_Language_ToolKit.md
Exported KO_rough_Neo4J_Database.md
