
# Knowlets: Vector Database and Network Visualization

This Jupyter Notebook details the implementation of the Knowlets system, a sophisticated approach for creating a hierarchical structure of knowledge using a tree-like data structure. Each 'Knowlet' represents a node, capable of storing data, managing relationships with other nodes, and enabling advanced data analysis and visualization.


In [509]:
pip install numpy scikit-learn pyvis nltk gensim matplotlib scipy networkx IPython openai

Collecting numpy==1.21.0
  Using cached numpy-1.21.0-cp38-cp38-macosx_10_9_x86_64.whl (16.9 MB)
Collecting scikit-learn==0.24.2
  Downloading scikit_learn-0.24.2-cp38-cp38-macosx_10_13_x86_64.whl (7.2 MB)
[K     |████████████████████████████████| 7.2 MB 6.9 MB/s eta 0:00:01
[?25hCollecting pyvis==0.1.9
  Downloading pyvis-0.1.9-py3-none-any.whl (23 kB)
Collecting nltk==3.6.2
  Downloading nltk-3.6.2-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 12.8 MB/s eta 0:00:01
Collecting matplotlib==3.4.3
  Downloading matplotlib-3.4.3-cp38-cp38-macosx_10_9_x86_64.whl (7.2 MB)
[K     |████████████████████████████████| 7.2 MB 43.0 MB/s eta 0:00:01
[?25hCollecting scipy==1.7.0
  Downloading scipy-1.7.0-cp38-cp38-macosx_10_9_x86_64.whl (31.9 MB)
[K     |████████████████████████████████| 31.9 MB 783 kB/s  eta 0:00:01
[?25hCollecting networkx==2.6.3
  Downloading networkx-2.6.3-py3-none-any.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 19.1 MB/s eta

# Identifying potential use-cases for Knowlets as a Vector Database / Self-contained Network Visualization and Knowledge Containers:

Extended my notebook from before, Knowlet has a name, children (other Knowlets), data (associated with the Knowlet), registry (parent Knowlet), file (path to an associated file), and other attributes.
Knowlets can be used to build a hierarchical structure of knowledge, with each Knowlet containing data and references to child Knowlets. The code includes methods for adding children, adding data, performing queries, visualizing the Knowlet structure, performing topic modeling, sentiment analysis, splitting the hierarchy based on text similarity, and more.

# Switching from using Langchain to directly querying GPT:
In this code, I have switched from using Langchain to directly querying GPT for generating responses. The OpenAI Chat Completion API is used to generate responses based on user queries. GPT-3.5 Turbo model is used for this purpose.

# Implementing non-GPT based LLM parameters for topic modeling, sentiment analysis, and preprocessing:
The code includes several non-GPT based libraries and models for various tasks:

- Topic Modeling: Latent Dirichlet Allocation (LDA) is used to perform topic modeling on the text data of the Knowlet and its children. The code optimizes the number of topics and provides the most probable words for each topic.
- Sentiment Analysis: The nltk.sentiment.vader package is used for sentiment analysis on the text data of the Knowlet and its children. Sentiment scores are calculated and stored in the Knowlet's data.
- Preprocessing: The code performs text preprocessing tasks such as tokenization, lemmatization, stop word removal, and sentence splitting using libraries like nltk, gensim, and sklearn.

# Additional Features:
- Visualizing the Knowlet Structure: The code includes a method to visualize the Knowlet and its children using a network graph. The Pyvis library is used to create an interactive network visualization.
- Splitting the Hierarchy: The code uses TF-IDF vectorization, cosine similarity, and Agglomerative Clustering to split the hierarchy of the Knowlet based on text similarity. It generates a dendrogram visualization of the clustering.
- Autonomous Splitting: The code includes methods for autonomously splitting the Knowlet based on GPT-3.5 Turbo responses or non-GPT based topic modeling. It uses the OpenAI Chat Completion API to generate JSON-formatted topic words and sentiment information. The Knowlet's children and data are updated based on the generated information.
- Parsing JSON-formatted Responses: The code includes a method to parse JSON-formatted strings into Python dictionaries.

I have tested the code with various scenarios, and it seems to be working as intended. If you have any suggestions or feedback on how to improve the code or if there are any specific aspects you would like me to focus on, please let me know. I am looking forward to discussing this further with you.


# LLM-based Implementation of a Knowlet

In [521]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pyvis.network import Network
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt
from sklearn.decomposition import LatentDirichletAllocation
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize
from gensim.summarization import summarize
from nltk.corpus import stopwords
from gensim.models import CoherenceModel
from gensim.models import Phrases
from gensim.corpora import Dictionary
from gensim.models import LdaModel
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
import tempfile
from collections import deque
from pyvis import network as net
import networkx as nx
from IPython.core.display import display, HTML
import json
import openai
import multiprocessing

OPENAI_API_KEY="REPLACE WITH YOUR OPENAI API KEY"
openai.api_key = OPENAI_API_KEY

class Knowlet:
# Function:  - Explain what this function does here.
    def __init__(self, name):
        """
        Initialize a Knowlet object.

        Args:
            name (str): The name of the Knowlet.

        Attributes:
            name (str): The name of the Knowlet.
            children (list): List of child Knowlets.
            data (dict): Dictionary to store data associated with the Knowlet.
            registry (None or object): Reference to the Knowlet's parent.
            file (None or str): Path to a file associated with the Knowlet.
            stop_words (set): Set of stop words for preprocessing.
            lemmatizer (WordNetLemmatizer): Word lemmatizer for preprocessing.
            sia (SentimentIntensityAnalyzer): Sentiment analyzer for sentiment analysis.
        """
        self.name = name
        self.children = []
        self.data = {}
        self.registry = None
        self.file = None
        self.stop_words = set(stopwords.words('english'))
        self.lemmatizer = WordNetLemmatizer()
        self.sia = SentimentIntensityAnalyzer()
    
# Function:  - Explain what this function does here.
    def add_child(self, child):
        """
        Add a child Knowlet to the current Knowlet.

        Args:
            child (Knowlet): The child Knowlet to be added.
        """
        self.children.append(child)
        child.set_registry(self.registry)

# Function:  - Explain what this function does here.
    def add_data(self, data, tag):
        """
        Add data to the current Knowlet.

        Args:
            data (str or list): The data to be added.
            tag (str): The tag or label associated with the data.
        """
        if isinstance(data, str):
            data = self.preprocess(data)
        elif isinstance(data, list):
            data = [self.preprocess(sentence) for sentence in data]
        else:
            raise ValueError("Invalid data type. Expected str or list.")

        self.data[tag] = self.data.get(tag, []) + [data]

# Function:  - Explain what this function does here.
    def preprocess(self, text):
        """
        Preprocesses the given text.

        Args:
            text (str): The text to be preprocessed.

        Returns:
            list: The preprocessed sentences.
        """
        # Tokenize the text into sentences
        sentences = sent_tokenize(text)
        return sentences

# Function:  - Explain what this function does here.
    def get_data(self, key):
        """
        Get the data associated with the given key.

        Args:
            key (str): The key associated with the data.

        Returns:
            list or None: The data associated with the key, or None if key not found.
        """
        return self.data.get(key)

# Function:  - Explain what this function does here.
    def query(self, query_string):
        """
        Search for data based on the query string.

        Args:
            query_string (str): The query string to search for.

        Returns:
            list or None: The data matching the query, or None if no match found.
        """
        # Split query_string into words
        words = query_string.split()

        # Look for exact matches in self.data
        for word in words:
            if word in self.data:
                return self.data[word]

        # Recursively search children for matches
        for child in self.children:
            result = child.query(query_string)
            if result:
                return result

        # If no match is found, return None
        return None

# Function:  - Explain what this function does here.
    def complex_query(self, query_string):
        """
        Perform a complex query to search for data.

        Args:
            query_string (str): The query string to search for.

        Returns:
            Knowlet or None: The Knowlet matching the query, or None if no match found.
        """
        # Search all properties of self for matches
        if query_string in self.name:
            return self
        for key, value in self.data.items():
            if query_string in key or query_string in str(value):
                return self

        # Recursively search children for matches
        for child in self.children:
            result = child.complex_query(query_string)
            if result:
                return result

        # If no match is found, return None
        return None

# Function:  - Explain what this function does here.
    def add_file(self, file):
        """
        Add a file path to the current Knowlet.

        Args:
            file (str): The path to the file associated with the Knowlet.
        """
        self.file = file

# Function:  - Explain what this function does here.
    def get_file(self):
        """
        Get the path to the file associated with the Knowlet.

        Returns:
            str or None: The path to the file, or None if no file is associated.
        """
        return self.file

# Function:  - Explain what this function does here.
    def set_registry(self, registry):
        """
        Set the registry (parent) of the Knowlet.

        Args:
            registry (Knowlet or None): The parent Knowlet or None if the Knowlet is root.
        """
        self.registry = registry

# Function:  - Explain what this function does here.
    def get_registry(self, registry):
        """
        Get the registry (parent) of the Knowlet.

        Returns:
            Knowlet or None: The parent Knowlet or None if the Knowlet is root.
        """
        return self.registry

# Function:  - Explain what this function does here.
    def visualize(self):
        """
        Visualize the Knowlet and its children using a network graph.

        Returns:
            Network: The Pyvis Network graph object.
        """
        G = Network(height="1080px", width="100%", notebook=True, cdn_resources="remote")
        self._build_graph(G, self.name)
        G.show("knowlet_structure.html")
        return G

# Function:  - Explain what this function does here.
    def perform_topic_modeling(self, max_topics, min_topics=2):
        """
        Perform topic modeling on the text data of the Knowlet and its children.

        Args:
            max_topics (int): The maximum number of topics to consider.
            min_topics (int): The minimum number of topics to consider. Default is 2.

        Returns:
            LdaModel: The optimized Latent Dirichlet Allocation (LDA) model.
        """
        # Get the text data from the Knowlet and its children
        text_data = self._get_text_data()

        # Enhance preprocessing:
        # Perform stemming/lemmatization and stop word removal
        lemmatizer = WordNetLemmatizer()
        stop_words = set(stopwords.words('english'))
        text_data = [
            [lemmatizer.lemmatize(word) for word in doc.lower().split() if word not in stop_words]
            for doc in text_data
        ]

        # Handle bigrams
        bigram = Phrases(text_data, min_count=5, threshold=10)
        for idx in range(len(text_data)):
            for token in bigram[text_data[idx]]:
                if '_' in token:  # '_' indicates a bigram
                    text_data[idx].append(token)

        # Create the Dictionary and Corpus needed for Topic Modeling
        dictionary = Dictionary(text_data)
        corpus = [dictionary.doc2bow(doc) for doc in text_data]

        # Optimize the number of topics
        coherence_scores = []
        models = []
        for num_topics in range(min_topics, max_topics + 1):
            model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
            models.append(model)
            coherence_model = CoherenceModel(model=model, texts=text_data, dictionary=dictionary, coherence='c_v')
            coherence_scores.append(coherence_model.get_coherence())

        # Choose the model with the highest coherence score
        optimal_model = models[coherence_scores.index(max(coherence_scores))]

        return optimal_model

# Function:  - Explain what this function does here.
    def _get_text_data(self):
        """
        Retrieve the text data from the Knowlet and its children.

        Returns:
            list: The text data as a list of strings.
        """
        text_data = []

        data_values = list(self.data.values())
        if data_values:
            text_data.extend([str(value) for value in data_values])

        for child in self.children:
            text_data.extend(child._get_text_data())

        return text_data

# Function:  - Explain what this function does here.
    def _get_topic_words(self, lda, num_words):
        """
        Get the most probable words for each topic in the LDA model.

        Args:
            lda (LdaModel): The LDA model.
            num_words (int): The number of words to retrieve for each topic.

        Returns:
            list: The topic words for each topic.
        """
        topic_words = []

        for topic_idx in range(lda.num_topics):
            top_words = lda.show_topic(topic_idx, num_words)
            top_words = [word for word, _ in top_words]
            topic_words.append(top_words)

        return topic_words

# Function:  - Explain what this function does here.
    def split_hierarchy(self):
        """
        Split the hierarchy of the Knowlet based on text similarity.

        Uses TF-IDF vectorization, cosine similarity, and Agglomerative Clustering.

        Displays a dendrogram visualization of the clustering.
        """
        # Get the sentences as text data
        text_data = self._get_text_data_sentences()

        # Vectorize the text data using TF-IDF
        vectorizer = TfidfVectorizer(ngram_range=(1, 2))
        tfidf_matrix = vectorizer.fit_transform(text_data)

        # Compute pairwise cosine similarities
        similarities = cosine_similarity(tfidf_matrix)

        # Apply Agglomerative Clustering to form the hierarchy
        clustering = AgglomerativeClustering(distance_threshold=0, n_clusters=None, linkage='ward')
        clusters = clustering.fit_predict(similarities)

        # Visualize the clustering
        plt.figure(figsize=(10, 7))
        plt.title('Hierarchical Clustering Dendrogram')
        dendrogram = sch.dendrogram(sch.linkage(clusters, method='ward'))
        plt.show()

# Function:  - Explain what this function does here.
    def _get_text_data_sentences(self):
        """
        Retrieve the sentences from the Knowlet.

        Returns:
            list: The sentences as a list of strings.
        """
        sentences = []

        data_values = list(self.data.values())
        if data_values:
            sentences.extend(data_values)

        return sentences

# Function:  - Explain what this function does here.
    def perform_sentiment_analysis(self):
        """
        Perform sentiment analysis on the text data of the Knowlet and its children.

        Adds sentiment information to the data of each Knowlet.
        """
        # Initialize Sentiment Analyzer
        sia = SentimentIntensityAnalyzer()

        data_items = list(self.data.items())  # Create a copy of the items
        for key, value in data_items:
            if isinstance(value, str):  # Perform sentiment analysis on strings
                sentiment_score = sia.polarity_scores(value)
                if sentiment_score['compound'] > 0.05:
                    sentiment = 'positive'
                elif sentiment_score['compound'] < -0.05:
                    sentiment = 'negative'
                else:
                    sentiment = 'neutral'

                # Store the sentiment with the corresponding key
                self.data[key + '_sentiment'] = sentiment

        # Recursively perform sentiment analysis on children
        for child in self.children:
            child.perform_sentiment_analysis()

# Function:  - Explain what this function does here.
    def _build_graph(self, G, parent_name):
        """
        Build a network graph representation of the Knowlet and its children.

        Args:
            G (Network): The Pyvis Network graph object.
            parent_name (str): The name of the parent Knowlet.
        """
        # Initialize the graph and node properties
        stack = deque([(self, parent_name)])
        visited = set()
        child_counter = {}

        while stack:
            node, node_name = stack.pop()

            if node_name in visited:
                continue
            visited.add(node_name)
            basis = node.get_data("most_similar_word") or ""
            label = f"{node.name}\nNumber of children: {len(node.children)}"

            # Add topic information to the label
            topic_info = node.get_data("topic_info")
            if topic_info is not None:
                label += f"\nTopic: {topic_info['topic']}\nTop Words: {', '.join(topic_info['top_words'])}"

            # Node size and color
            size = child_counter.get(node_name, 5) * 2
            if 'sentiment' in node.data:
                if int(node.get_data("sentiment")[0][0]) > 3:
                    color = 'green'
                elif int(node.get_data("sentiment")[0][0]) < 2:
                    color = 'red'
                else:
                    color = 'gray'
            else:
                color = 'blue'

            G.add_node(node_name, label=label, title=str(node), color=color, size=size)

            for child in reversed(node.children):
                label = f"{child.name}\nNumber of children: {len(child.children)}"
                topic_info = child.get_data("topic_info")
                if topic_info is not None:
                    label += f"\nTopic: {topic_info['topic']}\nTop Words: {', '.join(topic_info['top_words'])}"

                # Child color
                if 'sentiment' in child.data:
                    if int(child.get_data("sentiment")[0][0]) > 3:
                        color = 'green'
                    elif int(child.get_data("sentiment")[0][0]) < 2:
                        color = 'red'
                    else:
                        color = 'gray'
                else:
                    color = 'blue'

                G.add_node(child.name, label=label, title=str(child), color=color, size=2)

                # Edge color and thickness (based on the number of children)
                G.add_edge(node_name, child.name)

                stack.append((child, child.name))
                child_counter[node_name] = child_counter.get(node_name, 0) + 1

# Function:  - Explain what this function does here.
    def autonomous_split(self):
        """
        Perform autonomous splitting of the Knowlet using GPT-3.5 Turbo.

        Uses the OpenAI Chat Completion API to generate JSON-formatted topic words and sentiment information.

        Updates the Knowlet's children and data based on the generated information.
        """
        chat_completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[
            {"role": "user", "content": f'''Your job is to provide a JSON formatted object of the possible topic words from the following text and their corresponding sentiment (0 - 5, from good to bad). Only return the JSON object and nothing else. 
            The JSON format of the object expected is as follows: {{"topics":"Array of relevant topics here, each topic is separated", "sentiment": "Integer between 0 - 5", "more_information":"Any other relevant information you can provide about this topic."}} 
            {self.data['content']}'''}
        ])

        response = chat_completion.choices[0].message.content
        data = self.parse_json_string(response)
        print(f"Processed {self.name}")
        print(f"GPT-3.5 response: {response}")
        if data:
            for topic in data['topics']:
                self.add_child(Knowlet(f"{topic}"))
            if data['sentiment']:
                self.add_data(str(data['sentiment']), "sentiment")
            if data['more_information']:
                self.add_data(data['more_information'], "summary")

# Function:  - Explain what this function does here.
    def autonomous_split_par(self):
        """
        Perform autonomous splitting of the Knowlet using GPT-3.5 Turbo in parallel.

        Uses the OpenAI Chat Completion API to generate JSON-formatted topic words and sentiment information.

        Updates the Knowlet's children and data based on the generated information.
        """
        # GPT-3.5 Turbo configuration
        model = "gpt-3.5-turbo"

        # Define the function to be executed in parallel
# Function:  - Explain what this function does here.
        def process_knowlet(knowlet):
            chat_completion = openai.ChatCompletion.create(
                model=model,
                messages=[
                    {
                        "role": "user",
                        "content": f'''Your job is to provide a JSON formatted object of the possible topic words from the following text and their corresponding sentiment (0 - 5, from good to bad). Only return the JSON object and nothing else. 
                        The JSON format of the object expected is as follows: {{"topics":"Array of relevant topics here, each topic is separated", "sentiment": "Only integers (int) is allowed. between 0 - 5", "more_information":"Any other relevant information you can provide about this topic."}} 
                        {knowlet.data['content']}'''
                    }
                ]
            )

            response = chat_completion.choices[0].message.content
            data = knowlet.parse_json_string(response)
            print(f"Processed {self.name}")
            if data:
                for topic in data['topics']:
                    knowlet.add_child(Knowlet(f"{topic}"))
                if data['sentiment']:
                    knowlet.add_data(str(data['sentiment']), "sentiment")
                if data['more_information']:
                    knowlet.add_data(data['more_information'], "summary")

        # Parallelize the processing of each child Knowlet
        with multiprocessing.Pool() as pool:
            pool.map(process_knowlet, self.children)

# Function:  - Explain what this function does here.
    def autonomous_split_non_gpt(self, num_topics):
        """
        Perform autonomous splitting of the Knowlet using non-GPT based topic modeling.

        Uses Latent Dirichlet Allocation (LDA) to identify topics and their associated topic words.

        Updates the Knowlet's children and data based on the identified topics.
        """
        # Perform topic modeling
        lda_model = self.perform_topic_modeling(max_topics=num_topics)

        # Get the most probable words for each topic in the Lda model
        topic_words = self._get_topic_words(lda_model, num_words=5)
        flat_list = [item for sublist in topic_words for item in sublist]
        # For each identified topic, create a new child Knowlet
        for idx, words in enumerate(flat_list):
            child_knowlet = Knowlet(f"Topic {idx+1}")
            child_knowlet.add_data(words, "topic_words")
            self.add_child(child_knowlet)

# Function:  - Explain what this function does here.
    def parse_json_string(self, json_string):
        """
        Parse a JSON-formatted string into a Python dictionary.

        Args:
            json_string (str): The JSON-formatted string to parse.

        Returns:
            dict: The parsed dictionary.
        """
        try:
            data = json.loads(json_string)

            # If the topics key is a string, split it by commas
            if isinstance(data.get('topics', None), str):
                data['topics'] = [topic.strip() for topic in data['topics'].split(',')]

            return data

        except json.JSONDecodeError:
            return {}


The code defines a class called `Knowlet` which represents a node in a tree data structure. Each instance of the class has a name, a dictionary of data, a list of child nodes, and optional properties such as a file and a registry.

The class has methods to add child nodes and data to the dictionary, retrieve data by key, search for nodes based on a query string, and print the data for the node and its children.

The `query` method searches for exact matches in the node's data dictionary and recursively searches its children for matches. The `complex_query` method searches for matches in all properties of the node and its children, including the node's name, data dictionary, file, and registry.

The `add_file` and `get_file` methods set and retrieve the file property of the node. The `set_registry` and `get_registry` methods set and retrieve the registry property of the node.

The `__str__` method returns a string representation of the node, including its name, data dictionary, and child nodes (if any), and optionally its file property.

In [522]:
import nltk
from nltk.corpus import gutenberg, nps_chat, movie_reviews
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download the required corpora and lexicon
nltk.download('gutenberg')
nltk.download('vader_lexicon')
nltk.download('nps_chat')
nltk.download('movie_reviews')

# Create a parent Knowlet to hold all documents
corpus_knowlet = Knowlet('Corpus')

# Iterate over each file in the Gutenberg corpus
for fileid in gutenberg.fileids()[0:10]:
    # Get the raw text
    text = gutenberg.raw(fileid)

    # Create a new Knowlet for this document
    doc_knowlet = Knowlet(fileid)

    # Add the document content as data
    doc_knowlet.add_data(text[0:2000], 'content')

    # Add the document Knowlet as a child to the corpus Knowlet
    corpus_knowlet.add_child(doc_knowlet)

# Iterate over each file in the NPS Chat corpus
for fileid in nps_chat.fileids()[0:15]:
    # Get the raw text
    text = nps_chat.raw(fileid)

    # Create a new Knowlet for this document
    doc_knowlet = Knowlet(fileid)

    # Add the document content as data
    doc_knowlet.add_data(text[0:2000], 'content')

    # Add the document Knowlet as a child to the corpus Knowlet
    corpus_knowlet.add_child(doc_knowlet)

# Iterate over each file in the Movie Reviews corpus
for fileid in movie_reviews.fileids()[0:15]:
    # Get the raw text
    text = movie_reviews.raw(fileid)

    # Create a new Knowlet for this document
    doc_knowlet = Knowlet(fileid)

    # Add the document content as data
    doc_knowlet.add_data(text[0:2000], 'content')

    # Add the document Knowlet as a child to the corpus Knowlet
    corpus_knowlet.add_child(doc_knowlet)

# Now the corpus Knowlet holds all the documents as its children
# Each child Knowlet represents a document and holds its content and sentiment as data

# Perform autonomous splitting based on topics detected
for child in corpus_knowlet.children:
    child.autonomous_split()

# Visualize the structure of the knowlets
graph = corpus_knowlet.visualize()

# If you're running this in a Jupyter notebook, you can show the graph using:
# graph.show("Your_browser.html")
pyvis_deepnote_show(graph)


[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/niksrid/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/niksrid/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package nps_chat to
[nltk_data]     /Users/niksrid/nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/niksrid/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


Processed austen-emma.txt
GPT-3.5 response: {"topics": ["Emma", "Jane Austen", "Miss Taylor", "marriage", "grief", "friendship"], "sentiment": 3, "more_information": "The text describes the life of Emma Woodhouse and her relationships with her family and friends, including her grief over the marriage of her friend Miss Taylor."}
Processed austen-persuasion.txt
GPT-3.5 response: {"topics":"Persuasion, Jane Austen, Sir Walter Elliot, Kellynch Hall, Somersetshire, Baronetage, patents, Elizabeth, James Stevenson, South Park, Gloucester, Anne, Mary, Charles Musgrove, Uppercross, Cheshire, Dugdale, high sheriff, loyalty, baronet, Marys, Elizabeths, arms, motto, history, rise, ancient, respectable family", "sentiment": "3", "more_information":"This is an excerpt from Persuasion by Jane Austen, a classic novel. The sentiment is neutral as there is no clear positive or negative connotation in the text. The topics relate to the characters, setting, and history mentioned in the excerpt."}
Process

## Knowlet Class

### Methods

#### `__init__(self, name)`
- Description: Initializes a Knowlet object.
- Parameters:
  - `name` (str): The name of the Knowlet.

#### `add_child(self, child)`
- Description: Adds a child Knowlet to the current Knowlet.
- Parameters:
  - `child` (Knowlet): The child Knowlet to add.

#### `add_data(self, data, tag)`
- Description: Adds data to the Knowlet.
- Parameters:
  - `data` (str or list): The data to add. If it's a string, it will be preprocessed as sentences. If it's a list, each item will be preprocessed as sentences.
  - `tag` (str): The tag or key associated with the data.

#### `preprocess(self, text)`
- Description: Preprocesses the given text by tokenizing it into sentences.
- Parameters:
  - `text` (str): The text to preprocess.
- Returns:
  - list: The preprocessed sentences.

#### `get_data(self, key)`
- Description: Retrieves the data associated with the given key.
- Parameters:
  - `key` (str): The key to retrieve the data for.
- Returns:
  - list or None: The data associated with the key, or None if the key is not found.

#### `query(self, query_string)`
- Description: Searches for data based on the query string.
- Parameters:
  - `query_string` (str): The query string to search for.
- Returns:
  - list or None: The data matching the query, or None if no match is found.

#### `complex_query(self, query_string)`
- Description: Performs a complex query to search for data.
- Parameters:
  - `query_string` (str): The query string to search for.
- Returns:
  - Knowlet or None: The Knowlet matching the query, or None if no match is found.

#### `add_file(self, file)`
- Description: Adds a file path to the current Knowlet.
- Parameters:
  - `file` (str): The path to the file associated with the Knowlet.

#### `get_file(self)`
- Description: Gets the path to the file associated with the Knowlet.
- Returns:
  - str or None: The path to the file, or None if no file is associated.

#### `set_registry(self, registry)`
- Description: Sets the registry (parent) of the Knowlet.
- Parameters:
  - `registry` (Knowlet or None): The parent Knowlet or None if the Knowlet is root.

#### `get_registry(self)`
- Description: Gets the registry (parent) of the Knowlet.
- Returns:
  - Knowlet or None: The parent Knowlet or None if the Knowlet is root.

#### `visualize(self)`
- Description: Visualizes the Knowlet and its children using a network graph.
- Returns:
  - Network: The Pyvis Network graph object.

#### `perform_topic_modeling(self, max_topics, min_topics=2)`
- Description: Performs topic modeling on the text data of the Knowlet and its children.
- Parameters:
  - `max_topics` (int): The maximum number of topics to consider.
  - `min_topics` (int): The minimum number of topics to consider. Default is 2.
- Returns:
  - LdaModel: The optimized Latent Dirichlet Allocation (LDA) model.

#### `split_hierarchy(self)`
- Description: Splits the hierarchy of the Knowlet based on text similarity using TF-IDF vectorization, cosine similarity, and Agglomerative Clustering. Displays a dendrogram visualization of the clustering.

#### `perform_sentiment_analysis(self)`
- Description: Performs sentiment analysis on the text data of the Knowlet and its children. Adds sentiment information to the data of each Knowlet.

#### `_build_graph(self, G, parent_name)`
- Description: Builds a network graph representation of the Knowlet and its children.
- Parameters:
  - `G` (Network): The Pyvis Network graph object.
  - `parent_name` (str): The name of the parent Knowlet.

#### `autonomous_split(self)`
- Description: Performs autonomous splitting of the Knowlet using GPT-3.5 Turbo. Uses the OpenAI Chat Completion API to generate JSON-formatted topic words and sentiment information. Updates the Knowlet's children and data based on the generated information.

#### `autonomous_split_par(self)`
- Description: Performs autonomous splitting of the Knowlet using GPT-3.5 Turbo in parallel. Uses the OpenAI Chat Completion API to generate JSON-formatted topic words and sentiment information. Updates the Knowlet's children and data based on the generated information.

#### `autonomous_split_non_gpt(self, num_topics)`
- Description: Performs autonomous splitting of the Knowlet using non-GPT based topic modeling. Uses Latent Dirichlet Allocation (LDA) to identify topics and their associated topic words. Updates the Knowlet's children and data based on the identified topics.

#### `parse_json_string(self, json_string)`
- Description: Parses a JSON-formatted string into a Python dictionary.
- Parameters:
  - `json_string` (str): The JSON-formatted string to parse.
- Returns:
  - dict: The parsed dictionary.

### Attributes

#### `name`
- Description: The name of the Knowlet.

#### `children`
- Description: The list of child Knowlets.

#### `data`
- Description: The data associated with the Knowlet.

#### `registry`
- Description: The parent Knowlet.

#### `file`
- Description: The file path associated with the Knowlet.

#### `stop_words`
- Description: The set of stopwords for text preprocessing.

#### `lemmatizer`
- Description: The WordNet Lemmatizer for text preprocessing.

#### `sia`
- Description: The SentimentIntensityAnalyzer for sentiment analysis.

## Other Functions

### `perform_topic_modeling(text_data, max_topics, min_topics=2)`
- Description: Performs topic modeling on the given text data.
- Parameters:
  - `text_data` (list): The text data to perform topic modeling on.
  - `max_topics` (int): The maximum number of topics to consider.
  - `min_topics` (int): The minimum number of topics to consider. Default is 2.
- Returns:
  - LdaModel: The optimized Latent Dirichlet Allocation (LDA) model.

### `split_hierarchy(text_data)`
- Description: Splits the text data hierarchy based on text similarity using TF-IDF vectorization, cosine similarity, and Agglomerative Clustering. Displays a dendrogram visualization of the clustering.
- Parameters:
  - `text_data` (list): The text data to split.
- Returns:
  - None

### `perform_sentiment_analysis(text_data)`
- Description: Performs sentiment analysis on the given text data.
- Parameters:
  - `text_data` (list): The text data to perform sentiment analysis on.
- Returns:
  - None

### `build_graph(G, parent_name, knowlet)`
- Description: Builds a network graph representation of the given Knowlet and its children.
- Parameters:
  - `G` (Network): The Pyvis Network graph object.
  - `parent_name` (str): The name of the parent Knowlet.
  - `knowlet` (Knowlet): The Knowlet object to build the graph for.
- Returns:
  - None

### `autonomous_split(knowlet, prompt)`
- Description: Performs autonomous splitting of the given Knowlet using GPT-3.5 Turbo. Uses the OpenAI Chat Completion API to generate JSON-formatted topic words and sentiment information. Updates the Knowlet's children and data based on the generated information.
- Parameters:
  - `knowlet` (Knowlet): The Knowlet object to perform autonomous splitting on.
  - `prompt` (str): The prompt to generate the chat completion.
- Returns:
  - None

### `autonomous_split_non_gpt(knowlet, num_topics)`
- Description: Performs autonomous splitting of the given Knowlet using non-GPT based topic modeling. Uses Latent Dirichlet Allocation (LDA) to identify topics and their associated topic words. Updates the Knowlet's children and data based on the identified topics.
- Parameters:
  - `knowlet` (Knowlet): The Knowlet object to perform autonomous splitting on.
  - `num_topics` (int): The number of topics to identify.
- Returns:
  - None

This documentation provides detailed explanations of the methods and functions implemented in the code, including their parameters, return values, and functionality.


In [358]:
# Custom function to show Pyvis in Jupyter Notebooks
# Function: pyvis_deepnote_show - Explain what this function does here.
def pyvis_deepnote_show(nt):
    tmp_output_filename = tempfile.NamedTemporaryFile(suffix='.html').name
    nt.save_graph(tmp_output_filename)

    f = open(tmp_output_filename, "r")
    display(HTML(f.read()))
        

In [None]:
# Expierimenting with split without GPT

import nltk
from nltk.corpus import gutenberg
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download the Gutenberg corpus
nltk.download('gutenberg')
nltk.download('vader_lexicon')

# Create a parent Knowlet to hold all documents
k = Knowlet('Gutenberg Corpus')

# Iterate over each file in the Gutenberg corpus
for fileid in gutenberg.fileids()[0:20]:
    # Get the raw text
    text = gutenberg.raw(fileid)

    # Create a new Knowlet for this document
    doc_knowlet = Knowlet(fileid)

    # Add the document content as data
    doc_knowlet.add_data(text[0:2000], 'content')

    # Add the document Knowlet as a child to the corpus Knowlet
    k.add_child(doc_knowlet)

# Perform topic modeling
for child in k.children:
    child.autonomous_split_non_gpt(3)
# Visualize the structure of the knowlets
graph = corpus_knowlet.visualize()

# If you're running this in a Jupyter notebook, you can show the graph using:
# graph.show("Your_browser.html")
pyvis_deepnote_show(graph)


## Conclusion

This notebook provides a comprehensive framework for managing and visualizing knowledge using Knowlets. The structured approach allows for efficient data handling, complex queries, and meaningful insights derived from the analysis of interconnected knowledge nodes.
