The goal of this notebook is to generate insights from saved notes (represented by URLs here) and cluster them into spaces of similar relevance using LLMs and Semantic Clustering.

In [2]:
%%capture
!pip install llama-index-core
!pip install llama-index-embeddings-huggingface

In [3]:
%%capture
!pip install openai
!pip install llama-index-llms-together
!pip install llama-index-llms-openai
!pip install llama-index-llms-groq

In [4]:
%%capture
!pip install unstructured
!pip install tldextract
!pip install kneed

We define the LLM to be used for querying.

In [49]:
# Using TogetherLLM
from llama_index.llms.together import TogetherLLM
llm = TogetherLLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", api_key="your_api_key")

# # Using Groq
# from llama_index.llms.groq import Groq
# llm =Groq(model="mixtral-8x7b-32768", api_key="your_api_key")

# # Using OpenAI
# from llama_index.llms.openai import OpenAI as llama_openai
# import openai
# openai.api_key = "your_api_key"
# os.environ['openai_key'] = openai.api_key
# llm = llama_openai(model="gpt-3.5-turbo", api_key=os.environ.get("openai_key"))

We import the bookmarks we saved from the user's Reddit account in Extracting_bookmarks_from_Reddit.py and we store them each in a llama_index document.

In [14]:
import pandas as pd
from llama_index.core import Document

saved_reddit_urls = pd.read_csv('reddit_saved.csv')
documents = []
for index, row in saved_reddit_urls.iterrows():
    doc = Document(text=row['Content'])
    doc.metadata['source'] = row['URL']
    documents.append(doc)

We load more notes using UnstructuredURLLoader (to give the clusterer more notes to work with).

In [15]:
from llama_index.core import download_loader

UnstructuredURLLoader = download_loader("UnstructuredURLLoader")

  UnstructuredURLLoader = download_loader("UnstructuredURLLoader")


In [16]:
more_note_urls = ["https://www.techtarget.com/whatis/definition/large-language-model-LLM",
        "https://www.techtarget.com/searchenterpriseai/tip/Top-generative-AI-tool-categories",
        "https://www.techtarget.com/whatis/feature/Top-AI-jobs",
        "https://www.washingtonpost.com/opinions/2024/02/23/myanman-junta-weakening-collapse/",
        "https://edition.cnn.com/2024/01/02/economy/interest-rate-cuts-inflation-fed-2024/index.html",
        "https://explodingtopics.com/blog/economic-trends",
        "https://www.deeplearning.ai/the-batch/issue-231/",
        "https://www.nytimes.com/2023/12/08/briefing/ai-dominance.html?auth=login-google1tap&login=google1tap",
        "https://www.scientificamerican.com/article/what-apples-new-vision-pro-headset-might-do-to-our-brain/#:~:text=Apple's%20ads%20have%20shown%20people,hours%A%20end%E2%80%94and%20even",
        "https://www.redpoints.com/blog/ai-copyright-infringement/"]

loader = UnstructuredURLLoader(urls=more_note_urls, continue_on_failure=False, headers={"User-Agent": "value"})
more_documents = loader.load_data()

In [17]:
documents += more_documents

We use the LLM to extract meaningful insights from the texts and other relevant metadata info.

In [37]:
import tldextract
from datetime import datetime

def text_to_insights(documents, llm=llm):

  categories = ["Artificial Intelligence", "Web Development", "Robotics", "Science", "Medicine", "Business", "Politics", "Entertainment", "Sports", "Mathematics"]

  for document in documents:

    title = llm.complete(f"Provide the title of this document.\n###\n{document.text}\n\
                          ###\nThe answer must be in this format: Title: some_title.\nGive a DIRECT answer.").text.strip()
    category = llm.complete(f"From the categories in this list: {categories}, provide the one that is best related to this document. The category must be from the ones listed. \n###\n{document.text}\n###\n\
                          \nThe answer must be in this format: Category: some_category.\nGive a DIRECT answer.").text.strip()
    topic = llm.complete(f"Provide the main idea of this document in one concise sentence. Give a DIRECT and SHORT answer. \n###\n{document.text}\n###\n\
                          \nThe answer must be in this format: Main idea: some_main_idea.").text.strip()

    document.metadata['title'] = title.split(": ")[1]
    document.metadata['category'] = category.split(": ")[1]
    document.metadata['topic'] = topic.split(": ")[1]
    if document.metadata['source']:
      document.metadata['URL'] = document.metadata['source']
      source = tldextract.extract(document.metadata['URL']).domain
      document.metadata['source'] = source
    document.metadata['date'] = datetime.today().strftime('%Y-%m-%d')
    query = f"Give me the summary of this document as the most relevant bullet points: \n \
          ###\n \
          {document.text}\n \
          ### \n \
          Be concise, avoid redundant ideas and provide the minimum number of bullet points without missing an important point."
    resp = llm.complete(query)
    document.text = resp.text.strip()
  return documents

In [38]:
documents = text_to_insights(documents, llm)
print(documents[1].text)
print(documents[1].metadata)

- 2024 marks the year for GroqChip's potential dominance among AI startups due to its affordability and speed.
- GroqChip's US manufacturing differentiates it from competitors, avoiding regulatory uncertainties tied to chips from China.
- GroqChip outperforms industry competitors in throughput (speed) versus price, being 18 times faster for LLM inference performance.
- With on-chip memory, GroqChip is quicker and has lower manufacturing costs than competitors using off-chip memory.
- Designed for open-source LLMs like Nistral, allowing scalability as models grow more powerful.
- Target market: small and middle-sized LLM startups, focusing on chatbots and customer service applications for its super fast latency.
- Groq aims to produce 1 million chips by the end of 2024 to address demand.
- Main concern: GroqChip may require revolutionary design changes to handle 10 trillion parameter models in the future.
- Additional points:
  - Potential for image-based models.
  - Investment opportun

A filter that we could use to filter documents by category and date.

In [39]:
def filter_documents(documents, category=False, date=False):
  if category and date:
    date_as_datetime = datetime.strptime(date, '%Y-%m-%d')
    return [doc for doc in documents if (datetime.strptime(doc.metadata['date'], '%Y-%m-%d') >= date_as_datetime) and (doc.metadata['category'] == category)]
  elif category:
    return [doc for doc in documents if (doc.metadata['category'] == category)]
  elif date:
    date_as_datetime = datetime.strptime(date, '%Y-%m-%d')
    return [doc for doc in documents if (doc.metadata['date'] >= date_as_datetime)]
  else:
    return documents

A function to extract global insights from a set of documents.

In [40]:
def extract_global_insights(documents, llm=llm):
  all_insights = "\n".join([doc.text for doc in documents])
  query = f"Synthetize this document and provide the most important ideas as bullet points: \n \
          ###\n \
          {all_insights}\n \
          ### \n \
          Be concise, avoid redundant ideas and provide the minimum number of bullet points without missing an important point."
  resp = llm.complete(query)
  return resp.text.strip()

In [51]:
print(extract_global_insights(documents[:4]))

- AI can be beneficial for school work by confirming work, helping with brainstorming and customizing lessons, but it's important to verify the accuracy of AI-generated information and avoid plagiarism.
- Experienced users recommend using AI for generating papers, summarizing after reading the material, and editing work.
- AI should be used as an aid to augment abilities, not as a replacement for critical thinking and problem-solving skills.
- GroqChip, an AI startup, is expected to dominate the market in 2024 due to its affordability, speed, and US manufacturing.
- GroqChip outperforms industry competitors in throughput versus price and is designed for open-source LLMs, allowing scalability as models grow more powerful.
- Groq's target market is small and middle-sized LLM startups, focusing on chatbots and customer service applications.
- Groq aims to produce 1 million chips by the end of 2024 to address demand, but may require revolutionary design changes to handle 10 trillion parame

**Clustering**

In [41]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embedding_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
def embed_doc_text(document, embedding_model=embedding_model):
  embedding = embedding_model.get_text_embedding(document.metadata['topic'])
  return embedding

In [42]:
import numpy as np
documents_embeddings = []
document_to_embedding = {}
embedding_to_document = {}
for idx, document in enumerate(documents):
  documents_embeddings.append(embed_doc_text(document))
  document_to_embedding[document.id_] = idx
  embedding_to_document[idx] = document.id_

documents_embeddings = np.array(documents_embeddings)

Semantic KMeans: adapted from https://github.com/avanwyk/semantic-document-clustering/blob/master/clustering.py


In [43]:
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity

def SemanticKMeans_clustering(embeddings: np.ndarray, centroids: int) -> np.ndarray:
    kmeans = SemanticKMeans(centroids)
    predictions = kmeans.fit_predict(embeddings).reshape(-1, 1)
    inertia = kmeans.inertia_
    return predictions, inertia

class SemanticKMeans:
    def __init__(self, centroids):
        self.num_centroids = centroids
        self.inertia_ = None

    def fit(self, vectors: np.ndarray, max_iterations: int = 200, random_state: int = 2024) -> 'SemanticKMeans':
        np.random.seed(random_state)
        centroids = np.random.uniform(-1., 1.,
                                      (self.num_centroids, vectors.shape[1]))

        for _ in range(max_iterations):
            distances = cosine_similarity(vectors, centroids).clip(-1., 1.)
            prev_centroids = np.copy(centroids)
            for c in range(self.num_centroids):
                members = vectors[np.argmax(distances, axis=1) == c]
                if len(members) > 0:
                    centroids[c] = np.mean(members, axis=0)

            if np.allclose(centroids, prev_centroids):
                break
        self.centroids = centroids
        self.inertia_ = self.calculate_inertia(vectors, centroids)
        return self

    def centroids_(self):
        return self.centroids

    def calculate_inertia(self, vectors: np.ndarray, centroids: np.ndarray) -> float:
        distances = cosine_similarity(vectors, centroids)
        labels = np.argmax(distances, axis=1)
        inertia = 0.0
        for c in range(self.num_centroids):
            members = vectors[labels == c]
            if len(members) > 0:
                inertia += np.sum((members - centroids[c])**2)
        return inertia

    def predict(self, vectors: np.ndarray) -> np.ndarray:
        distances = cosine_similarity(vectors, self.centroids)
        return np.argmax(distances, axis=1)

    def fit_predict(self, vectors: np.ndarray, max_iterations: int = 100) -> np.ndarray:
        return self.fit(vectors, max_iterations).predict(vectors)

In [44]:
from kneed import KneeLocator

def cluster_documents(X):
  inertia = []
  k_range = range(1, min(20, X.shape[0]))
  for k in k_range:
      labels, inertia_ = SemanticKMeans_clustering(X, k)
      inertia.append(inertia_)

  # We use KneeLocator from the kneed library to perform the elbow method without human intervention.
  knee = KneeLocator(list(k_range), inertia, curve='convex', direction='decreasing')
  optimal_k = knee.knee
  n_clusters = 2 # Default value if the KneeLocator does not find the elbow.
  if optimal_k:
    n_clusters = optimal_k
  labels, _ = SemanticKMeans_clustering(X, n_clusters)
  return labels

In [45]:
labels = cluster_documents(documents_embeddings)
labels

array([[2],
       [3],
       [2],
       [2],
       [1],
       [3],
       [3],
       [2],
       [0],
       [3],
       [3],
       [3],
       [2],
       [3],
       [3]])

We group the documents by label, then we generate a unifying topic to the insight space.

In [46]:
unique_labels = np.unique(labels)
insight_spaces = {}
for label in unique_labels:
  embedding_indexes = np.where((labels == label))[0]
  documents_ids = [embedding_to_document[idx] for idx in embedding_indexes]
  insight_spaces[label] = {'documents_ids': documents_ids, 'space_topic': ""}


In [47]:
for space_id in insight_spaces.keys():
  docs_topics = [doc.metadata['topic'] for doc in documents if (doc.id_ in insight_spaces[space_id]['documents_ids'])]
  query = f"Given this list of topics:\n###\n{docs_topics}\n###\nProvide a unifying topic in 15 words or less."
  resp = llm.complete(query).text.strip()
  insight_spaces[space_id]['space_topic'] = resp

In [48]:
insight_spaces

{0: {'documents_ids': ['45f828d4-9033-4f50-aaa3-862e308c8459'],
  'space_topic': '"Exploring the Crisis in Myanmar: Military Junta, Insurgents, and International Response."'},
 1: {'documents_ids': ['10bcd969-53c3-4664-838a-c4a5fa2bc7d5'],
  'space_topic': '"Elon Musk\'s Involvement and Concerns over AI Development and Safety."'},
 2: {'documents_ids': ['8d9f757a-daf3-4e02-a5a1-5169ce96a67f',
   'fb2bbd36-3df6-4dc7-bdaa-4750ab3e1919',
   '96692932-86ae-4765-9d2a-18761f814d90',
   '67454731-5f48-4e4d-b333-7f7e8fbd42b2',
   'ab1fc6be-f143-4246-a3d5-f36197bc3ce3'],
  'space_topic': 'Exploring the impact and ethical considerations of AI in various industries.'},
 3: {'documents_ids': ['a3da183b-8d9b-4163-a974-f50b09b0a009',
   '1e1b9f9f-7c21-412f-a643-654977134421',
   'bb630c8f-73a1-4c90-8573-61071f926cf1',
   '0cb5edba-8d3d-4004-8ce5-5f3fa284db75',
   '9c9bc931-7df7-4b7e-9754-f18f9b7d2417',
   '06b5613f-426d-4362-81a8-708cde6c0b13',
   '92057d9f-ccfa-4e83-ac60-714099cc87ca',
   '2d56adfc