<a href="https://colab.research.google.com/github/katharguppe/BITS_Pilani_Final/blob/master/Lcm_mimic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# =============================
#   Install Dependencies
# =============================

# 1) Install key libraries
!pip install spacy sentence_transformers networkx

# 2) Download and link spaCy's small English model
!python -m spacy download en_core_web_sm

# =============================
#   Import Libraries
# =============================
import spacy
from sentence_transformers import SentenceTransformer
import networkx as nx
import numpy as np

# =============================
#   Load Models
# =============================
# Load spaCy English model for sentence segmentation, POS tagging
nlp = spacy.load("en_core_web_sm")

# Load a SentenceTransformer model for creating embeddings
# (this serves as our "SONAR embedding" stand-in)
model = SentenceTransformer('all-MiniLM-L6-v2')

# =============================
#   Example Paragraph
# =============================
paragraph = """
AI is transforming industries. Machines are learning human languages
because automation helps free humans from repetitive tasks, so employees
can focus on more innovative work. This development will likely create
new opportunities in tech and data science.
"""

# =============================
#   1) Sentence Segmentation
# =============================
doc = nlp(paragraph.strip())
sentences = [sent.text for sent in doc.sents if sent.text.strip()]

print("\n--- Step 1: Sentence Segmentation ---")
for i, s in enumerate(sentences, start=1):
    print(f"Sentence {i}: {s}")

# =============================
#   2) SONAR Embedding (Concept Representation)
# =============================
# We'll embed each sentence to capture overall meaning
sentence_embeddings = model.encode(sentences)

print("\n--- Step 2: SONAR Embeddings Shape ---")
print("Each sentence is turned into a vector of dimension:", sentence_embeddings.shape[1])

# =============================
#   3) Diffusion Process (Contextual / Relation Graph)
# =============================
# Create a simple graph of sentence relationships if they share common nouns
G = nx.Graph()

for i, s in enumerate(sentences):
    G.add_node(i, text=s)

# A trivial example: connect sentences if they share any noun
for i in range(len(sentences)):
    for j in range(i+1, len(sentences)):
        # Extract nouns from each sentence
        nouns_i = {token.lemma_.lower() for token in nlp(sentences[i]) if token.pos_ == "NOUN"}
        nouns_j = {token.lemma_.lower() for token in nlp(sentences[j]) if token.pos_ == "NOUN"}
        # If they share any noun, create an edge
        if nouns_i.intersection(nouns_j):
            G.add_edge(i, j)

print("\n--- Step 3: Diffusion Graph Edges (Sentence Connections) ---")
for edge in G.edges:
    print(f"Edge between sentence {edge[0]+1} and {edge[1]+1}")

# =============================
#   4) Advanced Patterning (Cause-Effect, etc.)
# =============================
# A rudimentary approach: look for 'because' or similar words as cause indicators
cause_effect_pairs = []
for i, s in enumerate(sentences):
    if "because" in s.lower():
        cause_effect_pairs.append((i, s))

print("\n--- Step 4: Cause-Effect Pattern Detection ---")
if cause_effect_pairs:
    for idx, text_cause in cause_effect_pairs:
        print(f"Sentence {idx+1} has a cause-effect indicator: '{text_cause.strip()}'")
else:
    print("No direct cause-effect indicators found in this paragraph.")

# =============================
#   5) Hidden Process (Pseudo Memory)
# =============================
# Here we simulate storing a "memory" of the paragraph under a key.
memory_store = {}
memory_key = "paragraph_context"
memory_store[memory_key] = paragraph.strip()

# In a real system, this memory could be used to provide context to future queries.
print("\n--- Step 5: Hidden Process (Memory) ---")
print("Stored paragraph in memory_store under key 'paragraph_context'.")
print("Memory content:\n", memory_store[memory_key])

# =============================
#   6) Quantization (Efficient Embedding Storage)
# =============================
# Convert float embeddings to 8-bit integers to save space (example of quantization)
quantized_embeddings = []
for vec in sentence_embeddings:
    # Scale values from -1..1 to 0..255 for demonstration (assuming typical range)
    scaled = 127.5 * (vec + 1.0)
    quantized = np.clip(scaled, 0, 255).astype(np.uint8)
    quantized_embeddings.append(quantized)

quantized_embeddings = np.array(quantized_embeddings)

print("\n--- Step 6: Quantization ---")
print(f"Original embeddings shape: {sentence_embeddings.shape}")
print(f"Quantized embeddings shape: {quantized_embeddings.shape}")

# =============================
#   7) Output Generation
# =============================
# A simple example: "LCM" output might combine
# (a) sentences, (b) embeddings, (c) cause/effect flags, (d) memory reference
print("\n--- Step 7: Final Structured LCM Output ---")
final_structured_output = {
    "segmented_sentences": sentences,
    "cause_effect_sentences": [sent for (_, sent) in cause_effect_pairs],
    "graph_edges": list(G.edges),
    "memory_reference": memory_store.get(memory_key, ""),
    "quantized_embeddings_preview": quantized_embeddings[0][:10].tolist()  # first 10 dims of first sentence
}

print(final_structured_output)

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence_transformers)
 

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


--- Step 1: Sentence Segmentation ---
Sentence 1: AI is transforming industries.
Sentence 2: Machines are learning human languages 
because automation helps free humans from repetitive tasks, so employees 
can focus on more innovative work.
Sentence 3: This development will likely create 
new opportunities in tech and data science.

--- Step 2: SONAR Embeddings Shape ---
Each sentence is turned into a vector of dimension: 384

--- Step 3: Diffusion Graph Edges (Sentence Connections) ---

--- Step 4: Cause-Effect Pattern Detection ---
Sentence 2 has a cause-effect indicator: 'Machines are learning human languages 
because automation helps free humans from repetitive tasks, so employees 
can focus on more innovative work.'

--- Step 5: Hidden Process (Memory) ---
Stored paragraph in memory_store under key 'paragraph_context'.
Memory content:
 AI is transforming industries. Machines are learning human languages 
because automation helps free humans from repetitive tasks, so employees 
ca