# Understanding Large Language Models (LLMs)

Large Language Models (LLMs) are advanced AI systems trained on vast amounts of text data. They use deep learning architectures, primarily based on the Transformer model, to understand and generate human-like text.

## How Text Generation Works Internally

Text generation in LLMs follows these key steps:

1. **Tokenization**: The input text is split into tokens (words or subwords)
2. **Embedding**: Tokens are converted into numerical vectors
3. **Attention Mechanism**: The model uses self-attention to understand relationships between tokens
4. **Hidden State Processing**: Information flows through multiple transformer layers
5. **Token Prediction**: The model predicts the next token based on probability distributions
6. **Generation Loop**: Steps 4-5 repeat until the response is complete

Key parameters that influence generation:
- Temperature: Controls randomness in generation
- Top-p (nucleus sampling): Filters the cumulative probability distribution
- Context window: Maximum number of tokens the model can process

# Retrieval-Augmented Generation (RAG)

RAG is an architecture that enhances LLM responses by combining them with relevant information retrieved from a knowledge base. This approach offers several benefits:

1. **Improved Accuracy**: Models can access specific, up-to-date information
2. **Reduced Hallucination**: Responses are grounded in retrieved facts
3. **Domain Adaptation**: Can be specialized for specific use cases
4. **Cost Efficiency**: Smaller models can perform well with good retrieval

## RAG Pipeline Components:

1. **Document Processing**: Converting documents into chunks
2. **Embedding Generation**: Creating vector representations
3. **Vector Storage**: Efficient storage and retrieval of embeddings
4. **Retrieval**: Finding relevant context for queries
5. **Augmented Generation**: Combining retrieved context with LLM generation

In [5]:
from langchain_community.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.prompts import PromptTemplate
import pandas as pd
import json
from langchain_community.embeddings import HuggingFaceEmbeddings

# Model Configuration

Below we set up the Qwen 2.5 7B model using Ollama. The configuration includes:
- Temperature and top_p for controlling generation diversity
- Context window size
- Streaming callback for real-time output

In [15]:
# Model configuration
MODEL = "qwen2.5:7b"
llm = Ollama(
    model=MODEL,
    temperature=0.9,
    top_p=0.9,
    num_ctx=4096,
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()])
)

  llm = Ollama(


# Data Processing and Embedding Generation

This section demonstrates:
1. Loading and preprocessing JSON data
2. Text chunking for optimal retrieval
3. Generating embeddings using HuggingFace's sentence transformers
4. Storing vectors in ChromaDB for efficient retrieval

In [16]:
# Load JSON file locally (ensure this file is in the same folder or provide full path)
json_file = "json_dataset.json"

# Read JSON file asynchronously
with open(json_file, "r") as f:
    contents = f.read()
    json_data = json.loads(contents)

# Extract records based on JSON structure
if isinstance(json_data, dict) and "data" in json_data:
    # If MongoDB structure
    records = json_data["data"]
elif isinstance(json_data, list):
    # If direct list of records
    records = json_data
else:
    # If single record
    records = [json_data]

# Convert records to DataFrame and clean
df = pd.DataFrame(records)

# Remove unwanted columns if they exist
columns_to_drop = ['_id', 'dataset_id']
df_original = df.drop(columns=[col for col in columns_to_drop if col in df.columns], errors='ignore')

# Get resulting columns
fields = df_original.columns.tolist()
print(fields)

# Create documents for embeddings (concatenate columns into a string)
documents = []
for _, row in df.iterrows():
    doc_text = " ".join([f"{col}: {val}" for col, val in row.items()])
    documents.append(doc_text)

# Create splitter for dividing long texts into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        
    chunk_overlap=50       
)

# Split documents into chunks
split_docs = []
for doc in documents:
    split_docs.extend(text_splitter.split_text(doc))

# Define HuggingFace embeddings model
EMBEDDINGS_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDINGS_MODEL)

# Create and save embeddings in ChromaDB
db = Chroma.from_texts(texts=split_docs, embedding=embeddings)

# Verify number of generated chunks
print(f"Generated {len(split_docs)} text chunks.")
df_original.head()

['device_id', 'timestamp', 'bandwidth_mbps', 'latency_ms', 'packet_loss', 'signal_strength_dbm', 'cell_id', 'connection_type']
Generated 1000 text chunks.


Unnamed: 0,device_id,timestamp,bandwidth_mbps,latency_ms,packet_loss,signal_strength_dbm,cell_id,connection_type
0,5851,2024-12-13T14:46:24.173261,536.156239,6.55122,0.037045,-101.327696,57,MIMO
1,4856,2024-12-13T15:46:24.173271,485.536208,9.675234,0.040759,-99.914445,66,MIMO
2,3914,2024-12-13T16:46:24.173273,420.810843,8.836038,0.049106,-98.675641,59,MIMO
3,9267,2024-12-13T17:46:24.173274,736.451439,8.65604,0.056578,-80.390779,68,Carrier Aggregation
4,4908,2024-12-13T18:46:24.173275,524.27572,11.614536,0.022364,-109.168636,98,MIMO


# RAG Chain Setup

Here we configure the RAG pipeline by:
1. Setting up the retriever
2. Defining the prompt template
3. Creating the QA chain that combines retrieval and generation

In [21]:
retriever = db.as_retriever()
# Define base prompt
prompt_template = PromptTemplate(
    template="""
    You are a cybersecurity expert. Based on this dataset and its fields
    {context}

    Answer the user's question:
    {question}
    """,
    input_variables=["context", "question"]
)

# Configure RAG chain
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=False,
    chain_type_kwargs={"prompt": prompt_template}
)

# Build dynamic question with fields
fields_str = ', '.join(fields)
num_samples = 3

# Example system query
question = f"Generate a table with {num_samples} rows, random values and the following fields {fields_str}. Respond ONLY with pure JSON, no markdown, no triple quotes, no explanations."
response = rag_chain.invoke(question)
display(response)

{
  "data": [
    {
      "device_id": 1234,
      "timestamp": "2025-01-04T10:00:00.000000",
      "bandwidth_mbps": 200.5,
      "latency_ms": 9.8,
      "packet_loss": 0.01,
      "signal_strength_dbm": -97.5,
      "cell_id": 30,
      "connection_type": "4G"
    },
    {
      "device_id": 5678,
      "timestamp": "2025-01-04T11:00:00.000000",
      "bandwidth_mbps": 220.3,
      "latency_ms": 11.2,
      "packet_loss": 0.02,
      "signal_strength_dbm": -95.8,
      "cell_id": 45,
      "connection_type": "5G"
    },
    {
      "device_id": 9101,
      "timestamp": "2025-01-04T12:00:00.000000",
      "bandwidth_mbps": 180.7,
      "latency_ms": 8.6,
      "packet_loss": 0.005,
      "signal_strength_dbm": -93.4,
      "cell_id": 25,
      "connection_type": "Carrier Aggregation"
    }
  ]
}

{'query': 'Generate a table with 3 rows, random values and the following fields device_id, timestamp, bandwidth_mbps, latency_ms, packet_loss, signal_strength_dbm, cell_id, connection_type. Respond ONLY with pure JSON, no markdown, no triple quotes, no explanations.',
 'result': '{\n  "data": [\n    {\n      "device_id": 1234,\n      "timestamp": "2025-01-04T10:00:00.000000",\n      "bandwidth_mbps": 200.5,\n      "latency_ms": 9.8,\n      "packet_loss": 0.01,\n      "signal_strength_dbm": -97.5,\n      "cell_id": 30,\n      "connection_type": "4G"\n    },\n    {\n      "device_id": 5678,\n      "timestamp": "2025-01-04T11:00:00.000000",\n      "bandwidth_mbps": 220.3,\n      "latency_ms": 11.2,\n      "packet_loss": 0.02,\n      "signal_strength_dbm": -95.8,\n      "cell_id": 45,\n      "connection_type": "5G"\n    },\n    {\n      "device_id": 9101,\n      "timestamp": "2025-01-04T12:00:00.000000",\n      "bandwidth_mbps": 180.7,\n      "latency_ms": 8.6,\n      "packet_loss": 0.005,

# Data Processing and Export

Finally, we process the generated data by:
1. Extracting and cleaning the JSON response
2. Converting to a pandas DataFrame
3. Exporting to CSV for further use

In [22]:
import json
import pandas as pd

# Extract and clean 'result' field
raw_result = response['result']

# Parse the JSON string in the 'result' field
parsed_result = json.loads(raw_result)

# Extract the "data" field, which contains the list of records
data = parsed_result.get("data", [])

# Convert the extracted data into a pandas DataFrame
df_generated = pd.DataFrame(data)

# Save the DataFrame to a CSV file
output_csv = "output.csv"
df_generated.to_csv(output_csv, index=False)

# Display the first few rows of the DataFrame
df_generated.head()


Unnamed: 0,device_id,timestamp,bandwidth_mbps,latency_ms,packet_loss,signal_strength_dbm,cell_id,connection_type
0,1234,2025-01-04T10:00:00.000000,200.5,9.8,0.01,-97.5,30,4G
1,5678,2025-01-04T11:00:00.000000,220.3,11.2,0.02,-95.8,45,5G
2,9101,2025-01-04T12:00:00.000000,180.7,8.6,0.005,-93.4,25,Carrier Aggregation


# Conclusion and Next Steps

This notebook demonstrates a basic RAG implementation. You can extend it by:
1. Integrating different LLMs and embedding models
2. Experimenting with various vector databases
3. Adding evaluation metrics
4. Implementing caching and optimization
5. Adding error handling and validation

The combination of LLMs with RAG provides a powerful foundation for building knowledge-intensive applications while maintaining accuracy and relevance.