In [1]:
#!csharp
Console.WriteLine("C# Kernel is working!");

C# Kernel is working!


In [3]:
#!import config/Settings.cs
// Configure AI backend used by the kernel
var (useAzureOpenAI, model, azureEndpoint, apiKey, orgId) = Settings.LoadFromFile();

### RAG
Retrieval-augmented generation (RAG) lets you extend the knowledge of the base LLM you're using in your agent. 

The way it works is: 
1. Add your own knowledge (in the form of documents or data snippets) to a vector database
2. Call that database to extract relevant data when the LLM is called
3. Feed your data into the LLM's prompt alongside the user query

Knowing the specifics of vector retrieval is outside the scope of this exercise, but at a broad level, it is not a simple string match. Instead, searching vector space enables searching for _semantic_ similarities, or searching by meaning instead of specific text.

In [4]:
#r "nuget: Azure.AI.OpenAI"
#r "nuget: Azure.Identity"
#r "nuget: Newtonsoft.Json"
#r "nuget: Qdrant.Client, 1.6.0"
#r "nuget: System.Text.Json"

In [5]:
using System;
using Azure;
using Azure.Identity;
using Azure.AI.OpenAI;
using System.Linq;
using Newtonsoft.Json.Linq;

Lets see how we can vectorize a sentence using Azure Open AI pre trained embeddings

1. Instantiate Embeddings

In [6]:
//Use your AzureOpenAIClient setup
var openAIClient = new AzureOpenAIClient(new Uri(azureEndpoint), new Azure.Identity.DefaultAzureCredential());
// Create an embedding client for the model
var embeddingClient = openAIClient.GetEmbeddingClient("text-embedding-ada-002");

2. Embed the sentence

In [7]:
string input = "Microsoft Threat Protection Research (MTP-R) will become an AI powerhouse.";
// Generate the embedding (synchronously or asynchronously)
var response = embeddingClient.GenerateEmbedding(input);
// Display info
string jsonResponse = response.GetRawResponse().Content.ToString();
// Parse the JSON response
JObject parsedResponse = JObject.Parse(jsonResponse);
// Extract the embedding vector
string embedding = parsedResponse["data"][0]["embedding"].ToString();
// Optionally, you can convert the embedding string back to a numerical vector
Console.WriteLine("Embedding: " + embedding);

Embedding: 3mjevG2h2rw6gCS7kaJnvFYLATzv7ww9XymTvF8pk7u3mxk83SqkvAeG5jwDm5E8+2IbO6uAqbyS0w07uIG1O7KZgjyZCwQ8NZ0qOqadr7zzY6a77+8MPInUmLuzTvi7u4buu8Nmxrv0SUI8IAwKvQkUZLwmDqG7U6R7PLAqorzH0oS8F2UzvP/WNDtyo/G7AtQSPNkOqTsK22I8v9MPPFC5JjyA1aM6pGdQPJd9Brw3K6i8cg2ZOyzXNjx2oE86C/KkvGGG6rqWZsQ7lt1/u659h7z9SDe8cH+bvOzyLrw8ZsC8CismPJrSgjw83fs7PkSBPIC2hrxhhmq86wyTOwYPqzylfhI75bq4vFuE0zys7wk8BmdJPPYvXjw19Ug8lCiKPFBhiLyS0408mkk+vG0qH7zfmYS7AiwxPMqevDzwZkg8MWKSPHoMjro7vt65DdjAPJicIzz1aN87k5oMPI9+ETzDtok8kQyPPEAqHbus7wm89ColPG0qn7wBDZQ8KDJ3NTKBrzsg22M8DWEFvUm/ajxt+fi87igOvUAqHbzgf6C8X4ExPL6DzDvWaWk87TDpPG8v2Dtphd+7gwuDvPksPDt0EtK8axPdOgvypLoL04c6O0cjPG0qH7pQMOK6b7gcPSto1ju7hm488pwnvO4oDjxVMvk8Vtrau/rzOr3Hod48OthCvE/ypzyOhuy8tdx1PBspkLs+9L28xmMkPIxIMr0lZr+8ENUevXa/7Lvj1Bw8URZ+PE0MDLyOLs681UrMOU+aCT15vEo8EZydPBQQt7v7MXU7fiiJvIwpFb271rG8AQ2UPBUvVLvxfQo7MGptvOfwFzul9U28Q/ZUu7u3FDvKJ4G7rzL9OymCujvE1aY8TYPHPAy5Iztifg88V8D2ugPzL7zlYho8mkk+OyecnrwDaus68wuIO/AOKjzUg0074H8gvEFoV7nj1Jy66hTuPHbwkrzaheQ8MkiuvGucITzRt

The string you see a base64-encoded string representing the raw numerical values of the embedding vector. Embeddings are high-dimensional numeric representations, often transformed into a format like base64 for efficient transmission over networks or storage. The model generates this embedding by processing the input text and outputting a vector of numbers.

In [8]:
// Required namespaces
using System.Text.Json;

public static float[] DecodeEmbedding(string base64String)
{
    // Step 1: Decode the base64 string into a byte array
    byte[] decodedBytes = Convert.FromBase64String(base64String);
    
    // Step 2: Convert the byte array into a float array (assuming 32-bit floats)
    int floatCount = decodedBytes.Length / 4; // Each float is 4 bytes
    float[] embedding = new float[floatCount];

    for (int i = 0; i < floatCount; i++)
    {
        embedding[i] = BitConverter.ToSingle(decodedBytes, i * 4);
    }

    return embedding;
}
// Decode the embedding
float[] embeddingVec = DecodeEmbedding(embedding);
Console.WriteLine(string.Join(", ", embeddingVec.Take(7))); // Display first 7 elements

-0.027149614, -0.026688302, -0.0025100843, -0.01413788, 0.007876238, 0.034408506, -0.017964063


In [9]:
Console.WriteLine($"Vector size: {embeddingVec.Length}");

Vector size: 1536


Using the same principle - lets:
1. Read a document and split it to chunks
2. Generate Embeddings per chunk
3. Upload the vectors to the Qdrant (document chunks and embeddings)

In [10]:
// Helping function to generate embedding for the document
double[] GetEmbeddingVector(string document)
{
    // Call the embedding service
    var embeddingResult = embeddingClient.GenerateEmbedding(document);

    // Extract raw JSON response as string
    var embeddingData = embeddingResult.GetRawResponse().Content.ToString();

    // Parse the JSON
    var jsonDoc = JsonDocument.Parse(embeddingData);
    var embeddingArray = jsonDoc.RootElement
                                .GetProperty("data")[0]
                                .GetProperty("embedding")
                                .GetString();

    // Decode the string into float[]
    float[] embeddingVec = DecodeEmbedding(embeddingArray);

    // Convert float[] to double[]
    return Array.ConvertAll(embeddingVec, x => (double)x);
}

In [11]:
// Relative path to the file inside "config" folder
string filePath = Path.Combine("config", "long_document.txt");
// Read all text from the file
string my_document = File.ReadAllText(filePath);
// Print first 200 characters to confirm
Console.WriteLine($"First 200 chars of the document:\n\n{my_document.Substring(0, Math.Min(200, my_document.Length))}");

First 200 chars of the document:

SIGNATURE_TYPE_PEHSTR (0x61)
Location: Signature\Source\mavsigs\hstr
Extraction tool: Manual
Compiler: hstr.exe
Online view: http://avreports/engine/signaturetype.aspx?id=97 

Summary:

An hst


Lets split a long document into the list of chunks

In [12]:
// Function to split a string into chunks of maxChunkSize characters
List<string> SplitIntoChunks(string text, int maxChunkSize)
{
    var chunks = new List<string>();
    for (int i = 0; i < text.Length; i += maxChunkSize)
    {
        int length = Math.Min(maxChunkSize, text.Length - i);
        chunks.Add(text.Substring(i, length));
    }
    return chunks;
}
// Split document into chunks of 500 characters
var chunks = SplitIntoChunks(my_document, 500);

Generate embeddings for each chunk

In [13]:
// Prepare points list
var pointsList = new List<object>();

foreach (var chunk in chunks)
{
    // Get embedding vector for each chunk
    var embeddingAsDouble = GetEmbeddingVector(chunk);

    pointsList.Add(new
    {
        id = Guid.NewGuid().ToString(),
        vector = embeddingAsDouble,
        payload = new
        {
            document = chunk
        }
    });
}
// Create JSON for uploading all points
var json = new
{
    points = pointsList.ToArray()
};

Upload all chunks as separate points to your Qdrant collection

In [14]:
using System.Net.Http;
using System.Text;
using System.Threading.Tasks;
using Qdrant.Client;
using Qdrant.Client.Grpc;

// Your Qdrant setup
var qdrantUrl = "http://localhost:6333";
var collectionName = "my-collection";
var apiKey = ""; // Optional

var httpClient = new HttpClient();
if (!string.IsNullOrEmpty(apiKey))
    httpClient.DefaultRequestHeaders.Add("api-key", apiKey);

Delete existing collection before creating (if you want to reset)

In [15]:
// to Delete the collection
var deleteResponse = await httpClient.DeleteAsync($"{qdrantUrl}/collections/{collectionName}");
Console.WriteLine($"[Delete Collection] Status: {deleteResponse.StatusCode}");

[Delete Collection] Status: OK


In [None]:
// 1. Create collection

var createCollectionUrl = $"{qdrantUrl}/collections/{collectionName}";
//Side of each vector in the collection
// The size of the vector must match the dimension of your vectors
var embeddingSize = 1536; //

var createCollectionJson = new
{
    vectors = new
    {
        size = embeddingSize, // Must match the dimension of your vectors
        distance = "Cosine" // Can be "Cosine", "Euclidean", or "Dot"
    }
};

var createCollectionString = JsonSerializer.Serialize(createCollectionJson);
var createCollectionContent = new StringContent(createCollectionString, Encoding.UTF8, "application/json"); 


// Make the PUT request to Qdrant to create the collection
var createResponse = await httpClient.PutAsync(createCollectionUrl, createCollectionContent);
var responseText = await createResponse.Content.ReadAsStringAsync();

// Output result
Console.WriteLine($"[Create Collection] Status: {createResponse.StatusCode}");
Console.WriteLine(responseText);

[Create Collection] Status: OK
{"result":true,"status":"ok","time":0.932855704}


In [17]:
// 2. Read document and split into chunks
string filePath = Path.Combine("config", "long_document.txt");
string my_document = File.ReadAllText(filePath);
var chunks = SplitIntoChunks(my_document, 500);

// 3. Prepare points list
var pointsList = new List<object>();
foreach (var chunk in chunks)
{
    var embeddingAsDouble = GetEmbeddingVector(chunk);
    pointsList.Add(new
    {
        id = Guid.NewGuid().ToString(),
        vector = embeddingAsDouble,
        payload = new { document = chunk }
    });
}

In [18]:
// 4. Upload points
var pointsJson = new { points = pointsList.ToArray() };
string pointsJsonString = JsonSerializer.Serialize(pointsJson);
var pointsContent = new StringContent(pointsJsonString, Encoding.UTF8, "application/json");
var uploadResponse = httpClient.PutAsync($"{qdrantUrl}/collections/{collectionName}/points", pointsContent).Result;
uploadResponse.EnsureSuccessStatusCode();

Console.WriteLine("Collection created and points uploaded successfully.");

Collection created and points uploaded successfully.


Retrieve chunks most similar to the user question

In [19]:
// 1. User question
string userQuestion = "What is an HSTR signature and how does it function?";

// 2. Embed the question
double[] questionVector = GetEmbeddingVector(userQuestion); // You already have this implemented

// 3. Search in Qdrant for top matching document chunks
var searchBody = new
{
    vector = questionVector,
    limit = 3,
    with_payload = true,
    with_vector = true
};
var searchJson = JsonSerializer.Serialize(searchBody);
var searchContent = new StringContent(searchJson, Encoding.UTF8, "application/json");
var searchResponse = await httpClient.PostAsync($"{qdrantUrl}/collections/{collectionName}/points/search", searchContent);
var searchResult = await searchResponse.Content.ReadAsStringAsync();

In [20]:
// 4. Extract top chunks
var parsed = JsonDocument.Parse(searchResult);
var results = parsed.RootElement.GetProperty("result");

var sb = new StringBuilder();
foreach (var item in results.EnumerateArray())
{
    string doc = item.GetProperty("payload").GetProperty("document").GetString();
    sb.AppendLine(doc);
}
string retrievedContext = sb.ToString();

In [22]:
using Azure;
using Azure.Identity;
using OpenAI.Assistants;
using Azure.AI.OpenAI;
using OpenAI.Chat;
using static System.Environment;

// Initialize Azure OpenAI
var (useAzureOpenAI, model, endpoint, apiKey, orgId) = Settings.LoadFromFile();
AzureOpenAIClient openAIClient = new AzureOpenAIClient(new Uri(endpoint), new DefaultAzureCredential());
ChatClient chatClient = openAIClient.GetChatClient("gpt-4o");
// Construct the prompt with the RAG context
string finalPrompt = $"Based in the below context only:\n{retrievedContext}\n\n answer the user question: {userQuestion}";

// Create the chat completion using your AzureOpenAIClient
ChatCompletion completion = chatClient.CompleteChat(
[
    new SystemChatMessage("You are a helpful assistant. Read the instructions carefully."),
    new UserChatMessage(finalPrompt)
]);

// Print the result
Console.WriteLine($"{completion.Role}: {completion.Content[0].Text}");

Assistant: An HSTR signature, also known as a heuristic string signature, is a type of detection mechanism used in antivirus software. It operates by searching through the memory that the emulator has processed after execution for specific strings. When these predefined strings are found, it triggers detection, indicating potential malware or suspicious behavior.

HSTR signatures are particularly useful because they examine memory, allowing them to detect threats that may not be visible through static analysis alone. This approach leverages heuristics, which means it can identify new or altered threats based on patterns rather than a predefined list of malware.

In practice, when multiple HSTR signatures with the same priority level are detected, the one with the lowest IndexValue is prioritized for reporting. This IndexValue helps manage the specificity and generic nature of the signature, such as distinguishing between a family-specific signature (Index 0) and a broader generic signa