# Assignment: Structuring Data for Effective Retrieval-Augmented Generation (RAG)

#### Objective:
Your task is to develop a unique data/embedding structure that organizes the information from the provided GStreamer tutorial in a way that enhances the performance of a Retrieval-Augmented Generation (RAG) model. The goal is to ensure that the data is well-linked and easily understandable, allowing the RAG model to fetch and utilize the most important information efficiently.

#### Resources:
•⁠  ⁠*Text Source*: [GStreamer Tutorial - Dynamic Pipelines](https://gstreamer.freedesktop.org/documentation/tutorials/basic/dynamic-pipelines.html?gi-language=c)

#### Instructions:
1.⁠ ⁠*Understand the Content*:
   - Thoroughly read and understand the provided GStreamer tutorial.
   - Identify key sections, concepts, and terminologies.

2.⁠ ⁠*Data Structuring*:
   - Break down the content into logical sections such as Introduction, Concepts, Examples, and Code Snippets.
   - Create a hierarchical structure that reflects the flow of the tutorial, linking related sections and sub-sections.

3.⁠ ⁠*Embedding Strategy (Optional)*:
   - Design an embedding strategy that captures the essence of each section.
   - Ensure that embeddings for similar or related sections are closely linked, facilitating easy navigation and retrieval.

4.⁠ ⁠*Linking Data*:
   - Develop a system for linking related pieces of information. For example, code snippets should be linked to the explanations and concepts they demonstrate.
   - Use metadata to tag sections with relevant keywords and concepts for better indexing and retrieval.

#### Deliverables:
•⁠  ⁠A structured document/file containing the organized data from the GStreamer tutorial.
•⁠  ⁠A written explanation of your data structure and embedding strategy.
•⁠  ⁠(Optional) Results and feedback from testing with a RAG model.

---

If you have any questions or need further clarification, feel free to reach out .

Good luck and happy coding!

In [26]:
import requests
from bs4 import BeautifulSoup

# URL of the GStreamer Tutorial
url = "https://gstreamer.freedesktop.org/documentation/tutorials/basic/dynamic-pipelines.html?gi-language=c"

# Send a GET request to the URL
response = requests.get(url)
response.raise_for_status()

# Parse the HTML content
soup = BeautifulSoup(response.content, "html.parser")

# Extract relevant sections
title = soup.find("h1").get_text()
content = ""

# Collect all the paragraphs and headings
for element in soup.find_all(["h1", "h2", "h3", "h4", "p", "pre"]):
    content += element.get_text() + "\n\n"

# Save the content to a file
with open("Vishwas.txt", "w") as file:
    file.write(content)

print("Content scraped and saved to 'Vishwas.txt'")



Content scraped and saved to 'Vishwas.txt'


In [29]:
# My openai key is saved in .env file
from openai import OpenAI
from dotenv import load_dotenv
import os
import json

load_dotenv()
client = OpenAI()

# Function to read content from file
def read_content(filename):
    with open(filename, "r") as file:
        content = file.read()
    return content

# Read content from "Vishwas.txt"
filename = "Vishwas.txt"
content = read_content(filename)

# Define the function schema for OpenAI function
functions = [
    {
        "name": "parse_gstreamer_tutorial",
        "description": "Structure GStreamer tutorial into hierarchical JSON format",
        "parameters": {
            "type": "object",
            "properties": {
                "GStreamerTutorial": {
                    "type": "object",
                    "properties": {
                        "metadata": {
                            "type": "object",
                            "properties": {
                                "keywords": {"type": "array", "items": {"type": "string"}},
                                "links": {"type": "array", "items": {"type": "string"}}
                            }
                        },
                        "Introduction": {
                            "type": "object",
                            "properties": {
                                "content": {"type": "string"},
                                "metadata": {
                                    "type": "object",
                                    "properties": {
                                        "keywords": {"type": "array", "items": {"type": "string"}}
                                    }
                                }
                            }
                        },
                        "Concepts": {
                            "type": "object",
                            "properties": {
                                "content": {"type": "string"},
                                "metadata": {
                                    "type": "object",
                                    "properties": {
                                        "keywords": {"type": "array", "items": {"type": "string"}}
                                    }
                                }
                            }
                        },
                        "Examples": {
                            "type": "object",
                            "properties": {
                                "content": {"type": "string"},
                                "metadata": {
                                    "type": "object",
                                    "properties": {
                                        "keywords": {"type": "array", "items": {"type": "string"}}
                                    }
                                }
                            }
                        },
                        "CodeSnippets": {
                            "type": "object",
                            "properties": {
                                "content": {"type": "string"},
                                "metadata": {
                                    "type": "object",
                                    "properties": {
                                        "keywords": {"type": "array", "items": {"type": "string"}}
                                    }
                                }
                            }
                        }
                    },
                    "required": ["metadata", "Introduction", "Concepts", "Examples", "CodeSnippets"]
                }
            },
            "required": ["GStreamerTutorial"],
        },
    }
]

# Define the prompt for OpenAI function call
prompt = f"""
    The following is the content of a GStreamer tutorial on dynamic pipelines. Please structure it into a hierarchical JSON format with sections such as Introduction, Concepts, Examples, and Code Snippets. Also, include metadata keywords and links to related sections.

    Content:
    {content}

    Please structure the content into the format specified in the function schema, ensuring that each section has appropriate content and metadata.
"""

try:
    # Call OpenAI function to structure content
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        functions=functions,
        function_call={"name": "parse_gstreamer_tutorial"},
        max_tokens=2000,
        temperature=0.5
    )

    # Check for response errors or incomplete data
    if response and response.choices and response.choices[0].message:
        message = response.choices[0].message
        if message.function_call:
            structured_content = json.loads(message.function_call.arguments)
            
            # Pretty print the JSON with indentation
            print(json.dumps(structured_content, indent=2))
            
            # Save the formatted output to a file
            with open("organized_data_from_GS.json", "w") as outfile:
                json.dump(structured_content, outfile, indent=2)
            print("\nFormatted output has been saved to 'output.json'")
        else:
            print("No function call in the response. Check API logs for more details.")
    else:
        print("Response was empty or incomplete. Check API logs for more details.")

except json.JSONDecodeError as json_error:
    print(f"JSON Decoding Error: {json_error}")
    print("Raw response:")
    print(message.function_call.arguments)

except Exception as e:
    print(f"Error occurred: {e}")

{
  "GStreamerTutorial": {
    "metadata": {
      "keywords": [
        "GStreamer",
        "dynamic pipelines",
        "tutorial",
        "coding",
        "multimedia"
      ],
      "links": [
        "https://gstreamer.freedesktop.org/documentation/tutorials/basic/dynamic-pipelines.html"
      ]
    },
    "Introduction": {
      "content": "This tutorial shows the rest of the basic concepts required to use GStreamer, which allow building the pipeline 'on the fly', as information becomes available, instead of having a monolithic pipeline defined at the beginning of your application. After this tutorial, you will have the necessary knowledge to start the Playback tutorials. The points reviewed here will be: How to attain finer control when linking elements. How to be notified of interesting events so you can react in time. The various states in which an element can be.",
      "metadata": {
        "keywords": [
          "GStreamer",
          "Introduction",
          "dynamic

In [36]:
from openai import OpenAI
from dotenv import load_dotenv
import os
import json
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

load_dotenv()
client = OpenAI()

# Step 2: Implement embedding strategy
def get_embedding(text):
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=text
    )
    return response.data[0].embedding

def create_embeddings(structured_content):
    embeddings = {}
    for section, content in structured_content['GStreamerTutorial'].items():
        if section != 'metadata':
            section_text = content['content']
            embeddings[section] = get_embedding(section_text)
    return embeddings

# Step 3: Create a retrieval system
def find_most_relevant_section(query, embeddings):
    query_embedding = get_embedding(query)
    similarities = {}
    for section, embedding in embeddings.items():
        similarity = cosine_similarity([query_embedding], [embedding])[0][0]
        similarities[section] = similarity
    return max(similarities, key=similarities.get)

# Step 4: Implement RAG
def generate_response(query, structured_content, embeddings):
    relevant_section = find_most_relevant_section(query, embeddings)
    context = structured_content['GStreamerTutorial'][relevant_section]['content']
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful assistant with expertise in GStreamer."},
            {"role": "user", "content": f"Context: {context}\n\nQuery: {query}"}
        ]
    )
    return response.choices[0].message.content

# Main execution
if __name__ == "__main__":
    # Load the structured content
    with open("organized_data_from_GS.json", "r") as file:
        structured_content = json.load(file)
    
    # Create embeddings
    embeddings = create_embeddings(structured_content)
    
    # Example usage of the RAG system
    query = "Explain the concept of dynamic pipelines in GStreamer."
    response = generate_response(query, structured_content, embeddings)
    print(f"Query: {query}")
    print(f"Response: {response}")
    # print(f"Embeddings: {embeddings}")
    
    # Save the results to result.txt
    with open("result.txt", "w") as file:
        file.write(f"Query: {query}\n\n")
        file.write(f"Response: {response}\n\n")
        file.write("Embeddings have been created for the following sections:\n")
        for section in embeddings.keys():
            file.write(f"- {section}\n")
    
    # Save the embeddings for future use
    with open("embeddings.json", "w") as file:
        json.dump({k: v for k, v in embeddings.items()}, file)
    
    print("Results have been saved to 'result.txt'")
    print("Embeddings have been saved to 'embeddings.json'")

# OUTPUT

# Query: Explain the concept of dynamic pipelines in GStreamer.
# Response: Dynamic pipelines in GStreamer refer to the ability to build and modify media processing pipelines 'on the fly'. This means that as new information becomes available or circumstances change, you can add, remove, or modify elements within your pipeline during runtime. This provides a great deal of flexibility and enables complex use-cases compared to static pipelines where all elements and their configurations are pre-defined when the pipeline is created.

# One significant aspect of dynamic pipeline handling in GStreamer is "Pad" and "Ghost Pad". Pads act as input/output ports for elements and can be created or linked on demand. Ghost pad is a special type of pad to facilitate connection across bin boundaries, helping in encapsulating logic inside bins while dynamically linking elements.

# Another important factor is understanding the different states of the elements and the pipelines. GStreamer pipelines have four different states - NULL, READY, PAUSED, and PLAYING. The transition between these states allows you to manage the system resources effectively and control the data flow between the elements. 

# However, working with dynamic pipelines requires correct handling of buffering, timestamps, event synchronization, and state changes. Understanding how to deal with these challenges is also crucial for successful implementation. 

# Monitoring or reacting to events is another vital part of GStreamer dynamic pipelines. Events like EOS (end-of-stream), errors, metadata updates, etc., can trigger actions in your application. You can listen to these events by connecting callback functions to the "message" signal emitted by the pipeline bus. 

# To conclude, dynamic pipelines offer a strong feature in GStreamer to build efficient, flexible, and complex multimedia applications. However, managing dynamic pipelines requires an understanding of various concepts like pad linking, states, event handling, etc.
# Results have been saved to 'result.txt'
# Embeddings have been saved to 'embeddings.json'

Query: Explain the concept of dynamic pipelines in GStreamer.
Response: Dynamic pipelines in GStreamer refer to the ability to build and modify media processing pipelines 'on the fly'. This means that as new information becomes available or circumstances change, you can add, remove, or modify elements within your pipeline during runtime. This provides a great deal of flexibility and enables complex use-cases compared to static pipelines where all elements and their configurations are pre-defined when the pipeline is created.

One significant aspect of dynamic pipeline handling in GStreamer is "Pad" and "Ghost Pad". Pads act as input/output ports for elements and can be created or linked on demand. Ghost pad is a special type of pad to facilitate connection across bin boundaries, helping in encapsulating logic inside bins while dynamically linking elements.

Another important factor is understanding the different states of the elements and the pipelines. GStreamer pipelines have four di