<a href="https://colab.research.google.com/github/mkaramb/CloudWeaver/blob/retriever-draft/Custom_Retriever.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -U --upgrade --quiet langchain-google-vertexai langchain-google-genai langchain-core langchain-community langchain unstructured lark chromadb

In [None]:
import os

os.environ['GOOGLE_API_KEY'] = 'AIzaSyA7lgFVJCMuPk6V5xm-jxMHh8ndOpo69pY'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from langchain_core.documents import Document
import os
import glob

def read_terraform_files(base_path):
    documents = []

    for root, dirs, files in os.walk(base_path):
        for file in files:
            if file.endswith('.tf'):
                metadata = extract_metadata(root, file)
                with open(os.path.join(root, file), 'r') as content_file:
                    content = content_file.read()
                document = Document(page_content=content, metadata=metadata)
                documents.append(document)

        for dir in dirs:
            instance_type_path = os.path.join(root, dir)
            tf_files = glob.glob(instance_type_path + '/**/*.tf', recursive=True)
            for tf_file in tf_files:
                metadata = extract_metadata(os.path.dirname(tf_file), os.path.basename(tf_file))
                with open(tf_file, 'r') as content_file:
                    content = content_file.read()
                document = Document(page_content=content, metadata=metadata)
                documents.append(document)

    return documents

def extract_metadata(file_path, file_name):
    metadata = {'instance_type': file_name.replace('.tf', '')}
    path_parts = file_path.split(os.sep)

    if 'terraform_code_samples' in path_parts:
        terraform_index = path_parts.index('terraform_code_samples')
        if len(path_parts) > terraform_index + 1:
            metadata['resource'] = path_parts[terraform_index + 1]
        if len(path_parts) > terraform_index + 2:
            metadata['instance'] = path_parts[terraform_index + 2]

    return metadata

base_path = '/content/drive/My Drive/terraform_code_samples'
documents = read_terraform_files(base_path)

# Example to print the metadata of the first document, if available
if documents:
    print(documents[123].metadata)

{'instance_type': 'rmig_stateful_policy_ips', 'resource': 'compute', 'instance': 'rmig_stateful_policy_ips'}


In [None]:
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_google_genai import ChatGoogleGenerativeAI

model = ChatGoogleGenerativeAI(model="gemini-pro", convert_system_message_to_human=True)
output = model(
    [
        SystemMessage(content="""When you receive a prompt from a user outlining a desired GCP project, your task is to meticulously analyze the prompt to identify all the GCP instances and resources required to construct the project. This involves both explicitly mentioned instances and resources, as well as those that are implicitly required due to dependencies or the nature of the project.

Decompose the User's Prompt: Begin by dissecting the user's prompt to understand the scope and requirements of the GCP project. Identify key elements such as the GCP products mentioned (e.g., Compute Engine, Cloud Storage, BigQuery) and the specific instances within those products (e.g., VM instances, storage buckets, BigQuery datasets).

Identify Implicit Requirements: Consider the dependencies and necessary components that might not be explicitly mentioned but are essential for the project's functionality. For example, if a VM instance is requested, consider the need for a network and a firewall rule.

Formulate Questions for the Retriever: For each instance or resource identified, formulate a precise question that can be passed to the retriever. The retriever has access to Terraform files organized by instance type within GCP product-named folders. Ensure each question is clear and specific to enable the retriever to find the exact Terraform code files needed. For example:

"Retrieve Terraform code for a Compute Engine VM instance with standard configuration."
"Retrieve Terraform code for a Cloud Storage bucket with public access."
"Retrieve Terraform code for a VPC network with custom subnets."
Ensure Completeness: Cross-reference your list of questions with the initial project requirements to ensure all necessary components are covered. If the project involves interdependent resources (e.g., a VM instance that requires a specific network configuration), make sure to include questions that cover these dependencies.

Output the Questions: Present the formulated questions in a structured format that can be easily passed to the retriever. This could involve listing the questions sequentially or grouping them by the GCP product for clarity.

Your ultimate goal is to generate a comprehensive set of retriever-ready questions that, when answered, will provide all the Terraform code files necessary to build the user's GCP project in its entirety. This approach ensures that no crucial component is overlooked and that the user can seamlessly compile the Terraform code to deploy their project on GCP.

"""),
        HumanMessage(content="Create a GCP Terraform project that connects a VM to a mysql database."),
    ]
)

  warn_deprecated(


In [None]:
print(str(output))

content='**Questions for the Retriever:**\n\n**Compute Engine:**\n* Retrieve Terraform code for a Compute Engine VM instance with standard configuration.\n\n**Cloud SQL:**\n* Retrieve Terraform code for a MySQL Cloud SQL instance with a specific database name.\n\n**VPC Network:**\n* Retrieve Terraform code for a VPC network with a custom subnet.\n\n**Firewall:**\n* Retrieve Terraform code for a firewall rule that allows traffic from the VM instance to the Cloud SQL instance.\n\n**Service Account:**\n* Retrieve Terraform code for a service account that grants the VM instance access to the Cloud SQL instance.'


In [None]:
content = str(output)

retrieve_sentences_corrected = [line.strip() for line in content.split('\\n') if line.strip().startswith('* Retrieve')]
retrieve_sentences_final = [sentence.replace('*', '') for sentence in retrieve_sentences_corrected]

retrieve_sentences_final

[' Retrieve Terraform code for a Compute Engine VM instance with standard configuration.',
 ' Retrieve Terraform code for a MySQL Cloud SQL instance with a specific database name.',
 ' Retrieve Terraform code for a VPC network with a custom subnet.',
 ' Retrieve Terraform code for a firewall rule that allows traffic from the VM instance to the Cloud SQL instance.',
 " Retrieve Terraform code for a service account that grants the VM instance access to the Cloud SQL instance.'"]

In [None]:
from langchain_community.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings

doc_embeddings = GoogleGenerativeAIEmbeddings(
    model="models/embedding-001", task_type="retrieval_document"
)

vectorstore = Chroma.from_documents(documents=documents, embedding=doc_embeddings)

In [None]:
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
    AttributeInfo(
        name="instance_type",
        description="Specifies the exact variant or configuration of the instance that the Terraform code represents. This allows for precise identification of Terraform files based on specific implementations, such as a MySQL version for a Cloud SQL instance, enabling targeted retrieval of code.",
        type="string",
    ),
    AttributeInfo(
        name="resource",
        description="Identifies the broader GCP product category to which an instance belongs. For example, a Compute Engine VM or a Cloud SQL database would fall under 'compute' and 'sql' resources, respectively. This categorization facilitates the organization and search of Terraform files within the context of GCP products.",
        type="string",
    ),
    AttributeInfo(
        name="instance",
        description="Denotes the specific instance within a GCP product that the Terraform code is designed to provision or manage. This could refer to a particular VM, database, or storage bucket, among others. The instance name aids in pinpointing Terraform files that apply to particular GCP service instances.",
        type="string",
    ),
]


In [None]:
from langchain.retrievers.self_query.base import SelfQueryRetriever

document_content_description = "Terraform code for GCP instances"
retriever = SelfQueryRetriever.from_llm(
    model,
    vectorstore,
    document_content_description,
    metadata_field_info,
    enable_limit=True,
)

In [None]:
code_docs = {}

for i, sentence in enumerate(retrieve_sentences_final):
  code_docs[sentence] = retriever.invoke(sentence)

In [None]:
print((code_docs))

{' Retrieve Terraform code for a Compute Engine VM instance with standard configuration.': [Document(page_content='# [START compute_basic_vm_parent_tag]\n# [START compute_instances_create]\n\n# Create a VM instance from a public image\n# in the `default` VPC network and subnet\n\nresource "google_compute_instance" "default" {\n  name         = "my-vm"\n  machine_type = "n1-standard-1"\n  zone         = "us-central1-a"\n\n  boot_disk {\n    initialize_params {\n      image = "ubuntu-minimal-2210-kinetic-amd64-v20230126"\n    }\n  }\n\n  network_interface {\n    network = "default"\n    access_config {}\n  }\n}\n# [END compute_instances_create]\n\n# [START vpc_compute_basic_vm_custom_vpc_network]\nresource "google_compute_network" "custom" {\n  name                    = "my-network"\n  auto_create_subnetworks = false\n}\n# [END vpc_compute_basic_vm_custom_vpc_network]\n\n# [START vpc_compute_basic_vm_custom_vpc_subnet]\nresource "google_compute_subnetwork" "custom" {\n  name          = "

In [None]:
vectorstore1 = Chroma.from_documents(documents=code_docs, embedding=doc_embeddings)

AttributeError: 'str' object has no attribute 'page_content'

In [None]:
# Retriever

retriever1 = vectorstore1.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [None]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

template = """You will be provided with a set of Terraform files as context. These files contain code for various instances and configurations within a Google Cloud Platform (GCP) project. Your task is to utilize these files as the foundation for building a complete Terraform project based on a specific project description provided by the user.

Steps to Follow:

Review the Context: Start by thoroughly examining the Terraform files you've been given. Understand the instances, resources, and configurations they define. Take note of any variables, modules, and outputs used in these files, as they will be critical in ensuring that your project is modular and reusable.

Understand the User's Project Description: Carefully read the user's project description. Identify all the components, features, and specific configurations they want to implement in their GCP project. This description might include requests for specific types of instances (e.g., VMs, databases), networking configurations (e.g., VPCs, subnets), access controls (e.g., IAM roles, service accounts), or any other GCP services and resources.

Identify Gaps and Overlaps: Compare the user's project requirements with the instances and configurations outlined in the provided Terraform files. Identify any gaps (i.e., required components not covered in the files) and overlaps (i.e., components already defined in the files that match the user's requirements).

Modify and Integrate: Use your knowledge of Terraform and GCP to modify existing code and add new code where necessary to fill in the gaps. Ensure that all components work together seamlessly. This might involve adjusting parameters, adding or modifying resource definitions, and ensuring that dependencies are correctly managed.

Ensure Best Practices: As you weave the code together, ensure that you follow Terraform and GCP best practices. This includes organizing resources into modules for reusability, using variables for customization, and defining outputs for critical information. Also, ensure that the project is secure, efficient, and cost-effective.

Compile the Complete Project: Combine the modified and new Terraform code into a comprehensive project. This project should reflect the user's description and meet all specified requirements. The final output should be a set of Terraform files that, when applied, will deploy the user's desired GCP project in its entirety.

Final Output: Your final output will be a detailed Terraform project, encompassing all necessary files to implement the user's GCP project as described. This project should be ready for deployment, with all resources correctly configured and integrated according to the project description and your expertise.


{context}

Question: {question}

Helpful Answer:"""
custom_rag_prompt = PromptTemplate.from_template(template)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | model
    | StrOutputParser()
)

final_code = rag_chain.invoke(content)

In [None]:
print(final_code)

**Compute Engine**

```
resource "google_compute_instance" "default" {
  name         = "my-instance"
  machine_type = "e2-standard-4"
  zone         = "us-central1-a"

  boot_disk {
    initialize_params {
      image = "debian-cloud/debian-11"
    }
  }

  network_interface {
    network = "default"
    access_config {}
  }
}
```

**Cloud SQL**

```
resource "google_sql_database" "default" {
  name    = "my-database"
  instance = google_sql_instance.default.name
}

resource "google_sql_instance" "default" {
  name             = "my-instance"
  database_version = "MYSQL_8_0"
  machine_type     = "db-n1-standard-1"
  region           = "us-central1"

  settings {
    storage_auto_increase = true
  }
}
```

**VPC Network**

```
resource "google_compute_network" "default" {
  name = "my-network"
}

resource "google_compute_subnetwork" "default" {
  name          = "my-subnet"
  network       = google_compute_network.default.name
  region        = "us-central1"
  ip_cidr_range = "10.0.0.0

In [None]:
final_content = str(final_code)

retrieve_sentences_corrected = [line.strip() for line in final_code.split('\n')]

retrieve_sentences_corrected

['**Compute Engine**',
 '',
 '```',
 'resource "google_compute_instance" "default" {',
 'name         = "my-instance"',
 'machine_type = "e2-standard-4"',
 'zone         = "us-central1-a"',
 '',
 'boot_disk {',
 'initialize_params {',
 'image = "debian-cloud/debian-11"',
 '}',
 '}',
 '',
 'network_interface {',
 'network = "default"',
 'access_config {}',
 '}',
 '}',
 '```',
 '',
 '**Cloud SQL**',
 '',
 '```',
 'resource "google_sql_database" "default" {',
 'name    = "my-database"',
 'instance = google_sql_instance.default.name',
 '}',
 '',
 'resource "google_sql_instance" "default" {',
 'name             = "my-instance"',
 'database_version = "MYSQL_8_0"',
 'machine_type     = "db-n1-standard-1"',
 'region           = "us-central1"',
 '',
 'settings {',
 'storage_auto_increase = true',
 '}',
 '}',
 '```',
 '',
 '**VPC Network**',
 '',
 '```',
 'resource "google_compute_network" "default" {',
 'name = "my-network"',
 '}',
 '',
 'resource "google_compute_subnetwork" "default" {',
 'nam