# Use-Case
GenAI what will be "trained" on the SEMPER Policy Language: https://github.com/nuvibit/semper-policy-repo-sample/wiki

The solution will accept prompts (samples below), generate an SEMPER policy, and validate the policy in a compiler.
After sucessfull compilation  there shall be an option, to check-in the policy to the right place in the policy repository.

###Sample Prompts

``` 
I want to detect the usage of 'root' user in all accounts, where the account tag 'Environment' starts with 'Production' or 'Core'.
```

``` 
I want to disable the control IDs 4.* of the Security Hub Standard CIS AWS Foundations Benchmark 1.4.
```

``` 
I want to archive all GuardDuty Findings that originate from a specific actor.
```

```
I want an enrichment policy to trigger an auto-remediation for the open security group port TCP80.
```

```
I want an enrichment policy to trigger an auto-remediation for the open security group port TCP80.
The policy should be applied to all accounts where the account tag 'Owner' starts with 'Donald Duck'.
```

```
I want an enrichment policy to trigger an auto-remediation for the open security group port TCP80.
The policy should be applied to all accounts where the account tag 'Owner' starts with 'Pluto' and that are in the Organization Unit "Sandbox".
```

``` 
I want an enrichanment policy to trigger an auto-remediation for the open security group port TCP80.
The policy should be applied to all accounts where the account tag 'Owner' starts with 'Pluto' or 'Donald Duck' in or below the Organization Unit "Department1".
```

AWS-Account: 678856817733  
Region: US-West-2

https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/  
https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/studio/open/d-warwybct01ai/default-20230911t173777

https://github.com/aws-samples/amazon-bedrock-workshop

Ensure this permission is added to the execution role of the SageMaker Studio Profile:
```json {linenos=table,hl_lines=[],linenostart=50}

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "BedrockFullAccess",
            "Effect": "Allow",
            "Action": ["bedrock:*"],
            "Resource": "*"
        }
    ]
}
```




In [2]:
# Make sure you ran `download-dependencies.sh` from the root of the repository first!
%pip install --no-build-isolation --force-reinstall \
    ../dependencies/awscli-*-py3-none-any.whl \
    ../dependencies/boto3-*-py3-none-any.whl \
    ../dependencies/botocore-*-py3-none-any.whl

%pip install --quiet "faiss-cpu>=1.7,<2" langchain==0.0.249 "pypdf>=3.8,<4"
%pip install unstructured


[0mProcessing /root/michael-amazon-bedrock/dependencies/awscli-*-py3-none-any.whl
[31mERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/root/michael-amazon-bedrock/dependencies/awscli-*-py3-none-any.whl'
[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
import json
import os
import sys

import boto3

module_path = ".."
sys.path.append(os.path.abspath(module_path))
from utils import bedrock, print_ww


os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
os.environ["BEDROCK_ASSUME_ROLE"] = "arn:aws:iam::678856817733:role/BedrockRole"
os.environ["BEDROCK_ENDPOINT_URL"] = "https://bedrock.us-west-2.amazonaws.com"  # E.g. "https://..."


boto3_bedrock = bedrock.get_bedrock_client(
    assumed_role=os.environ.get("BEDROCK_ASSUME_ROLE", None),
    endpoint_url=os.environ.get("BEDROCK_ENDPOINT_URL", None),
    region=os.environ.get("AWS_DEFAULT_REGION", None),
)

Create new client
  Using region: us-west-2
  Using role: arn:aws:iam::678856817733:role/BedrockRole ... successful!
boto3 Bedrock client successfully created!
bedrock(https://bedrock.us-west-2.amazonaws.com)


In [4]:
# We will be using the Titan Embeddings Model to generate our Embeddings.
from langchain.embeddings import BedrockEmbeddings
from langchain.llms.bedrock import Bedrock

# - create the Anthropic Model
llm = Bedrock(model_id="anthropic.claude-v1", client=boto3_bedrock, model_kwargs={'max_tokens_to_sample':1000})
bedrock_embeddings = BedrockEmbeddings(client=boto3_bedrock)

## Data Preparation

In [27]:
from urllib.request import urlretrieve
FOLDERNAME_10_SOURCE = "10_source"
FOLDERNAME_20_PROCESSED = "20_processed"

os.makedirs(FOLDERNAME_10_SOURCE, exist_ok=True)
files = [
    "https://raw.githubusercontent.com/wiki/nuvibit/semper-policy-repo-sample/Home.md",
    "https://raw.githubusercontent.com/wiki/nuvibit/semper-policy-repo-sample/10-SEMPER-Policies.md",
    "https://raw.githubusercontent.com/wiki/nuvibit/semper-policy-repo-sample/90-JSON-Engine.md",
]
for url in files:
    file_path = os.path.join(FOLDERNAME_10_SOURCE, url.rpartition("/")[2])
    urlretrieve(url, file_path)

In [29]:
def load_md_from_directory(directory_path):
    documents = []
    for filename in os.listdir(directory_path):
         if filename.endswith('.md'):  # Assuming you want to read .md files
            full_path = os.path.join(directory_path, filename)
            with open(full_path, 'r', encoding='utf-8') as f:
                documents.append({"filename": filename, "raw_text": f.read()})
    return documents

def write_chunk_to_file(target_folder, base_filename, counter, chunk):
    suffix = "_".join(map(str, counter))
    filename = f"{base_filename}_{suffix}.md"
    with open(os.path.join(target_folder, filename), 'w') as f:
        f.write(chunk)

def get_chunks_and_write_files(target_folder, filename, raw_text):
    chunks = []
    lines = raw_text.split("\n")
    
    chunk = ""
    counter = []  # List to store counters for each level
    for line in lines:
        stripped_line = line.strip()

        # Check if the line starts with one or more '#' characters
        hash_count = len(stripped_line) - len(stripped_line.lstrip('#'))
        if hash_count > 0:
            if chunk:
                # Write the current chunk to a file
                write_chunk_to_file(target_folder, filename, counter, chunk)

                # Reset the chunk
                chunks.append(chunk.strip())
                chunk = ""

            # Update counter list based on the current level
            counter = counter[:hash_count]  # Remove counters for deeper levels, if any
            if len(counter) < hash_count:
                counter.extend([0] * (hash_count - len(counter)))  # Initialize counters for new levels, if any
            counter[-1] += 1  # Increment counter for the current level

        chunk = chunk + line + "\n"

    if chunk:
        # Write the last chunk to a file
        write_chunk_to_file(target_folder, filename, counter, chunk)

        chunks.append(chunk.strip())
    
    return chunks

documents = load_md_from_directory(FOLDERNAME_10_SOURCE)
for document in documents:
    get_chunks_and_write_files(FOLDERNAME_20_PROCESSED, document["filename"], document["raw_text"])

In [6]:
import numpy as np
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.document_loaders import DirectoryLoader

def load_md_from_directory(directory_path):
    documents = []
    for filename in os.listdir(directory_path):
        if filename.endswith('.md'):  # Assuming you want to read .txt files
            loader = UnstructuredMarkdownLoader(os.path.join(directory_path, filename))
            documents.append(loader.load())
    return documents

loader = DirectoryLoader("./source/", glob="**/*.md")

documents = loader.load()
# - in our testing Character split works better with this PDF data set
text_splitter = CharacterTextSplitter(
    # Set a really small chunk size, just to show.
    separator = "\n",
    chunk_size = 750,
    chunk_overlap  = 0,
)
docs = text_splitter.split_documents(documents)

In [7]:
avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)
avg_char_count_pre = avg_doc_length(documents)
avg_char_count_post = avg_doc_length(docs)
print(f'Average length among {len(documents)} documents loaded is {avg_char_count_pre} characters.')
print(f'After the split we have {len(docs)} documents more than the original {len(documents)}.')
print(f'Average length among {len(docs)} documents (after split) is {avg_char_count_post} characters.')

Average length among 3 documents loaded is 14180 characters.
After the split we have 61 documents more than the original 3.
Average length among 61 documents (after split) is 687 characters.


In [8]:
from langchain.chains.question_answering import load_qa_chain
from langchain.vectorstores import FAISS
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

vectorstore_faiss = FAISS.from_documents(
    docs,
    bedrock_embeddings,
)

wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)

In [9]:
query = "I want to get a enrichment policy to trigger an auto-remediation for the open security group port TCP80."

In [10]:
query_embedding = vectorstore_faiss.embedding_function(query)
np.array(query_embedding)

array([ 0.05932617,  0.04003906,  0.296875  , ..., -0.15625   ,
        0.02954102,  0.01116943])

In [11]:
relevant_documents = vectorstore_faiss.similarity_search_by_vector(query_embedding)
print(f'{len(relevant_documents)} documents are fetched which are relevant to the query.')
print('----')
for i, rel_doc in enumerate(relevant_documents):
    print_ww(f'## Document {i+1}: {rel_doc.page_content}.......')
    print('---')

4 documents are fetched which are relevant to the query.
----
## Document 1: | ...findingResource    | object | (optional) |
| ....id                | string | (optional, default = '') you can self-reference the JSON of the
processed Security Finding.
Example: 'raw.detail.requestParameters.keyId' |
| ....type              | string | (optional, default = 'AwsAccount') according to
AWS documentation. |
Samples of Extension-Policies  🔝
Sample 1 🔝
json {linenos=table,hl_lines=[],linenostart=50}
{
  "metaData": {
    "domain": "extension",
    "title": "Auto-Remediation of TCP-Ports 22 & 3389 for CIDR range /24",
    ...
  },
  "extension": {
    "policyScope": {
      ...
    },
    "findingPattern": {
      ...
    },
    "extensionBlock": {
      "sqsFanOut": [
        {.......
---
## Document 2: "findingComplianceStatus" : "FAILED",
      }
    }
  }
}
Processed Security Finding  🔝
A processed Security Finding will sent to one of two SNS topics.
The JSON of the processed Security Findin

In [12]:
answer = wrapper_store_faiss.query(question=query, llm=llm)
print_ww(answer)

 To trigger an Auto-Remediation for open TCP Port 80, you can use the following enrichment policy:
```json
{
  "metaData": {
    "domain": "extension",
    "title": "Auto-Remediation of TCP Port 80",

...
  },
  "extension": {
    "policyScope": {
      "accountIds": ["*"],
      "awsRegions": ["*"],
      "resourceTypes": ["AwsSecurityGroup"],
      "complianceTypes": ["FAILED"]
    },
    "findingPattern": {
      "detailType": ["AWS Security Finding - EC2 Security Group Finding"],
      "confidence": ["MEDIUM", "HIGH"],
      "Title": ["Port TCP 80 open"],
      "recommendation": ["Remediate open port 80 on security group"]
    },
    "extensionBlock": {
        "remediation": {
          "remediationInstructions": "sh sg-policy -sg ",
          "remediationResourceId": "air/{security_group.id}",
          "remediationParameters": {
            "security_group.id": "{findingResource.id}"
          }
        }
    }
  }
}
```

This policy will look for Security Hub findings of type "