In [1]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [3]:
from llama_index import SimpleDirectoryReader
documents = SimpleDirectoryReader('./documents_2').load_data()

documents_text = []
for document in documents:
    single_document_text = []
    for char in document.text:
        if 32 <= ord(char) <= 127:
            single_document_text.append(char)
    single_document_text = ''.join(single_document_text)
    print(single_document_text, "\n")
    documents_text.append(single_document_text)

print(documents_text)

Cloud security audit  issues and challenges  Livia Maria Brum  Economic Informatics Doctoral School The Bucharest University of Economic Studies Bucharest, Romania brumalivia@gmail.com  Abstract This paper analyzes the cyber security audit program of services, infrastructure and processes offered through the cloud computing technology. The first part of article presents the importance of performing the process of audit, general concepts of information security audit as well as the limitations of traditional methods for auditing complex systems, such as the cloud computing. In the second part there are presented frameworks for audit planning, that can be used for every cloud model.  Keywordscloud computing, security audit, information security, cyber security  I. INTRODUCTION  Cloud technology has become a component part of daily activities in the corporate, industrial, medical, educational and domestic use, evolving from concept to reality. NIST defines cloud computing as a model for e

In [15]:
len(documents_text)

16

In [5]:
from llama_index import Document

doc_chunks = []
for i, text in enumerate(documents_text):
    doc = Document(text, doc_id=f"doc_id_{i}")
    doc_chunks.append(doc)

In [6]:
len(doc_chunks)

1

In [7]:
from llama_index import LLMPredictor
from langchain import HuggingFaceHub

api_key="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"

llm=HuggingFaceHub(repo_id="google/flan-t5-xxl", huggingfacehub_api_token=api_key)

llm_predictor = LLMPredictor(llm=llm)           #input size is 2048 (maybe)

In [8]:
from llama_index.indices.service_context import ServiceContext
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index.callbacks import LlamaDebugHandler
from llama_index import (LangchainEmbedding, PromptHelper)
from transformers import AutoTokenizer
from llama_index.node_parser.simple import SimpleNodeParser
from llama_index.langchain_helpers.text_splitter import TokenTextSplitter

embed_model = LangchainEmbedding(HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))

prompt_helper = PromptHelper(max_input_size=900, num_output=100, max_chunk_overlap=20)

callback_manager_1 = LlamaDebugHandler()

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xxl")

service_context_vectorIndex = ServiceContext.from_defaults(
    llm_predictor=llm_predictor,
    prompt_helper = prompt_helper,
    embed_model = embed_model,
    #node_parser=SimpleNodeParser(text_splitter=TokenTextSplitter(tokenizer = tokenizer, chunk_size = 512, chunk_overlap = 0)),
    callback_manager = callback_manager_1 
)

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu
Use pytorch device: cpu


In [9]:
from llama_index import GPTVectorStoreIndex
index_set = {}
for i, doc in enumerate(doc_chunks):
    cur_index = GPTVectorStoreIndex.from_documents(
        [doc], 
        service_context=service_context_vectorIndex
    )
    #print(len(cur_index.index_struct.nodes_dict))
    index_set[doc.doc_id] = cur_index

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 5414 tokens
> [build_index_from_nodes] Total embedding token usage: 5414 tokens


In [10]:
from llama_index.retrievers import VectorIndexRetriever

query = "What is conclusion of this paper?"

response = ''

prev_MAX = 0
for index in index_set.values():
    retriever = VectorIndexRetriever(
        index=index, 
        similarity_top_k=1,
    )
    
    nodes = retriever.retrieve(query)
    
    for node in nodes:
        print(node.score)
        if node.score >= prev_MAX:
            print("\nSelected node =", node.score)
            query_engine = index.as_query_engine(similarity_top_k = 1)              #gets the vector store index
            response = query_engine.query(query)
            prev_MAX = node.score

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 7 tokens
> [retrieve] Total embedding token usage: 7 tokens
0.20381790839299863

Selected node = 0.20381790839299863


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 7 tokens
> [retrieve] Total embedding token usage: 7 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1020 tokens
> [get_response] Total LLM token usage: 1020 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens


In [11]:
print(response.response)

Cloud-based systems need specific security methods; traditional methods are not enough.


In [13]:
callback_manager_1.get_llm_inputs_outputs()

[[CBEvent(event_type=<CBEventType.LLM: 'llm'>, payload={'context_str': '- personnel, systems, processes. The predominant use of virtualization to allocate resources complicates the audit process because the abstraction of resources makes it difficult to identify all assets for audit purposes. To address this issue, some CSPs provide consumers with audit reports from third-party companies that can confirm whether their infrastructure meets compliance standards [18].  ISACA proposes an audit / assurance program based on 3 steps - planning and scoping the audit, governing the cloud and operating in the cloud. The security of data and information is analyzed with the help of the controls from the third stage, according to table 3:  Table III  ISACA Audit/Assurance Program Step Category Description Step Incident Response, Notification and Remediation Incident notifications, responses, and remediation are documented, timely, address the risk of the incident, escalated as necessary and are fo

In [147]:
callback_manager_2.flush_event_logs()