#**SSF Document analyzer using  PaLM 2 + Pinecone + LangChain**


---------------------------------------------------------------------------
## *Limit for api*

- Base_model:textembedding-gecko requests per minute	600
- Ase_model:chat-bison requests per minute	60
- The API accepts a maximum of 3,072 input tokens and outputs 768-dimensional vector embeddings
- Max vector dimensionality is 20,000.
----------------------------------------------------------------------------

#**Step 01: Install All the Required Packages**

In [7]:
!pip install langchain
!pip install pinecone-client
!pip install pypdf
!pip install -q google-generativeai



#**Step 02: Import All the Required Libraries**

In [8]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import GooglePalmEmbeddings
from langchain.llms import GooglePalm
from langchain.vectorstores import Pinecone
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import pinecone
import os
import sys
from google.colab import userdata


#**Step 03: Load the PDF Files**

In [9]:
!mkdir pdfs

mkdir: cannot create directory ‘pdfs’: File exists


add all pdf files to 'pdfs' directory before procedding

#**Step 04: Extract the Text from the PDF's**

In [10]:
loader = PyPDFDirectoryLoader("pdfs")
data = loader.load()
for index, document_info in enumerate(data):
    data[index].page_content= data[index].page_content.replace('\n', '')  # removing all newline characters


In [11]:
print(data[2].page_content)
print(data[2].metadata)


 Page 2 Table of Contents  Terms of Use  ................................ ................................ ................................ .................  1 Table of Contents  ................................ ................................ ................................ ..........  2 Overview  ................................ ................................ ................................ ......................  24 Intended Audience  ................................ ................................ ................................ ................ 24 Consensus Guidance  ................................ ................................ ................................ ........... 25 Typographical Conventions  ................................ ................................ ................................ .26 Recommendation Definitions  ................................ ................................ .....................  27 Title ................................ ..........

#**Step 05: Split the Extracted Data into Text Chunks**

In [12]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)

In [13]:
text_chunks = text_splitter.split_documents(data)

In [14]:
len(text_chunks)

7827

# Making a list of all documents and adding encoded Pg number to every chunk

In [15]:

lst = []
count = 0
for index, document_info in enumerate(text_chunks):
  value = text_chunks[index].metadata['source']  # name of the document
  if (value in lst): # if already present we just need to get the doc number
    count = lst.index(value)
  else: # else we need to add the document name to the list and return the doc number to be embedded
      lst.append(value)
      count=len(lst)-1

  text_chunks[index].page_content= text_chunks[index].page_content + " |" + str(count+1) + ":" + str(text_chunks[index].metadata['page'])+ "|"


print(lst)


['pdfs/CIS_Microsoft_Windows_10_Enterprise_Benchmark_v2.0.0.pdf', 'pdfs/C5-Annex.pdf', 'pdfs/nistspecialpublication800-30r1.pdf', 'pdfs/CIS_Google_Android_Benchmark_v1.4.0.pdf', 'pdfs/CompTIA Security Plus SY0-601 Certification (motakhasessho.ir).pdf', 'pdfs/- International Standard ISO 31000_2018-02 (2018).pdf', 'pdfs/27005 2022.pdf']


In [16]:
text_chunks[101]

Document(page_content="Page 9 17.2.1 (L1) Ensure 'Audit Application Group Management' is set to 'Success and Failure' (Automated)  .. 437 17.2.2 (L1) Ensure 'Audit Security Group Management' is set to include 'Success' (Automated)  .............  439 17.2.3 (L1) Ensure 'Audit User Ac count Management' is set to 'Success and Failure' (Automated)  .........  441 17.3 Detailed Tracking  ................................ ................................ ................................ .......................  444 17.3.1 |1:9|", metadata={'source': 'pdfs/CIS_Microsoft_Windows_10_Enterprise_Benchmark_v2.0.0.pdf', 'page': 9})

In [17]:
text_chunks[100]

Document(page_content="17.1.1 (L1) Ensure 'Audit Credential Validation' is set to 'Success and Failure' (Automated)  ...................  434 17.2 Account Management  ................................ ................................ ................................ ...............  436 |1:8|", metadata={'source': 'pdfs/CIS_Microsoft_Windows_10_Enterprise_Benchmark_v2.0.0.pdf', 'page': 8})

In [24]:
os.environ['GOOGLE_API_KEY'] = userdata.get('GOOGLE_PALM_API_KEY') # add ur api key "asdasdsafdsgsfgsd"

#**Step 06:Downlaod the Embeddings**

In [25]:
embeddings=GooglePalmEmbeddings()

In [26]:
query_result = embeddings.embed_query("Hello World")


In [27]:
print("Length", len(query_result))

Length 768


#**Step 07: Initializing the Pinecone**

In [28]:
PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY', userdata.get('PINECONE_API_KEY'))
PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV', 'gcp-starter')


In [31]:
# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_API_ENV  # next to api key in console
)
index_name = "grcbot" # put in the name of your pinecone index here


#**Step 08: Create Embeddings for each of the Text Chunk**

In [32]:
docsearch = Pinecone.from_texts([t.page_content for t in text_chunks], embeddings, index_name=index_name)

#If you already have an index, you can load it like this


In [None]:
#docsearch = Pinecone.from_existing_index(index_name, embeddings)


#**Step 09: Similarity Search**

In [33]:
query = "is there information about sa user if so "


In [34]:
query = "is there information about 'farm service account'"

In [35]:
docs = docsearch.similarity_search(query, k=3)


In [36]:
docs

[Document(page_content="both users and devices?31. What type of account is a service account?32. How many accounts should a system administrator for a multinational corporation have and why?33. What do I need to do when I purchase a baby monitor and why?34. What is a privilege account?35. What is the drawback for security if the company uses shared accounts?36. What is a default account? Is it a security risk?37. The system administrator in a multination corporation creates a user account using an employee's first |5:135|"),
 Document(page_content='a directory or identity service.   ● ● v7 16.8 Disable Any Unassociated Accounts   Disable any account that cannot be associated with a business process or business owner.  ● ● ● |1:995|'),
 Document(page_content='and other useful elements.   ● ● v7 16.6 Maintain an Inventory of Accounts   Maintain an inventory of all accounts organized by authentication system.   ● ● |1:451|')]

#Custom Prompts

In [37]:
prompt_template  = """
Use the following pieces of context to answer the question at the end only using the text in original form as much as possible. If you don't know the answer, just say that you don't know, don't try to make up an answer.also
it is a extract of a document that describes the vendors policy for security. the last part of the massage will contain "|XX:YY|" where XX are doc number and YY are page number.
this is the only place you can take the value for XX and YY dont make up value only take it from the message end part "|XX:YY|".

{context}

Question: {question}

you need to format it like this:
Doc-XX , Pg YY
Answer that descibe the content in detail
"""

In [38]:
prompt = PromptTemplate(template = prompt_template , input_variables=["context", "question"])
chain_type_kwargs = {"prompt": prompt}

#**Step 10: Creating a Google PaLM Model Wrapper**

In [39]:
llm = GooglePalm(temperature=0.7)

In [40]:
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever(), chain_type_kwargs=chain_type_kwargs)


#**Step 11: Q/A**

In [43]:
query = "is there information about 'CIS benchmark'"

In [44]:
print(qa.run(query))

Yes, there is information about 'CIS benchmark'.

Doc-1, Pg 24: 
"All CIS Benchmarks focus on technical configuration settings used to maintain and/or increase the security of the addressed technology, and they should be used in conjunction with other essential cyber hygiene tasks like:  • Monitoring the base operating system for vulnerabilities and quickly updating with the latest security patches  • Monitoring applications and libraries for vulnerabilities and quickly updating with the latest security patches  In the end, the CIS"

Doc-1, Pg 1293: 
"CIS Benchmark Recommendation  Set Correctly  Yes No 18.9.40  Service Control Manager Settings  18.9.41  Shutdown  18.9.42  Shutdown Options  18.9.43  Storage Health  18.9.44  Storage Sense  18.9.45  System Restore  18.9.46  Troubleshooting and Diagnostics  18.9.46.1  Application Compatibility Diagnostics  18.9.46.2  Corrupted File Recovery  18.9.46.3  Disk Diagnostic  18.9.46.4  Fault Tolerant Heap  18.9.46.5  Microsoft Support Diagnostic

In [55]:
query = "is there information about password length"

In [56]:
qa.run(query)

'Doc-1, Pg 41\n\nYes, the minimum password length is set to 14 or more characters.|1:41|'

In [47]:
query = "can you give me the registry location for audit of Make Hidden Markup Visible is set to Enabled"

In [48]:
qa.run(query)

'The registry location for audit of Make Hidden Markup Visible is set to Enabled is:\n\nHKEY_LOCAL_MACHINE\\SOFTWARE\\Policies\\Microsoft\\Messenger\\Client:CEIP\n\n|1:715|'

In [None]:
jam= True
prev_comment= [""]
user_input =""
while jam:
  prev_comment.append(user_input)
  prev_comment.pop(0)
  user_input = input(f"Input Prompt: ") + prev_comment[-1]
  if 'exit' in user_input :
    print('Exiting')
    jam=False
    continue
  if user_input == '':
    continue
  result = qa({'query': user_input})
  print(f"Answer: {result['result']}")

Input Prompt: tell me about sql hardening 4.1
Answer: Doc-3 , Pg 52 
SQL Hardening 4.1: Enable CLR Strict Security

To enable CLR Strict Security, run the following commands:

```
EXEC sp_configure 'show advanced options', 1;
RECONFIGURE;
EXEC sp_configure 'clr strict security', 1;
RECONFIGURE;
GO
EXEC sp_configure 'show advanced options', 0;
RECONFIGURE;
```

This will enable CLR Strict Security, which will help to protect against malicious code being executed in the database.
Input Prompt: how to remediate this 
Answer: Doc-1, Pg 10

To remediate SQL hardening 4.1, the following steps can be taken:

1. Review TSQL and application code for SQL Injection.
2. Only permit minimally privileged accounts to send user input to the server.
3. Minimize the risk of SQL injection attack by using parameterized commands and stored procedures.
4. Reject user input containing binary data, escape sequences, and comment characters.
Input Prompt: how to harden word 3.2?
Answer: Doc-3321, Pg 95 | 6.1 En

KeyboardInterrupt: ignored