## projectX

#### Problem Statement

Currently, the appsec team has done good job in creating a knowledgeable for frequently asked question from service team. So, whenever a service team reach out to Individual security team member individually or in a slack group with a question that exist in the knowledge base(that security team maintains). Security team do a manual cross reference for that question in the quip doc and respond to the service team in slack with an answer. The process is good in a way that the security team doesn’t need to spend time looking out for answer for the question if it exists. However, the process of responding to service team member is still manual. It requires security team member attention and a context switch from what they currently work upon to respond to the question which has been responded earlier.



#### Proposed Solution

Security team is coming up with a solution which can automate and help answer the frequently asked questions from service team. There is reliance on the knowledge base and if the answer to a question exists in the knowledge base we inherently assume that the question has been asked earlier. 

*Note*: Process of building knowledge base is currently out of scope of this project. There is already work going on maturing the knowledge base. This project will leverage the KB to automate the response of a frequently asked question by service team in a slack message.

#### STEPS:

1. Build, train and deploy the model from the HuggingFace pretrained model library.

2. Create a knowledge base to fine tune a pretrained model from hugging face

3. Use the finetuned model to generate text responses to questions by customers.

#### AI/ML solution by: Madhur Prashant (Alias: madhurpt, madhurpt@amazon.com)

## Retrieval Augmented Generation (RAG) with Lanchain

1. Langchain: Framework for orchestrating the RAG Workflow
2. FAISS: Using an in-memory vector database for storing document embeddings
3. PyPDF: Python library for processing and storing the PDF Documents

In [2]:
%pip install langchain==0.0.251 --quiet --root-user-action=ignore
%pip install faiss-cpu==1.7.4 --quiet --root-user-action=ignore
%pip install pypdf==3.15.1 --quiet --root-user-action=ignore

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### FETCHING AND PROCESSING THE AppSec Team Data

In [3]:
filenames = [
    'Data.pdf',
    'NYC Data.pdf',
]

data_root = "./job_data/"

In [4]:
filenames = [
    'Data.pdf',
    'NYC Data.pdf',
]

data_root = "./job_data/"

import numpy as np
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

documents = []

for filename in filenames:
    loader = PyPDFLoader(data_root + filename)
    loaded_documents = loader.load()  # Use a variable to store loaded documents
    documents.extend(loaded_documents)  # Extend the list with loaded documents

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=100,
)

docs = text_splitter.split_documents(documents)

print(f'Number of Document Pages: {len(documents)}')
print(f'Number of Document Chunks: {len(docs)}')

Number of Document Pages: 193
Number of Document Chunks: 1070


### Now, that we have processed the document or data, let's work with the model to embed the documents in vector stores to be able to use RAG to get the contextually correct AppSec related documents

## Deploying a Model for Embedding: All MiniLML6 v2 and the LLaMa-2-7b-chat for our LLM

In [5]:
!pip install -qU \
    sagemaker \
    pinecone-client==2.2.1 \
    ipywidgets==7.0.0

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.3.3 requires pyqt5<5.16, which is not installed.
spyder 5.3.3 requires pyqtwebengine<5.16, which is not installed.
distributed 2022.7.0 requires tornado<6.2,>=6.0.3, but you have tornado 6.3.3 which is incompatible.
jupyterlab 3.4.4 requires jupyter-server~=1.16, but you have jupyter-server 2.7.3 which is incompatible.
jupyterlab-server 2.10.3 requires jupyter-server~=1.4, but you have jupyter-server 2.7.3 which is incompatible.
sagemaker-datawrangler 0.4.3 requires sagemaker-data-insights==0.4.0, but you have sagemaker-data-insights 0.3.3 which is incompatible.
spyder 5.3.3 requires ipython<8.0.0,>=7.31.1, but you have ipython 8.15.0 which is incompatible.
spyder 5.3.3 requires pylint<3.0,>=2.5.0, but you have pylint 3.0.0a7 which is incompatible.
spyder-kernels 2.3.3 requires ipython<8,>=7.31.1; py

In [11]:
!pip install urllib3==1.26.6

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting urllib3==1.26.6
  Downloading urllib3-1.26.6-py2.py3-none-any.whl (138 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.5/138.5 kB[0m [31m322.2 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.26.16
    Uninstalling urllib3-1.26.16:
      Successfully uninstalled urllib3-1.26.16
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
distributed 2022.7.0 requires tornado<6.2,>=6.0.3, but you have tornado 6.3.3 which is incompatible.[0m[31m
[0mSuccessfully installed urllib3-1.26.6
[0m

To begin, we will initialize all of the SageMaker session variables we'll need to use throughout the walkthrough.

In [12]:
import sagemaker
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.huggingface import HuggingFaceModel

role = sagemaker.get_execution_role()

my_model = JumpStartModel(model_id = "meta-textgeneration-llama-2-7b-f")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


#### LLaMa chat LLM endpoint: arn:aws:sagemaker:us-east-1:110011534045:endpoint-config/llama-2-generator

## Deploying the model endpoint for Sentence Transformer embedding model

In [13]:
# hub_config = {
#     "HF_MODEL_ID": "sentence-transformers/all-MiniLM-L6-v2",  # model_id from hf.co/models
#     "HF_TASK": "feature-extraction",
# }

# huggingface_model = HuggingFaceModel(
#     env=hub_config,
#     role=role,
#     transformers_version="4.6",  # transformers version used
#     pytorch_version="1.7",  # pytorch version used
#     py_version="py36",  # python version of the DLC
# )

In [14]:
from sagemaker.jumpstart.model import JumpStartModel

embedding_model_id, embedding_model_version = "huggingface-textembedding-all-MiniLM-L6-v2", "*"
model = JumpStartModel(model_id=embedding_model_id, model_version=embedding_model_version)
embedding_predictor = model.deploy()

-------!

In [15]:
embedding_model_endpoint_name = embedding_predictor.endpoint_name
embedding_model_endpoint_name

'hf-textembedding-all-minilm-l6-v2-2023-09-12-14-03-33-828'

In [16]:
import boto3
aws_region = boto3.Session().region_name

print(aws_region)

us-east-1


## Creating and Populating our Vector Database:

In [17]:
from typing import Dict, List
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
import json

class CustomEmbeddingsContentHandler(EmbeddingsContentHandler):
    content_type = "application/json"
    accepts = "application/json"
    
    def transform_input(self, inputs: list[str], model_kwargs: Dict) -> bytes:
        input_str = json.dumps({"text_inputs": inputs, **model_kwargs})
        return input_str.encode("utf-8")
    
    def transform_output(self, output: bytes) -> List[List[float]]:
        response_json = json.loads(output.read().decode("utf-8"))
        embeddings = response_json.get("embedding", [])  # Use get() with a default value
        return embeddings  # Make sure to return the embeddings
    

embeddings_content_handler = CustomEmbeddingsContentHandler()

embeddings = SagemakerEndpointEmbeddings(
    endpoint_name= embedding_model_endpoint_name,
    region_name=aws_region,
    content_handler=embeddings_content_handler,
)

Now, with our embeddings, we can process our document chunks into vectors and actually store them somewhere. Our project will use the:

#### FAISS: In-Memory vector database

In [18]:
from langchain.schema import Document

In [19]:
from langchain.vectorstores import FAISS

#### Now, we will store our FAISS database


In [20]:
db = FAISS.from_documents(docs, embeddings)


### NOW, RUNNING VECTOR QUERIES!!

In [23]:
query = "data jobs in NYC?"

In [24]:
results_with_scores = db.similarity_search_with_score(query)

for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}\nScore {score}\n\n")

Content: business
goals.
Responsibilities
Job
Description:
●
Designing
and
implementing
data
pipelines
●
Managing
data
st or age
and
r etrie v al
●
Building
data
war ehouses
and
data
lak es
●
Cr eating
and
maintaining
data
APIs
●
Ensuring
data
quality
and
security
●
Sta ying
up-t o-date
with
emer ging
technologies
K nowledge/experience,
Skills,
Ability
&
A ttitude
●
1-5
y ears’
experience
in
P ython,
SQL
ser v er ,
Data
modeling,
E TL
t ools
or
A WS
Stack
●
Experience
working
with
data
pr ocessing
t ools
and
Score 0.9199874401092529


Content: rule,
or
r egulation.
Data
Engineer
LMI
·
New
York
County,
NY
(Hybrid)
Reposted
1
week
ago
·
363
applicants
●
Full-time
·
Entry
level
●
1,001-5,000
employees
·
Business
Consulting
and
Services
●
4
school
alumni
work
here
●
Skills:
Data
Engineering,
Communication,
+8
more
●
View
verifications
related
to
this
job
post.
●
View
verifications
related
to
this
job
post.
●
Show
all
Apply
Save
Score 0.9341291189193726


Content: options
Data
Infr astructu

## PROMPT ENGINEERING FOR CUSTOM DATA

In [25]:
from langchain.prompts import PromptTemplate

prompt_template = """
<s>[INST] <<SYS>>
Use the context provided below to answer the question at the end. If you don't know the answer, please state that you don't know and do not attempt to make up an answer.
<</SYS>>

Context:
----------------
{context}
----------------

Question: {question} [/INST]
"""

PROMPT = PromptTemplate(
    template = prompt_template, 
    input_variables=["context", "question"]
)

#### Now that we have defined what our prompt template is going to look like, we will create and prepare our LLM

## PREPARING OUR CUSTOM LLM

In [26]:
from typing import Dict

from langchain import SagemakerEndpoint, PromptTemplate
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import RetrievalQA
import json

class QAContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"
    
    def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
        input_str = json.dumps(
            {"inputs" : [
                [
                    {
                        "role": "system", 
                        "content": ""
                    },
                    {
                        "role": "user", 
                        "content": prompt
                    }
                ]], 
             "parameters": {**model_kwargs}
            })
        return input_str.encode('utf-8')
    
    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json[0]["generation"]["content"]
    
qa_content_handler = QAContentHandler()

Now that we have our content handler, we will deploy a sagemaker endpoint for our Large Language Model that will work with the embedding model to generate outputs.

## SageMaker LLaMa-2-7b-f LLM for our CUSTOM DATASET

In [27]:
# from sagemaker.jumpstart.model import JumpStartModel

llm_model_id, llm_model_version = "meta-textgeneration-llama-2-7b-f", "*"
llm_model = JumpStartModel(model_id=llm_model_id, model_version=llm_model_version)
llm_predictor = llm_model.deploy(
    initial_instance_count=1, instance_type="ml.g5.4xlarge")

-----------------!

In [28]:
llm_model_endpoint_name = llm_predictor.endpoint_name
llm_model_endpoint_name

'meta-textgeneration-llama-2-7b-f-2023-09-12-14-12-21-524'

In [29]:
llm = SagemakerEndpoint(
    endpoint_name=llm_model_endpoint_name, 
    region_name=aws_region, 
    model_kwargs={"max_new_tokens": 1000, "top_p":0.9, "temperature": 1e-11}, 
    endpoint_kwargs={"CustomAttributes": "accept_eula=true"},
    content_handler=qa_content_handler
)

Now, we can use our 'llm' object to query and make predictions on our dataset

In [30]:
query = "Hello"
llm.predict(query)

" Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?"

In [31]:
query = "What are data jobs in NYC for new grads?"
llm.predict(query)

' As a new graduate in New York City, there are several data-related job opportunities available to you. Here are some of the most in-demand data jobs in NYC for new grads:\n\n1. Data Analyst: Data analysts are responsible for collecting, organizing, and analyzing data to help organizations make informed decisions. They use tools such as Excel, SQL, and Tableau to analyze data and create visualizations.\n2. Data Scientist: Data scientists are responsible for developing and implementing machine learning models to solve complex problems. They use programming languages such as Python and R to analyze and interpret large datasets.\n3. Data Engineer: Data engineers are responsible for designing, building, and maintaining large-scale data systems. They use technologies such as Hadoop, Spark, and AWS to process and store large datasets.\n4. Business Intelligence Analyst: Business intelligence analysts are responsible for analyzing and interpreting data to help organizations make informed deci

## Not a bad answer, but we will create a Langchain CHAIN  using the RetrievalQA chain which will:

1. Take a query as input
2. Generate query embeddings
3. Query the vector database for revelant chunks from the knowledge you supply
4. Inject the context and original query in the Prompt Template
5. Invoke the LLM with a completed prompt and
6. Successfuly get the LLM Response/Completion:

In [33]:
qa_chain = RetrievalQA.from_chain_type(
    llm, 
    chain_type = 'stuff',
    retriever=db.as_retriever(), 
    return_source_documents=True, 
    chain_type_kwargs={"prompt":PROMPT}
)

### Now that our chain has been created, we can supply queries to it and generate responses based on our source documents

In [34]:
query = "What are data jobs in NYC for new grads?"
result = qa_chain({"query": query})

print(f'Query: {result["query"]}\n')
print(f'Result: {result["result"]}\n')
print(f'Context Documents: ')
for srcdoc in result["source_documents"]:
    print(f'{srcdoc}\n')

Query: What are data jobs in NYC for new grads?

Result:  Based on the provided context, there are several data jobs in NYC that are suitable for new grads. Here are some of the job postings that caught my attention:

1. Data Engineer at Fortune - New York, NY (On-site): This job posting is for a Data Engineer role at Fortune, a leading media company. The job requires 1-5 years of experience in Python, SQL server, data modeling, and ETL tools. The salary range is $90,000-$110,000 per year.
2. Data Scientist at Hewlett Packard Enterprise - Andover, MA (On-site): This job posting is for a Data Scientist role at Hewlett Packard Enterprise, a leading technology company. The job requires a graduate degree in a related field and 0-3 years of experience in data science. The salary range is $56,900-$130,600 per year.
3. Data Engineer at Data Science Graduate - New York, NY (On-site): This job posting is for a Data Engineer role at Data Science Graduate, a leading data science and machine learn

In [35]:
query = "What are jobs paying above 110000 dollars in NYC?"
result = qa_chain({"query": query})

print(f'Query: {result["query"]}\n')
print(f'Result: {result["result"]}\n')
print(f'Context Documents: ')
for srcdoc in result["source_documents"]:
    print(f'{srcdoc}\n')

Query: What are jobs paying above 110000 dollars in NYC?

Result:  Based on the information provided in the context, jobs paying above $110,000 in New York City (NYC) are:

1. US Zone 2: $97,000 - $114,000
2. US Zone 3: $86,000 - $101,000

These salary ranges are for the successful applicant, and individual salaries within those ranges are determined through a wide variety of factors, including education, experience, knowledge, skills, and geography.

Context Documents: 
page_content='What\nWe\nOffer\nWe\noffer\na\ncomprehensive\ncompensation\nand\nbenefits\npackage\nwhere\nyou’ll\nbe\nrewarded\nbased\non\nyour\nperformance\nand\nrecognized\nfor\nthe\nvalue\nyou\nbring\nto\nthe\nbusiness.\nThe\nsalary\nrange\nfor\nthis\njob\nin\nmost\ngeographic\nlocations\nin\nthe\nUS\nis\n$126,300\nto\n$231,600.\nThe\nsalary\nrange\nfor\nNew\nYork\nCity\nMetro\nArea,\nWashington\nState\nand\nCalifornia\n(excluding\nSacramento)\nis\n$151,600\nto\n$263,100.\nIndividual\nsalaries\nwithin\nthose\nranges\

In [38]:
query = "Can you list all job postings in NYC?"
result = qa_chain({"query": query})

print(f'Query: {result["query"]}\n')
print(f'Result: {result["result"]}\n')
print(f'Context Documents: ')
for srcdoc in result["source_documents"]:
    print(f'{srcdoc}\n')

Query: Can you list all job postings in NYC?

Result:  Based on the provided context, I can see that there are two job postings mentioned:

1. Product Manager - Data Walker & Dunlop (Hybrid) in New York, United States.
2. Experienced Associate, Data & Analytics at {:companyName} (New Grad-Rochester, NY).

Both of these job postings are located in New York City.

Context Documents: 
page_content='unlimited\npaid\ntime\noff,\ncompany\nholida ys\nand\nr echar ge\nda ys,\ncommuter\nbeneﬁts,\nlif estyle\nstipends,\nlearning\nand\nde v elopment\nstipends,\npatr onage,\npar enta\nProduct\nManager\n-\nData\nWalker\n&\nDunlop\n·\nNew\nYork,\nUnited\nStates\n(Hybrid)\nReposted\n2\nweeks\nago\n·\n165\napplicants\n●\nFull-time\n·\nMid-Senior\nlevel\n●\n1,001-5,000\nemployees\n·\nFinancial\nServices\n●\n3\nschool\nalumni\nwork\nhere\n●\n6\nof\n10\nskills\nmatch\nyour\nprofile\n-\nyou\nmay\nbe\na\ngood\nfit\n●\nView\nverifications\nrelated\nto\nthis\njob\npost.\n●\nView\nverifications' metadata={'so

## CLEAN UP YOUR ENDPOINT!

In [None]:
# sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# sagemaker_client.delete_endpoint(EndpointName=embedding_model_endpoint_name)
# sagemaker_client.delete_endpoint(EndpointName=llm_model_endpoint_name)