# Final Assignment - AI Engineer Assignment

# Requirements:

Use Gemini API instead of OpenAI API

Source:

https://aistudio.google.com/

https://ai.google.dev/pricing#1_5flash

The assignment includes two questions:

## Question 1: LLM integration (Score: 30%)
The task involves building an AI capable of language translation.

**1.1 Single Text Translation:** (Score: 15%)

You are asked to write a Python code using the OpenAI API to translate a given text into Vietnamese. (You should check the text if it’s already the destination language).

For example, translating "Hello" into Vietnamese should return "Xin chào", but “Xin chào” should return the same.

**1.2 Multiple Texts Translation:** (Score: 15%)

Similar to 2.1, but the input is a list of texts. The Python code should accept a list of strings and return their translations in the specified language. For instance, translating ["Hello", "I am John", “Tôi là sinh viên”] into Vietnamese should return ["Xin chào", "Tôi tên là John", “Tôi là sinh viên”].

In [5]:
import pandas as pd
import numpy as np
import google.generativeai as genai
from dotenv import load_dotenv, dotenv_values
import os
import time
from sklearn.metrics.pairwise import cosine_similarity

In [6]:
load_dotenv()
# get api key
API_KEY = os.getenv("API_KEY")
genai.configure(api_key=APIKEY)

In [7]:
class Translator:
    def __init__(self, model_name="gemini-1.5-flash"):
        self.model = genai.GenerativeModel(model_name)

    def make_prompt(self, text):
        prompt = (
            f"Translate the following text into Vietnamese. "
            f"If the sentence to be translated is already in Vietnamese, keep it unchanged. "
            f"Respond only with the translated text, no further explanations or additional information are allowed. "
            f"Input text:\n{text}\n"
        )
        return prompt

    def make_request(self, prompt):
        response = self.model.generate_content(prompt)
        return response.text.strip()

In [8]:
model = Translator()

In [9]:
def translate_single_text(input_json):
    if type(input_json) != str:
        raise Exception(f"Text in input json should be string type.")
    return model.make_request(model.make_prompt(input_json))

In [10]:
def translate_multiple_text(input_json):
    if type(input_json) != list:
        raise Exception(f"Texts in input json should be contain list of text.")
    results = []
    for text in input_json:
        results.append(model.make_request(model.make_prompt(text)))
    return results

In [11]:
json_1 = "Hello"
json_2 = ["Hello", "I am John", "Tôi là sinh viên"]
json_3 = ["내 선생님은 너무 아름답고 현명해요", "10점을 주시는 선생님을 좋아합니다"]

In [12]:
translate_single_text(json_1)

'Xin chào'

In [13]:
translate_multiple_text(json_2)

['Xin chào', 'Tôi là John', 'Tôi là sinh viên']

In [14]:
translate_multiple_text(json_3)

['Cô giáo của tôi rất xinh đẹp và thông minh.',
 'Em thích những thầy cô cho em điểm 10.']

## Question 2: Chatbot Development (Score: 70%)

### 2.1 Data Access and Indexing (Score: 40%)

Tasked with creating a chatbot, begin by web crawling the specified website to gather relevant data, then preprocess and structure this data into a searchable index, ready for query retrieval.

(Short version: crawling then embedding data, you can use selenium or requests)


#### Crawling data from website

Data is at https://www.presight.io/privacy-policy.html

In [18]:
import requests
from bs4 import BeautifulSoup

In [19]:
url = "https://www.presight.io/privacy-policy.html"

In [20]:
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
data = []

In [21]:
# page header
page_header = soup.find('h2', {'class': 'chakra-heading css-dhb2ck'}).get_text()
print(page_header)
update_date = soup.find('h2', {'class': 'chakra-heading css-18j379d'}).get_text()
print(update_date)
introduction = soup.find('p', {'class': 'chakra-text css-0'}).get_text()
print(introduction) 

data.append({'title': 'Page About', 'content': page_header})
data.append({'title': 'Update Date', 'content': update_date})
data.append({'title': 'Introduction', 'content': introduction})

PRIVACY POLICY
Last updated 15 Sep 2023
At Presight, we are committed to protecting the privacy of our customers and visitors to our website. This Privacy Policy explains how we collect, use, and disclose information about our customers and visitors.


In [22]:
# sections of page
sections = soup.find_all('div', {'class': 'chakra-stack css-o5l3sd'})

for div in sections:
    section = {}
    section['title'] = div.find('h2', {'class': 'chakra-heading css-18j379d'}).get_text()

    has_sections_1 = div.find('div', {'class' : 'chakra-stack css-bel3sh'})
    has_sections_2 = div.find('div', {'class' : 'chakra-stack css-1rh933v'})
    if has_sections_1:
        sub_sections = has_sections_1.find_all('div', {'class': 'css-0'})
        
        subset_contents = []
        for sec in sub_sections:
            subsec_content = {}
            subsec_content['subtitle'] = sec.find('i', {'class': 'chakra-heading css-9f6g39'}).get_text()
            text = sec.find('p', {'class': 'chakra-text css-0'}).get_text()
            try:
                bullet = sec.find_all('li', {'class': 'css-0'})
                for b in bullet:
                    bullet_content =  b.get_text()
                    text = text + ", " + bullet_content
            except:
                None
            subsec_content['content'] = text
            subset_contents.append(subsec_content)
        section['content'] = subset_contents
        
    elif has_sections_2:
        sub_sections = has_sections_2.find_all(['h2', 'p'])
        
        subset_contents = []
        for sec in sub_sections:
            if sec.name == 'h2':
                subsec_content = {}
                subsec_content['subtitle'] = sec.get_text()
            if sec.name == 'p':
                subsec_content['content'] = sec.get_text()
                subset_contents.append(subsec_content)
        section['content'] = subset_contents

    else:
        try:
            text = div.find('p', {'class': 'chakra-text css-0'}).get_text()
        except:
            None
        try:
            bullet = div.find_all('li', {'class': 'css-0'})
            for b in bullet:
                bullet_content =  b.get_text()
                text = text + ", " + bullet_content
        except:
            None
        section['content'] = text
    data.append(section)

In [23]:
data

[{'title': 'Page About', 'content': 'PRIVACY POLICY'},
 {'title': 'Update Date', 'content': 'Last updated 15 Sep 2023'},
 {'title': 'Introduction',
  'content': 'At Presight, we are committed to protecting the privacy of our customers and visitors to our website. This Privacy Policy explains how we collect, use, and disclose information about our customers and visitors.'},
 {'title': 'Information Collection and Use',
  'content': 'We collect several different types of information for various purposes to provide and improve our Service to you.'},
 {'title': 'Types of Data Collected',
  'content': [{'subtitle': 'Personal Data',
    'content': 'While using our Service, we may ask you to provide us with certain personally identifiable information that can be used to contact or identify you ("Personal Data"). Personally identifiable information may include, but is not limited to:, Email address, First name and last name, Phone number, Address, State, Province, ZIP/Postal code, City, Cookies

#### Preprocessing and Structuring Data

https://ai.google.dev/gemini-api/docs/embeddings

In [26]:
preprocessed_data = []

for item in data:
    if type(item['content']) == list:
        combined_content = ""
        for subitem in item['content']:
            combined_content += f"-{subitem['subtitle']}: {subitem['content']};"
        restructure_item = {
            'title': item['title'],
            'content': combined_content.strip()
        }
        preprocessed_data.append(restructure_item)
    else:
        preprocessed_data.append(item)

In [27]:
for index, item in enumerate(preprocessed_data):
    print('Index', index, ':')
    print(' Title:', item['title'])
    print(' Content:', item['content'])

Index 0 :
 Title: Page About
 Content: PRIVACY POLICY
Index 1 :
 Title: Update Date
 Content: Last updated 15 Sep 2023
Index 2 :
 Title: Introduction
 Content: At Presight, we are committed to protecting the privacy of our customers and visitors to our website. This Privacy Policy explains how we collect, use, and disclose information about our customers and visitors.
Index 3 :
 Title: Information Collection and Use
 Content: We collect several different types of information for various purposes to provide and improve our Service to you.
Index 4 :
 Title: Types of Data Collected
 Content: -Personal Data: While using our Service, we may ask you to provide us with certain personally identifiable information that can be used to contact or identify you ("Personal Data"). Personally identifiable information may include, but is not limited to:, Email address, First name and last name, Phone number, Address, State, Province, ZIP/Postal code, City, Cookies and Usage Data;-Usage Data: We may al

#### Embedding by called API

In [29]:
embedded_data = []

for item in preprocessed_data:
    result = genai.embed_content(
        model="models/text-embedding-004",
        content=item["content"],
        task_type="retrieval_document",
        title=item["title"]  # Provide the title for better embeddings
    )
    embedded_data.append({
        "title": item["title"],
        "content": item["content"],
        "embedding": result['embedding']
    })

In [30]:
embedded_data = pd.DataFrame(embedded_data)
embedded_data.head(5)

Unnamed: 0,title,content,embedding
0,Page About,PRIVACY POLICY,"[-0.048945945, 0.008555042, -0.03837295, -0.04..."
1,Update Date,Last updated 15 Sep 2023,"[0.040753134, 0.059642516, -0.048014387, -0.04..."
2,Introduction,"At Presight, we are committed to protecting th...","[-0.0009376601, -0.0157132, -0.05022917, -0.05..."
3,Information Collection and Use,We collect several different types of informat...,"[-0.052642185, -0.011830902, -0.039871644, -0...."
4,Types of Data Collected,"-Personal Data: While using our Service, we ma...","[-0.06256992, -0.010571767, -0.057725344, -0.0..."


### 2.2 Chatbot Development (Score: 30%)

Develop a chatbot that employs natural language processing to comprehend user questions, searches the indexed data from 2.1 for the best match, and delivers the most accurate response drawn from the website's information.

(Use any distance/similarity metrics to get the best match paragraph then feed to LLM to get answer)

https://ai.google.dev/gemini-api/docs/prompting-strategies

In [33]:
def best_match(query, df):
    """
    Compute the distances between the query and each document in the dataframe
    using the dot product.
    """
    query_embedding = genai.embed_content(model="models/text-embedding-004",
                                          content=query,
                                          task_type="retrieval_query")

    # Compute norms of the embeddings
    doc_embeddings = np.stack(df['embedding'])
    doc_norms = np.linalg.norm(doc_embeddings, axis=1)
    query_norm = np.linalg.norm(query_embedding["embedding"])
    
    # Compute cosine similarity
    cosine_similarities = np.dot(doc_embeddings, query_embedding["embedding"]) / (doc_norms * query_norm)
    results = df.iloc[np.argsort(cosine_similarities)[-5:][::-1]]
    
    return results['content'].tolist() # Return text from index with max value

In [34]:
class QA_Chatbot:
    def __init__(self, model_name="gemini-1.5-flash"):
        self.model = genai.GenerativeModel(model_name)

    def make_prompt(self, query, context):
        prompt = (
            f"Imagine yourself as an expert in customer service, \
                tasked with providing detailed and accurate answers to customer inquiries about your organization."
            f"Your responses must be based solely on the stored information available, without heavily relying on assumptions. \
                You are free to rephrase the sentence in your own words, but the factual content of the stored information must remain unchanged. \
                Do not provide any information which is not directly relevant to the question."
            f"Provide comprehensive answers with thorough (but not too long) explanations unless the user specifically requests a lengthy response. "
            f"If a term used in the answer is not commonly understood, include a brief explanation for clarity. "
            f"You are not permitted to respond to questions based on assumptions."
            f"The answer must be the same language with the question."
            f"Here is the user's question: \"{query}\" "
            f"The related document is as follows: \"{context}\""
        )
        return prompt

    def make_request(self, query, retries=3):
        attempt = 0
        while attempt < retries:
            try:
                context = best_match(query, embedded_data)
                response = self.model.generate_content(self.make_prompt(query, context))
                return response.text.strip()
            except (TimeoutError, ConnectionError) as e:
                attempt += 1
                print(f"Attempt {attempt}/{retries} failed due to {type(e).__name__}: {str(e)}")
                time.sleep(60)
            except Exception as e:
                print(f"An unexpected error occurred: {type(e).__name__}: {str(e)}")
                break   
        return f"Sorry, an error occurred while processing your request. Please try again later."

In [35]:
model = QA_Chatbot()

In [36]:
start = time.time()
query = "What types of data does Presight collect from users?"
print(f"Answer: {model.make_request(query)}") 
end = time.time()
print(f"Running time: {end-start} seconds")

Answer: Presight collects data to provide and maintain its service, notify users of service changes, enable participation in interactive features (if chosen by the user), provide customer support, improve the service through analysis, monitor service usage, and address technical issues.  The type of data collected is not specified in the provided text.
Running time: 1.5659940242767334 seconds


In [37]:
start = time.time()
query = "when is the latest update?" 
print(f"Answer: {model.make_request(query)}") 
end = time.time()
print(f"Running time: {end-start} seconds")

Answer: The latest update was on September 15, 2023.
Running time: 1.2308411598205566 seconds


In [38]:
start = time.time()
query = "What measures does Presight take to ensure data security?" 
print(f"Answer: {model.make_request(query)}") 
end = time.time()
print(f"Running time: {end-start} seconds")

Answer: Presight employs automated edit checks during data entry to ensure accuracy and data integrity.  These checks verify that information fields are completed correctly.  Additionally, before submitting personal information, users are asked to confirm its accuracy.
Running time: 1.3438348770141602 seconds
