**Thành viên nhóm** 

- 22280003 - Phạm Bá Hoàng Anh 

- 22280004 - Trương Bình Ba

- 22280013 - Phạm Lê Hồng Đức 

- 22280016 - Bạch Ngọc Lê Duy

In [51]:
#!pip install -q -U google-generativeai
#!pip install langdetect
#!pip install pycountry
#!pip install beautifulsoup4
#!pip install sentence-transformers
#!pip install faiss-cpu
#!pip install nltk

In [52]:
import requests
from bs4 import BeautifulSoup

import time
import random
import numpy as np

from langdetect import detect
import pycountry
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import google.generativeai as genai
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ThinkPad\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Question 1: LLM integration (Score: 30%)**  
The task involves building an AI capable of language translation.  

- **1.1 Single Text Translation: (Score: 15%)**  
You are asked to write a Python code using the OpenAI API to translate a given text into Vietnamese.  
(You should check the text if it’s already the destination language).  
For example, translating "Hello" into Vietnamese should return "Xin chào",  
but “Xin chào” shouldreturn the same. 

- **1.2 Multiple Texts Translation: (Score: 15%)**
Similar to 2.1, but the input is a list of texts.  
The Python code should accept a list of strings  
and returntheir translations in the specified language.  
For instance, translating ["Hello", "I am John",  “Tôi là sinh viên”] into Vietnamese  
should return ["Xin chào", "Tôi tên là John", “Tôi là sinh viên”].

In [53]:
# Hàm dịch văn bản đơn lẻ
def translate_single_text(json_input):
    text = json_input['text']
    dest_language = json_input['dest_language']

    # Dectect language (kiểm tra text đã là ngôn ngữ đích hay chưa thông qua thư viện langdetect)
    current_language = detect(text)
    if current_language == dest_language:
        return text

    # Pycountry (đưa dest_language từ dạng ISO 639-1 về dạng ngôn ngữ tự nhiên để request được rõ ràng)
    dest_language_name = pycountry.languages.get(alpha_2=dest_language).name
    genai.configure(api_key='')  
    model = genai.GenerativeModel("gemini-1.5-flash")
    response = model.generate_content(f"Dịch '{text}' sang ngôn ngữ {dest_language_name} mà không thêm thông tin khác kể cả dấu chấm")
    return response.text.strip() # strip(): Xóa khoảng trắng thừa ở đầu và cuối.

In [54]:
# Hàm dịch danh sách văn bản
def translate_multiple_texts(json_input):
    texts = json_input['text']
    dest_language = json_input['dest_language']
    translations = []

    for text in texts:
        multiple_texts = translate_single_text({'text': text, 'dest_language': dest_language})
        translations.append(multiple_texts)

    return translations

In [55]:
# Dữ liệu đầu vào
json_1 = {
    'text': 'Hello',
    'dest_language': 'vi'
}
json_2 = {
    'text': ['Hello', 'I am John','Tôi là sinh viên'],
    'dest_language': 'vi'
}

In [56]:
# Kiểm tra kết quả
translate_single_text(json_1)

'Xin chào'

In [57]:
translate_multiple_texts(json_2)

['Xin chào', 'Tôi là John', 'Tôi là sinh viên']

**Question 2: Chatbot Development (Score: 70%)**  
Assignment Test: Chatbot Development from Website Data. The data is at https://www.presight.io/privacy-policy.html

- **2.1 Data Access and Indexing (Score: 40%)**  
Tasked with creating a chatbot, begin by web crawling the specified website to gather relevant data,  
then preprocess and structure this data into a searchable index, ready for query retrieval.  
(Short version: crawling then embedding data, you can use selenium or requests)  
 
- **2.2 Chatbot Development (Score: 30%)**  
Develop a chatbot that employs natural language processing to comprehend user questions,  
searches the indexed data from 2.1 for the best match,  
and delivers the most accurate response drawn from the website's information.  
(Use any distance/similarity metrics to get the best match paragraph then feed to LLM to get answer)

In [58]:
class Chatbot:
    def __init__(self, api_key, url):
        self.url = url
        self.data = []
        self.crawl_data()
        self.sentence_transformer_model = SentenceTransformer('all-mpnet-base-v2')
        self.embeddings = self.sentence_transformer_model.encode([item["cleaned"] for item in self.data])
        genai.configure(api_key=api_key)
        self.gemini_model = genai.GenerativeModel("gemini-1.5-flash")

    def crawl_data(self):
        response = requests.get(self.url)
        soup = BeautifulSoup(response.content, "html.parser")

        # Danh sách các thẻ cần trích xuất
        tags_to_extract = ["h2", "i", "p", "li"]
        segments = []
        current_segment = []

        for element in soup.find_all(tags_to_extract):
            if element.name in ["h2", "i"]:  
                if current_segment: 
                    segments.append(current_segment)
                current_segment = [element.text.strip()]  
            elif element.name == "li":  
                current_segment.append(element.text.strip().rstrip(".")  + ';')
            else:
                current_segment.append(element.text.strip())
        # Thêm đoạn cuối cùng nếu còn nội dung
        if current_segment:
            segments.append(current_segment)

        segments[0][0] = segments[1][0] + " - " + segments[2][0]
        segments[1][0], segments[2][0] = segments[2][1].split(". ")
        segments[5][0] = segments[4][0] + " - " + segments[5][0]  
        segments[6][0] = segments[4][0] + " - " + segments[6][0]
        segments[10][0] = segments[9][0] + " - " + segments[10][0]  
        segments[11][0] = segments[9][0] + " - " + segments[11][0]
        del segments[23][2:]
        del segments[9]
        del segments[4]
        del segments[2][1]
        del segments[0][1]

        

        # Hiển thị nội dung trích xuất
        for idx, segment in enumerate(segments):
            if(idx>2):
                segments[idx][0] = segments[idx][0].upper() + ":"
            segments[idx] = " ".join(segment).strip()
            
            self.data.append({
                "original": segments[idx],
                "cleaned": self.clean(segments[idx])
            })
            print(f"Segment {idx}:")
            print("- " + "\n- ".join(segment))
            print("Original:", self.data[-1]["original"])
            print("Cleaned :", self.data[-1]["cleaned"])
            print("-" * 50)

    def clean(self, text):
        """
        Làm sạch đoạn văn bản trước khi embedding.
        Args:
            text (str): Văn bản cần làm sạch.
        Returns:
            str: Đoạn văn bản đã được làm sạch.
        """
        # text = re.sub(r'http\S+|www\S+', '', text)    # Loại bỏ URL
        # text = re.sub(r'\d+', '', text)# Loại bỏ các số
        text = text.lower() # Chuyển văn bản về chữ thường
        text = re.sub(r'\S+@\S+\.\S+', 'email', text) # Thay email bằng email
        text = re.sub(r'[^\w\s]', '', text) # Loại bỏ ký tự đặc biệt và dấu câu (trừ ký tự khoảng trắng)
        # Loại bỏ stop words 
        stop_words = set(stopwords.words('english'))
        # Danh sách các đại từ cần giữ lại
        '''pronouns_to_keep = {
        "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "you're", "you've", "you'll", "you'd", "your",
        "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "she's", "her", "hers", "herself", "it",
        "it's", "its", "itself", "they", "them", "their", "theirs", "themselves"}'''
        # Loại bỏ các đại từ ra khỏi danh sách stop words
        stop_words = stop_words #- pronouns_to_keep
        text = ' '.join(word for word in text.split() if word not in stop_words)
        text = re.sub(r'\s+', ' ', text).strip() # Loại bỏ khoảng trắng thừa

        return text

    def search_query(self, query, top_k):
        """
        Tìm kiếm các đoạn văn bản liên quan dựa trên câu hỏi.

        Args:
            query (str): Câu hỏi của người dùng.
            top_k (int): Số lượng kết quả trả về.
        Returns:
            contexts: Các đoạn văn bản gốc liên quan nhất.
            scores: Điểm tương đồng của các đoạn văn bản tương ứng.
        """
        # Làm sạch câu hỏi
        query_cleaned = self.clean(query)

        # Tạo embedding cho câu hỏi đã làm sạch
        query_embedding = self.sentence_transformer_model.encode([query_cleaned])

        # Tính cosine similarity giữa câu hỏi và các embeddings
        similarities = cosine_similarity(query_embedding, self.embeddings)

        # Lấy chỉ số của các đoạn văn bản có điểm tương đồng cao nhất
        top_indices = np.argsort(similarities[0])[::-1][:top_k]

        # Lấy các văn bản gốc và điểm tương đồng tương ứng
        contexts = [self.data[idx]["original"] for idx in top_indices]
        scores = [similarities[0][idx] for idx in top_indices]
        return contexts, scores
    
    #Tích hợp Gemini Flash 1.5
    def generate_answer_with_gemini(self, context, question):
        prompt = (
            f"You are a highly knowledgeable and detail-oriented assistant. "
            f"Begin your answer by addressing the general concept or definition of the question directly. "
            f"Then, if necessary, use the provided context to elaborate or refine your answer. "
            f"Ensure your response is accurate, detailed, and concise. "
            f"Do not include irrelevant information or assumptions not supported by the context.\n\n"
            f"Question:\n{question}\n\n"
            f"Context (if relevant):\n{context}\n\n"
            f"Answer:"
        )

        try:
            response = self.gemini_model.generate_content(prompt)
            return response.text
        except Exception as e:
            return f"Error: {e}"

    # Tạo request
    def make_request(self, question):
        # Tìm kiếm đoạn văn liên quan
        start = time.time()
        results, scores = self.search_query(question, 3)
        context = "\n".join([f"\t- (Score: {score:.4f}) {item}" for item, score in zip(results, scores)])
        answer = self.generate_answer_with_gemini(context, question)
        end = time.time()
        # Debug: In context và question
        print(f"Context : \n{context}")
        print(f"Chatbot : {answer.rstrip()}")
        print(f"Running time: {end - start:.2f} seconds")
        # return answer
    

In [60]:
url = "https://www.presight.io/privacy-policy.html"
api_key = ''
chatbot = Chatbot(api_key, url)

Segment 0:
- PRIVACY POLICY - Last updated 15 Sep 2023
Original: PRIVACY POLICY - Last updated 15 Sep 2023
Cleaned : privacy policy last updated 15 sep 2023
--------------------------------------------------
Segment 1:
- At Presight, we are committed to protecting the privacy of our customers and visitors to our website
Original: At Presight, we are committed to protecting the privacy of our customers and visitors to our website
Cleaned : presight committed protecting privacy customers visitors website
--------------------------------------------------
Segment 2:
- This Privacy Policy explains how we collect, use, and disclose information about our customers and visitors.
Original: This Privacy Policy explains how we collect, use, and disclose information about our customers and visitors.
Cleaned : privacy policy explains collect use disclose information customers visitors
--------------------------------------------------
Segment 3:
- INFORMATION COLLECTION AND USE:
- We collect sever

In [64]:
questions = [
    {"section": "URELATED QUESTION", 
     "question": "What is policy?"},

    {"section": "PRIVACY POLICY - Last updated 15 Sep 2023", 
     "question": "When was the privacy policy last updated?"},
    
    {"section": "Commitment to Privacy", 
     "question": "What does Presight commit to protecting for its customers and visitors?"},
    
    {"section": "Privacy Policy Explanation", 
     "question": "What does this Privacy Policy explain regarding customers and visitors?"},
    
    {"section": "INFORMATION COLLECTION AND USE", 
     "question": "Why does Presight collect different types of information?"},
    
    {"section": "TYPES OF DATA COLLECTED - PERSONAL DATA", 
     "question": "What types of personally identifiable information does Presight collect?"},
    
    {"section": "TYPES OF DATA COLLECTED - USAGE DATA", 
     "question": "What is included in the usage data collected by Presight?"},
    
    {"section": "USE OF DATA", 
     "question": "For what purposes does Presight use the collected data?"},
    
    {"section": "CONSENT", 
     "question": "How does Presight ensure that personal information submitted is correct?"},
    
    {"section": "ACCESS TO PERSONAL INFORMATION - ACCESSING YOUR PERSONAL INFORMATION", 
     "question": "How can users access and update their personal information held by Presight?"},
    
    {"section": "ACCESS TO PERSONAL INFORMATION - AUTOMATED EDIT CHECKS", 
     "question": "What is the purpose of automated edit checks in collecting personal information?"},
    
    {"section": "DISCLOSURE OF INFORMATION", 
     "question": "Under what circumstances might Presight disclose application data to third parties?"},
    
    {"section": "SHARING OF PERSONAL DATA", 
     "question": "Does Presight share personal data with third parties or AI models?"},
    
    {"section": "GOOGLE USER DATA AND GOOGLE WORKSPACE APIS", 
     "question": "What restrictions does Presight place on the use of Google User Data and Google Workspace APIs?"},
    
    {"section": "DATA SECURITY", 
     "question": "What encryption and security measures does Presight use to protect customer data?"},
    
    {"section": "DATA RETENTION & DISPOSAL", 
     "question": "How long does Presight retain customer data after account closure?"},
    
    {"section": "QUALITY, INCLUDING DATA SUBJECTS' RESPONSIBILITIES FOR QUALITY", 
     "question": "What responsibilities do users have in maintaining the accuracy of their personal data?"},
    
    {"section": "MONITORING AND ENFORCEMENT", 
     "question": "What actions does Presight take to monitor data compliance and handle data breaches?"},
    
    {"section": "COOKIES", 
     "question": "How can users manage cookies on Presight’s website?"},
    
    {"section": "THIRD-PARTY WEBSITES", 
     "question": "What is Presight’s responsibility regarding third-party websites?"},
    
    {"section": "CHANGES TO PRIVACY POLICY", 
     "question": "Where will updates to the Privacy Policy be posted?"},
    
    {"section": "CONTACT US", 
     "question": "What is the email address provided for contacting Presight if I have questions about the Privacy Policy?"},
    
    {"section": "PURPOSEFUL USE ONLY", 
     "question": "For what purposes does Presight commit to using personal information?"}
]

# === Vòng lặp tự động hỏi ===
for item in questions:
    delay = random.uniform(0, 2)
    time.sleep(delay)
    print("\n===== Debug Information =====")
    print(f"Section: {item['section']}")  
    print(f"Question: {item['question']}")
    chatbot.make_request(item['question'])


===== Debug Information =====
Section: URELATED QUESTION
Question: What is policy?
Context : 
	- (Score: 0.5053) PRIVACY POLICY - Last updated 15 Sep 2023
	- (Score: 0.3356) CHANGES TO PRIVACY POLICY: We may update this Privacy Policy from time to time. The updated Privacy Policy will be posted on our website.
	- (Score: 0.3030) PURPOSEFUL USE ONLY: We commit to only use personal information for the purposes identified in the entity's privacy policy.
Chatbot : Policy is a course or principle of action adopted or proposed by a government, party, business, or individual.  The provided text exemplifies this definition through the mention of a "PRIVACY POLICY," outlining principles for the handling of personal information.  The context further clarifies that a policy can be updated and that adherence to its stated purposes is crucial.
Running time: 0.96 seconds

===== Debug Information =====
Section: PRIVACY POLICY - Last updated 15 Sep 2023
Question: When was the privacy policy last upda

**Đánh giá chatbot:** Chatbot có khả năng trả lời chính xác các câu hỏi được đưa ra. Khi câu hỏi chứa nhiều từ khóa liên quan đến section tương ứng, khả năng chatbot cung cấp câu trả lời đúng và phù hợp sẽ cao hơn.

In [50]:
# Chạy Chatbot
'''print("Chatbot is ready! Type 'exit' or 'quit' to end the session.")
while True:
    user_input = input("You: ")
    if user_input.lower() in ["exit", "quit"]:
        print("Chatbot: Goodbye!")
        break
    chatbot(user_input)'''

'print("Chatbot is ready! Type \'exit\' or \'quit\' to end the session.")\nwhile True:\n    user_input = input("You: ")\n    if user_input.lower() in ["exit", "quit"]:\n        print("Chatbot: Goodbye!")\n        break\n    chatbot(user_input)'