<div style="text-align: center;">
  <h2><b>Ho Chi Minh University of Science</b></h2>
  <h3><b>Mathematics & Computer Science Faculty</b></h3>
  <img src='https://th.bing.com/th/id/R.7298e6a7f00d000e22456c9977085835?rik=DqP9BjBZsGKPxw&pid=ImgRaw&r=0' width="250px">
  <h1><b>COURSE PROJECT: PYTHON FOR DATA SCIENCE</b></h1>
  <ul style="list-style-type: none; padding: 0;">
    <li>
      <table style="margin: 0 auto; border-collapse: collapse; width: 50%;">
        <thead>
          <tr>
            <th style="border: 1px solid black; padding: 8px;">Full Name</th>
            <th style="border: 1px solid black; padding: 8px;">Student ID</th>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td style="border: 1px solid black; padding: 8px;">Nguyễn Lê Lâm Phúc</td>
            <td style="border: 1px solid black; padding: 8px;">22280066</td>
          </tr>
          <tr>
            <td style="border: 1px solid black; padding: 8px;">Mạc Minh Phúc</td>
            <td style="border: 1px solid black; padding: 8px;"></td>
          </tr>
          <tr>
            <td style="border: 1px solid black; padding: 8px;">Trần Đại Lộc</td>
            <td style="border: 1px solid black; padding: 8px;"></td>
          </tr>
          <tr>
            <td style="border: 1px solid black; padding: 8px;">Nguyễn Lê Đăng Khoa</td>
            <td style="border: 1px solid black; padding: 8px;"></td>
          </tr>
        </tbody>
      </table>
    </li>
  </ul>
</div>


## **Question 1:** LLM Integration

### Import libraries

In [None]:
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [None]:
import google.generativeai as genai
import os
from typing import Union, List
import dotenv

### Load API key

In [None]:
# Load API key from environment variables
dotenv.load_dotenv()
api_key = os.getenv('GOOGLE_API_KEY')

if not api_key:
    raise ValueError("API key not found. Please set the 'GOOGLE_API_KEY' environment variable.")

genai.configure(api_key=api_key)

### Text translation and language detection

In [None]:
class PromptTemplates:
    """
    A utility class for generating prompts for language detection and translation.
    """

    @staticmethod
    def get_language_detection_prompt(text: str) -> str:
        """
        Generates a prompt to detect the language of the given text.

        Args:
            text (str): The input text for language detection.

        Returns:
            str: A prompt instructing the model to detect the language.
        """
        examples = """
        Example inputs and outputs:
        Text: "Hello world" -> en
        Text: "Bonjour le monde" -> fr
        Text: "Xin chào thế giới" -> vi
        Text: "こんにちは世界" -> ja

        Now detect the language for this text and respond only with the ISO 639-1 code:
        """
        return f"{examples} {text}"

    @staticmethod
    def get_translation_prompt(text: str, target_lang: str) -> str:
        """
        Generates a prompt to translate the given text into a target language.

        Args:
            text (str): The input text to be translated.
            target_lang (str): The ISO 639-1 code of the target language.

        Returns:
            str: A prompt instructing the model to translate the text.
        """
        examples = f"""
        Example translations:
        Input: "Hello", Target: es -> "Hola"
        Input: "Good morning", Target: fr -> "Bonjour"
        Input: "Thank you", Target: vi -> "Cảm ơn"

        Translate this text into {target_lang}:
        """
        return f"{examples} {text}"

class TranslationService:
    """
    A service class for detecting language and translating text using a generative AI model.
    """

    def __init__(self, api_key: str):
        """
        Initializes the TranslationService class, setting up configurations for the generative AI model
        and preparing it for language detection and translation tasks.

        Args:
            api_key (str): The API key required to authenticate and access the generative AI service.
        """

        # Configure the genai library with the provided API key.
        # This step authenticates the client with the generative AI service and enables secure access.
        genai.configure(api_key=api_key)

        # Initialize a generative AI model with the specified name and configuration parameters.
        # This instance will be used to generate responses for language detection and translation.
        self.model = genai.GenerativeModel(
            model_name="gemini-2.0-flash-exp",  # Specifies the version of the AI model to use.
            generation_config={
                "temperature": 1,  # Controls randomness in responses. A higher value increases diversity.
                "top_p": 0.95,  # Implements nucleus sampling. Tokens are chosen from the top 95% cumulative probability.
                "top_k": 40,  # Limits token selection to the top 40 most probable tokens for added coherence.
                "max_output_tokens": 1536,  # Sets the maximum length for generated responses, preventing excessively long outputs.
                "response_mime_type": "text/plain",  # Specifies the output format as plain text.
            },
        )

        # Instantiate the PromptTemplates class, which provides predefined prompts for the generative model.
        # These templates ensure consistent and effective communication with the model for tasks like
        # detecting the language of input text and translating text into a target language.
        self.prompt_templates = PromptTemplates()


    def detect_language(self, text: str) -> str:
        """
        Detects the language of the given text.

        Args:
            text (str): The input text for language detection.

        Returns:
            str: The ISO 639-1 code of the detected language.
        """
        prompt = self.prompt_templates.get_language_detection_prompt(text)
        response = self.model.generate_content(prompt)
        return response.text.strip().lower()

    def translate(self, text: str, target_lang: str) -> str:
        """
        Translates the given text into the target language.

        Args:
            text (str): The input text to be translated.
            target_lang (str): The ISO 639-1 code of the target language.

        Returns:
            str: The translated text. If the source and target languages are the same, returns the input text.
        """
        # Detect the source language
        source_lang = self.detect_language(text)

        # If source and target languages match, return the original text
        if source_lang == target_lang:
            return text

        # Generate a translation prompt and get the translated content
        prompt = self.prompt_templates.get_translation_prompt(text, target_lang)
        response = self.model.generate_content(prompt)
        return response.text.strip()

### Handle text translation

In [None]:
def translate_single_text(input_json: dict) -> str:
    """
    Translates a single text string from the input JSON to the specified destination language.

    Args:
        input_json (dict): A dictionary containing:
            - 'text' (str): The text to be translated.
            - 'dest_language' (str): The target language code for translation.

    Returns:
        str: The translated text.

    Raises:
        ValueError: If the 'text' field is not a string.
    """
    # Extract the 'text' field from the input JSON, defaulting to an empty string if missing.
    text = input_json.get('text', '')

    # Ensure the 'text' field is a string; raise an error if not.
    if not isinstance(text, str):
        raise ValueError("Text must be a string")

    # Extract the destination language code from the input JSON, defaulting to an empty string if missing.
    language = input_json.get('dest_language', '')

    # Create an instance of the translation service using a provided API key.
    translator = TranslationService(api_key)

    # Translate the text to the specified destination language and return the result.
    return translator.translate(text, language)

def translate_multiple_texts(input_json: dict) -> List[str]:
    """
    Translates multiple text strings from the input JSON to the specified destination language.

    Args:
        input_json (dict): A dictionary containing:
            - 'text' (List[str]): A list of text strings to be translated.
            - 'dest_language' (str): The target language code for translation.

    Returns:
        List[str]: A list of translated text strings.

    Raises:
        ValueError: If the 'text' field is not a list of strings.
    """
    # Extract the 'text' field from the input JSON, defaulting to an empty list if missing.
    texts = input_json.get('text', [])

    # Ensure the 'text' field is a list of strings; raise an error if not.
    if not isinstance(texts, list) or not all(isinstance(t, str) for t in texts):
        raise ValueError("Text must be a list of strings")

    # Extract the destination language code from the input JSON, defaulting to an empty string if missing.
    language = input_json.get('dest_language', '')

    # Create an instance of the translation service using a provided API key.
    translator = TranslationService(api_key)

    # Translate each text string in the list to the specified destination language.
    return [translator.translate(text, language) for text in texts]


### Example

#### Question 1.1

In [None]:
# Ví dụ 1: Dịch một văn bản đơn lẻ
json_1 = {
    'text': 'Hello',
    'dest_language': 'es'
}


translate_single_text(json_1)


'Input: "Hello", Target: es -> "Hola"'

#### Question 1.2

In [None]:
# Ví dụ 2: Dịch nhiều văn bản
json_2 = {
    'text': ['Hello', 'I am Peter',' How are you?'],
    'dest_language': 'vi'
}
translate_multiple_texts(json_2)

['Input: "Hello", Target: vi -> "Xin chào"',
 'Input: "I am Peter", Target: vi -> "Tôi là Peter"',
 'Input: "How are you?", Target: vi -> "Bạn khỏe không?"']

In [None]:
json_3 = {
    'text':['私はあなたを愛しています'],
    'dest_language':'en'
}
translate_multiple_texts(json_3)

['Input: 私はあなたを愛しています, Target: en -> "I love you"']

##**Question 2:** Chatbot Development

### Import libraries

In [None]:
!pip install bs4



In [None]:
import json
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import google.generativeai as genai
import os
from bs4 import BeautifulSoup
import requests
import dotenv

### Crawling data

In [None]:
# Get the website URL
url='https://www.presight.io/privacy-policy.html'
result= requests.get(url)
soup=BeautifulSoup(result.text,'html.parser')

In [None]:
def parse_elements(elements):
    """
    Parse elements from HTML and structure them into a JSON format.

    Args:
        elements (list): List of HTML elements to parse.

    Returns:
        dict: Parsed data structured as a JSON.
    """
    data = {"PRIVACY POLICY": []}
    prev_element = None
    current_header = None

    # Skip unnecessary part, find and crawl only relevant data
    for element in elements[3:]:
        if element.name == 'i':
            handle_italic_element(element, current_header, prev_element)
        elif element.name == 'h2':
            current_header = handle_header_element(element, prev_element, current_header, data)
        elif element.name in ['p', 'li']:
            handle_content_element(element, current_header)

        prev_element = element

    if current_header:
        data["PRIVACY POLICY"].append(current_header)

    return data

def handle_italic_element(element, current_header, prev_element):
    """
    Handle italic elements (<i> tags) and add them as subheaders to construct a more reasonable JSON file.

    Args:
        element (bs4.element.Tag): The <i> element to process.
        current_header (dict): The current header being processed.
        prev_element (bs4.element.Tag): The previous element.
    """
    if current_header:
        current_header["Subheaders"].append({
            "Title": element.text.strip(),
            "Content": ""
        })
    else:
        current_header = {
            "Title": prev_element.text.strip(),
            "Subheaders": [
                {"Title": element.text.strip(), "Content": ""}
            ]
        }

def handle_header_element(element, prev_element, current_header, data):
    """
    Handle header elements (<h2> tags) and structure them accordingly.

    Args:
        element (bs4.element.Tag): The <h2> element to process.
        prev_element (bs4.element.Tag): The previous element.
        current_header (dict): The current header being processed.
        data (dict): The main data dictionary.

    Returns:
        dict: Updated current header.
    """
    if element.text == "Automated Edit Checks":
        if current_header:
            current_header["Subheaders"].append({
                "Title": element.text.strip(),
                "Content": ""
            })
    elif prev_element and prev_element.name == 'h2':
        if current_header:
            current_header["Subheaders"].append({
                "Title": element.text.strip(),
                "Content": ""
            })
        else:
            current_header = {
                "Title": prev_element.text.strip(),
                "Subheaders": [
                    {"Title": element.text.strip(), "Content": ""}
                ]
            }
    else:
        if current_header:
            data["PRIVACY POLICY"].append(current_header)
        current_header = {"Title": element.text.strip(), "Content": "", "Subheaders": []}

    return current_header

def handle_content_element(element, current_header):
    """
    Handle content elements (<p> and <li> tags) and add content to the appropriate section.

    Args:
        element (bs4.element.Tag): The content element to process.
        current_header (dict): The current header being processed.
    """
    if current_header:
        if current_header["Subheaders"]:
            current_header["Subheaders"][-1]["Content"] += element.text.strip() + ("; " if element.name == 'li' else " ")
        else:
            current_header["Content"] += element.text.strip() + ("; " if element.name == 'li' else " ")

def split_first_element(data):
    """
    Splits the first element in the PRIVACY POLICY into two distinct elements.

    Args:
        data (dict): The original JSON data.

    Returns:
        dict: Updated JSON data with the first element split.
    """
    result = {"PRIVACY POLICY": []}
    for entry in data["PRIVACY POLICY"]:
        if entry["Title"].startswith("Last updated"):
            result["PRIVACY POLICY"].append({
                "Title": "Update date",
                "Content": entry["Title"],
                "Subheaders": []
            })
            result["PRIVACY POLICY"].append({
                "Title": "Commitment",
                "Content": entry["Content"],
                "Subheaders": entry["Subheaders"]
            })
        else:
            result["PRIVACY POLICY"].append(entry)

    return result

elements = soup.find_all(['p', 'h2', 'li', 'i'])
parsed_data = parse_elements(elements)
result = split_first_element(parsed_data)
print(json.dumps(result, indent=4, ensure_ascii=False))



{
    "PRIVACY POLICY": [
        {
            "Title": "Update date",
            "Content": "Last updated 15 Sep 2023",
            "Subheaders": []
        },
        {
            "Title": "Commitment",
            "Content": "At Presight, we are committed to protecting the privacy of our customers and visitors to our website. This Privacy Policy explains how we collect, use, and disclose information about our customers and visitors. ",
            "Subheaders": []
        },
        {
            "Title": "Information Collection and Use",
            "Content": "We collect several different types of information for various purposes to provide and improve our Service to you. ",
            "Subheaders": []
        },
        {
            "Title": "Types of Data Collected",
            "Content": "",
            "Subheaders": [
                {
                    "Title": "Personal Data",
                    "Content": "While using our Service, we may ask you to provide us with 

In [None]:
# Write the JSON data to a file
output_file = "privacy_policy.json"

with open(output_file, "w", encoding="utf-8") as file:
    json.dump(result, file, indent=4, ensure_ascii=False)

print(f"Data has been written to {output_file}")

Data has been written to privacy_policy.json


### Embedding data and developing chatbot

#### Load API Key

In [None]:
# Load API key from environment variables
dotenv.load_dotenv()
api_key_2 = os.getenv('CHAT_API_KEY')

if not api_key_2:
    raise ValueError("API key not found. Please set the 'CHAT_API_KEY' environment variable.")

#### Develop chatbot

In [None]:
class PrivacyPolicyAssistant:
    def __init__(self, model_name="all-MiniLM-L6-v2", title_weight=2, thresh_hold=0.5, genai_api_key="", file_path="privacy_policy.json", max_file_size=1048576):
        """
        Initializes the PrivacyPolicyAssistant.

        Args:
            model_name (str): Name of the transformer model for embeddings.
            title_weight (float): Weight for title embeddings in relevance calculations.
            thresh_hold (float): Similarity threshold for relevance.
            genai_api_key (str): API key for the generative AI model.
            file_path (str): Path to the privacy policy JSON file.
            max_file_size (int): Maximum size for saved history files (in bytes).

        Returns:
            None
        """
        self.title_weight = title_weight
        self.thresh_hold = thresh_hold
        self.documents = []
        self.embeddings = None
        self.model = SentenceTransformer(model_name)
        genai.configure(api_key=genai_api_key)
        self.generative_model = genai.GenerativeModel("gemini-1.5-flash")
        self.file_path = file_path
        self.max_file_size = max_file_size
        self._prepare_and_embed_data()

    def _load_json(self):
        """
        Loads the privacy policy JSON data from the specified file.
        """
        with open(self.file_path, "r", encoding="utf-8") as file:
            return json.load(file)

    def _prepare_and_embed_data(self):
        """
        Processes the privacy policy sections into documents and computes their embeddings.
        Combines titles, contents, and subheaders to generate meaningful embeddings
        for similarity-based querying.
        """
        documents = []
        embeddings = []
        data = self._load_json()["PRIVACY POLICY"]

        for section in data:
            title = section.get("Title", "")
            content = section.get("Content", "")
            subheaders = section.get("Subheaders", [])
            if not subheaders:
              # Handles sections without subheaders
                combined_text = f"{title} {content}"
                documents.append({"text": combined_text, "context": title})
                # Adding weight to title to meet users' basic needs for concepts and definitions, avoiding information noise
                embeddings.append(self.model.encode(title) * self.title_weight + self.model.encode(content))
            else:
              # Handles sections with subheaders
                for subheader in subheaders:
                    sub_title = subheader.get("Title", "")
                    sub_content = subheader.get("Content", "")
                    documents.append({"text": sub_content, "context": sub_title})
                    embeddings.append(self.model.encode(sub_title) * self.title_weight + self.model.encode(sub_content))

                subheader_contents = " ".join([f"{sh['Title']}: {sh['Content']}" for sh in subheaders])
                header_combined_text = f"{title} {content} {subheader_contents}"
                documents.append({"text": header_combined_text, "context": title})
                embeddings.append(self.model.encode(title) * self.title_weight + self.model.encode(header_combined_text))

        self.documents = documents
        self.embeddings = np.array(embeddings)

    def save_history(self, file_path, user_message, bot_response):
      """
        Saves a user-bot conversation history into a JSON file.

        Args:
            file_path (str): Path to the file where history will be saved.
            user_message (str): User's query or message.
            bot_response (str): Generated response to the user's query.

        Returns:
            None
        """
      new_exchange = {"user": user_message, "bot": bot_response}
      new_content = json.dumps(new_exchange, indent=4)

      if os.path.exists(file_path):
          current_size = os.path.getsize(file_path)
          write_mode = "a"
      else:
          current_size = 0
          write_mode = "w"

      new_content_size = len(new_content.encode("utf-8"))
      if current_size + new_content_size > self.max_file_size:
          print(f"Alert: File size limit of {self.max_file_size} bytes exceeded. Save aborted.")
          return

      else:
            if write_mode == "a":
                # If the file exists and is not empty, append new data to the existing JSON array
              with open(file_path, "r+", encoding="utf-8") as f:
                  f.seek(0, os.SEEK_END)
                  f.seek(f.tell() - 1, os.SEEK_SET) # Adjusts the file pointer to overwrite the closing bracket
                  f.truncate()
                  f.write(",\n")
                  f.write(new_content)
                  f.write("\n]")
            else:
                with open(file_path, write_mode, encoding="utf-8") as f:
                    f.write("[\n")
                    f.write(new_content)
                    f.write("\n]")

    def find_most_relevant(self, query):
        """
        Finds the most relevant privacy policy section for the given query using cosine similarity.

        Args:
            query (str): The user's query for which relevance is evaluated.

        Returns:
            tuple: A dictionary of the most relevant document and its similarity score.
        """
        query_embedding = self.model.encode(query)
        similarities = cosine_similarity([query_embedding], self.embeddings).flatten()
        most_relevant_idx = np.argmax(similarities)
        return self.documents[most_relevant_idx], similarities[most_relevant_idx]

    def make_request(self, query):
        """
        Generates a response to the user's query based on the most relevant privacy policy section or fallback logic.
        Args:
            query (str): The user's query for the assistant.
        Returns:
            None
        """
        result, similarity = self.find_most_relevant(query)
        if similarity > self.thresh_hold:
            prompt = (f"You are an assistant about company privacy policy. Answer {query} based on this mock answer {result}.")
        else:
          # Handle cases where questions contain context about the information above
          if "this" in query or "that" in query or "above" in query:
            history = json.load(open("conversation_history.json","r"))
            if len(history) > 0:
              current_ques=history[-1]["user"]
              current_ans=history[-1]["bot"]
              prompt = (f"You are an assitant about company privacy policy. Your task is to answer {query} which is sequenced from {current_ques} and your current answer {current_ans}")
            else:
              print("Please give me further information about your context!!!")
          else:
            # Provides a generic response with a link for more information
            prompt = (f"You are an assitant about company privacy policy. Your task is to answer {query} and then told \"If you want to get further information about {query}, please visit our website https://www.presight.io/privacy-policy.html\"")

        try:
            response = self.generative_model.generate_content(prompt)
            print(response.text)
            self.save_history("conversation_history.json",query,response.text)

        except Exception as e:
            print(f"Error generating response: {e}")


#### Example

In [None]:
# Usage example
assistant = PrivacyPolicyAssistant(genai_api_key=api_key_2)
query = "What is Personal Data?"
assistant.make_request(query)

Personal Data, as used in our Service's privacy policy, refers to information that can be used to contact or identify you.  This includes, but is not limited to, your email address, first and last name, phone number, address (including state, province, ZIP/Postal code, and city), and cookies and usage data.



In [None]:
# Query with previous context
query_2 = "Explain further about this?"
assistant.make_request(query_2)

Explain further about this?

My previous response defined Personal Data as information that can identify or contact you. Let's break that down further and clarify some points:

**1.  Information used to *contact* you:** This refers to data that allows someone to reach you directly.  Examples already given include email address and phone number.  It could also include social media handles (if provided) or instant messaging usernames, depending on how the service uses that information.

**2. Information used to *identify* you:** This goes beyond simply contacting you.  It involves information that uniquely pinpoints you as an individual. This includes:

* **Name:** Your first and last name are the most obvious identifiers.  Middle names, nicknames (if explicitly provided), and maiden names (if relevant and provided) could also fall under this category.

* **Address:**  A full address (street address, city, state/province, ZIP/postal code) allows for precise geographic identification.  Ev

In [None]:
# Query about information not in the data
query_3 = "What is policy?"
assistant.make_request(query_3)

A privacy policy is a legal statement that explains how a company collects, uses, and protects the personal information of its users or customers.  It outlines what data is collected, why it's collected, who has access to it, and how users can control their information.  Compliance with privacy policies is crucial for maintaining trust and adhering to relevant data protection laws and regulations.

If you want to get further information about what our specific privacy policy entails, please visit our website https://www.presight.io/privacy-policy.html



In [None]:
# Yes/ No question
query_4 = "Do I need to provide my Email Address in Personal Data?"
assistant.make_request(query_4)

Based on the provided text, yes, you may need to provide your email address as part of your Personal Data when using the service.  The text explicitly lists "Email address" as an example of personally identifiable information that may be requested.



In [None]:
# Print out the history conversation
json.load(open("conversation_history.json","r"))

[{'user': 'What is Personal Data?',
  'bot': "Personal Data, as used in our Service's privacy policy, refers to information that can be used to contact or identify you.  This includes, but is not limited to, your email address, first and last name, phone number, address (including state, province, ZIP/Postal code, and city), and cookies and usage data.\n"},
 {'user': 'Explain further about this?',
  'bot': 'Explain further about this?\n\nMy previous response defined Personal Data as information that can identify or contact you. Let\'s break that down further and clarify some points:\n\n**1.  Information used to *contact* you:** This refers to data that allows someone to reach you directly.  Examples already given include email address and phone number.  It could also include social media handles (if provided) or instant messaging usernames, depending on how the service uses that information.\n\n**2. Information used to *identify* you:** This goes beyond simply contacting you.  It invol