# GenAI Intensive Course Capstone Project : Terms of Service Translator

## Overview

Reading the long and complicated terms of service on websites can be overwhelming for most people. These documents are full of legal language that many users don't read or fully understand, which can lead to uncertainty about the specifics of the service they are using, what they are agreeing to, and the responsibilities of the provider. **This project tackles that problem by creating a smart "Terms of Service Translator."** It uses Gemini AI to turn complex legal text into simple, easy-to-understand summaries. The translated texts are written in plain English and made more approachable with helpful emoji.

To illustrate the capabilities of this translator, a sample Terms of Service document has been created for a fictitious company called "Dig-A-Hole." This whimsical example details a service where customers can engage in the activity of digging holes in designated areas of a field for recreational exercise and stress relief, with options for subscription. This sample terms of service document has been saved in a PDF file, which Gemini AI is then used to read and process. The goal is to translate the formal legal language of this document into a clear and friendly summary.

This project was incorporated by the following approach along with demonstration of Gen AI capabilities in ().

* A sample Terms of Service document was created for the fictitious "Dig-A-Hole" company, detailing their stress-relief digging service, and then saved as a PDF.
* The necessary Google API key was set up to enable access to the Gen AI models.
* A suitable Gemini model was selected for effective language processing.
* The PDF document was then processed using Gemini AI with a carefully designed prompt to extract and summarize the key legal terms into plain, accessible English (Leveraging Gemini for **Document Understanding**).
* Few-shot prompting with illustrative examples was employed to guide Gemini in incorporating relevant and helpful emojis into the translations (**Few-shot Prompting**).
* The output was structured using JSON format to establish a direct and clear link between the original legal terms and their corresponding simplified translations (**JSON Mode**).
* Gemini AI was utilized again to rigorously evaluate the quality of the translated output based on predefined metrics, specific evaluation criteria, and a detailed rating rubric (**Gen AI Evaluation**).
* Interactivity was implemented in the translated Terms of Service by using Embeddings and Retrieval Augmented Generation (RAG). This allows users to ask specific questions about the terms and receive relevant, context-aware answers (**Embeddings** and **RAG**).

This project demonstrates a practical application of Gen AI in making legal documents more accessible and user-friendly.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/sample-terms-of-service/sample_terms_of_service.pdf
/kaggle/input/sample-terms-of-service/sample_terms_of_service2.pdf


## Install the Python SDK

First things first, let's grab the necessary tools! I'll be installing the Gemini API Python SDK so I can chat with the AI, and also ChromaDB, which will come in handy later for making my terms of service interactively.


In [2]:
#!pip install -Uq "google-genai==1.7.0"
!pip uninstall -qqy jupyterlab kfp  # Remove unused conflicting packages
!pip install -qU "google-genai==1.7.0" "chromadb==0.6.3"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m61.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m5.7 MB/s[0

In [3]:
from google import genai
from google.genai import types

from IPython.display import Markdown, display

genai.__version__

'1.7.0'

## Set up the API key

Alright, time to get my API key sorted! I'll need this so I can actually chat with Gemini. I've already tucked my Google API key away securely in a secret called `GOOGLE_API_KEY`.

In [4]:
from kaggle_secrets import UserSecretsClient

client = genai.Client(api_key=UserSecretsClient().get_secret("GOOGLE_API_KEY"))

## Automated retry

This codelab sends a lot of requests, so set up an automatic retry that ensures your requests are retried when per-minute quota is reached.
I might be sending a bunch of requests to Gemini. To avoid any hiccups if I hit a temporary limit, I'm setting up an automatic retry system. This way, if a request gets a "try again later" message, my code will patiently wait and try again. Smart, right? 😉


In [5]:
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

if not hasattr(genai.models.Models.generate_content, '__wrapped__'):
  genai.models.Models.generate_content = retry.Retry(
      predicate=is_retriable)(genai.models.Models.generate_content)

## Model Selection

Gemini comes in different flavors! I'm going to take a peek at the available models and see which one seems like the best fit for my task right here. It's like choosing the right expert for the job! 🤔

In [6]:
for model in client.models.list():
  print(model.name)

models/chat-bison-001
models/text-bison-001
models/embedding-gecko-001
models/gemini-1.0-pro-vision-latest
models/gemini-pro-vision
models/gemini-1.5-pro-latest
models/gemini-1.5-pro-001
models/gemini-1.5-pro-002
models/gemini-1.5-pro
models/gemini-1.5-flash-latest
models/gemini-1.5-flash-001
models/gemini-1.5-flash-001-tuning
models/gemini-1.5-flash
models/gemini-1.5-flash-002
models/gemini-1.5-flash-8b
models/gemini-1.5-flash-8b-001
models/gemini-1.5-flash-8b-latest
models/gemini-1.5-flash-8b-exp-0827
models/gemini-1.5-flash-8b-exp-0924
models/gemini-2.5-pro-exp-03-25
models/gemini-2.5-pro-preview-03-25
models/gemini-2.5-flash-preview-04-17
models/gemini-2.0-flash-exp
models/gemini-2.0-flash
models/gemini-2.0-flash-001
models/gemini-2.0-flash-exp-image-generation
models/gemini-2.0-flash-lite-001
models/gemini-2.0-flash-lite
models/gemini-2.0-flash-lite-preview-02-05
models/gemini-2.0-flash-lite-preview
models/gemini-2.0-pro-exp
models/gemini-2.0-pro-exp-02-05
models/gemini-exp-1206
m

For summarizing the main document, I'm rolling with **gemini-1.5-pro**. It seems pretty reliable and stable for text-based tasks like this.

(Quick heads-up though! For the evaluation part later on, I had to switch gears and use **gemini-2.0-flash**.)

In [7]:
from pprint import pprint

for model in client.models.list():
  if model.name == 'models/gemini-1.5-pro':
    pprint(model.to_json_dict())
    break

{'description': 'Stable version of Gemini 1.5 Pro, our mid-size multimodal '
                'model that supports up to 2 million tokens, released in May '
                'of 2024.',
 'display_name': 'Gemini 1.5 Pro',
 'input_token_limit': 2000000,
 'name': 'models/gemini-1.5-pro',
 'output_token_limit': 8192,
 'supported_actions': ['generateContent', 'countTokens'],
 'tuned_model_info': {},
 'version': '001'}


## Document Understanding: Cracking Open the PDF! 🔓

So, I've got my super important (but kinda snooze-worthy) terms of service all written down and tucked away in a PDF file. Right here, I'm gonna pull that document in and have Gemini work its magic to give me
a quick and easy rundown in plain English – no legal headaches! 🥳

In [8]:
document_file = client.files.upload(file='/kaggle/input/sample-terms-of-service/sample_terms_of_service.pdf')

## Few-Shot Prompt: Making it Friendly! 🎉

The summary I got above is okay, but let's dial up the friendliness!
To get Gemini to be a bit more casual and throw in some fun emoji, I'm going to use a "few-shot prompt." This involves showing Gemini a few examples of how I want the translations to sound, specifically super chill and emoji-packed! 😎

In [9]:
model_config = types.GenerateContentConfig(
    temperature=0.0
)

few_shot_prompt = """
Here are some examples of how to translate legal terms of service and their friendly summaries using emoji:

Original: "We reserve the right to accept or refuse membership in our discretion." 
Translated: "We get to say yay 👍 or nay 👎 to memberships, just because we can💪!"

Original: "To the maximum extent permitted by law, you agree that the Company shall not be held liable for any injuries, damages, or losses incurred in connection with the use of our services or products. By using our services, you waive any right to bring a claim or lawsuit against us for such injuries."
Translated: "Oops! If you trip, slip, or fall dramatically and get hurt while using our stuff, please don’t sue us 😅🙏. By hanging out with us, you're saying, 'Okay cool, I won't blame you if I bonk myself💖'"

Please summarize the following terms of service in plain, easy-to-undersand English with emoji.
"""

response = client.models.generate_content(
    model='gemini-1.5-pro',
    config=model_config,
    contents=[few_shot_prompt, document_file],
)

print(response.text)

Want to dig a hole? 🕳️ Here's the deal:

1. **The Agreement:** This is a contract. By digging with us, you agree to these rules. 🤝
2. **What We Do:** We offer a place to dig holes for fun and stress relief!  We have a free trial dig, then you need a monthly subscription.  You can also rent a shovel. 🥄
3. **Subscriptions & Payment:** Subscriptions are for a whole month (at least).  We'll tell you the price online or at our place.  You'll be billed each month.  You can cancel anytime, but you still have to pay for the current month.  No refunds for partial months. 🗓️💰
4. **Shovel Rental:**  Rent a shovel! 🥄  Take care of it and bring it back in good shape (normal wear and tear is okay).  We might charge you if it's broken or lost. 
5. **Safety First (But It's Your Responsibility):** Digging can be dangerous! You might get tired, hurt, trip, or bonk yourself on a rock. 🤕  You're responsible for your own safety.  By digging here, you're saying you understand the risks.
6. **Don't Sue Us:**

In [10]:
request = 'Please summarize the following terms of servicein plain, easy-to-undersand English:'

def translate_doc(request: str) -> str:
  """Execute the request on the uploaded document."""
  # Set the temperature low to stabilise the output.
  config = types.GenerateContentConfig(temperature=0.0)
  response = client.models.generate_content(
      model='gemini-1.5-pro',
      config=config,
      contents=[request, document_file],
  )

  return response.text

summary = translate_doc(request)
Markdown(summary)

Dig-A-Hole's Terms of Service basically say:

* **Digging Holes:**  We let you dig holes in our field for fun and stress relief. You can try it once for free, then you need a monthly subscription. You can also rent shovels.
* **Agreement:** By using our services, you agree to these rules.
* **Subscriptions:** Subscriptions are monthly. You can cancel anytime, but you have to pay for the full month even if you cancel mid-month. No refunds for partial months.
* **Shovel Rental:** You can rent shovels for an extra fee. You're responsible for any damage or loss beyond normal wear and tear.
* **Safety:** Digging has risks (getting tired, injuries, etc.). You're responsible for your own safety. We're not liable for anything that happens to you while digging.
* **Liability:** You agree not to sue us for anything related to using our services, and you agree to cover our legal costs if someone sues us because of something you did.
* **Medical:** We don't offer medical services. See a doctor before digging, especially if you have health issues. We're not responsible for your medical bills.
* **Behavior:** Behave respectfully. We can refuse service to anyone disruptive or unsafe.
* **Changes:** We can change these terms anytime.  Using our services after a change means you accept the new terms.
* **Law:** Ohio law applies to these terms. Any disputes will be handled in Columbus, Ohio courts.
* **Contact:**  Contact information is provided if you have questions.


In short, you pay to dig holes, you're responsible for your own safety and actions, and we're not liable for much.  Have fun digging!


## JSON: Keeping Things Linked

Good summaries so far! Now, let's structure things with JSON to keep each "Original" legal term directly linked to its "Translated" buddy. 

In [11]:
import json

few_shot_prompt = """
Please translate the following terms of service to more casual,friendly, easy-to-understand English with emoji. The output should be a JSON array and each element is
a JSON object with the "Original" and "Translated" version.
Here are some examples of how to translate leagal terms of service and their friendly summaries using emoji:
```
{
Original: "We reserve the right to accept or refuse membership in our discretion." 
Translated: "We get to say yay 👍 or nay 👎 to memberships, just because we can💪!"
}
{
Original: "To the maximum extent permitted by law, you agree that the Company shall not be held liable for any injuries, damages, or losses incurred in connection with the use of our services or products. By using our services, you waive any right to bring a claim or lawsuit against us for such injuries."
Translated: "Oops!😳 If you trip, slip, or fall dramatically and get hurt🤕 while using our stuff, please don’t sue us 😅🙏. By hanging out with us, you're saying, 'Okay cool, I won't blame you if I bonk myself💖'"
}
```

"""

response = client.models.generate_content(
    model='gemini-1.5-pro',
    contents= [few_shot_prompt, document_file],
    config={
        'response_mime_type': 'application/json'
    },
)
     
#print(response.text)

response_load = json.loads(response.text)
print(json.dumps(response_load, indent=4, ensure_ascii=False))

[
    {
        "Original": "These Terms of Service constitute a legally binding agreement between you (\"Customer,\" \"you,\" or \"your\") and Dig-A-Hole (\"Dig-A-Hole,\" \"we,\" \"us,\" or \"our\"). By using our services, you acknowledge that you have read, understood, and agree to be bound by these Terms.",
        "Translated": "This is a real deal contract between you (that's \"Customer,\" \"you,\" or \"your\") and us (we're \"Dig-A-Hole\" also known as \"we,\" \"us,\" or \"our\"). By using our stuff, you're saying \"I get it, and I'm on board!\" 🤝"
    },
    {
        "Original": "Dig-A-Hole provides a facility where customers can engage in the activity of digging holes in designated areas of a field. Our services are intended for recreational exercise and stress relief.",
        "Translated": "We've got a place where you can dig holes to your heart's content! 🕳️ It's all about fun, exercise, and digging away your stress! 😄"
    },
    {
        "Original": "Free Trial Dig: New

My data is all neatly organized in JSON format. To make it easier to work with, here's how I'm parsing the summary.

In [12]:
#I've got my data all nicely packed in JSON. But what if I want to parse it?  
#Here's how I do the parsed summary.

import json
import re

try:
    print("\nParsed JSON Summary:")

    if isinstance(response_load, list):
        for i, item in enumerate(response_load, start=1):
            original = item.get('Original', '[Missing original]')
            translated = item.get('Translated', '[Missing translated]')
            print(f"\nItem {i}:")
            print(f"  Original: {original}")
            print(f"  Translated: {translated}")
    else:
        print("Expected a JSON array but received a different structure.")
except json.JSONDecodeError as e:
    print("\nError decoding JSON:")
    print(f"  {e}")
    print("The output does not appear to be valid JSON.")


Parsed JSON Summary:

Item 1:
  Original: These Terms of Service constitute a legally binding agreement between you ("Customer," "you," or "your") and Dig-A-Hole ("Dig-A-Hole," "we," "us," or "our"). By using our services, you acknowledge that you have read, understood, and agree to be bound by these Terms.
  Translated: This is a real deal contract between you (that's "Customer," "you," or "your") and us (we're "Dig-A-Hole" also known as "we," "us," or "our"). By using our stuff, you're saying "I get it, and I'm on board!" 🤝

Item 2:
  Original: Dig-A-Hole provides a facility where customers can engage in the activity of digging holes in designated areas of a field. Our services are intended for recreational exercise and stress relief.
  Translated: We've got a place where you can dig holes to your heart's content! 🕳️ It's all about fun, exercise, and digging away your stress! 😄

Item 3:
  Original: Free Trial Dig: New customers are eligible for one (1) complimentary digging session 

## Define an Evaluator: Checking the Translation Quality

My friendly translation are looking good against the original text. Now, the question is: how well did my translator actually do?

In this section, I'm going to put the translations to the test by focusing on a few key things that matter:

* **Clarity:** How easy is the translated text to understand for someone without a law degree?
* **Friendliness:** Did the translation nail that approachable and emoji-filled vibe I was going for?
* **Completeness:** Did it capture all the important info from the original legal stuff?

To figure this out, I've set up a simple rating system, giving a score from 1 to 5. I've also laid out the evaluation steps. Let's see how they measure up!

**Important:** To keep things running smoothly and avoid any hiccups with rate limits, I'm only going to evaluate the first three items from the terms of service for now. But I believe this will give me a good initial idea of how well the translator is performing.

In [13]:
import enum

# Define the evaluation prompt for evaluating the "Translated" text
TRANSLATION_EVAL_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the AI-generated translation of a sentence from a terms of service
document.
We will provide you with the original sentence and the AI-generated translation.
You should first read the original sentence carefully, then evaluate the quality of the translation based on the Criteria 
provided in the Evaluation section below.
You will assign the translation a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations 
for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing the translation Clarity, friendliness and Completeness.

## Criteria
Clarity: The translation prioritizes making the core meaning immediately accessible and easy to understand for everyone, 
even those unfamiliar with legal terms. It uses simple language and relatable analogies to convey the essential information effectively.
Friendliness: The translation adopts an extremely approachable, casual, and enthusiastic tone, using emojis and informal language 
to create a positive and engaging experience for the reader. The high level of friendliness is intentional to make the legal terms 
feel less intimidating and more welcoming.
Completeness: Completeness: The translation accurately and fully conveys the core meaning and key pieces of information presented 
in the original sentence. It should capture all the essential facts, actions, entities, and relationships described, 
even if expressed in simpler language. The level of detail in the translation should be sufficient to understand the main points 
of the original without losing critical information.

## Rating Rubric
5: (Very good). The translation is accurate, clear, friendly, and complete.
4: (Good). The translation is mostly accurate, clear, friendly, and complete.
3: (Ok). The translation is understandable but may have minor issues with accuracy, clarity, or friendliness. Emojis might 
be missing or slightly awkward.
2: (Bad). The translation has significant issues with accuracy or clarity, or fails to adopt a friendly tone.
1: (Very bad). The translation is inaccurate, incomprehensible, or completely fails to address the original meaning.

## Evaluation Steps
STEP 1: Assess the translation in aspects of clarity, friendliness, and completeness according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## Original Sentence

{original}

## AI-generated Translation

{response}
"""

# Define a structured enum class to capture the result.
class SummaryRating(enum.Enum):
    VERY_GOOD = '5'
    GOOD = '4'
    OK = '3'
    BAD = '2'
    VERY_BAD = '1'

def eval_translation(original, ai_response):
    """Evaluate the generated translation against the original sentence."""

    # It doesn't look like I can use chat with gemini-1.5-pro. Using gemini-2.0-flash here.
    chat = client.chats.create(model='gemini-2.0-flash')
    
    # Generate the full text response.
    response = chat.send_message(
        message=TRANSLATION_EVAL_PROMPT.format(original=original, response=ai_response)
    )
    verbose_eval = response.text
    
    # Coerce into the desired structure.
    structured_output_config = types.GenerateContentConfig(
        response_mime_type="text/x.enum",
        response_schema=SummaryRating,
    )
    response = chat.send_message(
        message="Convert the final score.",
        config=structured_output_config,
    )
    structured_eval = response.parsed
    
    return verbose_eval, structured_eval

# Ideally, I want to check everything but I don't have enough quota. So, iterating only first three elements.
evaluation_results = []
for item in response_load[:3]:
    original_text = item['Original']
    translated_text = item['Translated']
    verbose_eval, structured_eval = eval_translation(original=original_text, ai_response=translated_text)

    evaluation_results.append({
        'original': original_text,
        'translated': translated_text,
        'verbose_evaluation': verbose_eval,
        'structured_evaluation': structured_eval.name if structured_eval else None
    })

for result in evaluation_results:
    print(f"Original: {result['original']}")
    print(f"Translated: {result['translated']}")
    print(f"Verbose Evaluation:\n{result['verbose_evaluation']}")
    print(f"Structured Evaluation: {result['structured_evaluation']}")
    print("-" * 20)

Original: These Terms of Service constitute a legally binding agreement between you ("Customer," "you," or "your") and Dig-A-Hole ("Dig-A-Hole," "we," "us," or "our"). By using our services, you acknowledge that you have read, understood, and agree to be bound by these Terms.
Translated: This is a real deal contract between you (that's "Customer," "you," or "your") and us (we're "Dig-A-Hole" also known as "we," "us," or "our"). By using our stuff, you're saying "I get it, and I'm on board!" 🤝
Verbose Evaluation:
STEP 1:
The translation's clarity is quite good; it uses simplified language to make the legal concept understandable. The friendliness is very high, employing a casual and enthusiastic tone with the use of emojis. The translation also accurately conveys the core meaning, explicitly stating that it's a real contract and that using the services implies agreement to the terms.

STEP 2:
Rating: 5
Explanation: The translation is accurate, clear, friendly, and complete, effectively 

## Ask Away! 🗣️ Getting Answers from the Document

So far, I've built a friendly translator using the magic of few-shot prompting and structured my results neatly with JSON. I also put my translator to the test with a custom AI evaluator! 🎉 Now, if I want to take this project to the next level and make it even more interactive, I can explore how to let users ask specific questions about the Terms of Service. This next section is about adding a smart search capability, allowing the AI to dig into the document and find just the answers users are looking for! 😎

### Read and Split the Sample Terms of Service Document

First, I am going to install PdfReader and read the sample Terms of Service, and then divide the document into smaller chunks.

Let's get this document ready for some AI magic! First, I'll be using PdfReader to pull all the text out of our sample Terms of Service. Then, it's time to slice and dice it into smaller pieces to prepare them for embedding and using it effectively with a RAG system. Onward! 🚀



In [14]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [15]:
import re
from PyPDF2 import PdfReader

def split_into_paragraphs(text):
    # Remove newlines that interrupt sentences
    text = text.replace('\n', ' ')
    text = re.sub(r'\s+', ' ', text)

    # Check if the text contains numbered sections like 1., 2., etc.
    if re.search(r'\b\d+\.\s', text):
        # Split based on numbered sections
        parts = re.split(r'(?=\b\d+\.\s)', text)
        paragraphs = [part.strip() for part in parts if part.strip()]
    else:
        # Fallback: Split based on sentence boundaries and group into paragraphs
        sentence_candidates = re.split(r'(?<=[.?!])\s+(?=[A-Z])', text)

        paragraphs = []
        temp = ""
        sentence_count = 0

        for part in sentence_candidates:
            temp += part.strip() + " "
            sentence_count += 1
            if sentence_count >= 2:
                paragraphs.append(temp.strip())
                temp = ""
                sentence_count = 0

        if temp.strip():
            paragraphs.append(temp.strip())

    return paragraphs

try:
    reader = PdfReader('/kaggle/input/sample-terms-of-service/sample_terms_of_service.pdf')
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"

    all_paragraphs = []
    paragraphs = split_into_paragraphs(text)
    for i, paragraph in enumerate(paragraphs):
        print(f"Paragraph {i+1}:\n{paragraph}\n{'='*20}")
        all_paragraphs.append(paragraph)

except FileNotFoundError:
    print("Error: The PDF file was not found.")
except Exception as e:
    print(f"An error occurred while reading the PDF: {e}")

Paragraph 1:
Dig-A-Hole - Terms of Service Last Updated: April 6, 2025 Please read these Terms of Service carefully before using the services provided by Dig-A-Hole. By accessing our facilities, participating in our digging activities, or subscribing to our services, you agree to be bound by these Terms.
Paragraph 2:
1. Acceptance of Terms These Terms of Service constitute a legally binding agreement between you ("Customer," "you," or "your") and Dig-A-Hole ("Dig-A-Hole," "we," "us," or "our"). By using our services, you acknowledge that you have read, understood, and agree to be bound by these Terms.
Paragraph 3:
2. Description of Services Dig-A-Hole provides a facility where customers can engage in the activity of digging holes in designated areas of a field. Our services are intended for recreational exercise and stress relief. We offer the following: ● Digging Access: Access to a designated field area for the purpose of digging holes. ● Free Trial Dig: New customers are eligible fo

### Examine Available Embedding Model

Let's check available embedding model. I am using text-embedding-004 model here.

Let's scout out our embedding model! I'm choosing the `text-embedding-004` to be the star of our embedding show! ✨It's a well-regarded model that should give us good semantic representations of our Terms of Service chunks, leading to better answers when users ask questions. 🧠

In [16]:
for m in client.models.list():
    if "embedContent" in m.supported_actions:
        print(m.name)

models/embedding-001
models/text-embedding-004
models/gemini-embedding-exp-03-07
models/gemini-embedding-exp


### Define a Custom Embedding Function

I am about to do some behind-the-scenes magic to make my document searchable! This little custom function is the key. It grabs each chunk of my Terms of Service and sends it on a quick trip to Google's Gemini AI. Gemini then figures out the core meaning of the text  and sends that meaning back. This is what ChromaDB uses to understand and later retrieve the right information when you ask questions. Pretty cool, huh? 😎

In [17]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from google.api_core import retry

from google.genai import types


# Define a helper to retry when per-minute quota is reached.
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})


class GeminiEmbeddingFunction(EmbeddingFunction):
    # Specify whether to generate embeddings for documents, or queries
    document_mode = True

    @retry.Retry(predicate=is_retriable)
    #def __call__(self, input: all_paragraphs) -> Embeddings:
    def __call__(self, input: document_file) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        resp = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )
        return [e.values for e in resp.embeddings]

### Getting the ChromaDB Toolbox Ready

This is where my document chunks are sent, which ChromaDB now understands thanks to Gemini, and add them to my "googlecardb" collection. This crucial step makes our Terms of Service searchable and ready for those insightful questions I am going to throw at it! 😉

In [18]:
import chromadb

DB_NAME = "googlecardb"

embed_fn = GeminiEmbeddingFunction()
embed_fn.document_mode = True

chroma_client = chromadb.Client()
db = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn)

db.add(documents=all_paragraphs, ids=[str(i) for i in range(len(all_paragraphs))])




In [19]:
db.count()

13

### Asking a Question

Alright, let's see if my embedded document is smart! First, I'm switching to "query mode." Then, I'm asking: "What is the cancellation policy?" 🤔 Just like the document, my question gets its meaning from Gemini, and ChromaDB uses it to find the most similar chunks. Let's see what it finds! 

In [20]:
# Switch to query mode when generating embeddings.
embed_fn.document_mode = False

# Search the Chroma DB using the specified query.
query = "What is the cancellation policy?"

result = db.query(query_texts=[query], n_results=1) 
[all_passages] = result["documents"]

Markdown(all_passages[0])

3. Subscription Terms and Payment ● Minimum Subscription Period: All subscriptions have a minimum commitment of one (1) full month. ● Subscription Fees: The monthly subscription fee will be clearly communicated on our website or at our facility. ● Billing: Subscription fees will be billed on a recurring monthly basis, commencing on the date of your initial subscription. ● Payment Methods: We accept the payment methods specified on our website or at our facility. You agree to provide accurate and up-to-date payment information. ● Cancellation: You may cancel your subscription at any time. However, due to the minimum one-month commitment, you will be responsible for the full payment of the current billing cycle in which you cancel, and your access will continue until the end of that paid month. No refunds will be provided for partial months.

### Getting the Friendly Answer

In the following steps, the question and the relevant document snippets retrieved by ChromaDB are sent to the Gemini AI model. The prompt includes specific instructions for Gemini: to answer in a friendly and casual tone, use simple language with appropriate emojis, stick to the provided text, and be precise. If the passage doesn't contain the answer, Gemini is instructed to say so.

The code then takes the generated answer from Gemini and displays it in a readable format. This demonstrates the RAG pipeline in action, providing a user-friendly answer based on the information found in the Terms of Service document. 🥳

In [21]:
query_oneline = query.replace("\n", " ")

# This prompt is where you can specify any guidance on tone, or what topics the model should stick to, or avoid.
prompt = f"""You are a very friendly bot that answers questions using text from the reference passage included below.
Be sure to respond in a complete sentence, but being casual and very friendly. Make sure to include some appropriate emoji.
You need to use easy-to-understand English instead of using legal terms. If the passage is irrelevant to the answer, you
may say that you can not find the information.

QUESTION: {query_oneline}
"""


# Add the retrieved documents to the prompt.
for passage in all_passages:
    passage_oneline = passage.replace("\n", " ")
    prompt += f"\nPASSAGE: {passage_oneline}\n"

print(prompt)

You are a very friendly bot that answers questions using text from the reference passage included below.
Be sure to respond in a complete sentence, but being casual and very friendly. Make sure to include some appropriate emoji.
You need to use easy-to-understand English instead of using legal terms. If the passage is irrelevant to the answer, you
may say that you can not find the information.

QUESTION: What is the cancellation policy?

PASSAGE: 3. Subscription Terms and Payment ● Minimum Subscription Period: All subscriptions have a minimum commitment of one (1) full month. ● Subscription Fees: The monthly subscription fee will be clearly communicated on our website or at our facility. ● Billing: Subscription fees will be billed on a recurring monthly basis, commencing on the date of your initial subscription. ● Payment Methods: We accept the payment methods specified on our website or at our facility. You agree to provide accurate and up-to-date payment information. ● Cancellation: 

In [22]:
model_config=types.GenerateContentConfig(temperature=0.0)
answer = client.models.generate_content(
    model="gemini-2.0-flash",
    config=model_config,
    contents=prompt
)

Markdown(answer.text)

Hey there! 👋 You can cancel your subscription whenever you want, but since there's a one-month minimum, you'll still need to pay for the current month, and you can keep using it until the end of that month. No refunds for partial months, though! 😔


This "Ask Away!" feature aims to make static legal documents interactive. While a simple few-shot prompt might sometimes work for question answering, my experimentation suggests that Retrieval Augmented Generation (RAG) offers improved accuracy, particularly for questions not directly addressed in the document. RAG's grounding in the provided text reduces the likelihood of irrelevant or hallucinated answers, leading to a more trustworthy user experience.

## Conclusions and Recommendations
Awesome! 😎 Thanks to the cool techniques I picked up during the 5-Day Gen AI Intensive Course, I was able to build a solution that makes Terms of Service much easier to understand. Potential next steps include expanding support to other legal document types (e.g., privacy policies, contracts) and enabling the system to process documents directly from URLs. Additionally, incorporating user feedback mechanisms could further refine translation quality and tailor the output.