## Translate Medical FAQs with GCP

### Overview
This notebook translates 331 `Correct` answers from the RAG-evaluated dataset (`rag_faq_eval.csv`) into Spanish and Telugu using the Google Cloud Platform (GCP) Translation API. It builds on the Retrieval-Augmented Generation (RAG) pipeline, enabling multilingual support for healthcare FAQs to enhance patient accessibility in telehealth applications.

### Purpose
- Filter `rag_faq_eval.csv` to 331 `Correct` rows.
- Translate `predicted_answer` into Spanish and Telugu, creating `GCP_Spanish` and `GCP_Telugu` columns.
- Save results as an Excel file for stakeholder review and portfolio presentation.

### Business Value
- **Patient Accessibility**: Provides multilingual FAQs (Spanish, Telugu), supporting diverse patient populations and improving engagement.
- **Clinical Efficiency**: Enables quick, accurate responses in multiple languages, reducing clinician workload by up to 60% (per Azure economics).
- **Scalability**: Demonstrates cloud-based translation for extensible language support (e.g., French, Hindi).

### Technical Approach
- **Input**: `rag_faq_eval.csv` with 500 rows (`input_text`, `predicted_answer`, `reference_answer`, `similarity_score`, `verdict`).
- **Filtering**: Retains 331 rows with `verdict` = "Correct".
- **Translation**: Uses GCP Translation API to translate `predicted_answer` to Spanish (`es`) and Telugu (`te`).
- **Output**: `rag_faq_eval_translated.xlsx` with `GCP_Spanish` and `GCP_Telugu` columns.
- **Runtime**: ~60-70 seconds per language for 331 translations on CPU (WSL Jupyter).
- **Tools**: Pandas, GCP Translate, `tqdm` for progress, `logging` for minimal output.

### Prerequisites
- GCP service account key (`healthcare-faq-translation-key.json`) in `.env`.
- Dependencies: `pip install pandas tqdm google-cloud-translate==2.0.1 python-dotenv`.
- Input file: `rag_faq_eval.csv` in WSL Jupyter directory.
---

In [1]:
import os
import pandas as pd
from tqdm import tqdm
from google.cloud import translate_v2 as translate
from dotenv import load_dotenv
import logging

---
## Configure Logging

### Purpose
Set up logging to track key events (e.g., library imports, errors) with minimal output for a clean, professional notebook.

- Ensures transparent execution for debugging and stakeholder review, critical for healthcare applications.
- Reduces clutter in notebook output, enhancing GitHub portfolio presentation.

### Technical Details
- Uses `logging` with `INFO` level and timestamped format.
- Logs only essential events (e.g., initialization, file loading) to avoid excessive output.

In [2]:
# Setup Logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logging.info("Libraries imported and logging configured.")

2025-09-09 20:50:05,472 - INFO - Libraries imported and logging configured.


---
## Load Environment Variables

### Purpose
Load GCP credentials from a `.env` file to securely authenticate the Translation API.

- Enables secure cloud integration, ensuring compliance with healthcare data privacy standards.
- Supports reproducible setups across environments (e.g., dev, prod).

### Technical Details
- Loads `GOOGLE_APPLICATION_CREDENTIALS` from `.env` (e.g., path to `healthcare-faq-translation-key.json`).
- Validates presence of credentials, raising clear errors if missing.

In [3]:
# Load environment variables from .env
load_dotenv()
if "GOOGLE_APPLICATION_CREDENTIALS" not in os.environ:
    logging.error("GOOGLE_APPLICATION_CREDENTIALS not set in .env file.")
    raise ValueError("GOOGLE_APPLICATION_CREDENTIALS must be set.")

logging.info("GCP credentials loaded from .env.")

2025-09-09 20:50:07,101 - INFO - GCP credentials loaded from .env.


---
## Initialize GCP Translation Client

### Purpose
Initialize the GCP Translation API client to enable translation of medical FAQ answers.

- Facilitates multilingual FAQ generation, improving accessibility for diverse patient populations.
- Leverages GCP’s robust translation capabilities for accurate, scalable results.

### Technical Details
- Uses `google.cloud.translate_v2.Client` for API access.
- Logs initialization success or errors for debugging.
- Requires valid credentials from `.env`.

In [4]:
# Initialize GCP Translation client
try:
    translate_client = translate.Client()
    logging.info("GCP Translation client initialized.")
except Exception as e:
    logging.error(f"Error initializing GCP Translation client: {e}")
    raise

2025-09-09 20:50:08,401 - INFO - GCP Translation client initialized.


---
## Load RAG-Evaluated Dataset

### Purpose
Load the RAG-evaluated dataset (`rag_faq_eval.csv`) containing 500 medical FAQs with predictions and verdicts.

- Provides the foundation for multilingual translation, enabling accessible healthcare FAQs for patients.
- Ensures data integrity before processing, critical for reliable outcomes.

### Technical Details
- Loads `rag_faq_eval.csv` (500 rows) with columns: `input_text`, `predicted_answer`, `reference_answer`, `similarity_score`, `verdict`.
- Uses `pandas` for efficient CSV handling in WSL Jupyter.

In [5]:
# Load rag_faq_eval.csv
df = pd.read_csv("rag_faq_eval.csv")
logging.info("Loaded rag_faq_eval.csv successfully.")

2025-09-09 20:50:09,649 - INFO - Loaded rag_faq_eval.csv successfully.


---
## Filter to 'Correct' Answers

### Purpose
Remove unnecessary columns (`full_text`, `rag_answer`) and filter to 331 rows with `verdict` = "Correct" for translatioValue
- Focuses translation on high-quality answers, ensuring only accurate FAQs are provided to patients.
- Reduces processing time and costs by limiting translations to verified answers.

### Technical Details
- Drops `full_text` and `rag_answer` columns (if present) using `df.drop`.
- Filters to 331 rows where `verdict` = "Correct".
- Logs column removal and row count for traceability.

In [6]:
# Remove unnecessary columns
df = df.drop(columns=['full_text', 'rag_answer'], errors='ignore')
logging.info("Dropped 'full_text' and 'rag_answer' columns.")

# Keep only rows whose verdict = "Correct"
df = df[df["verdict"] == "Correct"]
logging.info(f"Filtered to {len(df)} rows with verdict 'Correct'.")


2025-09-09 20:50:10,827 - INFO - Dropped 'full_text' and 'rag_answer' columns.
2025-09-09 20:50:10,829 - INFO - Filtered to 331 rows with verdict 'Correct'.


In [7]:
df.head(5)

Unnamed: 0,input_text,reference_answer,predicted_answer,similarity_score,verdict
0,question: spastic paraplegia type 8 inherited ...,Spastic paraplegia type 8 is inherited in an a...,This condition is inherited in an autosomal do...,1.0,Correct
2,question: sprengel deformity answer:,Sprengel deformity is a congenital condition c...,Sprengel deformity is a condition that affects...,1.0,Correct
4,question: mitochondrial neurogastrointestinal ...,Mitochondrial neurogastrointestinal encephalop...,Mitochondrial neurogastrointestinal encephalop...,1.0,Correct
5,question: congenital varicella syndrome answer:,Congenital varicella syndrome is an extremely ...,Congenital varicella syndrome (CVS) is a rare ...,0.935256,Correct
6,question: tinnitus answer:,Tinnitus is often described as a ringing in th...,Tinnitus is a condition that affects the nervo...,1.0,Correct


---
## Define Translation Function

### Purpose
Define a reusable function to translate `predicted_answer` into a specified language, creating a new column (e.g., `GCP_Spanish`, `GCP_Telugu`).

- Enables dynamic multilingual support for healthcare FAQs, improving accessibility for non-English-speaking patients.
- Supports extensible language additions (e.g., French, Hindi) for future scalability.

### Technical Details
- Function: `translate_answers(df, language, lang_code)` translates `predicted_answer` using GCP API.
- Validates `lang_code` (e.g., `es`, `te`) with a test API call.
- Creates column (e.g., `GCP_Spanish`) and logs errors, successes, and skipped rows.
- Uses `tqdm` for progress tracking (~25-30 seconds for 331 rows).

In [8]:
def translate_answers(df, language, lang_code):
    """
    Translates predicted_answer for Correct rows into the specified language.
    
    Args:
        df (pd.DataFrame): DataFrame with Correct rows (verdict == 'Correct').
        language (str): Target language name (e.g., 'Spanish', 'Telugu') for column naming.
        lang_code (str): GCP language code (e.g., 'es', 'te').
    
    Returns:
        pd.DataFrame: Updated DataFrame with new column (e.g., 'GCP_Spanish').
    """
    # Map language to GCP language code
    try:
        translate_client.translate("test", target_language=lang_code)
        logging.info(f"Starting translations to {language} (code: {lang_code})")
    except Exception as e:
        logging.error(f"Invalid GCP language code: {lang_code}. Error: {e}")
        raise ValueError(f"Invalid GCP language code: {lang_code}")

    column_name = f"GCP_{language}"
    
    # Initialize translation column
    if column_name in df.columns:
        logging.warning(f"Column '{column_name}' already exists. Overwriting translations.")
        df[column_name] = ""
    else:
        df[column_name] = ""
        logging.info(f"Created column '{column_name}' for {language} translations.")
    
    
    # Translate predicted_answer for each row
    for idx in tqdm(df.index, desc=f"Translating to {language}"):
        predicted_answer = df.at[idx, 'predicted_answer']
        if predicted_answer and pd.notna(predicted_answer) and predicted_answer != "Error: Generation failed.":
            try:
                result = translate_client.translate(predicted_answer, target_language=lang_code)
                df.at[idx, column_name] = result['translatedText']
            except Exception as e:
                df.at[idx, column_name] = f"Error: Translation failed - {str(e)}"
                logging.error(f"Error translating row {idx}: {e}")
        else:
            df.at[idx, column_name] = "No translation"
            logging.warning(f"Row {idx}: Skipped translation")
            
    # Log summary
    logging.info(f"Completed translations to {language}.")
    
    return df

In [9]:
# Translate to Spanish
df = translate_answers(df, "Spanish", "es")

2025-09-09 20:50:20,979 - INFO - Starting translations to Spanish (code: es)
2025-09-09 20:50:20,985 - INFO - Created column 'GCP_Spanish' for Spanish translations.
Translating to Spanish: 100%|█████████████████████████████████████████████████████████| 331/331 [01:07<00:00,  4.88it/s]
2025-09-09 20:51:28,889 - INFO - Completed translations to Spanish.


In [10]:
df.head(10)

Unnamed: 0,input_text,reference_answer,predicted_answer,similarity_score,verdict,GCP_Spanish
0,question: spastic paraplegia type 8 inherited ...,Spastic paraplegia type 8 is inherited in an a...,This condition is inherited in an autosomal do...,1.0,Correct,Esta afección se hereda con un patrón autosómi...
2,question: sprengel deformity answer:,Sprengel deformity is a congenital condition c...,Sprengel deformity is a condition that affects...,1.0,Correct,La deformidad de Sprengel es una afección que ...
4,question: mitochondrial neurogastrointestinal ...,Mitochondrial neurogastrointestinal encephalop...,Mitochondrial neurogastrointestinal encephalop...,1.0,Correct,La encefalopatía neurogastrointestinal mitocon...
5,question: congenital varicella syndrome answer:,Congenital varicella syndrome is an extremely ...,Congenital varicella syndrome (CVS) is a rare ...,0.935256,Correct,El síndrome de varicela congénita (SVC) es una...
6,question: tinnitus answer:,Tinnitus is often described as a ringing in th...,Tinnitus is a condition that affects the nervo...,1.0,Correct,El tinnitus es una afección que afecta el sist...
7,question: need know diarrhea answer:,"- Diarrhea is frequent, loose, and watery bowe...",Diarrhea is a disease in which the body's dige...,0.887206,Correct,La diarrea es una enfermedad en la que el sist...
11,question: outlook childhood ependymoma answer:,Certain factors affect prognosis (chance of re...,Key Points - The prognosis for children with c...,0.982512,Correct,Puntos clave - El pronóstico para los niños co...
12,question: many people affected oralfacialdigit...,Oral-facial-digital syndrome has an estimated ...,Oral-facial-digital syndrome is a rare conditi...,1.0,Correct,El síndrome orofacial-digital es una enfermeda...
13,question: liver cancer answer:,Your liver is the largest organ inside your bo...,Key Points - Liver cancer is a disease in whic...,0.835981,Correct,Puntos clave: El cáncer de hígado es una enfer...
14,question: treatment legius syndrome answer:,These resources address the diagnosis or manag...,These resources address the diagnosis or manag...,0.943522,Correct,Estos recursos abordan el diagnóstico o el tra...


In [11]:
# Translate to Telugu - adds column
df = translate_answers(df, "Telugu", "te")
df.head(10)

2025-09-09 20:51:29,051 - INFO - Starting translations to Telugu (code: te)
2025-09-09 20:51:29,056 - INFO - Created column 'GCP_Telugu' for Telugu translations.
Translating to Telugu: 100%|██████████████████████████████████████████████████████████| 331/331 [01:00<00:00,  5.47it/s]
2025-09-09 20:52:29,538 - INFO - Completed translations to Telugu.


Unnamed: 0,input_text,reference_answer,predicted_answer,similarity_score,verdict,GCP_Spanish,GCP_Telugu
0,question: spastic paraplegia type 8 inherited ...,Spastic paraplegia type 8 is inherited in an a...,This condition is inherited in an autosomal do...,1.0,Correct,Esta afección se hereda con un patrón autosómi...,ఈ పరిస్థితి ఆటోసోమల్ డామినెంట్ నమూనాలో వారసత్వ...
2,question: sprengel deformity answer:,Sprengel deformity is a congenital condition c...,Sprengel deformity is a condition that affects...,1.0,Correct,La deformidad de Sprengel es una afección que ...,స్ప్రెంజెల్ వైకల్యం అనేది చేతులు మరియు కాళ్ళ ఎ...
4,question: mitochondrial neurogastrointestinal ...,Mitochondrial neurogastrointestinal encephalop...,Mitochondrial neurogastrointestinal encephalop...,1.0,Correct,La encefalopatía neurogastrointestinal mitocon...,మైటోకాన్డ్రియల్ న్యూరోగ్యాస్ట్రోఇంటెస్టినల్ ఎన...
5,question: congenital varicella syndrome answer:,Congenital varicella syndrome is an extremely ...,Congenital varicella syndrome (CVS) is a rare ...,0.935256,Correct,El síndrome de varicela congénita (SVC) es una...,పుట్టుకతో వచ్చే వరిసెల్లా సిండ్రోమ్ (CVS) అనేద...
6,question: tinnitus answer:,Tinnitus is often described as a ringing in th...,Tinnitus is a condition that affects the nervo...,1.0,Correct,El tinnitus es una afección que afecta el sist...,టిన్నిటస్ అనేది నాడీ వ్యవస్థను ప్రభావితం చేసే ...
7,question: need know diarrhea answer:,"- Diarrhea is frequent, loose, and watery bowe...",Diarrhea is a disease in which the body's dige...,0.887206,Correct,La diarrea es una enfermedad en la que el sist...,అతిసారం అనేది శరీరంలోని జీర్ణవ్యవస్థ వ్యర్థాలు...
11,question: outlook childhood ependymoma answer:,Certain factors affect prognosis (chance of re...,Key Points - The prognosis for children with c...,0.982512,Correct,Puntos clave - El pronóstico para los niños co...,ముఖ్య అంశాలు - బాల్య ఎపెండిమోమా ఉన్న పిల్లలకు ...
12,question: many people affected oralfacialdigit...,Oral-facial-digital syndrome has an estimated ...,Oral-facial-digital syndrome is a rare conditi...,1.0,Correct,El síndrome orofacial-digital es una enfermeda...,ఓరల్-ఫేషియల్-డిజిటల్ సిండ్రోమ్ అనేది అరుదైన పర...
13,question: liver cancer answer:,Your liver is the largest organ inside your bo...,Key Points - Liver cancer is a disease in whic...,0.835981,Correct,Puntos clave: El cáncer de hígado es una enfer...,ముఖ్య అంశాలు - లివర్ క్యాన్సర్ అనేది కాలేయంలో ...
14,question: treatment legius syndrome answer:,These resources address the diagnosis or manag...,These resources address the diagnosis or manag...,0.943522,Correct,Estos recursos abordan el diagnóstico o el tra...,ఈ వనరులు లెజియస్ సిండ్రోమ్ నిర్ధారణ లేదా నిర్వ...


In [12]:
# Save updated Excel
output_path = "rag_faq_eval_translated.xlsx"
df.to_excel(output_path, index=False)
logging.info(f"Translated results saved to {output_path}")

2025-09-09 20:52:29,840 - INFO - Translated results saved to rag_faq_eval_translated.xlsx


---
## Conclusion

This notebook translated 331 `Correct` answers from `rag_faq_eval.csv` into Spanish and Telugu using the GCP Translation API, creating `GCP_Spanish` and `GCP_Telugu` columns, and saved the results to `rag_faq_eval_translated.xlsx`. The pipeline builds on the RAG-evaluated dataset (500 rows, 331 `Correct`), enhancing the Healthcare FAQ Generator with multilingual support for telehealth applications.

### Key Results
- **Row Count**: 331 `Correct` rows translated from `rag_faq_eval.csv`.
- **Translations**: Added `GCP_Spanish` (e.g., “Herencia dominante autosómica”) and `GCP_Telugu` (e.g., “ఆటోసోమల్ డామినెంట్”) columns.
- **Runtime**: ~60-70 seconds per language for 331 translations on CPU (WSL Jupyter).
- **Output**: `rag_faq_eval_translated.xlsx` with `input_text`, `reference_answer`, `predicted_answer`, `similarity_score`, `verdict`, `GCP_Spanish`, `GCP_Telugu`.

### Business Impact
- **Patient Accessibility**: Enables multilingual FAQs for Spanish and Telugu-speaking patients, improving engagement and equity.
- **Clinical Efficiency**: Supports quick, accurate responses in multiple languages, reducing clinician workload by up to 60% (per Azure economics).
- **Scalability**: Demonstrates cloud-based translation for extensible language support (e.g., French, Hindi).

### Next Steps
- **Expand Languages**: Add French (`fr`) and Hindi (`hi`) translations using `translate_answers`.
- **Deploy Application**: Build a Streamlit app to display multilingual FAQs, deployable on AWS SageMaker.
- **Improve Verdicts**: Add keyword-based rules (e.g., “beans and lentils” for CKD) to refine `Partially correct` verdicts in `rag_faq_eval.csv`.

This step showcases cloud-based NLP and multilingual support, positioning the project as a strong Data Science portfolio piece for healthcare applications.