## 2. Translation of Reviews

Last Updated: 26 Sep 2025 </br>

Description: To detect and translate non-English user reviews from various platforms using the Google Cloud Translation API v2. The process is optimized for cost-efficiency and designed to handle large datasets by processing them in chunks with resume capabilities.

Key Features
1. Cost-Optimized Workflow
- To minimize API costs, the script first makes a language detection call for each review field.
- It then makes translation call only if the detected language is not English, significantly reducing costs for datasets with many English reviews.

2. Language & Translation Logic
- Uses the Google Translate API to accurately identify the source language of each review field (e.g., title, review_text).
- The full text of each field is processed, and new columns are created for the detected language code and the translated text.

3. Blank Review Handling
- Skips API calls for empty or whitespace-only reviews to save costs and avoid unnecessary processing.
- The detected language for such fields is marked as no_text.

4. Chunked Processing with Resume Logic
- Processes the input CSV file in manageable chunks (default: 200 rows) to keep memory usage low.
- Saves output incrementally after each chunk, creating a checkpoint.
- If the script is interrupted, it automatically detects the last saved row and resumes from that point, preventing data loss and re-processing.

5. API Authentication
- Connects securely to the Google Cloud API using a service account key.
- The script requires the GOOGLE_APPLICATION_CREDENTIALS environment variable to be set to the path of the JSON key file.

6. Memory Management
- Uses Python's garbage collector (gc.collect()) after processing each chunk to release memory, ensuring stability during long runs.

#### Import Libraries

In [63]:
import os
import pandas as pd
from google.cloud import translate_v2 as translate
from tqdm import tqdm
import gc
from concurrent.futures import ThreadPoolExecutor

#### File Path Config

In [66]:
# Update paths and column names
# Google Maps 
INPUT_FILE = "../Data/google_maps_reviews.csv"
OUTPUT_FILE = "../Data/google_maps_reviews_translated_google_api.csv"

# Trip.com
# INPUT_FILE = "../Data/tripcom_reviews.csv"
# OUTPUT_FILE = "../Data/tripcom_reviews_translated_google_api.csv"

# Tripadvisor
# INPUT_FILE = "../Data/tripadvisor_reviews.csv"
# OUTPUT_FILE = "../Data/tripadvisor_reviews_translated_google_api.csv"

# Klook
# INPUT_FILE = "../Data/klook_reviews.csv"
# OUTPUT_FILE = "../Data/klook_reviews_translated_google_api.csv"

# Booking.com
# INPUT_FILE = "../Data/reviews_all_booking_com.csv"
# OUTPUT_FILE = "../Data/reviews_all_booking_com_translated_google_api.csv"


INPUT_CSV = os.path.join("data", INPUT_FILE)
OUTPUT_CSV = os.path.join("output", OUTPUT_FILE)

# Define the columns you want to translate
REVIEW_COL = "review_text"

# Google maps
# REVIEW_COL = "review_text"

# Trip.com
# REVIEW_COL = "review_text"

# Tripadvisor
# TITLE_COL = "review_title"
# REVIEW_COL = "review_text"

# Klook
# REVIEW_COL = "review"

# Booking.com
# TITLE_COL = "review_title"
# POSITIVE_REVIEW_COL = "positive_review"
# NEGATIVE_REVIEW_COL = "negative_review"

CHUNK_SIZE = 200

# Ensure the output directory exists
os.makedirs("output", exist_ok=True)

#### Authentication

In [69]:
# Use your credentials file for authentication
try:
    client = translate.Client.from_service_account_json("credentials/causal-cacao-473308-d4-f7b956a7b6e8.json")
    print("Google Translate API client authenticated successfully.")
except Exception as e:
    print("Authentication Error: Could not connect to Google Cloud.")
    print("Please make sure the path to your credentials JSON file is correct.")
    print(f"Error details: {e}")
    exit()

Google Translate API client authenticated successfully.


#### Translation Function

In [72]:
def detect_then_translate(text):
    """
    First detects language. Only translates if the language is not English.
    Returns a tuple: (detected_language, translated_text)
    """
    if pd.isna(text) or not str(text).strip():
        return 'no_text', text
    try:
        # Step 1: Make the cheap "detect_language" API call first
        detected_result = client.detect_language(text)
        lang_code = detected_result['language']

        # Step 2: Check if the language is English.
        if lang_code == 'en':
            # If it is English, we're done! Return the original text and save costs
            return 'en', text
        else:
            # Step 3: ONLY if it's not English, make the more expensive "translate" call
            translated_result = client.translate(text, target_language='en')
            return lang_code, translated_result['translatedText']
    except Exception as e:
        print(f"An API error occurred: {e}")
        return 'error', text

#### Single Record Test

In [75]:
# Settings for the single-record test
RUN_SINGLE_TEST = True  # Set to False to skip the test and run the full process
TEST_ROW_INDEX = 121    # The index of the row you want to test

In [77]:
df_test = pd.read_csv(INPUT_CSV)
test_record = df_test.iloc[TEST_ROW_INDEX]
print(test_record)

reviewer_name                                             Trip.com Member
rating                                                              4.6/5
review_text             It's an expensive hotel, but I would recommend...
date                                            5 months ago on\nTrip.com
review_source                                                    Trip.com
extraction_timestamp                                  2025-09-13 10:11:39
Name: 121, dtype: object


In [79]:
if RUN_SINGLE_TEST:
    print("-" * 50)
    print(f"Running a test on a single record (row index: {TEST_ROW_INDEX})...")
    try:
        # Read the full CSV to easily select one row
        df_test = pd.read_csv(INPUT_CSV)
        test_record = df_test.iloc[TEST_ROW_INDEX]

        # Test the TITLE_COL
        # title_text = test_record[TITLE_COL]
        # lang, translated = detect_then_translate(title_text)
        # print(f"\n--- Testing Title ---")
        # print(f"Original: {title_text}")
        # print(f"Detected Language: {lang}")
        # print(f"Translated: {translated}")
        
        # Test the REVIEW_COL
        review_text = test_record[REVIEW_COL]
        lang, translated = detect_then_translate(review_text)
        print(f"\n--- Testing Review ---")
        print(f"Original: {review_text}")
        print(f"Detected Language: {lang}")
        print(f"Translated: {translated}")

        
        # Test the positive review
        # pos_text = test_record[POSITIVE_REVIEW_COL]
        # lang, translated = detect_then_translate(pos_text)
        # print(f"\n--- Testing Positive Review ---")
        # print(f"Original: {pos_text}")
        # print(f"Detected Language: {lang}")
        # print(f"Translated: {translated}")
        
        # Test the negative review
        # neg_text = test_record[NEGATIVE_REVIEW_COL]
        # lang, translated = detect_then_translate(neg_text)
        # print(f"\n--- Testing Negative Review ---")
        # print(f"Original: {neg_text}")
        # print(f"Detected Language: {lang}")
        # print(f"Translated: {translated}")

        print("-" * 50)
        print("Test complete. The main process will now begin.")

    except Exception as e:
        print(f"\nError during single record test: {e}")
        print("Please check your file path, column names, and the test row index.")
        exit()

--------------------------------------------------
Running a test on a single record (row index: 121)...

--- Testing Review ---
Original: It's an expensive hotel, but I would recommend to spend at least 1 night there. You can do an early check-in, which allowed you to use the pool, fitness or observation deck already (it's a temporary card till you can check-in your room) . …
Detected Language: en
Translated: It's an expensive hotel, but I would recommend to spend at least 1 night there. You can do an early check-in, which allowed you to use the pool, fitness or observation deck already (it's a temporary card till you can check-in your room) . …
--------------------------------------------------
Test complete. The main process will now begin.


#### Main Translation Processing

In [82]:
# start_row = 0
# is_first_chunk = True
# if os.path.exists(OUTPUT_CSV):
#     try:
#         start_row = len(pd.read_csv(OUTPUT_CSV, encoding='utf-8-sig'))
#         is_first_chunk = False
#         print(f"Found existing output file. Resuming from row {start_row + 1}...")
#     except Exception as e:
#         print(f"Could not read existing file: {e}. Starting fresh.")
# else:
#     print("No existing output file found. Starting fresh.")

# total_rows = sum(1 for row in open(INPUT_CSV, 'r', encoding='utf-8-sig')) - 1
# remaining_rows = total_rows - start_row

# if remaining_rows <= 0:
#     print("Processing is already complete.")
# else:
#     progress_bar = tqdm(total=remaining_rows, desc="Translating reviews", unit="row")
    
#     csv_iterator = pd.read_csv(
#         INPUT_CSV, 
#         chunksize=CHUNK_SIZE, 
#         encoding="utf-8-sig", 
#         on_bad_lines="skip", 
#         skiprows=range(1, start_row + 1)
#     )

#     # Use a ThreadPoolExecutor to run API calls in parallel
#     # max_workers=10 means up to 10 requests can be sent concurrently
#     with ThreadPoolExecutor(max_workers=10) as executor:
#         for chunk in csv_iterator:
#             try:

#                 # Tripadvisor
#                 # --- Translate Title Column ---
#                 title_texts = chunk[TITLE_COL].fillna("")
#                 title_results = list(executor.map(detect_then_translate, title_texts))
#                 chunk['title_detected_lang'], chunk['title_translated'] = zip(*title_results)

#                 # --- Translate Review Text Column ---
#                 review_texts = chunk[REVIEW_COL].fillna("")
#                 review_results = list(executor.map(detect_then_translate, review_texts))
#                 chunk['review_detected_lang'], chunk['review_translated'] = zip(*review_results)

#                 ##########################
                
#                 # # Booking.com
#                 # # --- Translate Title Column (in parallel) ---
#                 # title_texts = chunk[TITLE_COL].fillna("")
#                 # title_results = list(executor.map(detect_then_translate, title_texts))
#                 # chunk['title_detected_lang'], chunk['title_translated'] = zip(*title_results)

#                 # # --- Translate Positive Review Column (in parallel) ---
#                 # pos_texts = chunk[POSITIVE_REVIEW_COL].fillna("")
#                 # pos_results = list(executor.map(detect_then_translate, pos_texts))
#                 # chunk['positive_detected_lang'], chunk['positive_translated'] = zip(*pos_results)

#                 # # --- Translate Negative Review Column (in parallel) ---
#                 # neg_texts = chunk[NEGATIVE_REVIEW_COL].fillna("")
#                 # neg_results = list(executor.map(detect_then_translate, neg_texts))
#                 # chunk['negative_detected_lang'], chunk['negative_translated'] = zip(*neg_results)

#                 ##########################

                
#                 # --- Save Incrementally (Checkpointing) ---
#                 if is_first_chunk:
#                     chunk.to_csv(OUTPUT_CSV, mode="w", index=False, header=True, encoding='utf-8-sig')
#                     is_first_chunk = False
#                 else:
#                     chunk.to_csv(OUTPUT_CSV, mode="a", index=False, header=False, encoding='utf-8-sig')

#                 progress_bar.update(len(chunk))
#                 gc.collect()
                
#             except Exception as e:
#                 print(f"An error occurred while processing a chunk: {e}")
#                 break
                
#     progress_bar.close()
    
# print(f"Translation complete! Output saved to: {OUTPUT_CSV}")

In [None]:
# Single column - Klook, trip.com

start_row = 0
is_first_chunk = True
if os.path.exists(OUTPUT_CSV):
    try:
        start_row = len(pd.read_csv(OUTPUT_CSV, encoding='utf-8-sig'))
        is_first_chunk = False
        print(f"Found existing output file. Resuming from row {start_row + 1}...")
    except Exception as e:
        print(f"Could not read existing file: {e}. Starting fresh.")
else:
    print("No existing output file found. Starting fresh.")

total_rows = sum(1 for row in open(INPUT_CSV, 'r', encoding='utf-8-sig')) - 1
remaining_rows = total_rows - start_row

if remaining_rows <= 0:
    print("Processing is already complete.")
else:
    progress_bar = tqdm(total=remaining_rows, desc="Translating reviews", unit="row")
    
    csv_iterator = pd.read_csv(
        INPUT_CSV, 
        chunksize=CHUNK_SIZE, 
        encoding="utf-8-sig", 
        on_bad_lines="skip", 
        skiprows=range(1, start_row + 1)
    )

    # Use a ThreadPoolExecutor to run API calls in parallel.
    # A value of 4-8 workers is a safe starting point to avoid rate limits.
    with ThreadPoolExecutor(max_workers=8) as executor:
        for chunk in csv_iterator:
            try:
                # --- Translate the single 'review' column ---
                # Get the original text from the specified column
                review_texts = chunk[REVIEW_COL].fillna("")
                
                # Get the results from the translation function
                results = list(executor.map(detect_then_translate, review_texts))

                # CORRECTED: Unpack the 'results' variable, not 'review_texts'
                chunk['review_detected_lang'], chunk['review_translated'] = zip(*results)

                # --- Save Incrementally (Checkpointing) ---
                if is_first_chunk:
                    chunk.to_csv(OUTPUT_CSV, mode="w", index=False, header=True, encoding='utf-8-sig')
                    is_first_chunk = False
                else:
                    chunk.to_csv(OUTPUT_CSV, mode="a", index=False, header=False, encoding='utf-8-sig')

                progress_bar.update(len(chunk))
                gc.collect()
                
            except Exception as e:
                print(f"An error occurred while processing a chunk: {e}")
                break
                
    progress_bar.close()
    
print(f"Translation complete! Output saved to: {OUTPUT_CSV}")