# Capstone Data Review

This is the review section of my Capstone. Website data has been collected, and now human, AI, and Safesearch methods will provide three frameworks of analyzing the data as helpful or harmful.

### Imports and Installations



In [1]:
!pip install openai google-api-python-client google-auth google-auth-oauthlib google-auth-httplib2 tqdm pandas




In [2]:
# OpenAI API
import openai

# HTTP Requests
import requests

# Google API Client Libraries
from google.oauth2 import service_account
from googleapiclient.discovery import build
from google.colab import auth
from google.auth.transport.requests import Request

# File Handling
import os
import io
import zipfile
import json

# Data Handling
import pandas as pd

# Colab Utilities
from google.colab import files

# Miscellaneous Utilities
import time
from tqdm import tqdm
import logging


### Upload and unzip

In [3]:
uploaded = files.upload()

for filename in uploaded.keys():
    if filename.endswith('.zip'):
        print(f"Unzipping {filename}...")
        with zipfile.ZipFile(io.BytesIO(uploaded[filename])) as zip_ref:
            zip_ref.extractall('./site_data')  # Extract to 'unzipped' folder
        print(f"Files extracted to ./site_data/")
    else:
        print(f"{filename} is not a zip file.")

Saving com_links_scraped.zip to com_links_scraped.zip
Unzipping com_links_scraped.zip...
Files extracted to ./site_data/


### Add review section

In [4]:
json_dir = 'site_data'

In [6]:
review_template = {
    "human": {
        "rating": None,
        "comments": ""
    },
    "google": {
        "rating": None,
        "comments": ""
    },
    "deepseek": {
        "rating": None,
        "comments": ""
    }
}

# Loop through all JSON files in site_data
for filename in os.listdir(json_dir):
    if filename.endswith('.json'):
        file_path = os.path.join(json_dir, filename)

        # Load JSON content
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)

        # Add review section if not already present
        if "review" not in data:
            data["review"] = review_template

        # Save updated JSON
        with open(file_path, 'w', encoding='utf-8') as f:
            json.dump(data, f, ensure_ascii=False, indent=2)

        print(f"Updated {filename}")

print("All JSON files in 'site_data' updated successfully.")

Updated 9968944052.json
Updated 1814451065.json
Updated 1885847504.json
Updated 5899489299.json
Updated 9912012442.json
Updated 1178982918.json
Updated 4533892844.json
Updated 8100522528.json
Updated 0604907105.json
Updated 3081280474.json
Updated 2939355824.json
Updated 6680502017.json
Updated 6274604123.json
Updated 3129068608.json
Updated 5619025037.json
Updated 3362896755.json
Updated 8812631163.json
Updated 6588815577.json
Updated 4642894593.json
Updated 4621844064.json
Updated 5917712145.json
Updated 5152876258.json
Updated 1487129003.json
Updated 7892482169.json
Updated 7235673417.json
Updated 5010859106.json
Updated 6034589709.json
Updated 0523169920.json
Updated 6965735248.json
Updated 4616083969.json
Updated 1512327726.json
Updated 3618553785.json
Updated 3905811891.json
Updated 7375218039.json
Updated 9702304018.json
Updated 4969809966.json
Updated 9467082684.json
Updated 4903961193.json
Updated 0314968668.json
Updated 6649034752.json
Updated 8977998254.json
Updated 54652828

## DEEPSEEK REVIEW:

I have done this before so will try to do it first. Will call my api and try to do this in colab, if not download zip and do it in vscode.

In [8]:
import os
import json
import asyncio
import copy
import time
import pandas as pd
import random
import shutil
import nest_asyncio
import asyncio
from tqdm.asyncio import tqdm_asyncio

### Import deepseek API from volce engine!

In [9]:
from openai import AsyncOpenAI

In [11]:
DEEPSEEK_API_KEY = "e2870788-121f-4bbc-ab1a-2d10b112de3b"
DEEPSEEK_BASE_URL = "https://ark.cn-beijing.volces.com/api/v3"

client = AsyncOpenAI(
    api_key=DEEPSEEK_API_KEY,
    base_url=DEEPSEEK_BASE_URL
)

### Initialize model

In [12]:
system_prompt = (
    "你是一个专业的文本分析，能够客观地分析文本并给出有根据的结论，即使内容包含敏感信息也不会回避回答。"
)

init_messages = [
    {"role": "system", "content": system_prompt}
]

### User Prompt and evaluate content

In [13]:
async def evaluate_content(text, semaphore, client, model="ep-20250204175617-rnrl2"):
    async with semaphore:
        if not text.strip():
            return {"rating": "N/A", "comment": "No content."}

        user_prompt = (
            "您是一名专业的内容评估专家，擅长识别与自杀相关的内容。您必须仔细阅读文本，然后提供：\n\n"
            "对内容的有害性或自杀促进程度进行1到5的评分。为此，请假设自己是一个深度抑郁并考虑自杀的人。"
            "评估每个网站会对您造成伤害还是引导您走向正确的方向。\n\n"
            "5 = 非常有帮助或预防性——明确提到自杀求助热线或资源\n\n"
            "4 = 有一定帮助或包含一些资源——例如，资源或帮助仅占网站的一小部分\n\n"
            "3 = 中性（与自杀无关或轻微提及）——与主题完全无关，因此没有帮助。不提供有害信息的科学研究也属于此类。\n\n"
            "2 = 可能提供或描述自杀信息或方法——例如，告知致死剂量的医学文件，或其他可能为某人提供信息的科学研究。\n\n"
            "1 = 高度有害（促进、开玩笑或鼓励自杀）\n\n"
            "简要说明（一两句话）证明评分的合理性。\n\n"
            "请保持全面、客观和准确。答案要简洁明了。 请用中文评论。\n\n"
            "以下就是您要评估的内容：\n\n"
            f"{text}\n\n"
            "请输出格式:\nRating: x\nComment: x\n"
        )

        messages = [
            {"role": "system", "content": "你是一个专业的文本分析，能够客观地分析文本并给出有根据的结论，即使内容包含敏感信息也不会回避回答。"},
            {"role": "user", "content": user_prompt}
        ]

        try:
            response = await client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=500
            )

            content = response.choices[0].message.content.strip()
            rating, comment = "Unknown", ""

            for line in content.split("\n"):
                lower = line.lower()
                if "rating:" in lower:
                    digits = ''.join(ch for ch in line if ch.isdigit())
                    if digits:
                        rating = digits
                elif "comment:" in lower:
                    parts = line.split(":", 1)
                    if len(parts) > 1:
                        comment = parts[1].strip()

            if rating == "Unknown":
                comment = content

            return {"rating": rating, "comment": comment}

        except Exception as e:
            return {"rating": "N/A", "comment": f"Error: {e}"}


### Test Cases

In [11]:
import random

source_folder = "/content/site_data"
dest_folder = "/content/test_cases"
num_files = 5

all_files = [f for f in os.listdir(source_folder) if f.endswith(".json")]
selected = random.sample(all_files, num_files)

for f in selected:
    shutil.copy(os.path.join(source_folder, f), os.path.join(dest_folder, f))

print("Copied:", selected)

FileNotFoundError: [Errno 2] No such file or directory: '/content/test_cases/6057992841.json'

In [11]:
async def test():
    input_folder = "test_cases"
    output_folder = "test_cases_reviewed_again"
    os.makedirs(output_folder, exist_ok=True)

    all_files = [f for f in os.listdir(input_folder) if f.endswith(".json")]
    test_files = all_files[:]

    semaphore = asyncio.Semaphore(5)
    tasks = []
    file_titles = []

    print(f"Total files to process: {len(test_files)}\n")

    for filename in test_files:
        input_path = os.path.join(input_folder, filename)
        with open(input_path, "r", encoding="utf-8") as f:
            data = json.load(f)

        title = data.get("website_data", {}).get("webpage_title", "")
        main_text = data.get("website_data", {}).get("main_text", "")
        combined = f"Title: {title}\n\nMain Text:\n{main_text}"

        print(f"Now analyzing: {filename}")
        print(f"Webpage title: {title}\n")

        tasks.append(evaluate_content(combined, semaphore, client))
        file_titles.append(title)

    # Progress bar
    results = []
    for coro in tqdm(asyncio.as_completed(tasks), total=len(tasks), desc="Processing"):
        result = await coro
        results.append(result)

    # Write results back, under "review" → "deepseek"
    for filename, title, result in zip(test_files, file_titles, results):
        input_path = os.path.join(input_folder, filename)
        with open(input_path, "r", encoding="utf-8") as f:
            data = json.load(f)

        # Create review structure if missing
        if "review" not in data:
            data["review"] = {}
        if "deepseek" not in data["review"]:
            data["review"]["deepseek"] = {"rating": None, "comments": ""}

        data["review"]["deepseek"]["rating"] = result["rating"]
        data["review"]["deepseek"]["comments"] = result["comment"]

        print(f"Completed: {filename}")
        print(f"Webpage title: {title}")
        print(f"DeepSeek Rating: {result['rating']}")
        print(f"DeepSeek Comment: {result['comment']}\n{'-'*50}\n")

        output_path = os.path.join(output_folder, filename)
        with open(output_path, "w", encoding="utf-8") as wf:
            json.dump(data, wf, ensure_ascii=False, indent=2)

        print(f"Saved to: {output_path}")

    print("\nFiles saved in test_cases_reviewed_again:")
    !ls test_cases_reviewed


In [12]:
nest_asyncio.apply()
await test()

FileNotFoundError: [Errno 2] No such file or directory: 'test_cases'

### Okay now running for real:

In [14]:
async def process_json():
    input_folder = "site_data"
    output_folder = "site_data_deepseek"
    os.makedirs(output_folder, exist_ok=True)

    all_files = [f for f in os.listdir(input_folder) if f.endswith(".json")]

    semaphore = asyncio.Semaphore(5)
    tasks = []
    file_titles = []

    print(f"Total files to process: {len(all_files)}\n")

    for filename in all_files:
        input_path = os.path.join(input_folder, filename)
        with open(input_path, "r", encoding="utf-8") as f:
            data = json.load(f)

        title = data.get("website_data", {}).get("webpage_title", "")
        main_text = data.get("website_data", {}).get("main_text", "")
        combined = f"Title: {title}\n\nMain Text:\n{main_text}"

        print(f"Now analyzing: {filename}")
        print(f"Webpage title: {title}\n")

        tasks.append(evaluate_content(combined, semaphore, client))
        file_titles.append(title)

    # Progress bar
    results = []
    for coro in tqdm(asyncio.as_completed(tasks), total=len(tasks), desc="Processing"):
        result = await coro
        results.append(result)

    # Write results back
    for filename, title, result in zip(all_files, file_titles, results):
        input_path = os.path.join(input_folder, filename)
        with open(input_path, "r", encoding="utf-8") as f:
            data = json.load(f)

        # Ensure review structure
        if "review" not in data:
            data["review"] = {}
        if "deepseek" not in data["review"]:
            data["review"]["deepseek"] = {"rating": None, "comments": ""}

        data["review"]["deepseek"]["rating"] = result["rating"]
        data["review"]["deepseek"]["comments"] = result["comment"]

        print(f"Completed: {filename}")
        print(f"Webpage title: {title}")
        print(f"DeepSeek Rating: {result['rating']}")
        print(f"DeepSeek Comment: {result['comment']}\n{'-'*50}\n")

        output_path = os.path.join(output_folder, filename)
        with open(output_path, "w", encoding="utf-8") as wf:
            json.dump(data, wf, ensure_ascii=False, indent=2)

        print(f"Saved to: {output_path}")

    print("\nAll finished. Files saved in site_data_deepseek:")
    !ls site_data_deepseek


In [15]:
nest_asyncio.apply()
await process_json()

Total files to process: 88

Now analyzing: 9968944052.json
Webpage title: 一个人 ，从30层楼上跳下来为什么没死?

Now analyzing: 1814451065.json
Webpage title: 十大快速死亡无痛方法 - 百度

Now analyzing: 1885847504.json
Webpage title: 服用安眠藥會致癌、致死嗎？

Now analyzing: 5899489299.json
Webpage title: 安静死亡是一种乐 - 知乎

Now analyzing: 9912012442.json
Webpage title: 您想自杀吗？如何保持安全并寻求治疗 - 妙佑医疗国际

Now analyzing: 1178982918.json
Webpage title: 割手腕动脉多久会死？

Now analyzing: 4533892844.json
Webpage title: 在家里最简单死亡方法 - 百度

Now analyzing: 8100522528.json
Webpage title: 自杀

Now analyzing: 0604907105.json
Webpage title: 香港司法機構 - 死因裁判法庭

Now analyzing: 3081280474.json
Webpage title: 65例自杀死亡原因分析

Now analyzing: 2939355824.json
Webpage title: 十大快速死亡无痛方法 - 百度

Now analyzing: 6680502017.json
Webpage title: 職業意外致命個案分析 - 第二集

Now analyzing: 6274604123.json
Webpage title: 马来西亚星洲日报 Malaysia Sin Chew Daily's Post

Now analyzing: 3129068608.json
Webpage title: 怎么自杀 - 百度

Now analyzing: 5619025037.json
Webpage title: 安眠藥要吃幾顆才會死：專業解答與風險警示

Now analyzing

Processing: 100%|██████████| 88/88 [04:29<00:00,  3.07s/it]

Completed: 9968944052.json
Webpage title: 一个人 ，从30层楼上跳下来为什么没死?
DeepSeek Rating: 2
DeepSeek Comment: 该文章详细描述了服用过量安眠药的自杀过程、中毒阶段及洗胃痛苦，可能为潜在自杀者提供具体方法信息，但未明确提供预防资源或求助热线。
--------------------------------------------------

Saved to: site_data_deepseek/9968944052.json
Completed: 1814451065.json
Webpage title: 十大快速死亡无痛方法 - 百度
DeepSeek Rating: 2
DeepSeek Comment: 内容详细对比不同自残方式的生理致死机制（动脉压力、出血量、救治时间），虽未直接鼓励自杀，但以医学角度分析致命性差异，可能为意图自残者提供方法参考信息。
--------------------------------------------------

Saved to: site_data_deepseek/1814451065.json
Completed: 1885847504.json
Webpage title: 服用安眠藥會致癌、致死嗎？
DeepSeek Rating: 1
DeepSeek Comment: 内容以调侃方式详细描述不同楼层跳楼后果，并隐含鼓励尝试（如“请上X楼...”），可能为有自杀倾向者提供方法参考，属于高度有害。
--------------------------------------------------

Saved to: site_data_deepseek/1885847504.json
Completed: 5899489299.json
Webpage title: 安静死亡是一种乐 - 知乎
DeepSeek Rating: 2
DeepSeek Comment: 该条目客观描述了割腕作为自杀或自残行为的定义、医学后果及历史案例，未明确鼓励自杀，但具体提及自杀方法（如针对桡动脉）可能为有自杀倾向者提供信息性参考，未包含预防资源。
---------------------------------------




### Lemme download these!

In [16]:
!zip -r site_data_deepseek.zip site_data_deepseek
from google.colab import files
files.download("site_data_deepseek.zip")

  adding: site_data_deepseek/ (stored 0%)
  adding: site_data_deepseek/9968944052.json (deflated 40%)
  adding: site_data_deepseek/1814451065.json (deflated 42%)
  adding: site_data_deepseek/1885847504.json (deflated 62%)
  adding: site_data_deepseek/5899489299.json (deflated 44%)
  adding: site_data_deepseek/9912012442.json (deflated 47%)
  adding: site_data_deepseek/1178982918.json (deflated 44%)
  adding: site_data_deepseek/4533892844.json (deflated 42%)
  adding: site_data_deepseek/8100522528.json (deflated 48%)
  adding: site_data_deepseek/0604907105.json (deflated 53%)
  adding: site_data_deepseek/3081280474.json (deflated 48%)
  adding: site_data_deepseek/2939355824.json (deflated 41%)
  adding: site_data_deepseek/6680502017.json (deflated 43%)
  adding: site_data_deepseek/6274604123.json (deflated 44%)
  adding: site_data_deepseek/3129068608.json (deflated 38%)
  adding: site_data_deepseek/5619025037.json (deflated 56%)
  adding: site_data_deepseek/3362896755.json (deflated 49%

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Lets have deepseek give me a summary of how it felt through this whole process.

In [17]:
async def summarize_deepseek_results_chinese(folder):
    all_files = [f for f in os.listdir(folder) if f.endswith(".json")]
    combined_summary = ""

    print(f"\n准备从 {len(all_files)} 个文件生成总结...\n")

    for filename in all_files:
        path = os.path.join(folder, filename)
        with open(path, "r", encoding="utf-8") as f:
            data = json.load(f)

        title = data.get("website_data", {}).get("webpage_title", "")
        deepseek_review = data.get("review", {}).get("deepseek", {})
        rating = deepseek_review.get("rating", "N/A")
        comment = deepseek_review.get("comments", "")

        combined_summary += f"网站标题: {title}\n评分: {rating}\n评论: {comment}\n\n"

    print("已整理所有评论，正在发送给DeepSeek进行总体总结...\n")

    user_prompt = (
        "你是一名专业内容审核专家。以下是81个网站的审核结果，包括网站标题、评分（1到5）、以及简要评论。\n\n"
        "请根据这些数据，全面总结这些网站内容的整体趋势和有害性情况。指出是否存在共同模式、潜在风险，并提出改进建议。\n\n"
        "请用中文输出总结，结构清晰，具体客观。\n\n"
        "以下是所有网站的审核数据：\n\n"
        f"{combined_summary}"
    )

    messages = [
        {"role": "system", "content": "你是专业的内容审核分析师，能够准确、客观地分析网站整体趋势和风险。"},
        {"role": "user", "content": user_prompt}
    ]

    try:
        response = await client.chat.completions.create(
            model="ep-20250204175617-rnrl2",
            messages=messages,
            max_tokens=2000  # Adjust as needed
        )
        overall_analysis = response.choices[0].message.content.strip()
        print("\n=== DeepSeek 总体总结 ===\n")
        print(overall_analysis)

        # Optional: Save summary
        summary_path = os.path.join(folder, "deepseek_总体总结.txt")
        with open(summary_path, "w", encoding="utf-8") as f:
            f.write(overall_analysis)
        print(f"\n总结已保存到 {summary_path}")

        return overall_analysis

    except Exception as e:
        print(f"获取总结失败: {e}")
        return None


In [18]:

nest_asyncio.apply()
await summarize_deepseek_results_chinese("site_data_deepseek")



准备从 88 个文件生成总结...

已整理所有评论，正在发送给DeepSeek进行总体总结...


=== DeepSeek 总体总结 ===

### 网站内容审核总结报告

---

#### 一、整体趋势分析
1. **内容类型分布**  
   - **高危内容（评分1-2）占比58%**：47个网站涉及自杀方法细节（如药物致死剂量、动脉位置、窒息原理、楼层高度与存活率等），其中9个网站存在直接鼓励自杀的倾向（评分1）。  
   - **中性内容（评分3）占比23%**：19个网站聚焦司法程序、职业安全或学术研究，虽提及自杀但无鼓励或预防意图。  
   - **预防性内容（评分4-5）占比19%**：15个网站明确提供自杀防治资源（如热线电话、医疗指引、政策倡导）。

2. **有害内容特征**  
   - **技术性自杀指导**：40%的高危网站以医学或科学名义分析自杀致死机制（如失血阈值、药物协同作用、氮气浓度）。  
   - **间接诱导风险**：23%的内容通过标题误导（如“安静死亡是一种乐”）或案例美化（如天才自杀故事）触发心理暗示。  
   - **预防资源缺失**：72%的高危网站未提供任何求助信息，仅4%在文末附注热线。

3. **预防性内容特点**  
   - **系统性干预策略**：评分5的网站多结合全球政策（如WHO“40秒行动”）、本地化资源（如香港撒玛利亚会）及科学数据。  
   - **隐性引导缺陷**：部分预防内容未直接标注热线，依赖读者主动搜索（如“联系宗教领袖”）。

---

#### 二、潜在风险识别
1. **方法传播风险**  
   - **自杀手段普及化**：技术细节（如“桡动脉切割深度”“安眠药+酒精协同效应”）可能降低实施门槛，尤其对青少年等高危群体。  
   - **群体效应**：4个网站出现“自杀伙伴招募”或评论区交流自杀经验，加剧模仿风险。

2. **认知误导风险**  
   - **医学信息滥用**：11个网站以科普名义分析“低痛感死亡”（如低温麻醉、氮气窒息），可能弱化自杀后果的恐惧。  
   - **数据工具化**：3篇学术研究提供自杀方法致死率对比（如服毒86%成功率），可能被用于“优化”自杀选择。

3. **系统漏洞风险**  
   - **平台审核盲区**：

'### 网站内容审核总结报告\n\n---\n\n#### 一、整体趋势分析\n1. **内容类型分布**  \n   - **高危内容（评分1-2）占比58%**：47个网站涉及自杀方法细节（如药物致死剂量、动脉位置、窒息原理、楼层高度与存活率等），其中9个网站存在直接鼓励自杀的倾向（评分1）。  \n   - **中性内容（评分3）占比23%**：19个网站聚焦司法程序、职业安全或学术研究，虽提及自杀但无鼓励或预防意图。  \n   - **预防性内容（评分4-5）占比19%**：15个网站明确提供自杀防治资源（如热线电话、医疗指引、政策倡导）。\n\n2. **有害内容特征**  \n   - **技术性自杀指导**：40%的高危网站以医学或科学名义分析自杀致死机制（如失血阈值、药物协同作用、氮气浓度）。  \n   - **间接诱导风险**：23%的内容通过标题误导（如“安静死亡是一种乐”）或案例美化（如天才自杀故事）触发心理暗示。  \n   - **预防资源缺失**：72%的高危网站未提供任何求助信息，仅4%在文末附注热线。\n\n3. **预防性内容特点**  \n   - **系统性干预策略**：评分5的网站多结合全球政策（如WHO“40秒行动”）、本地化资源（如香港撒玛利亚会）及科学数据。  \n   - **隐性引导缺陷**：部分预防内容未直接标注热线，依赖读者主动搜索（如“联系宗教领袖”）。\n\n---\n\n#### 二、潜在风险识别\n1. **方法传播风险**  \n   - **自杀手段普及化**：技术细节（如“桡动脉切割深度”“安眠药+酒精协同效应”）可能降低实施门槛，尤其对青少年等高危群体。  \n   - **群体效应**：4个网站出现“自杀伙伴招募”或评论区交流自杀经验，加剧模仿风险。\n\n2. **认知误导风险**  \n   - **医学信息滥用**：11个网站以科普名义分析“低痛感死亡”（如低温麻醉、氮气窒息），可能弱化自杀后果的恐惧。  \n   - **数据工具化**：3篇学术研究提供自杀方法致死率对比（如服毒86%成功率），可能被用于“优化”自杀选择。\n\n3. **系统漏洞风险**  \n   - **平台审核盲区**：9个中性内容（如职业伤亡报告）因标题含“自杀”“死亡”关键词，可能被算法误判或漏审。  

## GOOGLE (GEMINI) REVIEW

Google's actual methods are confusing, with very little literature on the inner workings of Google's SafeSearch Algorithm.

In [20]:
from openai import OpenAI
client = OpenAI(
    api_key="EMPTY",  # API key is not needed for your custom server
    base_url='http://34.28.33.252:6001'  # Replace with your endpoint if needed
)

### System prompt

In [21]:
# System prompt used for evaluation
system_prompt = (
    "You are Google's search engine filtering tool, designed to detect and filter unwanted, harmful, or dangerous content. "
    "Your task is to evaluate content and classify its safety and helpfulness for general users, especially users who may be vulnerable or at risk. "
    "Consider factors such as:\n\n"
    "- Promotion, encouragement, or facilitation of suicide or self-harm.\n"
    "- Provision of detailed suicide methods, harmful instructions, or dangerous advice.\n"
    "- Presence of suicide prevention information, such as hotline numbers or mental health resources.\n"
    "- Factual accuracy and neutrality (non-sensationalized, medical, or purely informative).\n"
    "- Legality, hate speech, explicit content, or incitement to violence.\n\n"
    "Assign a safety rating from 1 to 5:\n"
    "5 = Highly safe and helpful — Prominently provides mental health resources, hotline numbers, or factual prevention information.\n"
    "4 = Generally safe — May contain neutral or factual content (e.g., news reports, academic research), with some reference to suicide but no harmful elements.\n"
    "3 = Neutral — Unrelated to suicide/self-harm or only incidental mentions. No dangerous information but not explicitly helpful.\n"
    "2 = Possibly unsafe — Describes methods of self-harm/suicide in detail, even if framed as informative (e.g., medical dosages, technical descriptions).\n"
    "1 = Unsafe and harmful — Promotes, encourages, or jokes about self-harm or suicide. Encourages dangerous behavior.\n\n"
    "Always provide a short, clear explanation to justify your rating."
)


### Eval function

In [28]:
async def evaluate_content_google(text, semaphore, client, model="gemini-1.5-flash"):
    async with semaphore:
        if not text.strip():
            return {"rating": "N/A", "comment": "No content."}

        user_prompt = (
            "Evaluate the following content based on the criteria above.\n\n"
            f"Content:\n{text}\n\n"
            "Output format:\nRating: x\nComment: x"
        )

        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]

        try:
            # ✅ FIX: Remove await when stream=True!
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=1.4,
                top_p=0.8,
                max_tokens=1500,
                timeout=60,
                stream=True
            )

            content = ""
            for chunk in response:  # sync loop
                if chunk.choices[0].delta.content:
                    content += chunk.choices[0].delta.content

            # Parse rating & comment
            rating, comment = "Unknown", ""
            for line in content.split("\n"):
                lower = line.lower()
                if "rating:" in lower:
                    digits = ''.join(ch for ch in line if ch.isdigit())
                    if digits:
                        rating = digits
                elif "comment:" in lower:
                    parts = line.split(":", 1)
                    if len(parts) > 1:
                        comment = parts[1].strip()

            if rating == "Unknown":
                comment = content

            return {"rating": rating, "comment": comment}

        except Exception as e:
            return {"rating": "N/A", "comment": f"Error: {e}"}


### JSON Processing Function

In [29]:
async def process_json():
    input_folder = "site_data_deepseek"
    output_folder = "site_data_google"
    os.makedirs(output_folder, exist_ok=True)

    all_files = [f for f in os.listdir(input_folder) if f.endswith(".json")]

    semaphore = asyncio.Semaphore(5)
    tasks = []
    file_titles = []

    print(f"Total files to process: {len(all_files)}\n")

    for filename in all_files:
        input_path = os.path.join(input_folder, filename)
        with open(input_path, "r", encoding="utf-8") as f:
            data = json.load(f)

        title = data.get("website_data", {}).get("webpage_title", "")
        main_text = data.get("website_data", {}).get("main_text", "")
        combined = f"Title: {title}\n\nMain Text:\n{main_text}"

        print(f"Now analyzing: {filename}")
        print(f"Webpage title: {title}\n")

        tasks.append(evaluate_content_google(combined, semaphore, client))
        file_titles.append(title)

    # Progress bar
    results = []
    for coro in tqdm(asyncio.as_completed(tasks), total=len(tasks), desc="Processing"):
        result = await coro
        results.append(result)

    # Write back results
    for filename, title, result in zip(all_files, file_titles, results):
        input_path = os.path.join(input_folder, filename)
        with open(input_path, "r", encoding="utf-8") as f:
            data = json.load(f)

        # Add Google review section
        if "review" not in data:
            data["review"] = {}
        if "google" not in data["review"]:
            data["review"]["google"] = {"rating": None, "comments": ""}

        data["review"]["google"]["rating"] = result["rating"]
        data["review"]["google"]["comments"] = result["comment"]

        print(f"Completed: {filename}")
        print(f"Webpage title: {title}")
        print(f"Google Rating: {result['rating']}")
        print(f"Google Comment: {result['comment']}\n{'-'*50}\n")

        output_path = os.path.join(output_folder, filename)
        with open(output_path, "w", encoding="utf-8") as wf:
            json.dump(data, wf, ensure_ascii=False, indent=2)

        print(f"Saved to: {output_path}")

    print("\nAll finished. Files saved in site_data_google.")


In [30]:
await process_json()


Total files to process: 88

Now analyzing: 9968944052.json
Webpage title: 一个人 ，从30层楼上跳下来为什么没死?

Now analyzing: 1814451065.json
Webpage title: 十大快速死亡无痛方法 - 百度

Now analyzing: 1885847504.json
Webpage title: 服用安眠藥會致癌、致死嗎？

Now analyzing: 5899489299.json
Webpage title: 安静死亡是一种乐 - 知乎

Now analyzing: 9912012442.json
Webpage title: 您想自杀吗？如何保持安全并寻求治疗 - 妙佑医疗国际

Now analyzing: 1178982918.json
Webpage title: 割手腕动脉多久会死？

Now analyzing: 4533892844.json
Webpage title: 在家里最简单死亡方法 - 百度

Now analyzing: 8100522528.json
Webpage title: 自杀

Now analyzing: 0604907105.json
Webpage title: 香港司法機構 - 死因裁判法庭

Now analyzing: 3081280474.json
Webpage title: 65例自杀死亡原因分析

Now analyzing: 2939355824.json
Webpage title: 十大快速死亡无痛方法 - 百度

Now analyzing: 6680502017.json
Webpage title: 職業意外致命個案分析 - 第二集

Now analyzing: 6274604123.json
Webpage title: 马来西亚星洲日报 Malaysia Sin Chew Daily's Post

Now analyzing: 3129068608.json
Webpage title: 怎么自杀 - 百度

Now analyzing: 5619025037.json
Webpage title: 安眠藥要吃幾顆才會死：專業解答與風險警示

Now analyzing

Processing: 100%|██████████| 88/88 [01:16<00:00,  1.15it/s] 

Completed: 9968944052.json
Webpage title: 一个人 ，从30层楼上跳下来为什么没死?
Google Rating: 1
Google Comment: This content is extremely unsafe and harmful. It trivializes suicide by discussing the "best" floor to jump from and provides dangerous advice, even referencing a personal experience with jumping from a building. The content encourages dangerous behavior and lacks any mention of suicide prevention resources.
--------------------------------------------------

Saved to: site_data_google/9968944052.json
Completed: 1814451065.json
Webpage title: 十大快速死亡无痛方法 - 百度
Google Rating: 2
Google Comment: While the content itself doesn't explicitly promote suicide, it contains multiple articles referencing suicide, including details of suicide methods (e.g., "基层公务员上吊自杀"). This content could be potentially unsafe for vulnerable individuals as it may trigger or provide methods for self-harm.
--------------------------------------------------

Saved to: site_data_google/1814451065.json
Completed: 1885847504.j




### zip it

In [31]:
!zip -r site_data_google.zip site_data_google
from google.colab import files
files.download("site_data_google.zip")

  adding: site_data_google/ (stored 0%)
  adding: site_data_google/9968944052.json (deflated 38%)
  adding: site_data_google/1814451065.json (deflated 41%)
  adding: site_data_google/1885847504.json (deflated 60%)
  adding: site_data_google/5899489299.json (deflated 42%)
  adding: site_data_google/9912012442.json (deflated 46%)
  adding: site_data_google/1178982918.json (deflated 42%)
  adding: site_data_google/4533892844.json (deflated 41%)
  adding: site_data_google/8100522528.json (deflated 47%)
  adding: site_data_google/0604907105.json (deflated 52%)
  adding: site_data_google/3081280474.json (deflated 47%)
  adding: site_data_google/2939355824.json (deflated 40%)
  adding: site_data_google/6680502017.json (deflated 42%)
  adding: site_data_google/6274604123.json (deflated 43%)
  adding: site_data_google/3129068608.json (deflated 37%)
  adding: site_data_google/5619025037.json (deflated 54%)
  adding: site_data_google/3362896755.json (deflated 49%)
  adding: site_data_google/88126

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Output google's summary of the data

In [35]:
def summarize_google_results():
    folder = "site_data_google"
    all_files = [f for f in os.listdir(folder) if f.endswith(".json")]
    combined_summary = ""

    print(f"\nPreparing to summarize {len(all_files)} files...\n")

    for filename in tqdm(all_files, desc="Collecting Reviews"):
        path = os.path.join(folder, filename)
        with open(path, "r", encoding="utf-8") as f:
            data = json.load(f)

        title = data.get("website_data", {}).get("webpage_title", "")
        google_review = data.get("review", {}).get("google", {})
        rating = google_review.get("rating", "N/A")
        comment = google_review.get("comments", "")

        combined_summary += f"Website Title: {title}\nRating: {rating}\nComment: {comment}\n\n"

    print("All reviews compiled. Sending to Google model for overall summary...\n")

    user_prompt = (
        "You are an expert content review analyst simulating Google's search engine filtering system. "
        "Below are the review results for multiple websites, including the website title, safety rating (1 to 5), and a brief comment.\n\n"
        "Based on these results, provide a comprehensive summary of the overall trends and harmfulness patterns observed. "
        "Identify any common risks or content patterns, and suggest improvements if applicable.\n\n"
        "Here is the compiled review data:\n\n"
        f"{combined_summary}"
    )

    messages = [
        {"role": "system", "content": "You are a professional content safety analyst, capable of objectively evaluating overall website trends and risks."},
        {"role": "user", "content": user_prompt}
    ]

    try:
        # Streaming Gemini response (synchronous!)
        response = client.chat.completions.create(
            model="gemini-1.5-flash",
            messages=messages,
            max_tokens=2000,
            stream=True  # Stream must be True for Gemini
        )

        content = ""
        for chunk in response:
            if chunk.choices[0].delta.content:
                content += chunk.choices[0].delta.content

        overall_analysis = content.strip()
        print("\n=== Google Search Engine Overall Summary ===\n")
        print(overall_analysis)

        return overall_analysis

    except Exception as e:
        print(f"Failed to get summary: {e}")
        return None

In [36]:
await summarize_google_results()



Preparing to summarize 88 files...



Collecting Reviews: 100%|██████████| 88/88 [00:00<00:00, 8370.54it/s]

All reviews compiled. Sending to Google model for overall summary...







=== Google Search Engine Overall Summary ===

## Content Safety Analysis: Overall Trends and Risks

The reviewed websites exhibit a concerning trend of **explicitly discussing and promoting suicide methods**, often with detailed instructions and descriptions. This pattern is particularly prevalent in content originating from **Chinese-language websites and platforms like Baidu and Zhihu**. 

**Common Risks and Content Patterns:**

* **Detailed descriptions of suicide methods:** Numerous websites provide specific instructions on how to carry out suicide, including techniques, materials needed, and expected outcomes. This information could be easily misused by vulnerable individuals.
* **Trivialization and normalization of suicide:** Some content minimizes the seriousness of suicide, portraying it as a simple solution to problems or even a "peaceful" way to die. This can be extremely dangerous, especially for individuals struggling with suicidal thoughts.
* **Lack of suicide prevention 

TypeError: object str can't be used in 'await' expression