In [1]:
import pandas as pd
from deep_translator import GoogleTranslator

# Load your full dataset
df = pd.read_csv("1900rows_data.csv")

# Rename for consistency
df = df.rename(columns={"text": "phrase", "LABEL": "label"})

# Sample a few rows for zero-shot testing
sample_df = df.sample(10, random_state=42).reset_index(drop=True)

In [2]:
sample_df["translated_phrase"] = sample_df["phrase"].apply(
    lambda x: GoogleTranslator(source="en", target="zh-CN").translate(x)
)

In [3]:
for i, row in sample_df.iterrows():
    print(f"\nEnglish: {row['phrase']}")
    print(f"中文翻译: {row['translated_phrase']}")


English: I've been feeling really weak in my muscles and my neck has been really stiff. My joints have been swelling up and it's hard for me to move around without feeling stiff. Walking has been really painful too.
中文翻译: 我的肌肉感觉真的很虚弱，脖子真的很僵硬。我的关节一直在肿胀，我很难四处走动而不会感到僵硬。步行也确实很痛苦。

English: when i extend my leg there is pain in knee joint
中文翻译: 当我伸出腿时，膝关节会疼痛

English: My bloody stools have caused me to lose a lot of things, including iron and bloos. I now have anaemia as a result, and I typically feel rather weak.
中文翻译: 我的血腥凳子使我失去了很多东西，包括铁和蓝色。结果，我现在患有贫血，通常我会感到很虚弱。

English: My skin is red and scratchy. These can occasionally flake. My cheeks and lips swell, which is really annoying. I occasionally have headaches and runny eyes because to the puffing.
中文翻译: 我的皮肤是红色的。这些偶尔会剥落。我的脸颊和嘴唇肿胀，这真的很烦人。我偶尔会头痛和流鼻涕，因为浮肿。

English: I have a cut on my foot that became infected from using the showers at the gym.
中文翻译: 我的脚割伤了，由于在健身房使用淋浴而被感染。

English: I have been experiencing symptoms such as a headache, che

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load tokenizer and model (replace with your fine-tuned model path if available)
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("path_to_your_model")

# Tokenize Chinese input
inputs = tokenizer(
    sample_df["translated_phrase"].tolist(),
    return_tensors="pt",
    truncation=True,
    padding=True
)

# Predict
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    preds = torch.argmax(outputs.logits, dim=1)


In [9]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load your fine-tuned model
model_path = "bert_output"
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained(model_path)
model.eval()

# Tokenize Chinese text (zero-shot)
inputs = tokenizer(
    sample_df["translated_phrase"].tolist(),
    padding=True,
    truncation=True,
    return_tensors="pt",
)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=1)

In [10]:
# Your label mapping
id2label = {
    0: "Neurological & General Symptoms",
    1: "Dermatological & Skin Conditions",
    2: "Chronic Conditions",
    3: "Infections",
    4: "Pain-related Conditions",
    5: "Respiratory & Sensory Issues",
    6: "Gastrointestinal Conditions",
    7: "Allergic/Immunologic Reactions",
    8: "Hepatobiliary",
    9: "Trauma/Injuries",
}

# Add predictions to your dataframe
sample_df["predicted_label"] = [id2label[p.item()] for p in predictions]

# Show results
sample_df[["phrase", "translated_phrase", "label", "predicted_label"]]

Unnamed: 0,phrase,translated_phrase,label,predicted_label
0,I've been feeling really weak in my muscles an...,我的肌肉感觉真的很虚弱，脖子真的很僵硬。我的关节一直在肿胀，我很难四处走动而不会感到僵硬。步...,Chronic Conditions,Neurological & General Symptoms
1,when i extend my leg there is pain in knee joint,当我伸出腿时，膝关节会疼痛,Pain-related Conditions,Neurological & General Symptoms
2,My bloody stools have caused me to lose a lot ...,我的血腥凳子使我失去了很多东西，包括铁和蓝色。结果，我现在患有贫血，通常我会感到很虚弱。,Gastrointestinal Conditions,Neurological & General Symptoms
3,My skin is red and scratchy. These can occasio...,我的皮肤是红色的。这些偶尔会剥落。我的脸颊和嘴唇肿胀，这真的很烦人。我偶尔会头痛和流鼻涕，因...,Allergic/Immunologic Reactions,Neurological & General Symptoms
4,I have a cut on my foot that became infected f...,我的脚割伤了，由于在健身房使用淋浴而被感染。,Infections,Neurological & General Symptoms
5,I have been experiencing symptoms such as a he...,我一直在遇到症状，例如头痛，胸痛，头晕，平衡丧失和困难。,Chronic Conditions,Neurological & General Symptoms
6,My knee hurts when I walk,我走路时膝盖疼,Pain-related Conditions,Neurological & General Symptoms
7,It feels like I can't take a deep breath,感觉我不能深吸一口气,Respiratory & Sensory Issues,Neurological & General Symptoms
8,I have a cut that is red and swollen.,我有一个红色和肿胀的切口。,Infections,Neurological & General Symptoms
9,"I have a high temperature, vomiting, chills, a...",我有高温，呕吐，发冷和严重的瘙痒。此外，我一直在说话很多，头痛。我也因恶心和肌肉疼痛而困扰。,Infections,Neurological & General Symptoms


**All Chinese-translated inputs were classified as Neurological & General Symptoms, likely due to the English-only tokenizer failing to recognize Chinese characters. This confirms that zero-shot inference across entirely different scripts (e.g., Latin → Chinese) is not feasible with monolingual models.**

## bert-base-multilingual-cased

In [11]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load multilingual BERT
model_name = "bert-base-multilingual-cased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(
    model_name, num_labels=10
)  # 10 = number of your medical classes
model.eval()

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1