**Temat:** Analiza sentymentu w tekstach internetowych w oparciu o sieci typu Transformer

**Wprowadzenie:** Analiza sentymentu to technika przetwarzania języka naturalnego (NLP), która identyfikuje ton emocjonalny w tekście, klasyfikując go na pozytywny, negatywny lub neutralny. Wykorzystuje się ją do badania opinii klientów, monitorowania reputacji marki czy analizy treści mediów społecznościowych.

**Cel projektu:** Celem projektu jest opracowanie i implementacja modelu analizy sentymentu, który pozwoli na klasyfikację opinii użytkowników na podstawie tekstów pochodzących z Internetu. Należy przeanalizować dane tekstowe, przygotować odpowiedni model oraz zaprezentować wyniki analizy.

In [14]:
!pip3 install datasets transformers torch 'numpy<2' accelerate --quiet

### Ładowanie danych

In [15]:
from datasets import load_dataset

ds = load_dataset("clapAI/MultiLingualSentiment")

In [23]:
print(ds['train'][0]['text'])
print(ds['train'][0]['label'])

A good environment with good food. Price is reasonable.
positive


In [9]:
# what languages are available
languages = ds['train'].unique('language')
print("Available languages:", languages)

# Create dictionary to store datasets for each language
datasets_by_language = {}

# # Split train for each language
for lang in languages:
    datasets_by_language[lang] = ds['train'].filter(
        lambda batch: [x == lang for x in batch['language']],
        batched = True,
        num_proc=4
        )
    

Available languages: ['en', 'es', 'ja', 'ar', 'tr', 'fr', 'vi', 'zh', 'de', 'ru', 'ko', 'id', 'multilingual', 'pt', 'ms', 'hi', 'it']


Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1769567.49 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1784023.50 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1893929.40 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1864931.06 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1962052.74 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1930118.75 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1936049.42 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1909676.02 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1907027.18 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1898341.07 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1877616.93 examples/s]

In [10]:
datasets_by_language['ja'][0]

{'text': 'コードレス設計で車内の掃除もできます。\nコードレス設計で車内の掃除もできます。砂と土なども吸い込みます。掃除苦手の私でも快適に掃除ができます。',
 'label': 'positive',
 'source': 'https://huggingface.co/datasets/mteb/amazon_reviews_multi',
 'domain': 'amazon reviews',
 'language': 'ja'}

In [4]:
import torch

print(f'{torch.version}')
print(torch.backends.mps.is_available)

<module 'torch.version' from '/Users/mikolaj/Desktop/STUDIA/CDV STOPIEŃ II/I ROK/II SEMESTR/Uczenie głebokie w przetwarzaniu języka/PROJEKT/Multilingual-sentiment-analysis/.venv/lib/python3.12/site-packages/torch/version.py'>
<functools._lru_cache_wrapper object at 0x11d5dc460>


## Zero-shot Prompting

In [55]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


model_name = 'Qwen/Qwen3-0.6B'
tokenizer = AutoTokenizer.from_pretrained(model_name)
# For MacBooks with CPU Intel you have to set device_map as cpu and torch_dtype as torch.float32 
# otherwise it doesn't compile
zero_shot_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    device_map={"": "cpu"})
zero_shot_model.eval()

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 1024)
    (layers): ModuleList(
      (0-27): 28 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (up_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (down_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
        (post_attention_layernorm): Qwe

In [97]:
# Prompt template
def build_prompt(text):
    return f"Just define in one word the sentiment of this text as positive, negative or neutral:\n\"{text}\"\nAnswer(positive/negative/neutral):\n"

def predict_sentiment(text):
    prompt = build_prompt(text)
    zero_shot_model_inputs = tokenizer(prompt, return_tensors="pt").to(zero_shot_model.device)
    generated_ids = zero_shot_model.generate(**zero_shot_model_inputs, max_new_tokens=3)
    print(tokenizer.batch_decode(generated_ids)[0][len(prompt):])

In [98]:
predict_sentiment(ds['train'][0]['text'])
print(f'\nReal answer:\n{ds['train'][0]['label']}')

Answer:
positive

Real answer:
positive
