**Temat:** Analiza sentymentu w tekstach internetowych w oparciu o sieci typu Transformer

**Wprowadzenie:** Analiza sentymentu to technika przetwarzania języka naturalnego (NLP), która identyfikuje ton emocjonalny w tekście, klasyfikując go na pozytywny, negatywny lub neutralny. Wykorzystuje się ją do badania opinii klientów, monitorowania reputacji marki czy analizy treści mediów społecznościowych.

**Cel projektu:** Celem projektu jest opracowanie i implementacja modelu analizy sentymentu, który pozwoli na klasyfikację opinii użytkowników na podstawie tekstów pochodzących z Internetu. Należy przeanalizować dane tekstowe, przygotować odpowiedni model oraz zaprezentować wyniki analizy.

In [1]:
%pip install datasets transformers torch --quiet

Note: you may need to restart the kernel to use updated packages.


### Ładowanie danych

In [1]:
from datasets import load_dataset

ds = load_dataset("clapAI/MultiLingualSentiment")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
print(ds)

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'source', 'domain', 'language'],
        num_rows: 3147478
    })
    validation: Dataset({
        features: ['text', 'label', 'source', 'domain', 'language'],
        num_rows: 393435
    })
    test: Dataset({
        features: ['text', 'label', 'source', 'domain', 'language'],
        num_rows: 393436
    })
})


In [16]:
# what languages are available
languages = ds['train'].unique('language')
ds_types = ['train', 'validation', 'test']
drop_columns = ['source', 'domain']
ds = ds.remove_columns(drop_columns)
# Create dictionary to store datasets for each language
datasets = {}

# # Split train, validation and test for each language
for lang in languages:
    datasets[lang] = {}
    for ds_type in ds_types:
        datasets[lang][ds_type] = ds[ds_type].filter(
            lambda batch: [x == lang for x in batch['language']],
            batched = True,
            num_proc=4
            )
        
        # Reduce dataset by 100 times
        rows_counter = datasets[lang][ds_type].num_rows
        new_num_rows = round(rows_counter*0.01)
        datasets[lang][ds_type] = datasets[lang][ds_type].shuffle(seed=42)
        datasets[lang][ds_type] = datasets[lang][ds_type].select(range(new_num_rows))

Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1593933.63 examples/s]
Filter (num_proc=4): 100%|██████████| 393435/393435 [00:00<00:00, 970828.42 examples/s]
Filter (num_proc=4): 100%|██████████| 393436/393436 [00:00<00:00, 937174.09 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1672955.12 examples/s]
Filter (num_proc=4): 100%|██████████| 393435/393435 [00:00<00:00, 1042327.62 examples/s]
Filter (num_proc=4): 100%|██████████| 393436/393436 [00:00<00:00, 1036326.39 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1653757.68 examples/s]
Filter (num_proc=4): 100%|██████████| 393435/393435 [00:00<00:00, 1069735.66 examples/s]
Filter (num_proc=4): 100%|██████████| 393436/393436 [00:00<00:00, 1031221.27 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1677637.81 examples/s]
Filter (num_proc=4): 100%|██████████| 393435/393435 [00:00<00:00, 1081991.97 examples/s]
Filter (num_pro

In [17]:
datasets['ja']['train'][0]

{'text': 'タイトル画からは意外\nコメディタッチの出来かと思いきや、意外とシリアスで 山本周五郎作品のような、人情物 包丁侍という役どころを通した人情が織りなす所は面白いが 料理という点での話は、意外と薄かったような 最後の宴が見世物だが 品数だけは多いが、一個ずつの感想や評価は無いし 反応も殆ど無いという 現代と昔との違いはあるが、もっと料理を美味しく撮って欲しかった せっかくの豪華な加賀料理が、画面では伝わらないのは監督の技量だろう エンディングの歌が、何故か 英語ｗ せっかくのシリアスな作品を ぶち壊す おちゃらけは頂けない',
 'label': 'positive',
 'language': 'ja'}

### Tokenizacja

In [9]:
from transformers import BertTokenizer
from datasets import concatenate_datasets

In [20]:
languages_to_process = ['en', 'es', 'zh']
labels_id = {'negative': 0, 'neutral': 1, 'positive': 2}

# Convert labels to IDs
def convert_labels_to_ids(batch):
    batch['label_id'] = [labels_id[label] for label in batch['label']]
    return batch

# train ds
train_ds_list = [datasets[lang]['train'] for lang in languages_to_process]
# Concatenate datasets for selected languages
train_ds = concatenate_datasets(train_ds_list)
train_ds = train_ds.map(convert_labels_to_ids, batched=True, num_proc=4)
train_ds = train_ds.shuffle(seed=42)

# eval ds
eval_ds_list = [datasets[lang]['validation'] for lang in languages_to_process]
# Concatenate datasets for selected languages
eval_ds = concatenate_datasets(eval_ds_list)
eval_ds = eval_ds.map(convert_labels_to_ids, batched=True, num_proc=4)
eval_ds = eval_ds.shuffle(seed=42)

# test ds
test_ds_list = [datasets[lang]['test'] for lang in languages_to_process]
# Concatenate datasets for selected languages
test_ds = concatenate_datasets(test_ds_list)
test_ds = test_ds.map(convert_labels_to_ids, batched=True, num_proc=4)
test_ds = test_ds.shuffle(seed=42)

In [21]:
print(train_ds[0])

{'text': "lol, I'm so used to Spotify, i can't even thinking of switching it. This is great and even their chat support is awesome.", 'label': 'positive', 'language': 'en', 'label_id': 2}


In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")

def tokenize_and_encode(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=128,  # BERT max sequence length
        # return_tensors=None,  # Returns PyTorch tensors
        # return_special_tokens_mask=True
    )



tokenized_datasets = {}
for lang, dataset in example_datasets.items():
    print(f"Tokenizing {lang} dataset...")
    tokenized_datasets[lang] = dataset.map(
        tokenize_and_encode,
        batched=True,
        batch_size=1000,  # Increased batch size
        num_proc=4,       # Use multiple CPU cores
        remove_columns=['text', 'language']  # Remove original columns we don't need
    )
    
    # Convert label to tensor format
    tokenized_datasets[lang] = tokenized_datasets[lang].map(
        lambda x: {'labels': x['label']},
        remove_columns=['label']
    )

print("\nExample of tokenized text:")
print(tokenized_datasets[list(tokenized_datasets.keys())[0]][0])

Tokenizing en dataset...


Map (num_proc=4): 100%|██████████| 1215709/1215709 [07:43<00:00, 2622.59 examples/s]
Map: 100%|██████████| 1215709/1215709 [01:51<00:00, 10866.46 examples/s]


Tokenizing es dataset...


Map (num_proc=4): 100%|██████████| 178434/178434 [00:25<00:00, 7055.55 examples/s]
Map: 100%|██████████| 178434/178434 [00:16<00:00, 11038.80 examples/s]


Tokenizing fr dataset...


Map (num_proc=4): 100%|██████████| 210298/210298 [00:40<00:00, 5220.27 examples/s]
Map: 100%|██████████| 210298/210298 [00:19<00:00, 10751.26 examples/s]


Example of tokenized text:
{'source': 'https://www.kaggle.com/datasets/choonkhonng/malaysia-restaurant-review-datasets', 'domain': 'restaurant reviews ', 'input_ids': [101, 138, 15198, 26069, 10169, 15198, 18301, 119, 23002, 10124, 27949, 13096, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 


