**Temat:** Analiza sentymentu w tekstach internetowych w oparciu o sieci typu Transformer

**Wprowadzenie:** Analiza sentymentu to technika przetwarzania języka naturalnego (NLP), która identyfikuje ton emocjonalny w tekście, klasyfikując go na pozytywny, negatywny lub neutralny. Wykorzystuje się ją do badania opinii klientów, monitorowania reputacji marki czy analizy treści mediów społecznościowych.

**Cel projektu:** Celem projektu jest opracowanie i implementacja modelu analizy sentymentu, który pozwoli na klasyfikację opinii użytkowników na podstawie tekstów pochodzących z Internetu. Należy przeanalizować dane tekstowe, przygotować odpowiedni model oraz zaprezentować wyniki analizy.

In [1]:
%pip install datasets transformers torch --quiet

Note: you may need to restart the kernel to use updated packages.


### Ładowanie danych

In [2]:
from datasets import load_dataset

ds = load_dataset("clapAI/MultiLingualSentiment")

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
print(ds)

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'source', 'domain', 'language'],
        num_rows: 3147478
    })
    validation: Dataset({
        features: ['text', 'label', 'source', 'domain', 'language'],
        num_rows: 393435
    })
    test: Dataset({
        features: ['text', 'label', 'source', 'domain', 'language'],
        num_rows: 393436
    })
})


In [None]:
# what languages are available
languages = ds['train'].unique('language')
print("Available languages:", languages)

# Create dictionary to store datasets for each language
datasets_by_language = {}

# # Split train for each language
for lang in languages:
    datasets_by_language[lang] = ds['train'].filter(
        lambda batch: [x == lang for x in batch['language']],
        batched = True,
        num_proc=4
        )

    

Available languages: ['en', 'es', 'ja', 'ar', 'tr', 'fr', 'vi', 'zh', 'de', 'ru', 'ko', 'id', 'multilingual', 'pt', 'ms', 'hi', 'it']


Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/3147478 [00:00<?, ? examples/s]

In [32]:
datasets_by_language['ja'][0]

{'text': 'コードレス設計で車内の掃除もできます。\nコードレス設計で車内の掃除もできます。砂と土なども吸い込みます。掃除苦手の私でも快適に掃除ができます。',
 'label': 'positive',
 'source': 'https://huggingface.co/datasets/mteb/amazon_reviews_multi',
 'domain': 'amazon reviews',
 'language': 'ja'}

### Tokenizacja

In [8]:
from transformers import BertTokenizer

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")

def tokenize_and_encode(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=128,  # BERT max sequence length
        return_tensors=None,  # Returns PyTorch tensors
        return_special_tokens_mask=True
    )

languages_to_process = ['en', 'es', 'fr']  # example
example_datasets = {lang: datasets_by_language[lang] for lang in languages_to_process if lang in datasets_by_language}
tokenized_datasets = {}
for lang, dataset in example_datasets.items():
    print(f"Tokenizing {lang} dataset...")
    tokenized_datasets[lang] = dataset.map(
        tokenize_and_encode,
        batched=True,
        batch_size=1000,  # Increased batch size
        num_proc=4,       # Use multiple CPU cores
        remove_columns=['text', 'language']  # Remove original columns we don't need
    )
    
    # Convert label to tensor format
    tokenized_datasets[lang] = tokenized_datasets[lang].map(
        lambda x: {'labels': x['label']},
        remove_columns=['label']
    )

print("\nExample of tokenized text:")
print(tokenized_datasets[list(tokenized_datasets.keys())[0]][0])

Tokenizing en dataset...


Map (num_proc=4): 100%|██████████| 1215709/1215709 [07:43<00:00, 2622.59 examples/s]
Map: 100%|██████████| 1215709/1215709 [01:51<00:00, 10866.46 examples/s]


Tokenizing es dataset...


Map (num_proc=4): 100%|██████████| 178434/178434 [00:25<00:00, 7055.55 examples/s]
Map: 100%|██████████| 178434/178434 [00:16<00:00, 11038.80 examples/s]


Tokenizing fr dataset...


Map (num_proc=4): 100%|██████████| 210298/210298 [00:40<00:00, 5220.27 examples/s]
Map: 100%|██████████| 210298/210298 [00:19<00:00, 10751.26 examples/s]


Example of tokenized text:
{'source': 'https://www.kaggle.com/datasets/choonkhonng/malaysia-restaurant-review-datasets', 'domain': 'restaurant reviews ', 'input_ids': [101, 138, 15198, 26069, 10169, 15198, 18301, 119, 23002, 10124, 27949, 13096, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 


