**Temat:** Analiza sentymentu w tekstach internetowych w oparciu o sieci typu Transformer

**Wprowadzenie:** Analiza sentymentu to technika przetwarzania języka naturalnego (NLP), która identyfikuje ton emocjonalny w tekście, klasyfikując go na pozytywny, negatywny lub neutralny. Wykorzystuje się ją do badania opinii klientów, monitorowania reputacji marki czy analizy treści mediów społecznościowych.

**Cel projektu:** Celem projektu jest opracowanie i implementacja modelu analizy sentymentu, który pozwoli na klasyfikację opinii użytkowników na podstawie tekstów pochodzących z Internetu. Należy przeanalizować dane tekstowe, przygotować odpowiedni model oraz zaprezentować wyniki analizy.

In [41]:
%pip install datasets transformers torch langdetect scikit-learn --quiet

Note: you may need to restart the kernel to use updated packages.


### Ładowanie danych

In [None]:
from datasets import load_dataset

ds = load_dataset("clapAI/MultiLingualSentiment")

Saving the dataset (3/3 shards): 100%|██████████| 3147478/3147478 [00:02<00:00, 1158612.77 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 393435/393435 [00:00<00:00, 1250675.28 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 393436/393436 [00:00<00:00, 1251788.48 examples/s]


In [2]:
print(ds)

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'source', 'domain', 'language'],
        num_rows: 3147478
    })
    validation: Dataset({
        features: ['text', 'label', 'source', 'domain', 'language'],
        num_rows: 393435
    })
    test: Dataset({
        features: ['text', 'label', 'source', 'domain', 'language'],
        num_rows: 393436
    })
})


In [None]:
languages = ds['train'].unique('language')
ds_types = ['train', 'validation', 'test']
drop_columns = ['source', 'domain']
ds = ds.remove_columns(drop_columns)
# Create dictionary to store datasets for each language
datasets = {}

# Split train, validation and test for each language
for lang in languages:
    datasets[lang] = {}
    for ds_type in ds_types:
        datasets[lang][ds_type] = ds[ds_type].filter(
            lambda batch: [x == lang for x in batch['language']],
            batched = True,
            num_proc=4
            )
        
        # Reduce dataset by 100 times
        rows_counter = datasets[lang][ds_type].num_rows
        new_num_rows = round(rows_counter*0.01)
        datasets[lang][ds_type] = datasets[lang][ds_type].shuffle(seed=42)
        datasets[lang][ds_type] = datasets[lang][ds_type].select(range(new_num_rows))

Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1593933.63 examples/s]
Filter (num_proc=4): 100%|██████████| 393435/393435 [00:00<00:00, 970828.42 examples/s]
Filter (num_proc=4): 100%|██████████| 393436/393436 [00:00<00:00, 937174.09 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1672955.12 examples/s]
Filter (num_proc=4): 100%|██████████| 393435/393435 [00:00<00:00, 1042327.62 examples/s]
Filter (num_proc=4): 100%|██████████| 393436/393436 [00:00<00:00, 1036326.39 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1653757.68 examples/s]
Filter (num_proc=4): 100%|██████████| 393435/393435 [00:00<00:00, 1069735.66 examples/s]
Filter (num_proc=4): 100%|██████████| 393436/393436 [00:00<00:00, 1031221.27 examples/s]
Filter (num_proc=4): 100%|██████████| 3147478/3147478 [00:01<00:00, 1677637.81 examples/s]
Filter (num_proc=4): 100%|██████████| 393435/393435 [00:00<00:00, 1081991.97 examples/s]
Filter (num_pro

### Czyszczenie datasetów z błędnych języków

In [None]:
from langdetect import detect

def detect_language(text):
    try:
        return detect(text)
    except:
        return "unknown"
    
for ds_type in ds_types:
        for lang in languages:
            datasets['en'][ds_type] = datasets['en'][ds_type].filter(
                lambda batch: [detect_language(x) == 'en' for x in batch['text']],
                batched=True,
                num_proc=4
            )

Filter (num_proc=4): 100%|██████████| 10305/10305 [00:09<00:00, 1121.32 examples/s]
Filter (num_proc=4): 100%|██████████| 10281/10281 [00:09<00:00, 1134.04 examples/s]
Filter (num_proc=4): 100%|██████████| 10267/10267 [00:09<00:00, 1109.08 examples/s]
Filter (num_proc=4): 100%|██████████| 10261/10261 [00:09<00:00, 1033.45 examples/s]
Filter (num_proc=4): 100%|██████████| 10253/10253 [00:13<00:00, 756.87 examples/s]
Filter (num_proc=4): 100%|██████████| 10250/10250 [00:16<00:00, 615.81 examples/s]
Filter (num_proc=4): 100%|██████████| 10249/10249 [00:15<00:00, 640.84 examples/s]
Filter (num_proc=4): 100%|██████████| 10243/10243 [00:12<00:00, 823.90 examples/s] 
Filter (num_proc=4): 100%|██████████| 10240/10240 [00:10<00:00, 940.95 examples/s]
Filter (num_proc=4): 100%|██████████| 10236/10236 [00:10<00:00, 1001.95 examples/s]
Filter (num_proc=4): 100%|██████████| 10234/10234 [00:10<00:00, 1010.62 examples/s]
Filter (num_proc=4): 100%|██████████| 10232/10232 [00:10<00:00, 1010.85 examples

In [26]:
datasets['en']['train'][0]

{'text': 'Not allowed to withdraw my funds, and was told I can only cash out if I deactivate or close the account. Well if u say so.',
 'label': 'negative',
 'language': 'en'}

### Tokenizacja

In [None]:
from datasets import concatenate_datasets

languages_to_process = ['en', 'es', 'zh']
labels_id = {'negative': 0, 'neutral': 1, 'positive': 2}

# Convert labels to IDs
def convert_labels_to_ids(batch):
    batch['label_id'] = [labels_id[label] for label in batch['label']]
    return batch

# train ds
train_ds_list = [datasets[lang]['train'] for lang in languages_to_process]
# Concatenate datasets for selected languages
train_ds = concatenate_datasets(train_ds_list)
train_ds = train_ds.map(convert_labels_to_ids, batched=True, num_proc=4)
train_ds = train_ds.shuffle(seed=42)

# eval ds
eval_ds_list = [datasets[lang]['validation'] for lang in languages_to_process]
# Concatenate datasets for selected languages
eval_ds = concatenate_datasets(eval_ds_list)
eval_ds = eval_ds.map(convert_labels_to_ids, batched=True, num_proc=4)
eval_ds = eval_ds.shuffle(seed=42)

# test ds
test_ds_list = [datasets[lang]['test'] for lang in languages_to_process]
# Concatenate datasets for selected languages
test_ds = concatenate_datasets(test_ds_list)
test_ds = test_ds.map(convert_labels_to_ids, batched=True, num_proc=4)
test_ds = test_ds.shuffle(seed=42)

Map (num_proc=4): 100%|██████████| 15560/15560 [00:01<00:00, 8382.52 examples/s] 
Map (num_proc=4): 100%|██████████| 1959/1959 [00:00<00:00, 6640.81 examples/s]
Map (num_proc=4): 100%|██████████| 1942/1942 [00:00<00:00, 6541.98 examples/s]


In [32]:
print(train_ds[0])

{'text': '很实用，点赞。。。:)', 'label': 'positive', 'language': 'zh', 'label_id': 2}


In [38]:
from transformers import BertTokenizer, BertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
model = BertForSequenceClassification.from_pretrained("bert-base-multilingual-cased", num_labels=3)

def tokenize_and_encode(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=128,  # BERT max sequence length
        # return_tensors=None,  # Returns PyTorch tensors
        # return_special_tokens_mask=True
    )



tokenized_train_ds = train_ds.map(
    tokenize_and_encode,
    batched=True,
    batch_size=1000,  # Increased batch size
    num_proc=4,       # Use multiple CPU cores
    remove_columns=['text', 'language', 'label']  # Remove original columns we don't need
)
tokenized_train_ds = tokenized_train_ds.rename_column("label_id", "label")
tokenized_eval_ds = eval_ds.map(
    tokenize_and_encode,
    batched=True,
    batch_size=1000,  # Increased batch size
    num_proc=4,       # Use multiple CPU cores
    remove_columns=['text', 'language', 'label']  # Remove original columns we don't need
)
tokenized_eval_ds = tokenized_eval_ds.rename_column("label_id", "label")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [52]:
print(tokenized_train_ds.num_rows)

15560


In [53]:
test_ds.to_csv("test_ds.csv", index=False)

Creating CSV from Arrow format: 100%|██████████| 2/2 [00:00<00:00, 30.72ba/s]


782201

In [39]:
print(tokenized_train_ds[0])

{'label': 2, 'input_ids': [101, 3767, 3392, 5605, 10064, 5286, 7520, 1882, 1882, 1882, 131, 114, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [48]:
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, f1_score

model.to('cpu')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=1)
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average='weighted')
    return {
        'accuracy': accuracy,
        'f1': f1
    }

training_args = TrainingArguments(
    output_dir="./multilingual_bert_sentiment",
    overwrite_output_dir=True,
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_ds,
    eval_dataset=tokenized_eval_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()
trainer.save_model("./multilingual_bert_sentiment")
trainer.evaluate()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 