# IMDB Reviews embedding

将 IMDB 数据集中的电影评论转换成词向量，然后存在 `./data/imdb_embedding.csv`.

Dataset: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/data

In [1]:
# !pip install swifter

In [2]:
import re
import torch
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from transformers import BertModel, BertTokenizer
from swifter import swifter

import util

In [3]:
EN_BERT_PATH = './data/bert-base-uncased'
IMDB_FILE = './data/IMDB Dataset.csv'
EMBEDDING_CSV_FILE = './data/imdb_embedding.csv'

## 1. 文本预处理

In [4]:
df = pd.read_csv(IMDB_FILE)
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [5]:
# 去除 html 标签
def remove_html_label(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

df['review'] = df['review'].apply(remove_html_label)
df['review']



0        One of the other reviewers has mentioned that ...
1        A wonderful little production. The filming tec...
2        I thought this was a wonderful way to spend ti...
3        Basically there's a family where a little boy ...
4        Petter Mattei's "Love in the Time of Money" is...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot, bad dialogue, bad acting, idiotic di...
49997    I am a Catholic taught in parochial elementary...
49998    I'm going to have to disagree with the previou...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

In [6]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49581,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [7]:
df['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

## 2. 计算句子向量

下载 [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) 的模型文件：

```bash
conda install pytorch -y
pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download bert-base-uncased --local-dir ./data/bert-base-uncased
```

In [8]:
# 加载模型
tokenizer = BertTokenizer.from_pretrained(EN_BERT_PATH)
model = BertModel.from_pretrained(EN_BERT_PATH)

Some weights of the model checkpoint at ./data/bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
## 计算句子向量
def get_avg_embeddings(corpus):
    """计算句子的平均嵌入"""
    encoded_inputs = tokenizer(corpus,
                               padding='max_length',
                               truncation=True,
                               return_tensors='pt')
    with torch.no_grad():
        outputs = model(**encoded_inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)

    return embeddings

In [10]:
# 为每个评论计算句子向量
df['embedding'] = df['review'].swifter.apply(lambda e: get_avg_embeddings(e)[0])
df

Pandas Apply:   0%|          | 0/50000 [00:00<?, ?it/s]

Unnamed: 0,review,sentiment,embedding
0,One of the other reviewers has mentioned that ...,positive,"[tensor(0.0721), tensor(-0.1312), tensor(0.238..."
1,A wonderful little production. The filming tec...,positive,"[tensor(-0.0297), tensor(0.0778), tensor(0.290..."
2,I thought this was a wonderful way to spend ti...,positive,"[tensor(-0.1714), tensor(-0.2612), tensor(0.25..."
3,Basically there's a family where a little boy ...,negative,"[tensor(0.1463), tensor(-0.1541), tensor(0.470..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[tensor(-0.0845), tensor(0.0210), tensor(0.261..."
...,...,...,...
49995,I thought this movie did a down right good job...,positive,"[tensor(0.1163), tensor(-0.1384), tensor(0.420..."
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,"[tensor(0.0776), tensor(-0.1896), tensor(0.329..."
49997,I am a Catholic taught in parochial elementary...,negative,"[tensor(-0.0230), tensor(-0.0619), tensor(0.17..."
49998,I'm going to have to disagree with the previou...,negative,"[tensor(-0.0065), tensor(-0.0204), tensor(0.16..."


In [11]:
# 将结果存成 csv
util.embedding_df_to_csv(df,
                         csv_path=EMBEDDING_CSV_FILE,
                         ebd_cols=['embedding'])