# IMDB Reviews embedding

将 IMDB 数据集中的电影评论转换成词向量，然后存在 `./data/imdb_embedding.csv`.

Dataset: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/data

In [2]:
import re
import torch
import nltk
import pandas as pd
from bs4 import BeautifulSoup
from transformers import BertModel, BertTokenizer

import util

In [3]:
EN_BERT_PATH = './data/bert-base-uncased'
IMDB_FILE = './data/IMDB Dataset.csv'
EMBEDDING_CSV_FILE = './data/imdb_embedding.csv'

## 1. 文本预处理

In [4]:
df = pd.read_csv(IMDB_FILE)
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [6]:
# 去除 html 标签
def remove_html_label(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

df['review'] = df['review'].apply(remove_html_label)
df['review']

  soup = BeautifulSoup(text, "html.parser")


0        One of the other reviewers has mentioned that ...
1        A wonderful little production. The filming tec...
2        I thought this was a wonderful way to spend ti...
3        Basically there's a family where a little boy ...
4        Petter Mattei's "Love in the Time of Money" is...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot, bad dialogue, bad acting, idiotic di...
49997    I am a Catholic taught in parochial elementary...
49998    I'm going to have to disagree with the previou...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

In [7]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49581,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [8]:
df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

## 2. 计算句子向量

下载 [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) 的模型文件：

```bash
conda install pytorch -y
pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download bert-base-uncased --local-dir ./data/bert-base-uncased
```

In [None]:
# 加载模型
tokenizer = BertTokenizer.from_pretrained(EN_BERT_PATH)
model = BertModel.from_pretrained(EN_BERT_PATH)

## 计算句子向量
def get_avg_embeddings(sentences):
    """计算句子的平均嵌入"""
    encoded_inputs = tokenizer(corpus, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**encoded_inputs)
        embeddings = outputs.last_hidden_state.mean(dim=1)

    return embeddings

In [None]:
# 为每个评论计算句子向量
df['embedding'] = df['review'].apply(get_avg_embeddings)
df

In [None]:
# 将结果存成 csv
util.embedding_df_to_csv(df,
                         csv_path=EMBEDDING_CSV_FILE,
                         ebd_cols=['embedding'])