# Word2Vec

使用 Word2Vec 获取中英文词向量

In [1]:
# !pip install gensim
# !pip install nltk
# !pip uninstall scipy -y
# !pip install scipy==1.12.0
# !pip install jieba

In [2]:
!pip list | grep gensim
!pip list | grep scipy

gensim                                   4.3.2
scipy                                    1.12.0


In [3]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from nltk import download

download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/changluo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 1. 英文词向量

In [4]:
corpus = [
    "This is the first sentence for our word2vec example.",
    "Here is another sentence.",
    "Word2Vec is a great tool for word embeddings.",
    "This example is meant to show how to generate word vectors."
]

stop_words = set(stopwords.words('english'))
len(stop_words)

179

In [5]:
simple_preprocess('generate word vectors')

['generate', 'word', 'vectors']

In [6]:
processed_corpus = [
    [word for word in simple_preprocess(doc) if word not in stop_words]
    for doc in corpus
]
processed_corpus

[['first', 'sentence', 'word', 'vec', 'example'],
 ['another', 'sentence'],
 ['word', 'vec', 'great', 'tool', 'word', 'embeddings'],
 ['example', 'meant', 'show', 'generate', 'word', 'vectors']]

In [7]:
model = Word2Vec(sentences=processed_corpus, vector_size=100, window=5, min_count=1, workers=4)
model

<gensim.models.word2vec.Word2Vec at 0x14cd1fdf0>

In [8]:
word_vectors = model.wv
word_vectors

<gensim.models.keyedvectors.KeyedVectors at 0x14cd1ee90>

In [9]:
vector = word_vectors['word']
vector.shape

(100,)

In [10]:
similar_words = word_vectors.most_similar('word')
similar_words

[('tool', 0.21618761122226715),
 ('first', 0.09310851246118546),
 ('meant', 0.0929148867726326),
 ('another', 0.07966356724500656),
 ('great', 0.06283358484506607),
 ('embeddings', 0.027093399316072464),
 ('show', 0.016160275787115097),
 ('example', -0.010845640674233437),
 ('vectors', -0.027724914252758026),
 ('vec', -0.052069321274757385)]

In [11]:
type(word_vectors)

gensim.models.keyedvectors.KeyedVectors

## 2. 中文词向量

In [12]:
import re
import jieba

In [13]:
with open('./data/红楼梦.txt', 'r') as f:
    content = f.read()
len(content)

858628

In [14]:
content[:300]

'第1章 甄士隐梦幻识通灵 贾雨村风尘怀闺秀\n\u3000\u3000此开卷第一回也。\n\u3000\u3000作者自云：因曾历过一番梦幻之后，故将真事隐去，而借通灵说撰此《石头记》一书也，故曰“甄士隐”云云。但书中所记何事何人？自己又云：“今风尘碌碌，一事无成，忽念及当日所有之女子，一一细考较去，觉其行止见识皆出我之上。我堂堂须眉，诚不若彼裙钗，我实愧则有馀，悔又无益，大无可如何之日也。当此日，欲将已往所赖天思祖德，锦衣纨裤之时，饫甘餍肥之日，背父兄教育之恩，负师友规训之德，以致今日一技无成，半生潦倒之罪，编述一集，以告天下。知我之负罪固多，然闺阁中历历有人，万不可因我之不肖，自护己短，一并使其泯灭也。所以蓬牖茅椽，绳床瓦灶，并不足'

In [15]:
# 删除 \n \u3000 \u3000
pattern = re.compile(r'(\n|\u3000|\u3000)', re.IGNORECASE)
content = pattern.sub('', content)

# 对句子分段
sentences = re.split('。|！|？', content)
len(sentences), sentences[:5]

(35077,
 ['第1章 甄士隐梦幻识通灵 贾雨村风尘怀闺秀此开卷第一回也',
  '作者自云：因曾历过一番梦幻之后，故将真事隐去，而借通灵说撰此《石头记》一书也，故曰“甄士隐”云云',
  '但书中所记何事何人',
  '自己又云：“今风尘碌碌，一事无成，忽念及当日所有之女子，一一细考较去，觉其行止见识皆出我之上',
  '我堂堂须眉，诚不若彼裙钗，我实愧则有馀，悔又无益，大无可如何之日也'])

In [16]:
# 加载中文停用词
with open('./data/cn_stopwords.txt') as f:
    cn_stop_words = f.read()
cn_stop_words = cn_stop_words.split('\n')
len(cn_stop_words)

749

In [17]:
cn_processed_corpus = [
    [word for word in jieba.lcut(text) if word not in cn_stop_words]
    for text in sentences
]
cn_processed_corpus[:1]

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/97/m67m_56s0dq5k20t3vp4_pgh0000gn/T/jieba.cache
Loading model cost 0.621 seconds.
Prefix dict has been built successfully.


[['章', '甄士隐', '梦幻', '识通灵', '贾雨村', '风尘', '怀', '闺秀', '开卷', '第一回']]

In [18]:
cn_model = Word2Vec(sentences=cn_processed_corpus, vector_size=100, window=15, min_count=1, workers=4)
cn_model

<gensim.models.word2vec.Word2Vec at 0x14d5405b0>

In [19]:
cn_word_vectors = cn_model.wv
cn_vector = cn_word_vectors['林黛玉']
cn_vector.shape

(100,)

In [20]:
cn_similar_words = cn_word_vectors.most_similar('林黛玉')
cn_similar_words

[('取乐', 0.9973452687263489),
 ('身', 0.9971345067024231),
 ('甚', 0.9969601631164551),
 ('惟有', 0.9969542622566223),
 ('尤三姐', 0.9969341158866882),
 ('薛宝钗', 0.9969031810760498),
 ('封肃', 0.9968977570533752),
 ('兼', 0.9968968033790588),
 ('尚未', 0.9968937039375305),
 ('金桂', 0.996833086013794)]

这里有一个小插曲，因为 scipy 更新了 `scipy.linalg.triu` 函数，导致 gensim 在今天这个时点（2024 年 6 月 16日）crash 了。

> The scipy.linalg functions tri, triu & tril are deprecated and will be removed in SciPy 1.13. Users are recommended to use the NumPy versions of these functions with identical names.
>
> Source: [SciPy 1.11.0 Release Notes](https://scipy.github.io/devdocs/release/1.11.0-notes.html#deprecated-features)

其实 `gensim` 的代码库针对这个问题已经修改了 [[issues 3525]](https://github.com/piskvorky/gensim/issues/3525)，但是没有发布到 release，所以至今依然存在这个问题。

这个问题过段时间肯定就好了。如果你所处的时空跟我一样，依旧没有修复，可以考虑用以下两种解决方法：

1. 将 Scipy 回退到 `1.12.0`（推荐，已实践）：

```bash
   pip uninstall scipy -y
   pip install scipy==1.12.0
```

2. 从 [gensim](https://github.com/piskvorky/gensim) 代码库下载最新代码，然后从源代码构建 Package：

```bash
   git clone https://github.com/piskvorky/gensim.git
   cd gensim
   pip uninstall gensim -y
   pip install -e .
```