# Word2Vec

这里有一个小插曲，因为 scipy 更新了 `scipy.linalg.triu` 函数，导致 gensim 在今天这个时点（2024 年 6 月 16日）crash 了。

> The scipy.linalg functions tri, triu & tril are deprecated and will be removed in SciPy 1.13. Users are recommended to use the NumPy versions of these functions with identical names.
>
> Source: [SciPy 1.11.0 Release Notes](https://scipy.github.io/devdocs/release/1.11.0-notes.html#deprecated-features)

其实 `gensim` 的代码库针对这个问题已经修改了 [[issues 3525]](https://github.com/piskvorky/gensim/issues/3525)，但是没有发布到 release，所以至今依然存在这个问题。

这个问题过段时间肯定就好了。如果你处的时空跟我一样，依旧没有修复，可以考虑用以下两种解决方法：

1. 将 Scipy 回退到 `1.12.0`（推荐，已实践）：

```bash
   pip uninstall scipy -y
   pip install scipy==1.12.0
```

2. 从 [gensim](https://github.com/piskvorky/gensim) 代码库下载最新代码，然后从源代码构建 Package：

```bash
   git clone https://github.com/piskvorky/gensim.git
   cd gensim
   pip uninstall gensim -y
   pip install -e .
```


In [1]:
# !pip install gensim
# !pip install nltk
# !pip uninstall scipy -y
# !pip install scipy==1.12.0



In [2]:
!pip list | grep gensim
!pip list | grep scipy

gensim                                   4.3.2
scipy                                    1.12.0


In [3]:
# Step 1: Import the necessary libraries
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
import nltk
from nltk.corpus import stopwords

# Ensure you have the stopwords downloaded
nltk.download('stopwords')

# Step 2: Prepare your corpus (this is a simplified example)
corpus = [
    "This is the first sentence for our word2vec example.",
    "Here is another sentence.",
    "Word2Vec is a great tool for word embeddings.",
    "This example is meant to show how to generate word vectors."
]

# Step 3: Preprocess the corpus
stop_words = set(stopwords.words('english'))
processed_corpus = [
    [word for word in simple_preprocess(doc) if word not in stop_words]
    for doc in corpus
]

# Step 4: Train the Word2Vec model
model = Word2Vec(sentences=processed_corpus, vector_size=100, window=5, min_count=1, workers=4)

# Step 5: Access the word vectors
word_vectors = model.wv

# Example: Get the vector for the word 'word2vec'
vector = word_vectors['sentence']
print("Vector for 'sentence':", vector)

# Example: Find the most similar words to 'sentence'
similar_words = word_vectors.most_similar('sentence')
print("Words most similar to 'word2vec':", similar_words)

Vector for 'sentence': [-8.2426779e-03  9.2993546e-03 -1.9766092e-04 -1.9672764e-03
  4.6036304e-03 -4.0953159e-03  2.7431143e-03  6.9399667e-03
  6.0654259e-03 -7.5107943e-03  9.3823504e-03  4.6718083e-03
  3.9661205e-03 -6.2435055e-03  8.4599797e-03 -2.1501649e-03
  8.8251876e-03 -5.3620026e-03 -8.1294188e-03  6.8245591e-03
  1.6711927e-03 -2.1985089e-03  9.5136007e-03  9.4938548e-03
 -9.7740470e-03  2.5052286e-03  6.1566923e-03  3.8724565e-03
  2.0227872e-03  4.3050171e-04  6.7363144e-04 -3.8206363e-03
 -7.1402504e-03 -2.0888723e-03  3.9238976e-03  8.8186832e-03
  9.2591504e-03 -5.9759365e-03 -9.4026709e-03  9.7643770e-03
  3.4297847e-03  5.1661171e-03  6.2823449e-03 -2.8042626e-03
  7.3227035e-03  2.8302716e-03  2.8710044e-03 -2.3803699e-03
 -3.1282497e-03 -2.3701417e-03  4.2764368e-03  7.6057913e-05
 -9.5842788e-03 -9.6655441e-03 -6.1481940e-03 -1.2856961e-04
  1.9974159e-03  9.4319675e-03  5.5843508e-03 -4.2906962e-03
  2.7831673e-04  4.9643586e-03  7.6983096e-03 -1.1442233e-03
 

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/changluo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
word_vectors

<gensim.models.keyedvectors.KeyedVectors at 0x146ffffa0>