# 利用word2vec进行文本情感分类

word2vec是Google在2013年开源的一个工具，核心思想是将单词映射为对应的实数向量。

word2vec采用的模型有以下两种：

1.CBOW(Continuous Bag-Of-Words，即连续词袋模型)在给定上下文单词的情况下，预测目标单词（中心单词）。
![https://d2l.ai/_images/cbow.svg](https://d2l.ai/_images/cbow.svg)
2.Skip-Gram
Skip-Gram算法就是在给出目标单词（中心单词）的情况下，预测它的上下文单词。
![https://d2l.ai/_images/skip-gram.svg](https://d2l.ai/_images/skip-gram.svg)

经过模型的训练，最终获得每个单词的**词向量**。

我们暂不涉及word2vec的训练，而是直接采用预训练的词向量完成我们的任务。

预训练中文词向量：https://github.com/Embedding/Chinese-Word-Vectors

安装必要自然语言处理库gensim：

In [None]:
!pip install gensim

利用gensim导入预训练的词向量：

In [None]:
from gensim.models import KeyedVectors

In [None]:
model = KeyedVectors.load_word2vec_format("sgns.merge.word")

利用词向量我们可以看到单词之间一些有趣的关系：

In [None]:
model.most_similar("苹果")

In [None]:
model.most_similar("网球")

In [None]:
model.most_similar("多云")

$最大 - 大 = 最小 - 小$

In [None]:
model.most_similar(positive=["最大","小"],negative=["大"])

$中国 - 北京 = 法国 - 巴黎$

In [None]:
model.most_similar(positive=["中国","巴黎"],negative=["法国"])

对词向量进行降维，看看不同分类的词在空间上能否隔开：

In [None]:
with open("147种水果的名字.txt",errors='ignore') as f:
    fruit_words = f.read().split()
with open("天气学专有词汇.txt",errors='ignore') as f:
    weather_words = f.read().split()
with open("运动、休闲词库.txt",errors='ignore') as f:
    sports_words = f.read().split()
fruit_vecs = np.concatenate([model[fw].reshape((1,300)) for fw in fruit_words if fw in model])

weather_vecs = np.concatenate([model[ww].reshape((1,300)) for ww in weather_words if ww in model])
sports_vecs = np.concatenate([model[sw].reshape((1,300)) for sw in sports_words if sw in model])


In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

In [None]:
ts = TSNE(2)
reduced_vecs = ts.fit_transform(np.concatenate((fruit_vecs,weather_vecs,sports_vecs)))

In [None]:
for i in range(len(reduced_vecs)):
    if i < len(fruit_vecs):
        color = 'b'
    elif i>=len(fruit_vecs) and i<(len(fruit_vecs)+len(weather_vecs)):
        color = 'r'
    else:
        color = "g"
    plt.plot(reduced_vecs[i,0],reduced_vecs[i,1],marker='o',color=color)

编写辅助函数，用于将文本文件转换成字符串：

In [None]:
def file2str(file):
    with open(file,encoding='gbk',errors='ignore') as f:
        s = f.read().replace(" ","").replace("\n","")
    return s

编写将整个字符串转换为向量的函数，我们假设整个文本的向量就是所有词语向量的均值。

In [None]:
def str2vec(s):
    import jieba
    import numpy as np
    cut_s = jieba.cut(s)
    vecs = [model[word] for word in cut_s if word in model]
    return sum(np.array(vecs))/len(vecs)

导入上一讲的所有文本：

In [None]:
import os
pos = [os.path.join("pos",p) for p in os.listdir("pos")]
neg = [os.path.join("neg",p) for p in os.listdir("neg")]

In [None]:
pos_sentences = [file2str(f) for f in pos]
neg_sentences = [file2str(f) for f in neg]

将所有文本转换为向量，并合并成一个矩阵，用于接下去的机器学习算法。

In [None]:
pos_vecs = np.array([str2vec(s) for s in pos_sentences])

In [None]:
neg_vecs = np.array([str2vec(s) for s in neg_sentences])

In [None]:
Vecs = np.vstack((pos_vecs,neg_vecs))

建立标注向量，正向文本标注为1，负向文本标注为0：

In [None]:
labels = np.concatenate((np.ones(len(pos_vecs)), np.zeros(len(neg_vecs))))

### 利用逻辑回归（Logistic Regression）进行分类

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(Vecs, labels)

In [None]:
clf.score(Vecs,labels)

加入交叉验证：

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
Vecs_train, Vecs_test, labels_train, labels_test = train_test_split(Vecs,labels,test_size=0.2)

In [None]:
clf = LogisticRegression(random_state=0).fit(Vecs_train, labels_train)

In [None]:
clf.score(Vecs_test,labels_test)

### 基于支持向量机（SVM）的分类

In [None]:
from sklearn.svm import SVC

In [None]:
clf = SVC(C = 2, probability = True)
clf.fit(Vecs_train, labels_train)

In [None]:
clf.score(Vecs_test,labels_test)