Word2vec是一种神经网络实现，用于学习词的分布式表达（distributed representations for words）。

Word2vec即使不利用标签，也能产生有意义的表达。这是非常有用的，因为大部分真实世界里的数据是没有标签的。如果给的词足够多，词向量会展现很多有趣的特性。比如有相似意义的词会出现在一个类里，而不同的类是有间隔的，这种特性可以让词之间的关系，可以通过向量计算来表示。

分布式词向量对于词预测和翻译很有用，这次我们用来做情感分析

In [2]:
! unzip /kaggle/input/word2vec-nlp-tutorial/labeledTrainData.tsv.zip
! unzip /kaggle/input/word2vec-nlp-tutorial/testData.tsv.zip
! unzip /kaggle/input/word2vec-nlp-tutorial/unlabeledTrainData.tsv.zip

Archive:  /kaggle/input/word2vec-nlp-tutorial/labeledTrainData.tsv.zip
  inflating: labeledTrainData.tsv    
Archive:  /kaggle/input/word2vec-nlp-tutorial/testData.tsv.zip
  inflating: testData.tsv            
Archive:  /kaggle/input/word2vec-nlp-tutorial/unlabeledTrainData.tsv.zip
  inflating: unlabeledTrainData.tsv  


首先，用pandas导入数据，不过这次我们用unlabeledTrain.tsv，其中包含了50000个没有标签的评论。在Part 1，训练词袋模型时，如果一个评论没有标签，那么这条数据就是没有用的。但word2vec能从没有标记的数据中学习。

In [3]:
import pandas as pd

In [4]:
train = pd.read_csv("/kaggle/working/labeledTrainData.tsv", header=0, 
                     delimiter="\t", quoting=3)

test = pd.read_csv( "/kaggle/working/testData.tsv", header=0, delimiter="\t", quoting=3 )
unlabeled_train = pd.read_csv("/kaggle/working/unlabeledTrainData.tsv", header=0, 
                              delimiter="\t", quoting=3 )

# Verify the number of reviews that were read (100,000 in total)
print("Read %d labeled train reviews, %d labeled test reviews, and %d unlabeled reviews\n" % (train["review"].size, test["review"].size, unlabeled_train["review"].size ))


Read 25000 labeled train reviews, 25000 labeled test reviews, and 50000 unlabeled reviews



In [5]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [6]:
train['review'][0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

接下来对数据进行清洗，和Part1差不多，不过有些不一样的地方。首先，训练word2vec的时候，最好不要去除stop words，因为word2vec算法中，更多的词汇能产生更高质量的词向量，所以我们提供一个可选项。另外，最好不要去除数字：

In [7]:
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords

def review_to_wordlist(review, remove_stopwords=False):
    # 定义一个函数，将评论转换为单词序列，可选择是否移除停用词。返回一个单词列表。
    
    # 1. 移除HTML标签
    review_text = BeautifulSoup(review, 'lxml').get_text()
      
    # 2. 移除非字母字符
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    
    # 3. 将所有单词转换为小写并分割它们
    words = review_text.lower().split()

    # 4. 可选择是否移除停用词（默认为False）
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    
    # 5. 返回单词列表
    return(words)

现在，我们想要规定好输入的格式。输入Word2vec的是单个句子，一个句子是一个list，由词组成。换句话说，输入格式是a list of lists一个列表的列表。

想要把段落分割为句子并不是一件直观的事情。英语句子的结尾可以是"?", "!", """, or ".", 而空格和大小写也靠不住。所以，我们将使用NLTK中的punkt标记生成器来进行句子分割。

In [9]:
import nltk.data
# nltk.download()   

# 加载punkt分词器
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# 定义一个函数，将评论分割成解析后的句子
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
    # 这个函数将评论分割成解析后的句子。返回一个句子列表，其中每个句子是一个单词列表
    
    # 1. 使用NLTK分词器将段落分割成句子
    raw_sentences = tokenizer.tokenize(review.strip())
    
    # 2. 循环遍历每个句子
    sentences = []
    for raw_sentence in raw_sentences:
        # 如果一个句子为空，跳过它
        if len(raw_sentence) > 0:
            # 否则，调用review_to_wordlist函数来获取单词列表
            sentences.append( review_to_wordlist( raw_sentence, remove_stopwords ))

    # 返回句子列表 每个句子是一个单词列表，因此，这返回一个列表的列表
    return sentences

In [10]:
# tokenizer只负责将paragraph分割为多个sentence
# 对于每个sentence，review_to_wordlist负责对一个sentence进行清洗
s1 = review_to_sentences(train['review'][0], tokenizer)
# 输出第一个sentence的list
s1[0]

['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with',
 'mj',
 'i',
 've',
 'started',
 'listening',
 'to',
 'his',
 'music',
 'watching',
 'the',
 'odd',
 'documentary',
 'here',
 'and',
 'there',
 'watched',
 'the',
 'wiz',
 'and',
 'watched',
 'moonwalker',
 'again']

将训练集和未标记集（unlabeled set）中的评论文本解析成句子，并将这些句子存储在一个列表中

In [11]:
sentences = []  # Initialize an empty list of sentences

print("Parsing sentences from training set")
for review in train["review"]:
    sentences += review_to_sentences(review, tokenizer)

print("Parsing sentences from unlabeled set")
for review in unlabeled_train["review"]:
    sentences += review_to_sentences(review, tokenizer)

Parsing sentences from training set


  review_text = BeautifulSoup(review, 'lxml').get_text()
  review_text = BeautifulSoup(review, 'lxml').get_text()


Parsing sentences from unlabeled set


检查 sentences 列表中句子的总数：

In [12]:
# Check how many sentences we have in total - should be around 850,000+
len(sentences)

795538

In [13]:
print(sentences[0])

print(sentences[1])

['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again']
['maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent']


**训练和保存模型**

In [14]:
# 导入内置的logging模块并配置它 以便Word2Vec创建友好的输出消息
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# 设置各种参数的值
num_features = 300    # 词向量的维度                     
min_word_count = 40   # 最低词频                        
num_workers = 4       # 并行运行的线程数量
context = 10          # 上下文窗口大小                                                                                  
downsampling = 1e-3   # 频繁词的降采样设置

# 初始化并训练模型
from gensim.models import word2vec
print("Training model...")
model = word2vec.Word2Vec(sentences, workers=num_workers, 
            vector_size=num_features, min_count = min_word_count, 
            window = context, sample = downsampling)

# 如果你不打算进一步训练模型，调用init_sims会使模型更加内存高效
model.init_sims(replace=True)

# 创建一个有意义的模型名称并保存模型以供以后使用。你可以以后使用Word2Vec.load()来加载它
model_name = "/kaggle/working/300features_40minwords_10context"
model.save(model_name)

Training model...


  model.init_sims(replace=True)


**探索模型结果**

训练结束后，查看75000个评论的训练结果。doesnt_match函数会推断在一个集合里，哪一个单词与其他单词最不相似：

In [15]:
model.wv.doesnt_match('man woman child kitchen'.split())

'kitchen'

In [16]:
model.wv.doesnt_match('france england germany berlin'.split())

'berlin'

In [17]:
model.wv.doesnt_match('paris berlin london austria'.split())

'paris'

用most_similar函数来查看词汇集群：

In [18]:
model.wv.most_similar('man')

[('woman', 0.6229605674743652),
 ('lady', 0.601617693901062),
 ('lad', 0.5919666290283203),
 ('monk', 0.5295251607894897),
 ('millionaire', 0.5230624079704285),
 ('men', 0.514593780040741),
 ('soldier', 0.5063545107841492),
 ('guy', 0.49768635630607605),
 ('person', 0.4933617115020752),
 ('sailor', 0.4897652864456177)]

对情感分析做测试：

In [19]:
model.wv.most_similar('awful')

[('terrible', 0.7640948295593262),
 ('horrible', 0.7426056265830994),
 ('atrocious', 0.7376793622970581),
 ('dreadful', 0.7110174298286438),
 ('abysmal', 0.6884176731109619),
 ('appalling', 0.6729736924171448),
 ('horrid', 0.6491310000419617),
 ('horrendous', 0.6484755873680115),
 ('lousy', 0.6404727101325989),
 ('amateurish', 0.6080129742622375)]