In [3]:
import pandas as pd

# 0. gensim实践：

1. 读取预处理好的数据
2. 训练
3. 完事

# 1. 数据集路径

In [4]:
merger_data_path = 'data/merged_train_test_seg_data.csv'

# 2. 载入数据

In [5]:
merger_df = pd.read_csv(merger_data_path,header=None)
print('merger_data_path data size {}'.format(len(merger_df)))
merger_df.head()

merger_data_path data size 102871


Unnamed: 0,0
0,方向机 重 助力 泵 方向机 都 换 新 都 换 助力 泵 方向机 换 方向机 带 助力 重...
1,奔驰 ML500 排气 凸轮轴 调节 错误 有没有 电脑 检测 故障 代码 有发 一下 发动...
2,2010 款 宝马X1 2011 年 出厂 20 排量 通用 6L45 变速箱 原地 换挡 ...
3,30V6 发动机 号 位置 照片 最好 右侧 排气管 上方 缸体 上 靠近 变速箱 是不是 ...
4,2012 款 奔驰 c180 维修保养 动力 值得 拥有 家庭 用车 入手 维修保养 费用 ...


# 3. 模型创建

Gensim中 Word2Vec 模型的期望输入是进过分词的句子列表，即是某个二维数组。这里我们暂时使用 Python 内置的数组，不过其在输入数据集较大的情况下会占用大量的 RAM。Gensim 本身只是要求能够迭代的有序句子列表，因此在工程实践中我们可以使用自定义的生成器，只在内存中保存单条语句。

## Word2Vec 参数
+ min_count

在不同大小的语料集中，我们对于基准词频的需求也是不一样的。譬如在较大的语料集中，我们希望忽略那些只出现过一两次的单词，这里我们就可以通过设置min_count参数进行控制。一般而言，合理的参数值会设置在0~100之间。

+ size

size参数主要是用来设置神经网络的层数，Word2Vec 中的默认值是设置为100层。更大的层次设置意味着更多的输入数据，不过也能提升整体的准确度，合理的设置范围为 10~数百。

+ workers

workers参数用于设置并发训练时候的线程数，不过仅当Cython安装的情况下才会起作用：

In [7]:
# 引入 word2vec
from gensim.models.word2vec import LineSentence
from gensim.models import word2vec
import gensim

# 引入日志配置
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# 构建训练

In [8]:
merger_data_path

'data/merged_train_test_seg_data.csv'

In [9]:
model = word2vec.Word2Vec(LineSentence(merger_data_path), workers=8,min_count=5,size=200)

2020-03-18 23:22:36,886 : INFO : collecting all words and their counts
2020-03-18 23:22:36,887 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-03-18 23:22:37,066 : INFO : PROGRESS: at sentence #10000, processed 941657 words, keeping 36796 word types
2020-03-18 23:22:37,265 : INFO : PROGRESS: at sentence #20000, processed 1897796 words, keeping 54149 word types
2020-03-18 23:22:37,448 : INFO : PROGRESS: at sentence #30000, processed 2842477 words, keeping 66984 word types
2020-03-18 23:22:37,634 : INFO : PROGRESS: at sentence #40000, processed 3759167 words, keeping 77921 word types
2020-03-18 23:22:37,832 : INFO : PROGRESS: at sentence #50000, processed 4736386 words, keeping 87832 word types
2020-03-18 23:22:38,058 : INFO : PROGRESS: at sentence #60000, processed 5775137 words, keeping 97810 word types
2020-03-18 23:22:38,268 : INFO : PROGRESS: at sentence #70000, processed 6837177 words, keeping 107437 word types
2020-03-18 23:22:38,450 : INFO : PROGRE

2020-03-18 23:23:00,141 : INFO : worker thread finished; awaiting finish of 5 more threads
2020-03-18 23:23:00,152 : INFO : worker thread finished; awaiting finish of 4 more threads
2020-03-18 23:23:00,153 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-03-18 23:23:00,159 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-03-18 23:23:00,161 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-03-18 23:23:00,162 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-03-18 23:23:00,163 : INFO : EPOCH - 5 : training on 9748591 raw words (8611558 effective words) took 4.2s, 2071884 effective words/s
2020-03-18 23:23:00,164 : INFO : training on a 48742955 raw words (43058405 effective words) took 20.8s, 2073782 effective words/s


# 查找最近的词

In [10]:
model.wv.most_similar(['奇瑞'],topn=10)

2020-03-18 23:23:57,303 : INFO : precomputing L2-norms of word weight vectors


[('东南', 0.8520945310592651),
 ('海马', 0.838854193687439),
 ('名爵', 0.8372061252593994),
 ('铃木', 0.8324585556983948),
 ('东风风行', 0.8304704427719116),
 ('江淮', 0.8286133408546448),
 ('猎豹', 0.8283058404922485),
 ('二代', 0.8249144554138184),
 ('瑞虎5', 0.8204711675643921),
 ('鹰', 0.8132823705673218)]

# 保存模型

In [11]:
save_model_path='data/wv/word2vec.model'

In [12]:
model.save(save_model_path)

2020-03-18 23:24:21,878 : INFO : saving Word2Vec object under data/wv/word2vec.model, separately None
2020-03-18 23:24:21,878 : INFO : not storing attribute vectors_norm
2020-03-18 23:24:21,879 : INFO : not storing attribute cum_table
2020-03-18 23:24:22,574 : INFO : saved data/wv/word2vec.model


# 载入模型

In [13]:
model = word2vec.Word2Vec.load(save_model_path)

2020-03-18 23:24:26,958 : INFO : loading Word2Vec object from data/wv/word2vec.model
2020-03-18 23:24:27,271 : INFO : loading wv recursively from data/wv/word2vec.model.wv.* with mmap=None
2020-03-18 23:24:27,271 : INFO : setting ignored attribute vectors_norm to None
2020-03-18 23:24:27,271 : INFO : loading vocabulary recursively from data/wv/word2vec.model.vocabulary.* with mmap=None
2020-03-18 23:24:27,271 : INFO : loading trainables recursively from data/wv/word2vec.model.trainables.* with mmap=None
2020-03-18 23:24:27,271 : INFO : setting ignored attribute cum_table to None
2020-03-18 23:24:27,271 : INFO : loaded data/wv/word2vec.model


In [14]:
model.wv.most_similar(['奇瑞'],topn=10)

2020-03-18 23:24:30,486 : INFO : precomputing L2-norms of word weight vectors


[('东南', 0.8520945310592651),
 ('海马', 0.838854193687439),
 ('名爵', 0.8372061252593994),
 ('铃木', 0.8324585556983948),
 ('东风风行', 0.8304704427719116),
 ('江淮', 0.8286133408546448),
 ('猎豹', 0.8283058404922485),
 ('二代', 0.8249144554138184),
 ('瑞虎5', 0.8204711675643921),
 ('鹰', 0.8132823705673218)]

# 参考

1. https://radimrehurek.com/gensim/models/word2vec.html 