<a href="https://colab.research.google.com/github/luojie1024/TextClassification/blob/main/word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [29]:
import pandas as pd
import json
import jieba

# 0. gensim实践：

1. 读取预处理好的数据
2. 训练
3. 完事

# 1. 下载数据集

In [3]:
! wget https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip
! unzip tnews_public.zip

--2021-04-05 06:11:52--  https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.214.128, 172.253.114.128, 108.177.121.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.214.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4689325 (4.5M) [application/zip]
Saving to: ‘tnews_public.zip’


2021-04-05 06:11:52 (100 MB/s) - ‘tnews_public.zip’ saved [4689325/4689325]

Archive:  tnews_public.zip
  inflating: train.json              
  inflating: dev.json                
  inflating: test.json               
  inflating: labels.json             


## 1.1 预处理数据

In [22]:
def get_sentence(data_file):
  # 读取数据集中的句子
  f=open('train.json','r',encoding='utf-8')
  reader = f.readlines()
  sentence=[]
  for line in reader:
    line=json.loads(line.strip())
    sentence.append(line['sentence'])
  return sentence

In [25]:
# 读取句子语料
train_sentence=get_sentence('train.json')
test_sentence=get_sentence('test.json')
dev_sentence=get_sentence('dev.json')

# 2. 载入数据

In [41]:
# 使用所有语料作为词向量训练语料
train_data=train_sentence+test_sentence+dev_sentence

In [42]:
%%time
# 分词处理
train_data=[list(jieba.cut(stentence)) for stentence in train_data]

# 3. 模型创建

Gensim中 Word2Vec 模型的期望输入是进过分词的句子列表，即是某个二维数组。这里我们暂时使用 Python 内置的数组，不过其在输入数据集较大的情况下会占用大量的 RAM。Gensim 本身只是要求能够迭代的有序句子列表，因此在工程实践中我们可以使用自定义的生成器，只在内存中保存单条语句。

## Word2Vec 参数
+ min_count

在不同大小的语料集中，我们对于基准词频的需求也是不一样的。譬如在较大的语料集中，我们希望忽略那些只出现过一两次的单词，这里我们就可以通过设置min_count参数进行控制。一般而言，合理的参数值会设置在0~100之间。

+ size

size参数主要是用来设置神经网络的层数，Word2Vec 中的默认值是设置为100层。更大的层次设置意味着更多的输入数据，不过也能提升整体的准确度，合理的设置范围为 10~数百。

+ workers

workers参数用于设置并发训练时候的线程数，不过仅当Cython安装的情况下才会起作用：

In [36]:
# 引入 word2vec
from gensim.models.word2vec import LineSentence
from gensim.models import word2vec
import gensim

# 引入日志配置
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# 构建训练

## FastText

In [56]:
from gensim.models import FastText

In [57]:
model = FastText(train_data,  size=4, window=3, min_count=1, iter=10,min_n = 3 , max_n = 6,word_ngrams = 0)

2021-04-05 06:45:58,867 : INFO : collecting all words and their counts
2021-04-05 06:45:58,876 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-04-05 06:45:58,915 : INFO : PROGRESS: at sentence #10000, processed 129141 words, keeping 25698 word types
2021-04-05 06:45:58,957 : INFO : PROGRESS: at sentence #20000, processed 258779 words, keeping 39123 word types
2021-04-05 06:45:58,999 : INFO : PROGRESS: at sentence #30000, processed 387255 words, keeping 49324 word types
2021-04-05 06:45:59,039 : INFO : PROGRESS: at sentence #40000, processed 515598 words, keeping 57864 word types
2021-04-05 06:45:59,080 : INFO : PROGRESS: at sentence #50000, processed 644895 words, keeping 65382 word types
2021-04-05 06:45:59,122 : INFO : PROGRESS: at sentence #60000, processed 774201 words, keeping 67815 word types
2021-04-05 06:45:59,172 : INFO : PROGRESS: at sentence #70000, processed 903891 words, keeping 67815 word types
2021-04-05 06:45:59,220 : INFO : PROGRESS: at 

## skip-gram 与 CBOW.

In [45]:
# sg : {0, 1}, optional Training algorithm: 1 for skip-gram; otherwise CBOW.
model = word2vec.Word2Vec(train_data, sg=1,workers=8,min_count=5,size=200,iter=1)

2021-04-05 06:42:01,998 : INFO : collecting all words and their counts
2021-04-05 06:42:02,001 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-04-05 06:42:02,055 : INFO : PROGRESS: at sentence #10000, processed 129141 words, keeping 25698 word types
2021-04-05 06:42:02,104 : INFO : PROGRESS: at sentence #20000, processed 258779 words, keeping 39123 word types
2021-04-05 06:42:02,161 : INFO : PROGRESS: at sentence #30000, processed 387255 words, keeping 49324 word types
2021-04-05 06:42:02,209 : INFO : PROGRESS: at sentence #40000, processed 515598 words, keeping 57864 word types
2021-04-05 06:42:02,260 : INFO : PROGRESS: at sentence #50000, processed 644895 words, keeping 65382 word types
2021-04-05 06:42:02,306 : INFO : PROGRESS: at sentence #60000, processed 774201 words, keeping 67815 word types
2021-04-05 06:42:02,352 : INFO : PROGRESS: at sentence #70000, processed 903891 words, keeping 67815 word types
2021-04-05 06:42:02,403 : INFO : PROGRESS: at 

# 查找最近的词

In [60]:
model.wv.most_similar(['智能'],topn=10)

[('农业', 0.9993733763694763),
 ('美铝', 0.9988551139831543),
 ('科技', 0.9978682994842529),
 ('电商', 0.9966495037078857),
 ('零售', 0.9964438676834106),
 ('物流', 0.9961023330688477),
 ('鸿润', 0.9958018064498901),
 ('建设', 0.9957801699638367),
 ('金融', 0.9954661726951599),
 ('巴航', 0.9950317740440369)]

# 保存模型

In [51]:
save_model_path='word2vec.model'

In [52]:
model.save(save_model_path)

2021-04-05 06:42:58,312 : INFO : saving Word2Vec object under word2vec.model, separately None
2021-04-05 06:42:58,314 : INFO : not storing attribute vectors_norm
2021-04-05 06:42:58,316 : INFO : not storing attribute cum_table
2021-04-05 06:42:58,947 : INFO : saved word2vec.model


# 载入模型

In [53]:
model = word2vec.Word2Vec.load(save_model_path)

2021-04-05 06:43:03,418 : INFO : loading Word2Vec object from word2vec.model
2021-04-05 06:43:03,859 : INFO : loading wv recursively from word2vec.model.wv.* with mmap=None
2021-04-05 06:43:03,862 : INFO : setting ignored attribute vectors_norm to None
2021-04-05 06:43:03,866 : INFO : loading vocabulary recursively from word2vec.model.vocabulary.* with mmap=None
2021-04-05 06:43:03,870 : INFO : loading trainables recursively from word2vec.model.trainables.* with mmap=None
2021-04-05 06:43:03,876 : INFO : setting ignored attribute cum_table to None
2021-04-05 06:43:03,877 : INFO : loaded word2vec.model


In [54]:
model.wv.most_similar(['奇瑞'],topn=10)

2021-04-05 06:43:06,461 : INFO : precomputing L2-norms of word weight vectors


[('公积金', 0.9978774189949036),
 ('股价', 0.9977446794509888),
 ('小麦', 0.997087299823761),
 ('自主', 0.9970433712005615),
 ('东风', 0.9969918727874756),
 ('坊', 0.9968271851539612),
 ('沃尔沃', 0.9967370629310608),
 ('幼儿园', 0.9965959787368774),
 ('本田', 0.9965536594390869),
 ('千亿', 0.9965298771858215)]

# 参考

1. https://radimrehurek.com/gensim/models/word2vec.html 