# LSI text preprocessing example

In [12]:
from gensim.test.utils import common_dictionary, common_corpus
from gensim.models import LsiModel
from Common.DataCenter import data_center
import pandas as pd

首先给出官方文档中的例子

In [2]:
model = LsiModel(common_corpus, id2word=common_dictionary)
vectorized_corpus = model[common_corpus]

非常简单粗暴，传入corpus，以及传入字典，就能构造出LSA模型。得到模型以后，传入了corpus，就能得到向量。下面我们从头构造，来了解一下它是怎么运作的。

举例： 如何构造corpus

In [3]:
from collections import defaultdict
from gensim import corpora
# 这个就是文档库，是个字符串数组，列表中的每个元素是个字符串
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]
# 将文档中的所有单词转换为小写，移除刁stop words，并按空格分割
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]
# 这里删除罕见词，将只出现过1次的词删掉
# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]
# 将文档库的所有单词存入字典，内部自动赋予ID
dictionary = corpora.Dictionary(texts)
# 重新将文档库进行doc2bow，每篇文章由单词数组，转换为了[(id1, number),(id2, number), ... ]的形式，每个独一无二的单词对应一个id
corpus = [dictionary.doc2bow(text) for text in texts]

doc2bow可以将文档转换为内部定义的数字token

In [4]:
dictionary.doc2bow(['survey', 'graph', 'graph'])

[(4, 1), (10, 2)]

这里的4代表survey，后面的数字1代表survey出现了一次；10代表graph，后面的2代表出现了2次。  
总结一下，我们的文档库是用列表的形式表示，列表的一个元素是一个字符串。我们手工将字符串转换成小写以后，通过空格进行分割，这样每个文档就对应了个字符串的数组，每个数组元素就是一个单词。然后通过Dictionary，将这个二级数组转换为dictionary，这里的dictionary将所有单词存起来，每个单词指定了一个索引。  
接下来对于文档库的每一个文档，用doc2bow函数，转换为[(id1, number), (id2, number), ...]的形式。  
现在使用LsiModel

In [5]:
testModel = LsiModel(corpus=corpus, id2word=dictionary)
vec_corpus = model[corpus]

把topic按照重要程度列出来，每个默认列前10个单词

In [6]:
testModel.show_topics()

[(0,
  '0.644*"system" + 0.404*"user" + 0.301*"eps" + 0.265*"time" + 0.265*"response" + 0.240*"computer" + 0.221*"human" + 0.206*"survey" + 0.198*"interface" + 0.036*"graph"'),
 (1,
  '-0.623*"graph" + -0.490*"trees" + -0.451*"minors" + -0.274*"survey" + 0.167*"system" + 0.141*"eps" + 0.113*"human" + -0.107*"response" + -0.107*"time" + 0.072*"interface"'),
 (2,
  '-0.426*"response" + -0.426*"time" + 0.361*"system" + -0.338*"user" + 0.330*"eps" + 0.289*"human" + 0.231*"trees" + 0.223*"graph" + -0.178*"survey" + -0.164*"computer"'),
 (3,
  '-0.595*"computer" + -0.552*"interface" + -0.415*"human" + 0.333*"system" + 0.188*"eps" + 0.099*"user" + 0.074*"time" + 0.074*"response" + -0.032*"survey" + 0.025*"trees"'),
 (4,
  '0.594*"trees" + -0.537*"survey" + 0.332*"user" + -0.300*"minors" + 0.282*"interface" + -0.159*"system" + 0.115*"eps" + -0.107*"computer" + -0.106*"human" + 0.080*"time"'),
 (5,
  '0.496*"interface" + -0.392*"trees" + 0.385*"user" + -0.341*"human" + 0.277*"minors" + 0.272*"e

下面给定一句话(query)，将其转换到latent space

In [7]:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = testModel[vec_bow]  # convert the query to LSI space
print(vec_lsi)
testModel[corpus[0]]

[(0, 0.4618210045327161), (1, 0.07002766527900023), (2, 0.12452907551899081), (3, -1.0097125584438564), (4, -0.21303040605626267), (5, -0.5959384533820675), (6, 0.2204175354609439), (7, 0.0018778773554747955), (8, -0.08576685494995556)]


[(0, 0.6594664059797395),
 (1, 0.1421154440372992),
 (2, 0.2595687142084211),
 (3, -1.561952142099366),
 (4, 0.06873853289228493),
 (5, -0.1000604422714601),
 (6, 0.1499940942871652),
 (7, -0.008062159852297827),
 (8, 0.023163410616346095)]

我们能看出，它给出的格式是一个tuple列表，tuple的第一个元素代表topic id，第二个元素就代表在该topic下的取值。从而得到latent space下的向量。  
下面使用data center中的数据，训练LSA，然后算出latent space下的vector

In [10]:
def dc_format(D):
    data = {'message':D[0] , 'sentiment':D[1]}
    df = pd.DataFrame(data)
    return df

In [13]:
dc = data_center('./twitter_sentiment_data.csv', test_size=8000, noisy_size=8000, validation_size=5000)
test_df = dc_format(dc.get_test())
val_df = dc_format(dc.get_validation())

print(f"Test size: {test_df.shape[0]}")
print(f"Validation size: {val_df.shape[0]}")

Test size: 8000
Validation size: 5000
