# 3. Topics and transformations

In [18]:
import tempfile
import os.path

TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))


Folder "/var/folders/jn/wdgx85t54gl1m5byxrycxt280000gp/T" will be used to save temporary dictionary and corpus.


In [19]:
from gensim import corpora, models, similarities
if os.path.isfile(os.path.join(TEMP_FOLDER, 'deerwester.dict')):
    dictionary = corpora.Dictionary.load(os.path.join(TEMP_FOLDER, 'deerwester.dict'))
    corpus = corpora.MmCorpus(os.path.join(TEMP_FOLDER, 'deerwester.mm'))
    print("Used files generated from first tutorial")
else:
    print("Please run first tutorial to generate data set")

Used files generated from first tutorial


In [8]:
print(dictionary[0])
print(dictionary[1])
print(dictionary[2])

human
interface
computer


## Creating a transformation

The transformations are standard Python objects, typically initialized by means of a training corpus:


In [9]:
tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model

In [10]:
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow]) # step 2 -- use the model to transform vectors

[(0, 0.7071067811865476), (1, 0.7071067811865476)]


Or to apply a transformation to a whole corpus:


In [11]:
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(2, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.3244870206138555), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.44424552527467476)]
[(1, 0.5710059809418182), (4, 0.4170757362022777), (5, 0.4170757362022777), (8, 0.5710059809418182)]
[(0, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(4, 0.45889394536615247), (6, 0.6282580468670046), (7, 0.6282580468670046)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(3, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]


Transformations can also be serialized, one on top of another, in a sort of chain:


In [12]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi


In [13]:
lsi.print_topics(2)


[(0,
  '0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'),
 (1,
  '-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"time" + -0.320*"response" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]

It appears that according to LSI, “trees”, “graph” and “minors” are all related words (and contribute the most to the direction of the first topic), while the second topic practically concerns itself with all the other words. As expected, the first five documents are more strongly related to the second topic while the remaining four documents to the first topic:

In [14]:
for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    print(doc)


[(0, 0.066007833960903428), (1, -0.52007033063618413)]
[(0, 0.19667592859142452), (1, -0.76095631677000486)]
[(0, 0.089926399724464548), (1, -0.72418606267525054)]
[(0, 0.07585847652178164), (1, -0.63205515860034234)]
[(0, 0.10150299184980072), (1, -0.57373084830029597)]
[(0, 0.70321089393783143), (1, 0.16115180214025759)]
[(0, 0.87747876731198349), (1, 0.16758906864659387)]
[(0, 0.90986246868185816), (1, 0.14086553628719001)]
[(0, 0.61658253505692828), (1, -0.053929075663893655)]


In [15]:
lsi.save(os.path.join(TEMP_FOLDER, 'model.lsi')) # same for tfidf, lda, ...
#lsi = models.LsiModel.load(os.path.join(TEMP_FOLDER, 'model.lsi'))