### Transform one vector representation to another

Previously created `Bag of words` to `TF-IDF`. The transformation results in the same input feature space and the preprocessing of inputs must be done in the same manner.

In [9]:
import os
from gensim import corpora, models, similarities
if (os.path.exists("/tmp/deerwester.dict")):
    dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
    corpus = corpora.MmCorpus('/tmp/deerwester.mm')
    print("Used files generated from first tutorial")
else:
    print("Please run first tutorial to generate data set")

Used files generated from first tutorial


1. Initialize a model

In [10]:
tfidf = models.TfidfModel(corpus)

2. Use model to transform vector or the entire corpus

In [11]:
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow])

[(0, 0.7071067811865476), (1, 0.7071067811865476)]


In [12]:
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]
[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]


Calling `model[coupus]` only creates a wrapper around the old corpus. The acutal conversion is done on the fly.

##### Transformations can be serialized/chained

Here we are converting `TF-IDF` via `LSI (Latent Semantic Indexing)` into a 2D space (`num_topics=2`)

In [13]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus_tfidf]

In [17]:
lsi.print_topics(2)

[(0, '0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'), (1, '-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]

In [18]:
for doc in corpus_lsi:
    print(doc)

[(0, 0.06600783396090736), (1, -0.5200703306361845)]
[(0, 0.1966759285914284), (1, -0.7609563167700036)]
[(0, 0.08992639972446881), (1, -0.7241860626752503)]
[(0, 0.07585847652178614), (1, -0.6320551586003424)]
[(0, 0.10150299184980326), (1, -0.5737308483002951)]
[(0, 0.7032108939378301), (1, 0.16115180214026198)]
[(0, 0.8774787673119822), (1, 0.16758906864659917)]
[(0, 0.9098624686818568), (1, 0.14086553628719542)]
[(0, 0.6165825350569281), (1, -0.053929075663890005)]
