In [1]:
%matplotlib inline


Topics and Transformations
===========================

토이 corpus에서 변환과정을 실습해봅니다.

In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

이번 튜토리얼 에서는 한 document를 한 벡터에서 다른 벡터로 represent 하는 법을 배웁니다.
이번 과정은 두가지 목표를 가지고 있습니다.

1. corpus의 내재된 구조를 꺼내고, 단어들 사이의 관계를 탐색합니다.그리고 그것들은 document를 새롭고 의미론 적으로 표현해봅니다.

2. document representation을 좀더 치밀하게 하려면, 
    효율(적은 리소스로 새로운 represatation 생성)과 효능(중요하지 않은 데이터를 무시)을 증가시켜야 합니다.

Creating the Corpus
-------------------

In [10]:
from collections import defaultdict
from gensim import corpora

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# 불용어 제거
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# 한번 나온 단어제거
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = corpora.Dictionary(texts)
dictionary

2021-09-06 20:53:54,590 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-09-06 20:53:54,591 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
2021-09-06 20:53:54,592 : INFO : Dictionary lifecycle event {'msg': "built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)", 'datetime': '2021-09-06T20:53:54.592542', 'gensim': '4.0.1', 'python': '3.9.1 (tags/v3.9.1:1e5d33e, Dec  7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}


<gensim.corpora.dictionary.Dictionary at 0x25916513190>

In [11]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpus

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

transformation 생성 <br>
+++++++++++++++++++++<br><br>
transformation은 training corpus들로 만들어진 python 객체들 입니다.<br>

In [12]:
from gensim import models

tfidf = models.TfidfModel(corpus)  # step 1 -- 모델 초기화

2021-09-06 20:54:21,674 : INFO : collecting document frequencies
2021-09-06 20:54:21,674 : INFO : PROGRESS: processing document #0
2021-09-06 20:54:21,675 : INFO : TfidfModel lifecycle event {'msg': 'calculated IDF weights for 9 documents and 12 features (28 matrix non-zeros)', 'datetime': '2021-09-06T20:54:21.675938', 'gensim': '4.0.1', 'python': '3.9.1 (tags/v3.9.1:1e5d33e, Dec  7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'initialize'}


다른 transformation은 다른 초기화 파라미터들이 필요할 수 있습니다.<br>
tfidt의 경우에는 단어의 문장안의 빈도수와 전체 corpus에서의 빈도수를 이용합니다.<br>
<br>
Latent Semantic Analysis 또는 Latent Dirichlet Allocation은 더 진보된 방식이지만 결국 더 많은 시간이 필요합니다.

<div class="alert alert-info"><h4>Note</h4><p>
  Transformations는 항상 두 특정 벡터공간 사이의 변환입니다.
  모델 훈련에는 동일한 벡터공간이 사용되어져야 합니다.
  다음에 올 벡터 transformation들도 마찬가지 입니다.
  문자열처리 방식이 다르다 거나, id가 다르게 동일한 특성공간을 사용하지 못하면,
  transformation이 호출 되었을때 feature mismatch가 발생하거나 나쁜 출력이 나오거나
  런타임오류가 발생할 수 있습니다.
</p></div>


벡터 변환
+++++++++++++++++++++

지금부터 tfidf는 어떤 벡터를 representation으로 변환하는 읽기전용 객체로 다뤄집니다.

In [5]:
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow])  # step 2 -- 벡터변환을 하기위해 모델을 사용

[(0, 0.7071067811865476), (1, 0.7071067811865476)]


모든 corpus에 transformation 적용

In [6]:
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]
[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]


이번 경우에, 모델학습에 사용된 corpus를 벡터 변환 했습니다.<br>
그러나 모델이 한번 초기화 된 이후에도 어떤 단어가 corpus training에 전혀 사용되지 않았다해도<br>
그 벡터를 사용 할 수 있습니다.<br>

이는 folding-in 방식의 LSA와 topic inference 방식의 LDA가 할 수 있습니다.

<div class="alert alert-info"><h4>Note</h4>
    <p>corpus_transformed = model[corpus] 처럼 corpus를 변환 하는 것은 모든 결과를 메모리에 적재한다는 것입니다.<br>
    이는 gensim의 메모리와 독립적이려는 의도와 반하는 일입니다.<br>
        그래서 모델의 결과를 corpus format으로 먼저 직렬화하고 그것을 iterate해서 사용하는 것이 좋습니다.
  .</p></div>

Transformations 또한 직렬화 될 수 있습니다. 

In [7]:
lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)  # initialize an LSI transformation
corpus_lsi = lsi_model[corpus_tfidf]  # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

2021-09-06 20:36:30,238 : INFO : using serial LSI version on this node
2021-09-06 20:36:30,238 : INFO : updating model with new documents
2021-09-06 20:36:30,239 : INFO : preparing a new chunk of documents
2021-09-06 20:36:30,240 : INFO : using 100 extra samples and 2 power iterations
2021-09-06 20:36:30,240 : INFO : 1st phase: constructing (12, 102) action matrix
2021-09-06 20:36:30,246 : INFO : orthonormalizing (12, 102) action matrix
2021-09-06 20:36:30,265 : INFO : 2nd phase: running dense svd on (12, 9) matrix
2021-09-06 20:36:30,275 : INFO : computing the final decomposition
2021-09-06 20:36:30,275 : INFO : keeping 2 factors (discarding 47.565% of energy spectrum)
2021-09-06 20:36:30,276 : INFO : processed documents up to #9
2021-09-06 20:36:30,278 : INFO : topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"response" + 0.060*"time" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"
2021-09-06 20:36:30,279 : INFO : topic #

`Latent Semantic Indexing <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`<br><br>

TF-idf corpus를 2차원 (num_topic을 2로 줬기 때문에)Latent Semantic Indexing 로 변환을 시켰습니다.<br>
그럼 궁금한 것이 이 두 잠재차원은 무엇을 위해 존재하는 걸까요?<br>
models.LsiModel.print_topics 함수로 알아 봅시다.


In [8]:
lsi_model.print_topics(2)

2021-09-06 20:36:30,298 : INFO : topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"response" + 0.060*"time" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"
2021-09-06 20:36:30,299 : INFO : topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"time" + -0.320*"response" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"


[(0,
  '0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"response" + 0.060*"time" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'),
 (1,
  '-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"time" + -0.320*"response" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]

In [9]:
# bow->tfidf 로 바꾸고 tfidf->lsi 로 변환하는 과정은 실제로 여기서 일어남
for doc, as_text in zip(corpus_lsi, documents):
    print(doc, as_text)

[(0, 0.06600783396090475), (1, -0.5200703306361845)] Human machine interface for lab abc computer applications
[(0, 0.19667592859142663), (1, -0.760956316770004)] A survey of user opinion of computer system response time
[(0, 0.08992639972446669), (1, -0.7241860626752508)] The EPS user interface management system
[(0, 0.07585847652178374), (1, -0.6320551586003426)] System and human system engineering testing of EPS
[(0, 0.10150299184980256), (1, -0.5737308483002953)] Relation of user perceived response time to error measurement
[(0, 0.7032108939378304), (1, 0.16115180214025948)] The generation of random binary unordered trees
[(0, 0.8774787673119824), (1, 0.16758906864659617)] The intersection graph of paths in trees
[(0, 0.9098624686818572), (1, 0.14086553628719245)] Graph minors IV Widths of trees and well quasi ordering
[(0, 0.6165825350569281), (1, -0.05392907566389192)] Graph minors A survey


save load 기능을 모델에 지원한다.

In [13]:
import os
import tempfile

with tempfile.NamedTemporaryFile(prefix='model-', suffix='.lsi', delete=False) as tmp:
    lsi_model.save(tmp.name)  # same for tfidf, lda, ...

loaded_lsi_model = models.LsiModel.load(tmp.name)

os.unlink(tmp.name)

2021-09-06 21:15:11,697 : INFO : Projection lifecycle event {'fname_or_handle': 'C:\\Users\\yongjae\\AppData\\Local\\Temp\\model-df0gwmn9.lsi.projection', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-09-06T21:15:11.697031', 'gensim': '4.0.1', 'python': '3.9.1 (tags/v3.9.1:1e5d33e, Dec  7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'saving'}
2021-09-06 21:15:11,698 : INFO : saved C:\Users\yongjae\AppData\Local\Temp\model-df0gwmn9.lsi.projection
2021-09-06 21:15:11,699 : INFO : LsiModel lifecycle event {'fname_or_handle': 'C:\\Users\\yongjae\\AppData\\Local\\Temp\\model-df0gwmn9.lsi', 'separately': 'None', 'sep_limit': 10485760, 'ignore': ['projection', 'dispatcher'], 'datetime': '2021-09-06T21:15:11.699032', 'gensim': '4.0.1', 'python': '3.9.1 (tags/v3.9.1:1e5d33e, Dec  7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'saving'}
2021-09-06 21:15:11

다음 질문은 아마도 정확히 document들이 서로 얼마나 비슷하냐 일것이다.
유사도를 정형화 하는 방법이 있기 때문에, 우리는 입력 document에 대해서 
유사도 순으로 document들을 정렬 할수있을까요? Similarity queries는 다음 튜토리얼에서.

Available transformations
--------------------------

가능한 변환 방법들 (tf-idf, LSI, RP, LDA, HDP, VSM)

Gensim implements several popular Vector Space Model algorithms:

* `Term Frequency * Inverse Document Frequency, Tf-Idf <http://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_
  expects a bag-of-words (integer values) training corpus during initialization.
  During transformation, it will take a vector and return another vector of the
  same dimensionality, except that features which were rare in the training corpus
  will have their value increased.
  It therefore converts integer-valued vectors into real-valued ones, while leaving
  the number of dimensions intact. It can also optionally normalize the resulting
  vectors to (Euclidean) unit length.

 .. sourcecode:: pycon

    model = models.TfidfModel(corpus, normalize=True)

* `Latent Semantic Indexing, LSI (or sometimes LSA) <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_
  transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into
  a latent space of a lower dimensionality. For the toy corpus above we used only
  2 latent dimensions, but on real corpora, target dimensionality of 200--500 is recommended
  as a "golden standard" [1]_.

  .. sourcecode:: pycon

    model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)

  LSI training is unique in that we can continue "training" at any point, simply
  by providing more training documents. This is done by incremental updates to
  the underlying model, in a process called `online training`. Because of this feature, the
  input document stream may even be infinite -- just keep feeding LSI new documents
  as they arrive, while using the computed transformation model as read-only in the meanwhile!

  .. sourcecode:: pycon

    model.add_documents(another_tfidf_corpus)  # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
    lsi_vec = model[tfidf_vec]  # convert some new document into the LSI space, without affecting the model

    model.add_documents(more_documents)  # tfidf_corpus + another_tfidf_corpus + more_documents
    lsi_vec = model[tfidf_vec]

  See the :mod:`gensim.models.lsimodel` documentation for details on how to make
  LSI gradually "forget" old observations in infinite streams. If you want to get dirty,
  there are also parameters you can tweak that affect speed vs. memory footprint vs. numerical
  precision of the LSI algorithm.

  `gensim` uses a novel online incremental streamed distributed training algorithm (quite a mouthful!),
  which I published in [5]_. `gensim` also executes a stochastic multi-pass algorithm
  from Halko et al. [4]_ internally, to accelerate in-core part
  of the computations.
  See also `wiki` for further speed-ups by distributing the computation across
  a cluster of computers.

* `Random Projections, RP <http://www.cis.hut.fi/ella/publications/randproj_kdd.pdf>`_ aim to
  reduce vector space dimensionality. This is a very efficient (both memory- and
  CPU-friendly) approach to approximating TfIdf distances between documents, by throwing in a little randomness.
  Recommended target dimensionality is again in the hundreds/thousands, depending on your dataset.

  .. sourcecode:: pycon

    model = models.RpModel(tfidf_corpus, num_topics=500)

* `Latent Dirichlet Allocation, LDA <http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation>`_
  is yet another transformation from bag-of-words counts into a topic space of lower
  dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA),
  so LDA's topics can be interpreted as probability distributions over words. These distributions are,
  just like with LSA, inferred automatically from a training corpus. Documents
  are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).

  .. sourcecode:: pycon

    model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)

  `gensim` uses a fast implementation of online LDA parameter estimation based on [2]_,
  modified to run in `distributed mode <distributed>` on a cluster of computers.

* `Hierarchical Dirichlet Process, HDP <http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf>`_
  is a non-parametric bayesian method (note the missing number of requested topics):

  .. sourcecode:: pycon

    model = models.HdpModel(corpus, id2word=dictionary)

  `gensim` uses a fast, online implementation based on [3]_.
  The HDP model is a new addition to `gensim`, and still rough around its academic edges -- use with care.

Adding new :abbr:`VSM (Vector Space Model)` transformations (such as different weighting schemes) is rather trivial;
see the `apiref` or directly the `Python code <https://github.com/piskvorky/gensim/blob/develop/gensim/models/tfidfmodel.py>`_
for more info and examples.

It is worth repeating that these are all unique, **incremental** implementations,
which do not require the whole training corpus to be present in main memory all at once.
With memory taken care of, I am now improving `distributed`,
to improve CPU efficiency, too.
If you feel you could contribute by testing, providing use-cases or code, see the `Gensim Developer guide <https://github.com/RaRe-Technologies/gensim/wiki/Developer-page>`__.

What Next?
----------

Continue on to the next tutorial on `sphx_glr_auto_examples_core_run_similarity_queries.py`.

References
----------

.. [1] Bradford. 2008. An empirical study of required dimensionality for large-scale latent semantic indexing applications.

.. [2] Hoffman, Blei, Bach. 2010. Online learning for Latent Dirichlet Allocation.

.. [3] Wang, Paisley, Blei. 2011. Online variational inference for the hierarchical Dirichlet process.

.. [4] Halko, Martinsson, Tropp. 2009. Finding structure with randomness.

.. [5] Řehůřek. 2011. Subspace tracking for Latent Semantic Analysis.

