In [None]:
%matplotlib inline


# Corpora and Vector Spaces

텍스트를 벡터 공간에 represent 하는 실습<br>
그리고 corpus 스트리밍과 corpus가 저장소에 존재하는 다양한 포맷에 대해서<br>

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

9개의 document로 이루어 진 corpus

## From Strings to Vectors

이번 document는 string

In [2]:
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

이 9개의 documents로 이루어진 작은 corpus는 각각 한 문장으로 이루어져 있습니다.<br>

먼저, documents를 토큰화 하고, 간단히 불용어와 corpus에서 한번나온 단어를 제거 하겠습니다. 

In [3]:
from pprint import pprint  # pretty-printer
from collections import defaultdict

# 단어를 소문자로 바꾸고 불용어 처리
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# 한번만 나온 단어 제거
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


documents를 처리하는 방식은 다양할 수 있습니다.

documents를 처리하는 방식은 앱이나 언어 어디에 의존하냐에 따라 다양해 지는데, gensim은 특성 방식으로 제한을 두지 않았습니다.

대신에, document는 표면적인게 아닌 document에서 추출된 특징으로 represent했습니다.

어떻게 특징을 추출할지는 사용자에게 달려 있습니다.

앞으로 일반적인 방식의 접근법을 볼겁니다(bag-of-words).

그러나 언제나 다른 앱의 도메인에서는 다른 특징을 갖는다는 것을 기억해야 합니다.

document를 변환시키기 위해서 bag-or-words를 사용할 것입니다.

- Question: How many times does the word `system` appear in the document?
- Answer: Once.

위의 형식을 띄는 질-답 쌍입니다.

질의는 integer형 id로 represent되고 id와 질의는 딕셔너리로 맵핑될 것입니다.

In [4]:
from gensim import corpora
dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict')  # store the dictionary, for future reference
print(dictionary)

2021-09-05 17:16:57,356 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-09-05 17:16:57,357 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
2021-09-05 17:16:57,358 : INFO : Dictionary lifecycle event {'msg': "built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)", 'datetime': '2021-09-05T17:16:57.358268', 'gensim': '4.0.1', 'python': '3.9.1 (tags/v3.9.1:1e5d33e, Dec  7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}
2021-09-05 17:16:57,359 : INFO : Dictionary lifecycle event {'fname_or_handle': '/tmp/deerwester.dict', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-09-05T17:16:57.359268', 'gensim': '4.0.1', 'python': '3.9.1 (tags/v3.9.1:1e5d33e, Dec  7 2020, 17:08:21) [MSC v.1927 64 bit

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


gensim 딕셔너리 클래스를 이용해서 corpus의 고유id와 모든 단어를 매칭시켰습니다.<br>

In [5]:
print(dictionary.token2id)

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}


토큰화된 documents를 벡터들로 변환시키기 위해서:

In [6]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)  # 딕셔너리에 없는 interaction 이란 단어는 무시되었다

[(0, 1), (1, 1)]


doc2bow 함수는 단순히 모든 단어의 빈도수를 세고, 단어를 integer id로 변한하고, 희소 벡터 형태로 값을 반환합니다.

In [7]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus)  # store to disk, for later use
print(corpus)

2021-09-05 17:23:47,418 : INFO : storing corpus in Matrix Market format to /tmp/deerwester.mm
2021-09-05 17:23:47,419 : INFO : saving sparse matrix to /tmp/deerwester.mm
2021-09-05 17:23:47,420 : INFO : PROGRESS: saving document #0
2021-09-05 17:23:47,420 : INFO : saved 9x12 matrix, density=25.926% (28/108)
2021-09-05 17:23:47,421 : INFO : saving MmCorpus index to /tmp/deerwester.mm.index


[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]


## Corpus Streaming -- One Document at a Time

In [1]:
from smart_open import open  # for transparently opening remote files


class MyCorpus:
    def __iter__(self):
        for line in open('https://radimrehurek.com/mycorpus.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

Gensim은 corpus가 iterable한 객체면 사용 가능합니다. like list,numpy,pandas

In [None]:
# This flexibility allows you to create your own corpus classes that stream the
# documents directly from disk, network, database, dataframes... The models
# in Gensim are implemented such that they don't require all vectors to reside
# in RAM at once. You can even create the documents on the fly!

Download the sample `mycorpus.txt file here <https://radimrehurek.com/mycorpus.txt>`_. The assumption that
each document occupies one line in a single file is not important; you can mold
the `__iter__` function to fit your input format, whatever it is.
Walking directories, parsing XML, accessing the network...
Just parse your input to retrieve a clean list of tokens in each document,
then convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside `__iter__`.



In [None]:
corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!
print(corpus_memory_friendly)

Corpus is now an object. We didn't define any way to print it, so `print` just outputs address
of the object in memory. Not very useful. To see the constituent vectors, let's
iterate over the corpus and print each document vector (one at a time):



In [None]:
for vector in corpus_memory_friendly:  # load one vector into memory at a time
    print(vector)

Although the output is the same as for the plain Python list, the corpus is now much
more memory friendly, because at most one vector resides in RAM at a time. Your
corpus can now be as large as you want.

Similarly, to construct the dictionary without loading all texts into memory:



In [None]:
# collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/mycorpus.txt'))
# remove stop words and words that appear only once
stop_ids = [
    dictionary.token2id[stopword]
    for stopword in stoplist
    if stopword in dictionary.token2id
]
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids)  # remove stop words and words that appear only once
dictionary.compactify()  # remove gaps in id sequence after words that were removed
print(dictionary)

And that is all there is to it! At least as far as bag-of-words representation is concerned.
Of course, what we do with such a corpus is another question; it is not at all clear
how counting the frequency of distinct words could be useful. As it turns out, it isn't, and
we will need to apply a transformation on this simple representation first, before
we can use it to compute any meaningful document vs. document similarities.
Transformations are covered in the next tutorial
(`sphx_glr_auto_examples_core_run_topics_and_transformations.py`),
but before that, let's briefly turn our attention to *corpus persistency*.


## Corpus Formats

There exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk.
`Gensim` implements them via the *streaming corpus interface* mentioned earlier:
documents are read from (resp. stored to) disk in a lazy fashion, one document at
a time, without the whole corpus being read into main memory at once.

One of the more notable file formats is the `Market Matrix format <http://math.nist.gov/MatrixMarket/formats.html>`_.
To save a corpus in the Matrix Market format:

create a toy corpus of 2 documents, as a plain Python list



In [None]:
corpus = [[(1, 0.5)], []]  # make one document empty, for the heck of it

corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)

Other formats include `Joachim's SVMlight format <http://svmlight.joachims.org/>`_,
`Blei's LDA-C format <http://www.cs.princeton.edu/~blei/lda-c/>`_ and
`GibbsLDA++ format <http://gibbslda.sourceforge.net/>`_.



In [None]:
corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)
corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)

Conversely, to load a corpus iterator from a Matrix Market file:



In [None]:
corpus = corpora.MmCorpus('/tmp/corpus.mm')

Corpus objects are streams, so typically you won't be able to print them directly:



In [None]:
print(corpus)

Instead, to view the contents of a corpus:



In [None]:
# one way of printing a corpus: load it entirely into memory
print(list(corpus))  # calling list() will convert any sequence to a plain Python list

or



In [None]:
# another way of doing it: print one document at a time, making use of the streaming interface
for doc in corpus:
    print(doc)

The second way is obviously more memory-friendly, but for testing and development
purposes, nothing beats the simplicity of calling ``list(corpus)``.

To save the same Matrix Market document stream in Blei's LDA-C format,



In [None]:
corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)

In this way, `gensim` can also be used as a memory-efficient **I/O format conversion tool**:
just load a document stream using one format and immediately save it in another format.
Adding new formats is dead easy, check out the `code for the SVMlight corpus
<https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/svmlightcorpus.py>`_ for an example.

## Compatibility with NumPy and SciPy

Gensim also contains `efficient utility functions <http://radimrehurek.com/gensim/matutils.html>`_
to help converting from/to numpy matrices



In [None]:
import gensim
import numpy as np
numpy_matrix = np.random.randint(10, size=[5, 2])  # random matrix as an example
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
# numpy_matrix = gensim.matutils.corpus2dense(corpus, num_terms=number_of_corpus_features)

and from/to `scipy.sparse` matrices



In [None]:
import scipy.sparse
scipy_sparse_matrix = scipy.sparse.random(5, 2)  # random sparse matrix as example
corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)
scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)

## What Next

Read about `sphx_glr_auto_examples_core_run_topics_and_transformations.py`.

## References

For a complete reference (Want to prune the dictionary to a smaller size?
Optimize converting between corpora and NumPy/SciPy arrays?), see the `apiref`.

.. [1] This is the same corpus as used in
       `Deerwester et al. (1990): Indexing by Latent Semantic Analysis <http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf>`_, Table 2.



In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img = mpimg.imread('run_corpora_and_vector_spaces.png')
imgplot = plt.imshow(img)
_ = plt.axis('off')