<a href="https://colab.research.google.com/github/mashyko/NLP_Mecab/blob/master/gensim_tutor_English.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Tutorial 1: Corpora and Vector Spaces


In [0]:
!pip install gensim



In [0]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [0]:
from gensim import corpora


2019-12-25 06:18:53,827 : INFO : 'pattern' package not found; tag filters are not available for English


In [0]:
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

In [0]:
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]

from pprint import pprint  # pretty-printer
pprint(texts)


[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


In [0]:
import os
import tempfile
TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))

Folder "/tmp" will be used to save temporary dictionary and corpus.


In [0]:
dictionary = corpora.Dictionary(texts)
dictionary.save(os.path.join(TEMP_FOLDER, 'deerwester.dict'))  # store the dictionary, for future reference
print(dictionary)

2019-12-25 06:18:53,875 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-12-25 06:18:53,876 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
2019-12-25 06:18:53,877 : INFO : saving Dictionary object under /tmp/deerwester.dict, separately None
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-12-25 06:18:53,883 : INFO : saved /tmp/deerwester.dict


Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


In [0]:
print(dictionary.token2id)

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}


In [0]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)  # the word "interaction" does not appear in the dictionary and is ignored

[(0, 1), (1, 1)]


In [0]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize(os.path.join(TEMP_FOLDER, 'deerwester.mm'), corpus)  # store to disk, for later use
for c in corpus:
    print(c)

2019-12-25 06:18:53,908 : INFO : storing corpus in Matrix Market format to /tmp/deerwester.mm
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-12-25 06:18:53,910 : INFO : saving sparse matrix to /tmp/deerwester.mm
2019-12-25 06:18:53,911 : INFO : PROGRESS: saving document #0
2019-12-25 06:18:53,913 : INFO : saved 9x12 matrix, density=25.926% (28/108)
2019-12-25 06:18:53,914 : INFO : saving MmCorpus index to /tmp/deerwester.mm.index


[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]


By now it should be clear that the vector feature with id=10 represents the number of times the word "graph" occurs in the document. The answer is “zero” for the first six documents and “one” for the remaining three. As a matter of fact, we have arrived at exactly the same corpus of vectors as in the Quick Example. If you're running this notebook yourself the word IDs may differ, but you should be able to check the consistency between documents comparing their vectors.

Corpus Streaming – One Document at a Time

Note that corpus above resides fully in memory, as a plain Python list. In this simple example, it doesn’t matter much, but just to make things clear, let’s assume there are millions of documents in the corpus. Storing all of them in RAM won’t do. Instead, let’s assume the documents are stored in a file on disk, one document per line. Gensim only requires that a corpus be able to return one document vector at a time:


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
from smart_open import smart_open
class MyCorpus(object):
    def __iter__(self):
        for line in smart_open('/content/drive/My Drive/NLP/datasets/mycorpus.txt', 'rb'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

In [0]:
corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory!
print(corpus_memory_friendly)

<__main__.MyCorpus object at 0x7f1b0c9b6908>


In [0]:
for vector in corpus_memory_friendly:  # load one vector into memory at a time
    print(vector)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]


In [0]:
from six import iteritems
from smart_open import smart_open

# collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in smart_open('/content/drive/My Drive/NLP/datasets/mycorpus.txt', 'rb'))

# remove stop words and words that appear only once
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist 
            if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]

# remove stop words and words that appear only once
dictionary.filter_tokens(stop_ids + once_ids)
print(dictionary)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-12-25 06:25:59,977 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-12-25 06:25:59,979 : INFO : built Dictionary(42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...) from 9 documents (total 69 corpus positions)


Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


Corpus Formats
There exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk. Gensim implements them via the streaming corpus interface mentioned earlier: documents are read from (or stored to) disk in a lazy fashion, one document at a time, without the whole corpus being read into main memory at once.

In [0]:
# create a toy corpus of 2 documents, as a plain Python list
corpus = [[(1, 0.5)], []]  # make one document empty, for the heck of it

corpora.MmCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.mm'), corpus)

2019-12-25 06:26:13,105 : INFO : storing corpus in Matrix Market format to /tmp/corpus.mm
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-12-25 06:26:13,108 : INFO : saving sparse matrix to /tmp/corpus.mm
2019-12-25 06:26:13,109 : INFO : PROGRESS: saving document #0
2019-12-25 06:26:13,110 : INFO : saved 2x2 matrix, density=25.000% (1/4)
2019-12-25 06:26:13,113 : INFO : saving MmCorpus index to /tmp/corpus.mm.index


In [0]:
corpora.SvmLightCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.svmlight'), corpus)
corpora.BleiCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.lda-c'), corpus)
corpora.LowCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.low'), corpus)


2019-12-25 06:26:16,730 : INFO : converting corpus to SVMlight format: /tmp/corpus.svmlight
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-12-25 06:26:16,732 : INFO : saving SvmLightCorpus index to /tmp/corpus.svmlight.index
2019-12-25 06:26:16,735 : INFO : no word id mapping provided; initializing from corpus
2019-12-25 06:26:16,738 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c
2019-12-25 06:26:16,740 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab
2019-12-25 06:26:16,743 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index
2019-12-25 06:26:16,745 : INFO : no word id mapping provided; initializing from corpus
2019-12-25 06:26:16,749 : INFO : storing corpus in List-Of-Words format into /tmp/corpus.low
2019-12-25 06:26:16,751 : INFO : saving LowCorpus index to /tmp/corpus.low.index


In [0]:
corpus = corpora.MmCorpus(os.path.join(TEMP_FOLDER, 'corpus.mm'))

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-12-25 06:26:21,514 : INFO : loaded corpus index from /tmp/corpus.mm.index
2019-12-25 06:26:21,515 : INFO : initializing cython corpus reader from /tmp/corpus.mm
2019-12-25 06:26:21,516 : INFO : accepted corpus with 2 documents, 2 features, 1 non-zero entries


In [0]:
print(corpus)

MmCorpus(2 documents, 2 features, 1 non-zero entries)


In [0]:
# one way of printing a corpus: load it entirely into memory
print(list(corpus))  # calling list() will convert any sequence to a plain Python list

[[(1, 0.5)], []]


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
# another way of doing it: print one document at a time, making use of the streaming interface
for doc in corpus:
    print(doc)

[(1, 0.5)]
[]


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
corpora.BleiCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.lda-c'), corpus)

2019-12-25 06:26:34,923 : INFO : no word id mapping provided; initializing from corpus
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-12-25 06:26:34,926 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c
2019-12-25 06:26:34,927 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab
2019-12-25 06:26:34,928 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index


Compatibility with NumPy and SciPy
Gensim also contains efficient utility functions to help converting from/to numpy matrices:

In [0]:
import gensim
import numpy as np
numpy_matrix = np.random.randint(10, size=[5,2])
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
numpy_matrix_dense = gensim.matutils.corpus2dense(corpus, num_terms=10)

  result = np.column_stack(sparse2full(doc, num_terms) for doc in corpus)


In [0]:
import scipy.sparse
scipy_sparse_matrix = scipy.sparse.random(5,2)
corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)
scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)

Topics and Transformation

In [0]:
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import tempfile
import os.path

TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))

Folder "/tmp" will be used to save temporary dictionary and corpus.


In [0]:
from gensim import corpora, models, similarities
if os.path.isfile(os.path.join(TEMP_FOLDER, 'deerwester.dict')):
    dictionary = corpora.Dictionary.load(os.path.join(TEMP_FOLDER, 'deerwester.dict'))
    corpus = corpora.MmCorpus(os.path.join(TEMP_FOLDER, 'deerwester.mm'))
    print("Used files generated from first tutorial")
else:
    print("Please run first tutorial to generate data set")

2019-12-25 06:26:55,128 : INFO : loading Dictionary object from /tmp/deerwester.dict
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-12-25 06:26:55,132 : INFO : loaded /tmp/deerwester.dict
2019-12-25 06:26:55,134 : INFO : loaded corpus index from /tmp/deerwester.mm.index
2019-12-25 06:26:55,135 : INFO : initializing cython corpus reader from /tmp/deerwester.mm
2019-12-25 06:26:55,136 : INFO : accepted corpus with 9 documents, 12 features, 28 non-zero entries


Used files generated from first tutorial


In [0]:
print(dictionary[0])
print(dictionary[1])
print(dictionary[2])

computer
human
interface


In [0]:
tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model

2019-12-25 06:27:03,476 : INFO : collecting document frequencies
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-12-25 06:27:03,478 : INFO : PROGRESS: processing document #0
2019-12-25 06:27:03,480 : INFO : calculating IDF weights for 9 documents and 11 features (28 matrix non-zeros)


In [0]:
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow]) # step 2 -- use the model to transform vectors

[(0, 0.7071067811865476), (1, 0.7071067811865476)]


In [0]:
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]
[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

2019-12-25 06:27:13,328 : INFO : using serial LSI version on this node
2019-12-25 06:27:13,330 : INFO : updating model with new documents
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-12-25 06:27:13,333 : INFO : preparing a new chunk of documents
2019-12-25 06:27:13,336 : INFO : using 100 extra samples and 2 power iterations
2019-12-25 06:27:13,337 : INFO : 1st phase: constructing (12, 102) action matrix
2019-12-25 06:27:13,342 : INFO : orthonormalizing (12, 102) action matrix
2019-12-25 06:27:13,360 : INFO : 2nd phase: running dense svd on (12, 9) matrix
2019-12-25 06:27:13,364 : INFO : computing the final decomposition
2019-12-25 06:27:13,365 : INFO : keeping 2 factors (discarding 47.565% of energy spectrum)
2019-12-25 06:27:13,370 : INFO : processed documents up to #9
2019-12-25 06:27:13,373 : INFO : topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"response" + 0.060*"time" + 0.058*"user" + 0.049*"com

In [0]:
lsi.print_topics(2)

2019-12-25 06:27:17,329 : INFO : topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"response" + 0.060*"time" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"
2019-12-25 06:27:17,331 : INFO : topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"


[(0,
  '0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"response" + 0.060*"time" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'),
 (1,
  '-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]

In [0]:
for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    print(doc)

[(0, 0.06600783396090554), (1, -0.5200703306361844)]
[(0, 0.19667592859142888), (1, -0.7609563167700034)]
[(0, 0.08992639972446784), (1, -0.7241860626752504)]
[(0, 0.07585847652178464), (1, -0.6320551586003422)]
[(0, 0.1015029918498045), (1, -0.573730848300295)]
[(0, 0.7032108939378295), (1, 0.16115180214026117)]
[(0, 0.8774787673119817), (1, 0.16758906864659823)]
[(0, 0.9098624686818568), (1, 0.14086553628719467)]
[(0, 0.6165825350569287), (1, -0.05392907566389046)]


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
lsi.save(os.path.join(TEMP_FOLDER, 'model.lsi')) # same for tfidf, lda, ...
#lsi = models.LsiModel.load(os.path.join(TEMP_FOLDER, 'model.lsi'))

2019-12-25 06:27:25,454 : INFO : saving Projection object under /tmp/model.lsi.projection, separately None
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-12-25 06:27:25,457 : INFO : saved /tmp/model.lsi.projection
2019-12-25 06:27:25,458 : INFO : saving LsiModel object under /tmp/model.lsi, separately None
2019-12-25 06:27:25,460 : INFO : not storing attribute projection
2019-12-25 06:27:25,462 : INFO : not storing attribute dispatcher
2019-12-25 06:27:25,467 : INFO : saved /tmp/model.lsi
