## Gensim Core Concepts

In [7]:
import pprint
from collections import defaultdict
from gensim import corpora, models, similarities

The core concepts of gensim are:<br>

Document: some text.<br>

Corpus: a collection of documents.<br>

Vector: a mathematically convenient representation of a document.<br>

Model: an algorithm for transforming vectors from one representation to another.<br>

### Document<br>
In Gensim, a document is an object of the text sequence type. A document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book.

In [2]:
document = "Human machine interface for lab abc computer applications"

### Corpus<br>
A corpus is a collection of Document objects. Corpora serve two roles in Gensim:<br>

1. Input for training a Model. During training, the models use this training corpus to look for common themes and topics, initializing their internal model parameters.

Gensim focuses on unsupervised models so that no human intervention, such as costly annotations or tagging documents by hand, is required.

2. Documents to organize. After training, a topic model can be used to extract topics from new documents (documents not seen in the training corpus).
<br>
Such corpora can be indexed for Similarity Queries, queried by semantic similarity, clustered etc.

In [3]:
text_corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

In [11]:
# Create a set of frequent words
stopwords = set('for a of the and to in'.split())

# Lowercase each document, split it by white space and filter out stopwords
texts = [[ word for word in document.lower().split() if word not in stopwords ] for document in text_corpus]
pprint.pprint(texts)

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]


In [23]:
# Count word frequencies
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

pprint.pprint(frequency)
# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
pprint.pprint(processed_corpus)

defaultdict(<class 'int'>,
            {'abc': 1,
             'applications': 1,
             'binary': 1,
             'computer': 2,
             'engineering': 1,
             'eps': 2,
             'error': 1,
             'generation': 1,
             'graph': 3,
             'human': 2,
             'interface': 2,
             'intersection': 1,
             'iv': 1,
             'lab': 1,
             'machine': 1,
             'management': 1,
             'measurement': 1,
             'minors': 2,
             'opinion': 1,
             'ordering': 1,
             'paths': 1,
             'perceived': 1,
             'quasi': 1,
             'random': 1,
             'relation': 1,
             'response': 2,
             'survey': 2,
             'system': 4,
             'testing': 1,
             'time': 2,
             'trees': 3,
             'unordered': 1,
             'user': 3,
             'well': 1,
             'widths': 1})
[['human', 'interface', 'computer'],


In [13]:
# Map each word in corpus to a unique integer ID
dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


### Vector

In [15]:
# Token : integer ID mapping
pprint.pprint(dictionary.token2id)

{'computer': 0,
 'eps': 8,
 'graph': 10,
 'human': 1,
 'interface': 2,
 'minors': 11,
 'response': 3,
 'survey': 4,
 'system': 5,
 'time': 6,
 'trees': 9,
 'user': 7}


Creating a bag-of-word representation for a document using the doc2bow method of the dictionary, which returns a sparse representation of the word counts:

In [16]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

[(0, 1), (1, 1)]


The first entry in each tuple corresponds to the ID of the token in the dictionary, the second corresponds to the count of this token.

Entire original corpus to a list of vectors:

In [17]:
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(bow_corpus)

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]


### Model

We use model as an abstract term referring to a transformation from one document representation to another. In gensim documents are represented as vectors so a model can be thought of as a transformation between two vector spaces. The model learns the details of this transformation during training, when it reads the training Corpus.

One simple example of a model is tf-idf. The tf-idf model transforms vectors from the bag-of-words representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus.

In [18]:
from gensim import models

# train the model
tfidf = models.TfidfModel(bow_corpus)

# transform the "system minors" string
words = "system minors".lower().split()
print(tfidf[dictionary.doc2bow(words)])

[(5, 0.5898341626740045), (11, 0.8075244024440723)]


Transform the whole corpus via TfIdf and index it, in preparation for similarity queries:

In [19]:
index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)

In [20]:
query_document = 'system engineering'.split()
query_bow = dictionary.doc2bow(query_document)
sims = index[tfidf[query_bow]]
print(list(enumerate(sims)))

[(0, 0.0), (1, 0.32448703), (2, 0.41707572), (3, 0.7184812), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]


In [21]:
for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):
    print(document_number, score)

3 0.7184812
2 0.41707572
1 0.32448703
0 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0


Summary: Document 3 has a similarity score of 0.718=72%, document 2 has a similarity score of 42% etc.