
Core Concepts
=============

This tutorial introduces Documents, Corpora, Vectors and Models: the basic concepts and terms needed to understand and use gensim.

https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html

The core concepts of ``gensim`` are:

1. `core_concepts_document`: some text.
2. `core_concepts_corpus`: a collection of documents.
3. `core_concepts_vector`: a mathematically convenient representation of a document.
4. `core_concepts_model`: an algorithm for transforming vectors from one representation to another.

Let's examine each of these in slightly more detail.


Document
--------

In Gensim, a *document* is an object of the `text sequence type <https://docs.python.org/3.7/library/stdtypes.html#text-sequence-type-str>`_ (commonly known as ``str`` in Python 3).
A document could be anything from a short 140 character tweet, a single
paragraph (i.e., journal article abstract), a news article, or a book.


Corpus
------

A *corpus* is a collection of `core_concepts_document` objects.
Corpora serve two roles in Gensim:

1. Input for training a `core_concepts_model`.
   During training, the models use this *training corpus* to look for common
   themes and topics, initializing their internal model parameters.

   Gensim focuses on *unsupervised* models so that no human intervention,
   such as costly annotations or tagging documents by hand, is required.

2. Documents to organize.
   After training, a topic model can be used to extract topics from new
   documents (documents not seen in the training corpus).

   Such corpora can be indexed for
   `sphx_glr_auto_examples_core_run_similarity_queries.py`,
   queried by semantic similarity, clustered etc.

Here is an example corpus.
It consists of 9 documents, where each document is a string consisting of a single sentence.

.. Important::
  The above example loads the entire corpus into memory.
  In practice, corpora may be very large, so loading them into memory may be impossible.
  Gensim intelligently handles such corpora by *streaming* them one document at a time.
  See `corpus_streaming_tutorial` for details.

This is a particularly small example of a corpus for illustration purposes.
Another example could be a list of all the plays written by Shakespeare, list
of all wikipedia articles, or all tweets by a particular person of interest.

After collecting our corpus, there are typically a number of preprocessing
steps we want to undertake. We'll keep it simple and just remove some
commonly used English words (such as 'the') and words that occur only once in
the corpus. In the process of doing so, we'll tokenize our data.
Tokenization breaks up the documents into words (in this case using space as
a delimiter).

.. Important::
  There are better ways to perform preprocessing than just lower-casing and
  splitting by space.  Effective preprocessing is beyond the scope of this
  tutorial: if you're interested, check out the
  :py:func:`gensim.utils.simple_preprocess` function.


In [1]:
%matplotlib inline

In [2]:
from utils import parse_xml
import string
import re
import nltk
import pprint

# Specify the path to your XML file
xml_file_path = r'C:\dev\NLP-Sandbox\PURE\requirements-xml\0000 - cctns.xml'
# Define the namespace
namespace = {'ns': 'req_document.xsd'}

# import utils.ParseXML as ParseXML
df = parse_xml.process_xml_with_namespace(xml_file_path, namespace)
df.head(10)


Unnamed: 0,tag,text,id,path
0,title,E-GOVERNANCE MISSION MODE PROJECT (MMP),,req_document/title/title
1,title,CRIME & CRIMINAL TRACKING NETWORK AND SYSTEMS ...,,req_document/title/title
2,title,FUNCTIONAL REQUIREMENTS SPECIFICATION V1.0 (DR...,,req_document/title/title
3,title,MINISTRY OF HOME AFFAIRS GOVERNMENT OF INDIA,,req_document/title/title
4,version,1.0,,req_document/version
5,title,INTRODUCTION,1.0,req_document/p/title
6,title,The Functional Requirements Specifications (FR...,,req_document/p/text_body
7,title,FUNCTIONAL OVERVIEW,2.0,req_document/p/title
8,title,CCTNS V1.0 functionality is designed to focus ...,,req_document/p/text_body
9,title,DESCRIPTION OF THE MODULES AND FUNCTIONAL REQU...,3.0,req_document/p/title


In [3]:
from utils import clean_data

#Convert dataframe attribute to list
text_corpus = df['text'].tolist()

#cleat the list strings
texts = [clean_data.clean_text(string.lower()) for string in text_corpus]

# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
pprint.pprint(type(processed_corpus))
pprint.pprint(processed_corpus)

<class 'list'>
[['mode'],
 ['crime', 'network', 'systems', 'cctns'],
 ['functional', 'requirements', 'v10'],
 ['home'],
 ['10'],
 [],
 ['functional',
  'requirements',
  'provides',
  'detailed',
  'description',
  'functionalities',
  'required',
  'version',
  'cctns',
  'key',
  'guiding',
  'principle',
  'functional',
  'design',
  'cctns',
  'v10',
  'focus',
  'critical',
  'functionality',
  'provides',
  'value',
  'police',
  'personnel',
  'cutting',
  'edge',
  'turn',
  'improve',
  'investigation',
  'crime'],
 ['functional', 'overview'],
 ['cctns',
  'v10',
  'functionality',
  'designed',
  'focus',
  'value',
  'records',
  'citizens',
  'within',
  'broad',
  'crime',
  'investigation',
  'area',
  'based',
  'guiding',
  'principles',
  'different',
  'function',
  'blocks',
  'identified',
  'detailed',
  'functionality'],
 ['description', 'modules', 'functional', 'requirements'],
 ['functionality',
  'cctns',
  'application',
  'providing',
  'value',
  'police',
 

In [4]:
def corpus_list_to_dictionary(corpus):
    """
    Associate each word in the corpus with a unique integer ID
    This dictionary defines the vocabulary of all words that our processing knows about.

    Parameters:
    - corpus (str): A list of input text to be indexed.


    Returns:
    - A dictionary of unique tokes with an associated ID
    """
    from gensim import corpora
    return corpora.Dictionary(corpus)

dictionary = corpus_list_to_dictionary(processed_corpus)
print(dictionary)

Dictionary<456 unique tokens: ['mode', 'cctns', 'crime', 'network', 'systems']...>



Vector
------

To infer the latent structure in our corpus we need a way to represent
documents that we can manipulate mathematically. One approach is to represent
each document as a vector of *features*.
For example, a single feature may be thought of as a question-answer pair:

1. How many times does the word *splonge* appear in the document? Zero.
2. How many paragraphs does the document consist of? Two.
3. How many fonts does the document use? Five.

The question is usually represented only by its integer id (such as `1`, `2` and `3`).
The representation of this document then becomes a series of pairs like ``(1, 0.0), (2, 2.0), (3, 5.0)``.
This is known as a *dense vector*, because it contains an explicit answer to each of the above questions.

If we know all the questions in advance, we may leave them implicit
and simply represent the document as ``(0, 2, 5)``.
This sequence of answers is the **vector** for our document (in this case a 3-dimensional dense vector).
For practical purposes, only questions to which the answer is (or
can be converted to) a *single floating point number* are allowed in Gensim.

In practice, vectors often consist of many zero values.
To save memory, Gensim omits all vector elements with value 0.0.
The above example thus becomes ``(2, 2.0), (3, 5.0)``.
This is known as a *sparse vector* or *bag-of-words vector*.
The values of all missing features in this sparse representation can be unambiguously resolved to zero, ``0.0``.

Assuming the questions are the same, we can compare the vectors of two different documents to each other.
For example, assume we are given two vectors ``(0.0, 2.0, 5.0)`` and ``(0.1, 1.9, 4.9)``.
Because the vectors are very similar to each other, we can conclude that the documents corresponding to those vectors are similar, too.
Of course, the correctness of that conclusion depends on how well we picked the questions in the first place.

Another approach to represent a document as a vector is the *bag-of-words
model*.
Under the bag-of-words model each document is represented by a vector
containing the frequency counts of each word in the dictionary.
For example, assume we have a dictionary containing the words
``['coffee', 'milk', 'sugar', 'spoon']``.
A document consisting of the string ``"coffee milk coffee"`` would then
be represented by the vector ``[2, 1, 0, 0]`` where the entries of the vector
are (in order) the occurrences of "coffee", "milk", "sugar" and "spoon" in
the document. The length of the vector is the number of entries in the
dictionary. One of the main properties of the bag-of-words model is that it
completely ignores the order of the tokens in the document that is encoded,
which is where the name bag-of-words comes from.

Our processed corpus has 12 unique words in it, which means that each
document will be represented by a 12-dimensional vector under the
bag-of-words model. We can use the dictionary to turn tokenized documents
into these 12-dimensional vectors. We can see what these IDs correspond to:




In [5]:
pprint.pprint(dictionary.token2id)

{'10': 9,
 '2': 414,
 '20': 419,
 '58': 408,
 '924114': 371,
 '924117': 372,
 '9241171': 234,
 'ability': 73,
 'able': 163,
 'acceptable': 308,
 'access': 167,
 'accessed': 309,
 'accessibility': 235,
 'accessible': 139,
 'account': 269,
 'achieved': 159,
 'achieving': 251,
 'action': 131,
 'actions': 118,
 'activating': 293,
 'acts': 49,
 'adaptation': 384,
 'adapting': 385,
 'addition': 105,
 'additional': 279,
 'address': 174,
 'advanced': 74,
 'alerts': 99,
 'allow': 177,
 'also': 75,
 'alternative': 243,
 'andor': 244,
 'application': 45,
 'appropriate': 175,
 'architecture': 421,
 'area': 32,
 'attempted': 168,
 'attempts': 169,
 'attributes': 114,
 'audit': 142,
 'authentication': 436,
 'automatically': 143,
 'available': 160,
 'avoided': 275,
 'avoiding': 280,
 'back': 362,
 'based': 33,
 'behaviour': 386,
 'blocks': 34,
 'broad': 35,
 'browser': 140,
 'cache': 452,
 'capabilities': 357,
 'capacity': 409,
 'care': 358,
 'case': 144,
 'cases': 66,
 'cctns': 1,
 'centralized': 42

## Use the dictionary created from corpus to vectorize new documents
For example, suppose we wanted to vectorize the phrase "Human computer
interaction" (note that this phrase was not in our original corpus). We can
create the bag-of-word representation for a document using the ``doc2bow``
method of the dictionary, which returns a sparse representation of the word
counts:


The first entry in each tuple corresponds to the ID of the token in the
dictionary, the second corresponds to the count of this token.

Note that "interaction" did not occur in the original corpus and so it was
not included in the vectorization. Also note that this vector only contains
entries for words that actually appeared in the document. Because any given
document will only contain a few words out of the many words in the
dictionary, words that do not appear in the vectorization are represented as
implicitly zero as a space saving measure.

We can convert our entire original corpus to a list of vectors:



In [6]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

print("----------------------")

bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(bow_corpus)

[(54, 1), (360, 1)]
----------------------
[[(0, 1)],
 [(1, 1), (2, 1), (3, 1), (4, 1)],
 [(5, 1), (6, 1), (7, 1)],
 [(8, 1)],
 [(9, 1)],
 [],
 [(1, 2),
  (2, 1),
  (5, 2),
  (6, 1),
  (7, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 2),
  (27, 1),
  (28, 1),
  (29, 1),
  (30, 1)],
 [(5, 1), (31, 1)],
 [(1, 1),
  (2, 1),
  (7, 1),
  (14, 1),
  (16, 1),
  (18, 2),
  (19, 1),
  (21, 1),
  (29, 1),
  (32, 1),
  (33, 1),
  (34, 1),
  (35, 1),
  (36, 1),
  (37, 1),
  (38, 1),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1)],
 [(5, 1), (6, 1), (12, 1), (44, 1)],
 [(1, 1),
  (11, 1),
  (15, 1),
  (18, 1),
  (23, 1),
  (24, 2),
  (29, 1),
  (39, 1),
  (45, 1),
  (46, 2),
  (47, 1)],
 [(48, 1)],
 [(21, 1),
  (24, 4),
  (33, 1),
  (36, 2),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (52, 1),
  (53, 1),
  (54, 1),
  (55, 1),
  (56, 1),
  (57, 1),
  

Note that while this list lives entirely in memory, in most applications you
will want a more scalable solution. Luckily, ``gensim`` allows you to use any
iterator that returns a single document vector at a time. See the
documentation for more details.

.. Important::
  The distinction between a document and a vector is that the former is text,
  and the latter is a mathematically convenient representation of the text.
  Sometimes, people will use the terms interchangeably: for example, given
  some arbitrary document ``D``, instead of saying "the vector that
  corresponds to document ``D``", they will just say "the vector ``D``" or
  the "document ``D``".  This achieves brevity at the cost of ambiguity.

  As long as you remember that documents exist in document space, and that
  vectors exist in vector space, the above ambiguity is acceptable.

.. Important::
  Depending on how the representation was obtained, two different documents
  may have the same vector representations.


Model
-----

### Creating a transformation


Now that we have vectorized our corpus we can begin to transform it using
*models*. We use model as an abstract term referring to a *transformation* from
one document representation to another. In ``gensim`` documents are
represented as vectors so a model can be thought of as a transformation
between two vector spaces. The model learns the details of this
transformation during training, when it reads the training
`core_concepts_corpus`.

One simple example of a model is `tf-idf
<https://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_.  The tf-idf model
transforms vectors from the bag-of-words representation to a vector space
where the frequency counts are weighted according to the relative rarity of
each word in the corpus.

Here's a simple example. Let's initialize the tf-idf model, training it on
our corpus and transforming the string "system minors":




In [7]:
from gensim import models

# train the model
tfidf = models.TfidfModel(bow_corpus)

# transform the "system minors" string
words = "system requirements".lower().split()
print(tfidf[dictionary.doc2bow(words)])

[(6, 0.955192497981361), (117, 0.2959852898373965)]


The ``tfidf`` model again returns a list of tuples, where the first entry is
the token ID and the second entry is the tf-idf weighting. The words that occur
more times are weighted lower.

You can save trained models to disk and later load them back, either to
continue training on new training documents or to transform new documents.

``gensim`` offers a number of different models/transformations.
For more, see `sphx_glr_auto_examples_core_run_topics_and_transformations.py`.

Once you've created the model, you can do all sorts of cool stuff with it.
For example, to transform the whole corpus via TfIdf and index it, in
preparation for similarity queries:




In [8]:
from gensim import similarities

index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)

and to query the similarity of our query document ``query_document`` against every document in the corpus:



In [9]:
query_document = 'system engineering'.split()
query_bow = dictionary.doc2bow(query_document)
pprint.pprint(query_bow)

[(117, 1)]


## Transforming vectors
From now on, tfidf is treated as a read-only object that can be used to convert any vector from the old representation (bag-of-words integer counts) to the new representation (TfIdf real-valued weights).
Once the transformation model has been initialized, it can be used on any vectors (provided they come from the same vector space, of course), even if they were not used in the training corpus at all. This is achieved by a process called folding-in for LSA, by topic inference for LDA etc.

In [16]:
from gensim import models
tfidf = models.TfidfModel(bow_corpus)
lsi = models.LsiModel(bow_corpus, id2word=dictionary, num_topics=2)




[(0, -0.032339315942719514), (1, 0.0857829675957448)]


Now suppose a user typed in the query “Human computer interaction”. We would like to sort our nine corpus documents in decreasing order of relevance to this query. Unlike modern search engines, here we only concentrate on a single aspect of possible similarities—on apparent semantic relatedness of their texts (words). No hyperlinks, no random-walk static ranks, just a semantic extension over the boolean keyword match:

In [None]:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]  # convert the query to LSI space
print(vec_lsi)

In addition, we will be considering cosine similarity to determine the similarity of two vectors. Cosine similarity is a standard measure in Vector Space Modeling, but wherever the vectors represent probability distributions, different similarity measures may be more appropriate.


### Initializing query structures
To prepare for similarity queries, we need to enter all documents which we want to compare against subsequent queries. In our case, they are the same nine documents used for training LSI, converted to 2-D LSA space. But that’s only incidental, we might also be indexing a different corpus altogether.



In [18]:
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[bow_corpus])  # transform corpus to LSI space and index it

pprint.pprint(index)

<gensim.similarities.docsim.MatrixSimilarity object at 0x00000227D3153690>


### Performing queries

To obtain similarities of our query document against the nine indexed documents:

In [21]:
sims = index[vec_lsi]  # perform a similarity query against the corpus
# pprint.pprint(list(enumerate(sims)))  # print (document_number, document_similarity) 2-tuples

sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims:
    print(doc_score, text_corpus[doc_position])
   


0.9673083 Observing principles of human perception When designing application pages, the general principles of human perception should be taken into account. The International Standards mentioned below shall be consulted for guidance. Practical guidelines for presenting information to the user are to be found in ISO 9241-12. Guidance on selecting and using different forms of interaction techniques is to be found in ISO 9241-14 to ISO 9241-17. ISO 9241-14 gives guidance about menus, ISO 9241-15 about command dialogues, ISO 9241-16 about direct manipulation and ISO 9241-17 about forms. In addition, when designing multimedia information presentations, the design principles and recommendations described in ISO 14915-1 to ISO 14915-3 should be taken into account. Appropriate content presentation also plays a key role in accessibility.
0.9638009 MINISTRY OF HOME AFFAIRS GOVERNMENT OF INDIA
0.9545109 Readability of text: Text presented on the pages should be readable taking into account the e

The thing to note here is that documents no. 2 ("The EPS user interface management system") and 4 ("Relation of user perceived response time to error measurement") would never be returned by a standard boolean fulltext search, because they do not share any common words with "Human computer interaction". However, after applying LSI, we can observe that both of them received quite high similarity scores (no. 2 is actually the most similar!), which corresponds better to our intuition of them sharing a “computer-human” related topic with the query. In fact, this semantic generalization is the reason why we apply transformations and do topic modelling in the first place.