In [5]:
documents = ["(1) The Accepted Programme is the programme identified in the Contract Data or is the latest programme accepted by the Project Manager. The latest programme accepted by the Project Manager supersedes previous Accepted Programmes.",
             "(2) Completion is when the Contractor has done all the work which the Works Information states he is to do by the Completion Date and corrected notified Defects which would have prevented the Employer from using the works and Others from doing their work. If the work which the Contractor is to do by the Completion Date is not stated in the Works Information, Completion is when the Contractor has done all the work necessary for the Employer to use the works and for Others to do their work.",
             "(3) The Completion Date is the completion date unless later changed in accordance with this contract."]

In [2]:
#from Tutorial 1: Corpora and Vector Spaces¶
from gensim import corpora, models, similarities

In [None]:
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]

from pprint import pprint  # pretty-printer
pprint(texts)

In [None]:
#create dictionary, then map from ids to dictionary
dictionary = corpora.Dictionary(texts)
print(dictionary)
print(dictionary.token2id)

In [8]:
#convert tokenized documents to vector
new_doc = "The Contract Date is the date when this contract came into existence."
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)  # only those words that match up are given a dimension

[(2, 1), (9, 2), (17, 1)]


In [9]:
corpus = [dictionary.doc2bow(text) for text in texts]
for c in corpus:
    print(c)

[(0, 4), (1, 2), (2, 2), (3, 2), (4, 4), (5, 2)]
[(1, 2), (2, 5), (6, 2), (7, 4), (8, 3), (9, 2), (10, 3), (11, 2), (12, 2), (13, 2), (14, 2), (15, 2), (16, 2), (17, 2), (18, 3), (19, 3), (20, 2), (21, 4)]
[(2, 1), (7, 2), (9, 2)]


From Quick start tutorial. 
Now that we have vectorized our corpus we can begin to transform it using models. We use model as an abstract term referring to a transformation from one document representation to another. In gensim documents are represented as vectors so a model can be thought of as a transformation between two vector spaces. The details of this transformation are learned from the training corpus.
One simple example of a model is tf-idf. The tf-idf model transforms vectors from the bag-of-words representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus.
Let's initialize the tf-idf model, training it on our corpus and transforming the string "Contract date":

In [11]:
from gensim import models
# train the model
tfidf = models.TfidfModel(corpus)
# transform the "Contract date" string
tfidf[dictionary.doc2bow("Contract date".lower().split())]


[(9, 1.0)]

In [12]:
#now moved onto Topic and Transformations tutorial
#apply tfidf to the trained corpus
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(0, 0.6281916145233933), (1, 0.11592341698824639), (3, 0.31409580726169667), (4, 0.6281916145233933), (5, 0.31409580726169667)]
[(1, 0.07726398907993715), (6, 0.2093476508269452), (7, 0.1545279781598743), (8, 0.3140214762404178), (9, 0.07726398907993715), (10, 0.3140214762404178), (11, 0.2093476508269452), (12, 0.2093476508269452), (13, 0.2093476508269452), (14, 0.2093476508269452), (15, 0.2093476508269452), (16, 0.2093476508269452), (17, 0.2093476508269452), (18, 0.3140214762404178), (19, 0.3140214762404178), (20, 0.2093476508269452), (21, 0.4186953016538904)]
[(7, 0.7071067811865476), (9, 0.7071067811865476)]


In [None]:
#OPTIONAL
#this model can now be applied to another corpus other than the training one, not just individaul documents
#i have not pulled in a second corpus but this is how you would do it. Note you pull in a corpus (processed as above), not just docs. 
corpus2nd_tfidf = tfidf[corpus2nd]
for doc in corpus2nd_tfidf:
    print(doc)

In [15]:
#now applying an LSI to the first corpus, by working on top of its representation as a TFIDF
# here we have created a two dim LSI space, like Deerwesters 1990 example
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi


In [16]:
#inspect the topics
lsi.print_topics(2)


[(0,
  '-0.564*"completion" + -0.513*"date" + -0.274*"works" + -0.206*"work" + -0.206*"which" + -0.206*"do" + -0.206*"contractor" + -0.137*"employer" + -0.137*"when" + -0.137*"their"'),
 (1,
  '-0.627*"accepted" + -0.627*"programme" + -0.314*"project" + -0.314*"latest" + -0.116*"by" + 0.039*"date" + 0.039*"completion" + -0.000*"works" + -0.000*"do" + -0.000*"contractor"')]

In [17]:
#higher number per topic indicates more close to that topic
for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    print(doc)

[(0, -0.04162997881084487), (1, -0.9985101982664886)]
[(0, -0.7629371590041821)]
[(0, -0.7618005339021363), (1, 0.054565409902413764)]


moved onto Similarity search tutorial.
Now suppose a user typed in the query “Human computer interaction”. We would like to sort our corpus documents in decreasing order of relevance to this query. Unlike modern search engines, here we only concentrate on a single aspect of possible similarities—on apparent semantic relatedness of their texts (words). No hyperlinks, no random-walk static ranks, just a semantic extension over the boolean keyword match:

In [18]:
doc = "completion date"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow] # convert the query to LSI space
print(vec_lsi)

[(0, -1.077348646867466), (1, 0.07716714272044073)]


In addition, we will be considering cosine similarity to determine the similarity of two vectors. Cosine similarity is a standard measure in Vector Space Modeling, but wherever the vectors represent probability distributions, different similarity measures may be more appropriate.


To prepare for similarity queries, we need to enter all documents which we want to compare against subsequent queries. In our case, they are the same nine documents used for training LSI, converted to 2-D LSA space. But that’s only incidental, we might also be indexing a different corpus altogether.

In [19]:
index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it

In [20]:
sims = index[vec_lsi] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

[(0, -0.020164568), (1, 0.9974446), (2, 1.0)]


Cosine measure returns similarities in the range <-1, 1> (the greater, the more similar), so that the first document has a score of 0.99809301 etc.

With some standard Python magic we sort these similarities into descending order, and obtain the final answer to the query “Human computer interaction”:

In [21]:
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims) # print sorted (document number, similarity score) 2-tuples

[(2, 1.0), (1, 0.9974446), (0, -0.020164568)]


In [22]:
print (documents[2])

(3) The Completion Date is the completion date unless later changed in accordance with this contract.


In [23]:
#now trying the same thing with tfidf only. Not sure if I have created index properly
index = similarities.MatrixSimilarity(corpus_tfidf)
sims = index[vec_lsi] 
print(list(enumerate(sims)))
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print ("\n")
print (sims)


[(0, -0.6183043), (1, 0.005520038), (2, 0.0)]


[(1, 0.005520038), (2, 0.0), (0, -0.6183043)]


In [24]:
print (documents[0:2])

['(1) The Accepted Programme is the programme identified in the Contract Data or is the latest programme accepted by the Project Manager. The latest programme accepted by the Project Manager supersedes previous Accepted Programmes.', '(2) Completion is when the Contractor has done all the work which the Works Information states he is to do by the Completion Date and corrected notified Defects which would have prevented the Employer from using the works and Others from doing their work. If the work which the Contractor is to do by the Completion Date is not stated in the Works Information, Completion is when the Contractor has done all the work necessary for the Employer to use the works and for Others to do their work.']
