# Topic Modeling and Similarities

* [Latent Dirichlet Allocation (LDA) with Python](https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html)

Topic modeling is a process of automatically identifying the topics present in a text corpus, it derives the hidden patterns among the words in the corpus in an unsupervised manner. Topics are defined as “a repeating pattern of co-occurring terms in a corpus”. Topic modelling can be described as a method for finding a group of words (i.e topic) from a collection of documents that best represents the information in the collection.

As the name suggests, it is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making. 

Topic Modelling is different from rule-based text mining approaches that use regular expressions or dictionary based keyword searching techniques. It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.

A good topic model results in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

Topic Models are very useful for the purpose for document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection. For Example – New York Times are using topic models to boost their user – article recommendation engines. Various professionals are using topic models for recruitment industries where they aim to extract latent features of job descriptions and map them to right candidates. They are being used to organize large datasets of emails, customer reviews, and user social media profiles.

There are many approaches for obtaining topics from a text such as – Term Frequency and Inverse Document Frequency (TfIdf). NonNegative Matrix Factorization techniques. Latent Dirichlet Allocation(LDA) is the most popular topic modeling technique and in this article, we will discuss the same.

LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.



In [1]:
import numpy as np
import sys, re
import nltk
import gensim

utils_dir = '/Users/gshyam/utils/'
data_dir = './datassets/bbc_sports/'

sys.path.append(utils_dir)

from nlp_utils import prepare_text

In [2]:
# lets see how it works with the following sentences.

doc1 = "I have big exam tomorrow and I need to study hard to get a good grade. This exam is harder than most of the exams."
doc2 = "My wife likes to go out with me but I prefer staying at home and studying."
doc3 = "Kids are playing football in the field and they seem to have fun"
doc4 = "Sometimes I feel depressed while driving and it's hard to focus on the road."
doc5 = "I usually prefer reading at home but my wife prefers watching a TV."

# array of documents aka corpus
corpus = [doc1, doc2, doc3, doc4, doc5]

## Processing and Tokenizing the text 

In [3]:
doc_tokenized = prepare_text(doc1, TOKENIZE=True)
doc_stemmed = prepare_text(doc1, STEM=True)

print (doc1)
print (doc_tokenized)
print (doc_stemmed)


I have big exam tomorrow and I need to study hard to get a good grade. This exam is harder than most of the exams.
['big', 'exam', 'tomorrow', 'need', 'study', 'hard', 'get', 'good', 'grade', 'exam', 'harder', 'exams']
['big', 'exam', 'tomorrow', 'need', 'studi', 'hard', 'get', 'good', 'grade', 'exam', 'harder', 'exam']


In [4]:
tokenized_data = [prepare_text(doc, TOKENIZE=True) for doc in corpus]
tokenized_data

[['big',
  'exam',
  'tomorrow',
  'need',
  'study',
  'hard',
  'get',
  'good',
  'grade',
  'exam',
  'harder',
  'exams'],
 ['wife', 'likes', 'go', 'prefer', 'staying', 'home', 'studying'],
 ['kids', 'playing', 'football', 'field', 'seem', 'fun'],
 ['sometimes', 'feel', 'depressed', 'driving', 'hard', 'focus', 'road'],
 ['usually', 'prefer', 'reading', 'home', 'wife', 'prefers', 'watching', 'tv']]

In [5]:
dictionary = gensim.corpora.Dictionary(tokenized_data)

print ("First 10 items in the dictionary: key is index and value are the words")
for item in list(dictionary.items())[:10]:
    print (item)

First 10 items in the dictionary: key is index and value are the words
(0, 'big')
(1, 'exam')
(2, 'exams')
(3, 'get')
(4, 'good')
(5, 'grade')
(6, 'hard')
(7, 'harder')
(8, 'need')
(9, 'study')


## Bag of Words (BoW) method 
this is a common and very popular method to convert a document in text form into numerical values which can be fed into a model. In this method each unque word in the doc is assignmed a label and the number of times a word appears in the doc is also assigned.


In [6]:
# Transform the collection of texts to a numerical form
numerical_corpus = [dictionary.doc2bow(text) for text in tokenized_data]

In [7]:
numerical_corpus

[[(0, 1),
  (1, 2),
  (2, 1),
  (3, 1),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1)],
 [(11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)],
 [(18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1)],
 [(6, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)],
 [(12, 1), (14, 1), (17, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]]

Notice in the first line that second entry `(1,2)` represents word `exam` with index 1 and it appears 2 times in the doc. Similarly the word `hard` appears twice. hence we have `(5,2)`.

## LDA Model

The LDA model discovers the different topics that the documents represent and how much of each topic is present in a document. 

Python provides many great libraries for text mining practices, “gensim” is one such clean and beautiful library to handle text data. It is scalable, robust and efficient.

In [8]:
model = gensim.models.LdaModel(corpus=numerical_corpus, num_topics=10, id2word=dictionary)

all_topics = model.print_topics()

for i in range(10):
    # Print the first 10 most representative topics
    print(f"Topic #{i} : {model.print_topic(i, 5 )}")


Topic #0 : 0.029*"prefer" + 0.029*"hard" + 0.029*"kids" + 0.029*"focus" + 0.029*"wife"
Topic #1 : 0.029*"wife" + 0.029*"home" + 0.029*"prefer" + 0.029*"playing" + 0.029*"hard"
Topic #2 : 0.114*"wife" + 0.114*"home" + 0.114*"prefer" + 0.059*"reading" + 0.059*"watching"
Topic #3 : 0.135*"exam" + 0.071*"hard" + 0.071*"big" + 0.071*"study" + 0.071*"tomorrow"
Topic #4 : 0.029*"home" + 0.029*"prefer" + 0.029*"exam" + 0.029*"playing" + 0.029*"wife"
Topic #5 : 0.029*"hard" + 0.029*"home" + 0.029*"prefer" + 0.029*"fun" + 0.029*"playing"
Topic #6 : 0.029*"prefer" + 0.029*"playing" + 0.029*"hard" + 0.029*"wife" + 0.029*"home"
Topic #7 : 0.029*"prefer" + 0.029*"home" + 0.029*"playing" + 0.029*"hard" + 0.029*"wife"
Topic #8 : 0.116*"field" + 0.116*"seem" + 0.116*"kids" + 0.116*"fun" + 0.116*"football"
Topic #9 : 0.105*"feel" + 0.105*"depressed" + 0.105*"sometimes" + 0.105*"driving" + 0.105*"hard"


Since we trained and built our LDA model over the five simple sentences, whenever we want to detect the topic of a new sentence or text, we'll at first prepare the text and then push that into our model to get a topic. Let's try to predict a topic for a new sentence.

## Testing the model

Let's find out a topic for a new doc using the previously trained model. 
`My wife plans to go out tonight.`

In [9]:
doc_new = "My wife plans to go out tonight."
doc_new_prepared = prepare_text(doc_new, TOKENIZE=True)
print ( doc_new_prepared )
doc_bow = dictionary.doc2bow(doc_new_prepared)
print (doc_bow)


['wife', 'plans', 'go', 'tonight']
[(11, 1), (17, 1)]


Notice here since our dictionary is not large enough the bag of words for the new doc has missed a couple of words `plans` and `tonight`. As only the words `wife: index=15` and `go : index=9` exists in the dictionary.

In [10]:
def sort_list(A, key=0):
    # sort a list taking the take element of each item
    # reverse=True to make the first element the largest
    return sorted(A, reverse=True, key=lambda x: x[key] )

A = [(2, 1), (3, 4), (4, 1), (1, 3)]
A0=sort_list(A, 0)
A1=sort_list(A, 1)
print (f"orinal list :\t\t{A} \nsort with first:\t{A0} \nSorted with second:\t{A1}" )


orinal list :		[(2, 1), (3, 4), (4, 1), (1, 3)] 
sort with first:	[(4, 1), (3, 4), (2, 1), (1, 3)] 
Sorted with second:	[(3, 4), (1, 3), (2, 1), (4, 1)]


In [11]:
#def print_topics(topics_sorted, all_topics, k=2):
def print_top_k(topics, all_topics, k=2):
    topics_sorted = sort_list(topics, key=1)
    for i, topics in enumerate(topics_sorted[:k]):
        idx = topics[0]
        print (i, all_topics[idx])
        
        

In [12]:
topics= model.get_document_topics( doc_bow )
top_k_topics = print_top_k(topics, all_topics, k=2)

0 (2, '0.114*"wife" + 0.114*"home" + 0.114*"prefer" + 0.059*"reading" + 0.059*"watching" + 0.059*"prefers" + 0.059*"tv" + 0.059*"go" + 0.059*"likes" + 0.059*"staying"')
1 (0, '0.029*"prefer" + 0.029*"hard" + 0.029*"kids" + 0.029*"focus" + 0.029*"wife" + 0.029*"playing" + 0.029*"home" + 0.029*"road" + 0.029*"fun" + 0.029*"exam"')


top predictions for the new sentence `My wife plans to go out tonight.` are printed above.

## Similarity between documents


In [13]:
lda_index = gensim.similarities.MatrixSimilarity(model[numerical_corpus])

doc_new = "We are going play soccer with the kids"
doc_new_prepared = prepare_text(doc_new, TOKENIZE=True)
print ( doc_new_prepared )
doc_bow = dictionary.doc2bow(doc_new_prepared)
print (doc_bow)

similarities = lda_index[model[doc_bow]]

print(similarities)

['going', 'play', 'soccer', 'kids']
[(21, 1)]
[0.08770609 0.11107764 0.9765236  0.11107766 0.10820496]


Which means this new sentence is closest in the meaning to `doc3`  with probability `0.976`. And it makes sense that the new doc `We are going play soccer with the kids` is closest in meaning to `Kids are playing football in the field and they seem to have fun`.

## Things that can be added here

* N grams vocabulary
* Word Embeddings where `play` and `playing` mean the same thing.

# Document Similarity with BBC sport data


### Load, Preprocess and Tokenize the document

In [14]:
import glob, sys

data_dir = './datasets/bbc_sports/'


In [15]:
# Read all the files and store it in a doc

def read_data_files(data_dir):
    all_data_files = glob.glob(data_dir+'*')   
    raw_doc = []
    for file in all_data_files:
        # Use try and except method as some files may not be readable
        try:
            f = open(file, 'r', encoding='utf-8')
            raw_doc.append(f.read())
        except:
            print (f"skipping the unreadable file:  {file}")
            pass
    return raw_doc

raw_doc = read_data_files(data_dir)
print (f"Total # of documents: {len(raw_doc)}")


skipping the unreadable file:  ./datasets/bbc_sports/199.txt
Total # of documents: 510


In [16]:
print (f"The first 150 character (before processing) in first doc: \n\n{raw_doc[0][:150]}" )

The first 150 character (before processing) in first doc: 

Fuming Robinson blasts officials

England coach Andy Robinson insisted he was "livid" after his side were denied two tries in Sunday's 19-13 Six Natio


In [17]:


tokenized_doc = [prepare_text(doc, TOKENIZE=True, STEM=True) for doc in raw_doc]

print (f"The first 50 tokens (words) (after processing) in first doc: \n\n{tokenized_doc[0][:20]}" )


The first 50 tokens (words) (after processing) in first doc: 

['fume', 'robinson', 'blast', 'officialsengland', 'coach', 'andi', 'robinson', 'insist', 'livid', 'side', 'deni', 'two', 'tri', 'sunday', '1913', 'six', 'nation', 'loss', 'ireland', 'dublinmark']


In [18]:
dictionary = gensim.corpora.Dictionary(tokenized_doc)
print (f"There are {len(list(dictionary.items()))} items in the dictionary")
print ("First 10 items in the dictionary: key is index and value are the words")
for item in list(dictionary.items())[:10]:
    print (item)

There are 10321 items in the dictionary
First 10 items in the dictionary: key is index and value are the words
(0, '1913')
(1, 'abl')
(2, 'absolut')
(3, 'african')
(4, 'ahead')
(5, 'andi')
(6, 'awesom')
(7, 'back')
(8, 'ball')
(9, 'bbc')


### Bag of words

Create a numerical corpus. A corpus is a list of bags of words. A bag-of-words representation for a document just lists the number of times each word occurs in the document.

In [19]:
# Transform the collection of texts to a numerical form
numerical_corpus = [dictionary.doc2bow(text) for text in tokenized_doc]
print ( numerical_corpus[0] )

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 2), (15, 1), (16, 2), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 3), (24, 1), (25, 1), (26, 1), (27, 1), (28, 4), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 2), (38, 1), (39, 1), (40, 1), (41, 1), (42, 2), (43, 2), (44, 2), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 4), (54, 1), (55, 1), (56, 1), (57, 2), (58, 1), (59, 1), (60, 3), (61, 2), (62, 1), (63, 1), (64, 2), (65, 1), (66, 2), (67, 1), (68, 1), (69, 3), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 3), (76, 1), (77, 1), (78, 2), (79, 2), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 3), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 2), (98, 1), (99, 1), (100, 1), (101, 4), (102, 1), (103, 5), (104, 1), (105, 2), (106, 1), (107, 1), (108, 1), (109, 1), (110, 2),

## Model

Now we will create a similarity measure object in tf-idf space. tf-idf stands for term frequency-inverse document frequency. Term frequency is how often the word shows up in the document and inverse document fequency scales the value by how rare the word is in the corpus.





In [20]:
tf_idf = gensim.models.TfidfModel(numerical_corpus, id2word=dictionary)
print(tf_idf)

TfidfModel(num_docs=510, num_nnz=67925)


In [21]:
similarity_object = gensim.similarities.Similarity('./datasets/', tf_idf[numerical_corpus], num_features=len(dictionary))
print(similarity_object)
print(type(similarity_object))

Similarity index with 510 documents in 0 shards (stored under ./datasets/)
<class 'gensim.similarities.docsim.Similarity'>


### Saving the model

In [22]:
# Dump the model object to a file
# can use pickle as well
# but joblib is faster for large number of numpy arrays

import joblib
model_file_joblib = './datasets/tfidf_model'
joblib.dump(similarity_object, model_file_joblib)

['./datasets/tfidf_model']

In [23]:
import pickle
model_file_pickle = './datasets/tfidf_model.p'
pickle.dump(similarity_object, open(model_file_pickle, 'wb'))

### Loading and using the  stored model

Now create a query document and convert it to tf-idf.

query document is the one that we want to find the similar documents accordingly

we're going to use `raw_doc[8]` as our query doc and will try to see if our model would be able to find the same document as the most similar one


In [24]:
import joblib
similarity_object = joblib.load(model_file_joblib)

In [25]:
#query_text
q_text = raw_doc[8]
q_text_processed = prepare_text(q_text, TOKENIZE=True, STEM=True)
print ( "first 10 tokens:\n",q_text_processed[:10])
q_text_bow = dictionary.doc2bow(q_text_processed)
print ( "first 10 bow:\n",q_text_bow[:10])


first 10 tokens:
 ['robben', 'sidelin', 'broken', 'footchelsea', 'winger', 'arjen', 'robben', 'broken', 'two', 'metatars']
first 10 bow:
 [(23, 3), (45, 1), (46, 1), (50, 1), (57, 1), (71, 1), (94, 1), (106, 1), (110, 1), (111, 1)]


In [26]:
q_text_tfidf = tf_idf[q_text_bow]
print ( "first 10 tfidf:\n",q_text_tfidf[:10] )


first 10 tfidf:
 [(23, 0.060779464016390256), (45, 0.07075027144828656), (46, 0.014605368204018868), (50, 0.028797575313044728), (57, 0.015421487398221299), (71, 0.0404114823434502), (94, 0.015008913499203016), (106, 0.004434193825346702), (110, 0.020478063934517673), (111, 0.026530003463721384)]


In [27]:
similarity_scores=list(similarity_object[q_text_tfidf])
print ( "first 10 similarity scores:\n", similarity_scores[:10])

first 10 similarity scores:
 [0.024998276, 0.020571105, 0.009893991, 0.016575314, 0.014618116, 0.013003329, 0.020246074, 0.011348683, 1.0000001, 0.022904249]


In [28]:
max_score = max(similarity_scores)
max_score_index = similarity_scores.index(max_score)

print (f"max score {max_score} and max score index {max_score_index}")

max score 1.0000001192092896 and max score index 8


As expected the `similarity score` for 8th doc is 1. 

In [29]:
sorted_score = sorted(similarity_scores, reverse=True)

for i in range(3):
    score = sorted_score[i]
    indx = similarity_scores.index(score)
    print ( f"score: {score} index:{indx}")

score: 1.0000001192092896 index:8
score: 0.2723110318183899 index:220
score: 0.1732272207736969 index:113


In [30]:
raw_doc[8]

'Robben sidelined with broken foot\n\nChelsea winger Arjen Robben has broken two metatarsal bones in his foot and will be out for at least six weeks.\n\nRobben had an MRI scan on the injury, sustained during the Premiership win at Blackburn, on Monday. "Six weeks is the average time to heal this injury and then I need a few more weeks to be completely fit again," he told Dutch newspaper Algemeen Dagblad. "I had a feeling it was serious but because of the swelling it was impossible to make a final diagnosis." The 21-year-old missed the first three months of the season with a similar injury after a challenge with Roma\'s Olivier Dacourt. And he added: "It felt different then last summer when I had the same injury on my other foot. "Then I could walk already after three days but I stayed sidelined for a long period. I hope that it will now take me six to eight weeks." Chelsea physio Mike Banks was hopeful that Robben could return at some point in March. "The fractures are tiny and he coul

In [31]:
raw_doc[220]

'Robben plays down European return\n\nInjured Chelsea winger Arjen Robben has insisted that he only has a 10% chance of making a return against Barcelona in the Champions League.\n\nThe 21-year-old has been sidelined since breaking a foot against Blackburn last month. Chelsea face Barcelona at home on 8 March having lost 2-1 in the first leg. And Robben told the Daily Star: "It is not impossible that I will play against Barcelona but it is just a very, very small chance - about 10%."\n\nRobben has been an inspirational player for Chelsea this season following a switch from PSV Einhoven last summer. He added: "My recovery is going better than we expected a few weeks ago but I think the Barcelona game will come too soon. "I won\'t take any risks and come back too soon."\n'

In [32]:
raw_doc[113]

'Kenyon denies Robben Barca return\n\nChelsea chief executive Peter Kenyon has played down reports that Arjen Robben will return for the Champions League match against Barcelona.\n\n"He\'s been responding well to treatment and started running on Friday, but we\'ll have to wait and see," he told BBC Five Live\'s Sportsweek. "We\'re looking to getting him back as soon as possible, but he\'ll be back when it\'s right for him and for us. "There\'s no plans at the moment around the Barcelona game." His comments contradict those of chiropractor Jean Pierre Meersseman who treated the Dutchman after he fractured his foot at the start of February. Robben had been expected to be out for six weeks, but Meersseman hinted that the winger could be fit for the vital Stamford Bridge game on 8 March. "I hope he can be back and I will try to help him make that happen," Meersseman told the Mail on Sunday. "I put everything right with Arjen\'s foot the last time I saw him 12 days ago. It was an obvious co

## Word2Vec Model

In [33]:
raw_doc = read_data_files(data_dir)
tokenized_doc = [prepare_text(doc, TOKENIZE=True, STEM=True) for doc in raw_doc]


skipping the unreadable file:  ./datasets/bbc_sports/199.txt


In [34]:
# build vocabulary and train model
# you can see what all the inner parameters mean from the official gensim documentation

w2v_model = gensim.models.Word2Vec(
            sentences=tokenized_doc,
            size=300, # The size of the dense vector to represent each token or word 
            window=10, # The maximum distance between the target word and its neighboring word. 
            min_count=5, # Minimium frequency count of words. The model would ignore words that do not satisfy the min_count 
            workers=10) # How many threads to use behind the scenes

w2v_model.train(tokenized_doc, total_examples=len(tokenized_doc), epochs=15) 

(1177052, 1410330)

In [35]:
words = list(w2v_model.wv.vocab)
print (f"There are {len(words)} words. First 10 words:\n{words[:10]}")

There are 2817 words. First 10 words:
['robinson', 'blast', 'coach', 'andi', 'insist', 'livid', 'side', 'deni', 'two', 'tri']


In [36]:
vectors = np.array([w2v_model.wv[word] for word in words])
print (f"vectors.shape:{vectors.shape}")

vectors.shape:(2817, 300)


## Printing the vectors

In [37]:
my_word = 'defend'
print (f"my word: {my_word}")
v=w2v_model.wv[my_word]
idx=dictionary.doc2idx([my_word])[0]
print ('dictionary index:', idx)
print (f"sanity check: {idx}^th item in the dictionary: {dictionary[idx]}")

my word: defend
dictionary index: 31
sanity check: 31^th item in the dictionary: defend


In [38]:
word="celtic"
v = w2v_model.wv[word].tolist()

After building the vocabulary, we just need to call train(...) to start training the Word2Vec model. Behind the scenes, what’s happening here is that we are training a neural network with a single hidden layer where we train the model to predict the current word based on the context (using the default neural architecture). However, we are not going to use the neural network after training! Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. The resulting learned vector is also known as the embeddings. You can think of these embeddings as some features that describe the target word. For example, the word king may be described by the gender, age, the type of people the king associates with, etc.

Let's see similarity on some sport types. This first example shows a simple look up of words similar to the word ‘match’. All we need to do here is to call the most_similar function and provide the word ‘match’ as the positive example. This returns the top 10 similar words.

In [39]:
word="match"
w2v_model.wv.most_similar(positive=word)

[('lost', 0.909685492515564),
 ('defeat', 0.8464573621749878),
 ('seven', 0.8464318513870239),
 ('french', 0.8307112455368042),
 ('two', 0.8213168978691101),
 ('nine', 0.8158985376358032),
 ('arres', 0.805950403213501),
 ('huge', 0.7911980152130127),
 ('tournament', 0.7869153618812561),
 ('beaten', 0.7826323509216309)]

In [40]:
word="fractur"
w2v_model.wv.most_similar(positive=word)

[('crisi', 0.9595956802368164),
 ('physio', 0.9509631395339966),
 ('destroy', 0.9400373697280884),
 ('monday', 0.9334959983825684),
 ('bone', 0.9209457635879517),
 ('prefer', 0.9118180274963379),
 ('separ', 0.9071800112724304),
 ('doubt', 0.9070960283279419),
 ('thumb', 0.9057219624519348),
 ('session', 0.9042330980300903)]

In [41]:
# lets try the same with an adjective : "good"
word="good"

w2v_model.wv.most_similar(positive=word)

[('got', 0.9682800769805908),
 ('everi', 0.9312726855278015),
 ('get', 0.9273092746734619),
 ('look', 0.9247929453849792),
 ('hard', 0.9181417226791382),
 ('difficult', 0.9140352010726929),
 ('might', 0.9113308191299438),
 ('coupl', 0.9102075099945068),
 ('like', 0.9067820310592651),
 ('mental', 0.9051308631896973)]

In [42]:
word="celtic"
w2v_model.wv.most_similar(positive=word)

[('newcastl', 0.916409969329834),
 ('everton', 0.9107216596603394),
 ('old', 0.9064370393753052),
 ('alex', 0.9049527645111084),
 ('midfield', 0.8896152973175049),
 ('red', 0.888375997543335),
 ('wayn', 0.8713171482086182),
 ('ferguson', 0.864258348941803),
 ('blue', 0.8639053106307983),
 ('bolton', 0.8627519607543945)]

Overall, the results actually make sense. All of the related words tend to be used in similar contexts.

Now you could even use Word2Vec to compute similarity between two words in the vocabulary by invoking the similarity(...) function and passing in the relevant words.



In [43]:
word_pairs = [["celtic", "everton"],
              ["good", "bad"],
              ["good", "celtic"],
              ["good", "good"],
              ["kid", "men"]
             ]

for (w1, w2) in word_pairs:
    simi_score = w2v_model.wv.similarity(w1=w1, w2=w2)
    print (f"Similarity score of {w1} and {w2} : {simi_score}")

Similarity score of celtic and everton : 0.9107215404510498
Similarity score of good and bad : 0.7108150124549866
Similarity score of good and celtic : 0.0958632230758667
Similarity score of good and good : 1.0000001192092896
Similarity score of kid and men : 0.23357325792312622


Under the hood, the above four snippets compute the cosine similarity between the two specified words using word vectors (embeddings) of each. From the scores above, it makes sense that celtic is highly similar to nottingham but good is dissimilar to bad. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. 



### doc2vec model

doc2vec model gets its algorithm from word2vec.

In word2vec there is no need to label the words, because every word has their own semantic meaning in the vocabulary. But in case of doc2vec, there is a need to specify that how many number of words or sentences convey a semantic meaning, so that the algorithm could identify it as a single entity. For this reason, we are specifying labels or tags to sentence or paragraph depending on the level of semantic meaning conveyed.

If we specify a single label to multiple sentences in a paragraph, it means that all the sentences in the paragraph are required to convey the meaning. On the other hand, if we specify variable labels to all the sentences in a paragraph, it means that each conveys a semantic meaning and they may or may not have similarity among them.

In simple terms, a label means semantic meaning of something.