# NLP-Various Implementations | Word Embeddings Similarities & Analogies

**Overview:** In this part of the project, I implemented an algorithm that identifies similarities between words and predicts missing words in analogies, using the pre-trained word embeddings Word2vec and GloVe. For this purpose, I defined a function that finds the n most similar words for some user-defined input list of words, along with their scores. I also included a function that shows the common words between the two models for each word in that input list. Finally, the code calls the implemented functions with different input parameters to retrieve similar and common words for various words and analogies.

## 1. Import all the necessary modules

**Briefly:** `gensim.downloader` library gives access to pre-trained word embeddings, whereas `PrettyTable` library provides a way to display data in a table format.

In [9]:
import gensim.downloader as api
from prettytable import PrettyTable

## 2. Load pre-trained word embeddings

The function load_embeddings() uses the gensim library to load pre-trained word embeddings for two popular models: word2vec-google-news-300 and glove-wiki-gigaword-300. The function returns a tuple containing the loaded embeddings for both models. These embeddings are then assigned to variables named w2v_model and glove_model.

In [10]:
def load_embeddings():
    return api.load("word2vec-google-news-300"), api.load("glove-wiki-gigaword-300")

w2v_model, glove_model = load_embeddings() # loads the pre-trained word embeddings for word2vec and GloVe

## 3. Find and compare similar words using word2vec and GloVe embeddings

### 3.1. Top-10 similar words for given targets: {car, jaguar, Jaguar, facebook}

**Find similar words:** The function get_similar_words takes pre-trained word embedding models along with some input words, and performs similarity search to return the top-N similar words to the input words:

* **Step.1:** the function takes several inputs such as: n that determines the number of similar words to be retrieved, models a dictionary of pre-trained word embedding models that will be used to find the similar words, data a list of target words for which the similar words will be retrieved, pos a list of positive context words used in the word embedding models, neg a list of negative context words used in the word embedding models and analogy a boolean value that determines whether the word similarity task is an analogy task or not (simple similarity task).
* **Step.2:** the function initializes an empty list sims to store the retrieved similar words. It then iterates over the pre-trained models and for each model, it retrieves the most similar words for each target word and stores them in a table format. It adds the retrieved similar words to sims and prints the table for each model.
* **Step.3:** finally, it returns sims which contains the list of similar words for each target word.

> The `extra` variable is used to determine whether the word similarity task is an analogy task or a simple similarity task. If analogy is True, then extra is set to an empty list, meaning that no additional context words are used to retrieve similar words. If analogy is False, then extra is set to a list containing one element, which is the current target word d. This is done to enable the model to retrieve similar words based on the current target word and the additional context words provided in pos and neg.

The get_similar_words() function is called with several input arguments: value 10, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find similar words, a list of target words including 'car', 'jaguar', 'Jaguar', and 'facebook' to search for similar words within the models, empty lists for the positive and negative words that define the context and an argument 'False' which indicates that it should not generate missing words in analogies. The output of the function is stored in the variable sims, which contains a list of similar words for each target word in the input list.

> An empty list is provided for both the `positive` and `negative` context words, indicating that the function should only retrieve similar words based on the target words themselves.

In [11]:
def get_similar_words(n, models, data, pos, neg, analogy):
    sims = []
    for model_name, model in models.items():
        temp = []
        pt = PrettyTable(field_names=[f"\033[1m{d}\033[0m" for d in data])
        for d in data:
            extra = [] if analogy else [d]
            temp.extend([f"{s[0]}: {s[1]:.4f}" for s in model.most_similar(positive=pos+extra, negative=neg, topn=n)]) if all(e in model for e in extra) and all(p in model for p in pos) and all(n in model for n in neg) else temp.extend(["N/A"] * n)
        for i in range(n):
            pt.add_row([temp[i + j*n] for j in range(len(data))])
        sims.append([elem.split(':')[0].strip() if ":" in elem else "N/A" for elem in temp])
        print('\033[1m' + f"\n{model_name} Model:" + '\033[0m')
        pt.align = 'l'
        print(pt)
    return sims

sims = get_similar_words(10, {'Word2vec': w2v_model, 'GloVe': glove_model}, ['car', 'jaguar', 'Jaguar', 'facebook'], [], [], False)

[1m
Word2vec Model:[0m
+----------------------+-------------------------+------------------------+--------------------------+
| [1mcar[0m                  | [1mjaguar[0m                  | [1mJaguar[0m                 | [1mfacebook[0m                 |
+----------------------+-------------------------+------------------------+--------------------------+
| vehicle: 0.7821      | jaguars: 0.6738         | Land_Rover: 0.6484     | Facebook: 0.7564         |
| cars: 0.7424         | Macho_B: 0.6313         | Aston_Martin: 0.6437   | FaceBook: 0.7077         |
| SUV: 0.7161          | panther: 0.6086         | Mercedes: 0.6420       | twitter: 0.6989          |
| minivan: 0.6907      | lynx: 0.5815            | Porsche: 0.6233        | myspace: 0.6942          |
| truck: 0.6736        | rhino: 0.5754           | BMW: 0.6055            | Twitter: 0.6642          |
| Car: 0.6678          | lizard: 0.5607          | Bentley_Arnage: 0.6040 | twitter_facebook: 0.6572 |
| Ford_Focus: 0.

**Find common words:** The function get_common_words takes the number of similar words to be retrieved, the target words, and a list of similar words for each model as inputs, and returns a table that shows the common words in both models for each target word.

* **Step.1:** the function takes three inputs: n that specifies the number of similar words to be retrieved, words a list of target words, and sims a list containing the top-N similar words for each target word for both models.
* **Step.2:** it initializes an empty list coms to store the common words in both models for each target word. It then iterates over the retrieved similar words for each target word and finds the intersection of the top-N similar words for both models. It adds the common words to coms and creates a table using PrettyTable. The table shows the common words for each target word and highlights the target word in bold.
* **Step.3:** it finally prints the table showing the common words in both models for each target word.

The function get_common_words() is called with three input arguments: value 10, which represents the number of common words to retrieve for each pair of target words, a list of target words including 'car', 'jaguar', 'Jaguar', and 'facebook', and the variable sims, which contains a list of similar words for each target word obtained from the pre-trained word embeddings models. The function then compares the lists of similar words for each pair of target words and returns a list of common words that appear in the similar word lists for each pair. This list of common words is then sorted and returned as the output of the function.

> The `if/else` statement checks if there are any "N/A" values in the similarity results for the current group of similar words for both models. If both models return "N/A" for a particular group of similar words, it means that there are no similar words found for that particular target word in both models. In this case, an empty list is added to the list of common words (coms) for that target word. Otherwise, if there are similar words found for the target word in both models, the code creates a list of the common words between the two models by taking the intersection of the similar words retrieved from each model, and adds it to coms. The resulting coms list contains the common words between the two models for each target word.

In [12]:
def get_common_words(n, words, sims):
    coms = []
    pt = PrettyTable()
    for i in range(0, n*len(words), n):
        if 'N/A' in sims[0][i:i+n] and 'N/A' in sims[1][i:i+n]:
            coms.append([])
        else:
            coms.append([word for word in set(sims[0][i:i+n]).intersection(set(sims[1][i:i+n]))])
    coms = [sublist + [""] * (max(map(len, coms)) - len(sublist)) for sublist in coms]
    for i in range(0, n*len(words), n): 
        pt.add_column(f"\033[1m{words[i // n]}\033[0m", coms[i // n]) # where current word = words[i // n]
    print('\033[1m' + "Common words in both Models:" + '\033[0m')
    pt.align = 'l'
    print(pt)

get_common_words(10, ['car', 'jaguar', 'Jaguar', 'facebook'], sims)

[1mCommon words in both Models:[0m
+---------+--------+--------+----------+
| [1mcar[0m     | [1mjaguar[0m | [1mJaguar[0m | [1mfacebook[0m |
+---------+--------+--------+----------+
| truck   |        |        | myspace  |
| vehicle |        |        | twitter  |
| cars    |        |        | linkedin |
+---------+--------+--------+----------+


### 3.2. Top-10 similar words for user-defined targets: {country, crying, Rachmaninoff, espresso}

The get_similar_words() function is called with several input arguments: value 10, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find similar words, a list of target words including 'country', 'crying', 'Rachmaninoff', and 'espresso' to search for similar words within the models, empty lists for the positive and negative words that define the context and an argument 'False' which indicates that it should not generate missing words in analogies. The output of the function is stored in the variable sims, which contains a list of similar words for each target word in the input list.

In [13]:
sims = get_similar_words(10, {'Word2vec': w2v_model, 'GloVe': glove_model}, ['country', 'crying', 'Rachmaninoff', 'espresso'], [], [], False)

[1m
Word2vec Model:[0m
+-----------------------+-------------------------------+-----------------------------+-----------------------------+
| [1mcountry[0m               | [1mcrying[0m                        | [1mRachmaninoff[0m                | [1mespresso[0m                    |
+-----------------------+-------------------------------+-----------------------------+-----------------------------+
| nation: 0.7243        | sobbing: 0.7246               | Rachmaninov: 0.7945         | cappuccino: 0.6888          |
| continent: 0.6131     | bawling: 0.7187               | Liszt: 0.7910               | mocha: 0.6686               |
| region: 0.6015        | cried: 0.7152                 | Tchaikovsky: 0.7728         | coffee: 0.6617              |
| thecountry: 0.6002    | screaming: 0.7076             | Shostakovich: 0.7641        | latte: 0.6537               |
| world: 0.5980         | weeping: 0.6933               | concerto: 0.7544            | caramel_macchiato: 0.6491   |

The function get_common_words() is called with three input arguments: value 10, which represents the number of common words to retrieve for each pair of target words, a list of target words including 'country', 'crying', 'Rachmaninoff', and 'espresso', and the variable sims, which contains a list of similar words for each target word obtained from the pre-trained word embeddings models. The function then compares the lists of similar words for each pair of target words and returns a list of common words that appear in the similar word lists for each pair. This list of common words is then sorted and returned as the output of the function.

In [14]:
get_common_words(10, ['country', 'crying', 'Rachmaninoff', 'espresso'], sims)

[1mCommon words in both Models:[0m
+-----------+-----------+--------------+------------+
| [1mcountry[0m   | [1mcrying[0m    | [1mRachmaninoff[0m | [1mespresso[0m   |
+-----------+-----------+--------------+------------+
| nation    | cry       |              | cappuccino |
| continent | sobbing   |              | mocha      |
|           | weeping   |              | coffee     |
|           | cries     |              | latte      |
|           | cried     |              |            |
|           | screaming |              |            |
+-----------+-----------+--------------+------------+


## 4. Find and filter similar words by context using word2vec and GloVe embeddings

The function get_similar_words() is called with several input arguments: value 10, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find similar words, a list of target words including 'student' to search for similar words within the models, empty lists for the positive and negative words that define the context, and an argument 'False' which indicates that it should not generate missing words in analogies. The output of the function is stored in the variable sims, which contains a list of similar words for the target word 'student'.

In [15]:
sims = get_similar_words(10, {'Word2vec': w2v_model, 'GloVe': glove_model}, ['student'], [], [], False)

[1m
Word2vec Model:[0m
+------------------------+
| [1mstudent[0m                |
+------------------------+
| students: 0.7295       |
| Student: 0.6707        |
| teacher: 0.6301        |
| stu_dent: 0.6241       |
| faculty: 0.6087        |
| school: 0.6056         |
| undergraduate: 0.6020  |
| university: 0.6005     |
| undergraduates: 0.5756 |
| semester: 0.5738       |
+------------------------+
[1m
GloVe Model:[0m
+-----------------------+
| [1mstudent[0m               |
+-----------------------+
| students: 0.7691      |
| teacher: 0.6874       |
| graduate: 0.6738      |
| school: 0.6131        |
| college: 0.6090       |
| undergraduate: 0.6044 |
| faculty: 0.5999       |
| university: 0.5971    |
| academic: 0.5810      |
| campus: 0.5768        |
+-----------------------+


The function get_similar_words() is called with several input arguments: a value of 10, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find similar words, a list of target words including 'student' to search for similar words within the models, an empty list for the positive words that define the positive context and the word 'university' in the negative words list, which indicates that similar words to 'student' associated with 'university' should be excluded from the output. The argument 'False' is used to indicate that it should not generate missing words in analogies. The output of the function is stored in the variable sims, which contains a list of similar words for the target word 'student'.

In [16]:
sims = get_similar_words(10, {'Word2vec': w2v_model, 'GloVe': glove_model}, ['student'], [], ['university'], False)

[1m
Word2vec Model:[0m
+-------------------------+
| [1mstudent[0m                 |
+-------------------------+
| sixth_grader: 0.4324    |
| seventh_grader: 0.4178  |
| 8th_grader: 0.4173      |
| eighth_grader: 0.4082   |
| grader: 0.3971          |
| kindergartner: 0.3918   |
| kindergartener: 0.3777  |
| Kindergartner: 0.3565   |
| teen: 0.3470            |
| middle_schooler: 0.3384 |
+-------------------------+
[1m
GloVe Model:[0m
+---------------------+
| [1mstudent[0m             |
+---------------------+
| 15-year: 0.3830     |
| 16-year: 0.3815     |
| 17-year: 0.3785     |
| 14-year: 0.3766     |
| 13-year-old: 0.3730 |
| 14-year-old: 0.3676 |
| 9-year: 0.3667      |
| 16-year-old: 0.3615 |
| 15-year-old: 0.3510 |
| 12-year-old: 0.3490 |
+---------------------+


The function get_similar_words() is called with several input arguments: a value of 10, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find similar words, a list of target words including 'student' to search for similar words within the models, an empty list for the positive words that define the positive context and the words 'elementary', 'middle', and 'high' in the negative words list, which indicates that similar words to 'student' associated with these educational levels should be excluded from the output. The argument 'False' is used to indicate that it should not generate missing words in analogies. The output of the function is stored in the variable sims, which contains a list of similar words for the target word 'student'.

In [17]:
sims = get_similar_words(10, {'Word2vec': w2v_model, 'GloVe': glove_model}, ['student'], [], ['elementary','middle','high'], False)

[1m
Word2vec Model:[0m
+-------------------------------------------------------------------------+
| [1mstudent[0m                                                                 |
+-------------------------------------------------------------------------+
| ----------_-----------------------------------------------_GS##: 0.3401 |
| K.Kahne_###-###: 0.3000                                                 |
| Obiter_Dicta: 0.2770                                                    |
| Hannsen: 0.2738                                                         |
| NewsTrack_Sports: 0.2734                                                |
| Ministere: 0.2702                                                       |
| Pi_fraternity: 0.2690                                                   |
| Mario_Anzuoni_REUTERS: 0.2634                                           |
| Pharma_ceutical: 0.2584                                                 |
| M.Kenseth_###-###: 0.2571                            

## 5. Solve word analogies using word2vec and GloVe embeddings

### 5.1. Top-2 solutions for five given analogies

The function get_similar_words() is called with several input arguments: a value of 2, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find the most similar words in an analogy, an analogy represented by a list of target words, which is 'king - man + woman' in this case. The positive words list includes 'king' and 'woman', whereas the negative words list includes the word 'man', which specifies the type of relationship to be captured in the analogy. The argument 'True' is used to indicate that the function should generate missing words in analogies, if any. The output of the function is stored in the variable sims, which contains a list of the two most similar words that fit the analogy according to the Word2vec and GloVe models.

> In this analogy, we want to find a word that is related to 'woman' in the same way that 'man' is related to 'king'. To achieve this, we include 'king' and 'woman' in the positive words list. The word 'king' is included because it plays the role of the known component in the relationship we want to capture, while 'woman' represents the unknown component. We exclude 'man' from the list because we want to replace it with the word we're looking for.

In [18]:
sims = get_similar_words(2, {'Word2vec': w2v_model, 'GloVe': glove_model}, ['king - man + woman'], ['king','woman'], ['man'], True)

[1m
Word2vec Model:[0m
+--------------------+
| [1mking - man + woman[0m |
+--------------------+
| queen: 0.7118      |
| monarch: 0.6190    |
+--------------------+
[1m
GloVe Model:[0m
+--------------------+
| [1mking - man + woman[0m |
+--------------------+
| queen: 0.6713      |
| princess: 0.5433   |
+--------------------+


**Man is to king as woman is to what?:** In this particular example, the answer to the analogy is `queen`. Both models have correctly identified "queen" as the answer to the analogy, with the Word2vec model being slightly more confident in its prediction than the GloVe model. However, both models have reasonably high scores for "queen" as the answer, which indicates that they have learned the association between the words "man-king" and "woman-queen" from the training corpus.

The function get_similar_words() is called with several input arguments: a value of 2, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find the most similar words in an analogy, an analogy represented by a list of target words, which is 'france - paris + tokyo' in this case. The positive words list includes 'france' and 'tokyo', whereas the negative words list includes the word 'paris', which specifies the type of relationship to be captured in the analogy. The argument 'True' is used to indicate that the function should generate missing words in analogies, if any. The output of the function is stored in the variable sims, which contains a list of the two most similar words that fit the analogy according to the Word2vec and GloVe models.

> In this analogy, we want to find a word that is related to 'tokyo' in the same way that 'paris' is related to 'france'. To achieve this, we include 'france' and 'tokyo' in the positive words list. The word 'france' is included because it plays the role of the known component in the relationship we want to capture, while 'tokyo' represents the unknown component. We exclude 'paris' from the list because we want to replace it with the word we're looking for.

In [19]:
sims = get_similar_words(2, {'Word2vec': w2v_model, 'GloVe': glove_model}, ['france - paris + tokyo'], ['france','tokyo'], ['paris'], True)

[1m
Word2vec Model:[0m
+------------------------+
| [1mfrance - paris + tokyo[0m |
+------------------------+
| japan: 0.5508          |
| hong_kong: 0.5012      |
+------------------------+
[1m
GloVe Model:[0m
+------------------------+
| [1mfrance - paris + tokyo[0m |
+------------------------+
| japan: 0.8017          |
| japanese: 0.6111       |
+------------------------+


**Paris is to france as tokyo is to what?:** In this particular example, both models have identified `japan` as the answer to the analogy, with the GloVe model being more confident in its prediction than the Word2vec model. The GloVe model has a score of 0.8017 for "japan" while the Word2vec model has a score of 0.5508 for the same word. These scores indicate that both models have learned the association between the words "paris-france" and "tokyo-japan" from the training corpus.

The function get_similar_words() is called with several input arguments: a value of 2, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find the most similar words in an analogy, an analogy represented by a list of target words, which is 'trees - apples + grapes' in this case. The positive words list includes 'trees' and 'grapes', whereas the negative words list includes the word 'apples', which specifies the type of relationship to be captured in the analogy. The argument 'True' is used to indicate that the function should generate missing words in analogies, if any. The output of the function is stored in the variable sims, which contains a list of the two most similar words that fit the analogy according to the Word2vec and GloVe models.

> In this analogy, we want to find a word that is related to 'grapes' in the same way that 'apples' is related to 'trees'. To achieve this, we include 'trees' and 'grapes' in the positive words list. The word 'trees' is included because it plays the role of the known component in the relationship we want to capture, while 'grapes' represents the unknown component. We exclude 'apples' from the list because we want to replace it with the word we're looking for.

In [20]:
sims = get_similar_words(2, {'Word2vec': w2v_model, 'GloVe': glove_model}, ['trees - apples + grapes'], ['trees','grapes'], ['apples'], True)

[1m
Word2vec Model:[0m
+-------------------------+
| [1mtrees - apples + grapes[0m |
+-------------------------+
| oak_trees: 0.6750       |
| vines: 0.6702           |
+-------------------------+
[1m
GloVe Model:[0m
+-------------------------+
| [1mtrees - apples + grapes[0m |
+-------------------------+
| vines: 0.5909           |
| tree: 0.5843            |
+-------------------------+


**Apples is to trees as grapes is to what?:** In this particular example, only the GloVe model has identified `vines` as the answer to the analogy, while the Word2vec model has identified "oak_trees" and "vines" as potential answers, with slightly higher confidence for "oak_trees". However, both models have relatively low scores for their predicted answers, indicating that that both models have learned some kind of association between the words "apples-trees" and "grapes-vines" from the training corpus.

The function get_similar_words() is called with several input arguments: a value of 2, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find the most similar words in an analogy, an analogy represented by a list of target words, which is 'swimming - walking + walked' in this case. The positive words list includes 'swimming' and 'walked', whereas the negative words list includes the word 'walking', which specifies the type of relationship to be captured in the analogy. The argument 'True' is used to indicate that the function should generate missing words in analogies, if any. The output of the function is stored in the variable sims, which contains a list of the two most similar words that fit the analogy according to the Word2vec and GloVe models.

> In this analogy, we want to find a word that is related to 'walked' in the same way that 'walking' is related to 'swimming'. To achieve this, we include 'swimming' and 'walked' in the positive words list. The word 'swimming' is included because it plays the role of the known component in the relationship we want to capture, while 'walked' represents the unknown component. We exclude 'walking' from the list because we want to replace it with the word we're looking for.

In [21]:
sims = get_similar_words(2, {'Word2vec': w2v_model, 'GloVe': glove_model}, ['swimming - walking + walked'], ['swimming','walked'], ['walking'], True)

[1m
Word2vec Model:[0m
+-----------------------------+
| [1mswimming - walking + walked[0m |
+-----------------------------+
| swam: 0.6926                |
| swim: 0.6725                |
+-----------------------------+
[1m
GloVe Model:[0m
+-----------------------------+
| [1mswimming - walking + walked[0m |
+-----------------------------+
| swam: 0.4978                |
| swimmers: 0.4852            |
+-----------------------------+


**Walking is to swimming as walked is to what?:** In this particular example, both models have identified `swam` as the answer to the analogy, with the Word2vec model being more confident in its prediction than the GloVe model. The Word2vec model has a score of 0.6926 for swam, while the GloVe model has a score of 0.4978 for swam. Additionally, both models have reasonably high scores for swim as the answer, which indicates that they have learned the association between the words "walking-swimming" and "walked-swam" from the training corpus.

The function get_similar_words() is called with several input arguments: a value of 2, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find the most similar words in an analogy, an analogy represented by a list of target words, which is 'doctor - father + mother' in this case. The positive words list includes 'doctor' and 'mother', whereas the negative words list includes the word 'father', which specifies the type of relationship to be captured in the analogy. The argument 'True' is used to indicate that the function should generate missing words in analogies, if any. The output of the function is stored in the variable sims, which contains a list of the two most similar words that fit the analogy according to the Word2vec and GloVe models.

> In this analogy, we want to find a word that is related to 'mother' in the same way that 'father' is related to 'doctor'. To achieve this, we include 'doctor' and 'mother' in the positive words list. The word 'doctor' is included because it plays the role of the known component in the relationship we want to capture, while 'mother' represents the unknown component. We exclude 'father' from the list because we want to replace it with the word we're looking for.

In [22]:
sims = get_similar_words(2, {'Word2vec': w2v_model, 'GloVe': glove_model}, ['doctor - father + mother'], ['doctor','mother'], ['father'], True)

[1m
Word2vec Model:[0m
+--------------------------+
| [1mdoctor - father + mother[0m |
+--------------------------+
| nurse: 0.7128            |
| doctors: 0.6593          |
+--------------------------+
[1m
GloVe Model:[0m
+--------------------------+
| [1mdoctor - father + mother[0m |
+--------------------------+
| nurse: 0.6570            |
| doctors: 0.6172          |
+--------------------------+


**Father is to doctor as mother is to what?:** In this particular example, both the Word2vec and GloVe models have unfortunately identified `nurse` as the answer to the analogy. The Word2vec model is slightly more confident in its prediction, with a score of 0.7128 for "nurse", while the GloVe model has a score of 0.6570. These high scores for "nurse" as the answer indicate that both models have learned the association between the words "father-doctor" and "mother-nurse" from the training corpus. This association is considered bad because it reinforces gender stereotypes and biases. The analogy implies that fathers are more likely to become doctors and mothers are more likely to become nurses, which is obviously not true.

### 5.2. Top-2 solutions for five user-defined analogies

The function get_similar_words() is called with several input arguments: a value of 2, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find the most similar words in an analogy, an analogy represented by a list of target words, which is 'russian - pelmeni + dumplings' in this case. The positive words list includes 'russian' and 'dumplings', whereas the negative words list includes the word 'pelmeni', which specifies the type of relationship to be captured in the analogy. The argument 'True' is used to indicate that the function should generate missing words in analogies, if any. The output of the function is stored in the variable sims, which contains a list of the two most similar words that fit the analogy according to the Word2vec and GloVe models.

> In this analogy, we want to find a word that is related to 'dumplings' in the same way that 'pelmeni' is related to 'russian'. To achieve this, we include 'russian' and 'dumplings' in the positive words list. The word 'russian' is included because it plays the role of the known component in the relationship we want to capture, while 'dumplings' represents the unknown component. We exclude 'pelmeni' from the list because we want to replace it with the word we're looking for.

In [23]:
sims = get_similar_words(2, {'Word2vec': w2v_model, 'GloVe': glove_model}, ['russian - pelmeni + dumplings'], ['russian','dumplings'], ['pelmeni'], True)

[1m
Word2vec Model:[0m
+-------------------------------+
| [1mrussian - pelmeni + dumplings[0m |
+-------------------------------+
| chinese: 0.5360               |
| japanese: 0.5182              |
+-------------------------------+
[1m
GloVe Model:[0m
+-------------------------------+
| [1mrussian - pelmeni + dumplings[0m |
+-------------------------------+
| chinese: 0.5497               |
| russia: 0.5305                |
+-------------------------------+


**Pelmeni is to russian as dumplings is to what?:** In this particular example, both the Word2vec and GloVe models have identified `chinese` as the answer to the analogy. The GloVe model is slightly more confident in its prediction, with a score of 0.5497 for "chinese", while the Word2vec model has a score of 0.5360. These high scores for "chinese" as the answer indicate that both models have learned the association between the words "pelmeni-russian" and "dumplings-chinese" from the training corpus.

The function get_similar_words() is called with several input arguments: a value of 2, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find the most similar words in an analogy, an analogy represented by a list of target words, which is 'piano - sonata + symphony' in this case. The positive words list includes 'piano' and 'symphony', whereas the negative words list includes the word 'sonata', which specifies the type of relationship to be captured in the analogy. The argument 'True' is used to indicate that the function should generate missing words in analogies, if any. The output of the function is stored in the variable sims, which contains a list of the two most similar words that fit the analogy according to the Word2vec and GloVe models.

> In this analogy, we want to find a word that is related to 'symphony' in the same way that 'sonata' is related to 'piano'. To achieve this, we include 'piano' and 'symphony' in the positive words list. The word 'piano' is included because it plays the role of the known component in the relationship we want to capture, while 'symphony' represents the unknown component. We exclude 'sonata' from the list because we want to replace it with the word we're looking for.

In [24]:
sims = get_similar_words(2, {'Word2vec': w2v_model, 'GloVe': glove_model}, ['piano - sonata + symphony'], ['piano','symphony'], ['sonata'], True)

[1m
Word2vec Model:[0m
+----------------------------+
| [1mpiano - sonata + symphony[0m  |
+----------------------------+
| orchestra: 0.7010          |
| symphony_orchestra: 0.5961 |
+----------------------------+
[1m
GloVe Model:[0m
+---------------------------+
| [1mpiano - sonata + symphony[0m |
+---------------------------+
| orchestra: 0.7714         |
| orchestras: 0.6377        |
+---------------------------+


**Sonata is to piano as symphony is to what?:** In this particular example, both the Word2vec and GloVe models have identified `orchestra` as the answer to the analogy. The Word2vec model has a score of 0.7010 for "orchestra", while the GloVe model has a higher score of 0.7714. These high scores for "orchestra" as the answer indicate that both models have learned the association between the words "sonata-piano" and "symphony-orchestra" from the training corpus.

The function get_similar_words() is called with several input arguments: a value of 2, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find the most similar words in an analogy, an analogy represented by a list of target words, which is 'bad - war + peace' in this case. The positive words list includes 'bad' and 'peace', whereas the negative words list includes the word 'war', which specifies the type of relationship to be captured in the analogy. The argument 'True' is used to indicate that the function should generate missing words in analogies, if any. The output of the function is stored in the variable sims, which contains a list of the two most similar words that fit the analogy according to the Word2vec and GloVe models.

> In this analogy, we want to find a word that is related to 'peace' in the same way that 'war' is related to 'bad'. To achieve this, we include 'bad' and 'peace' in the positive words list. The word 'bad' is included because it plays the role of the known component in the relationship we want to capture, while 'peace' represents the unknown component. We exclude 'war' from the list because we want to replace it with the word we're looking for.

In [25]:
sims = get_similar_words(2, {'Word2vec': w2v_model, 'GloVe': glove_model}, ['bad - war + peace'], ['bad','peace'], ['war'], True)

[1m
Word2vec Model:[0m
+-------------------+
| [1mbad - war + peace[0m |
+-------------------+
| good: 0.5381      |
| Bad: 0.4294       |
+-------------------+
[1m
GloVe Model:[0m
+-------------------+
| [1mbad - war + peace[0m |
+-------------------+
| good: 0.5159      |
| things: 0.4778    |
+-------------------+


**War is to bad as peace is to what?:** In this particular example, both the Word2vec and GloVe models have identified `good` as the answer to the analogy. The Word2vec model is slightly more confident in its prediction, with a score of 0.5381 for "good", while the GloVe model has a score of 0.5159. These scores indicate that both models have learned the association between the words "war-bad" and "peace-good" from the training corpus.