In [1]:
%matplotlib inline


How to download pre-trained models and corpora
==============================================

Demonstrates simple and quick access to common corpora and pretrained models.



In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

One of Gensim's features is simple and easy access to common data.
The `gensim-data <https://github.com/RaRe-Technologies/gensim-data>`_ project stores a
variety of corpora and pretrained models.
Gensim has a :py:mod:`gensim.downloader` module for programmatically accessing this data.
This module leverages a local cache (in user's home folder, by default) that
ensures data is downloaded at most once.

This tutorial:

* Downloads the text8 corpus, unless it is already on your local machine
* Trains a Word2Vec model from the corpus (see `sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py` for a detailed tutorial)
* Leverages the model to calculate word similarity
* Demonstrates using the API to load other models and corpora

Let's start by importing the api module.




In [3]:
import gensim.downloader as api

Now, let's download the text8 corpus and load it as a Python object
that supports streamed access.




In [4]:
#corpus = api.load('text8')
corpus = api.load("text8")

In this case, our corpus is an iterable.
If you look under the covers, it has the following definition:



In [5]:
import inspect
print(inspect.getsource(corpus.__class__))

class Dataset(object):
    def __init__(self, fn):
        self.fn = fn

    def __iter__(self):
        corpus = Text8Corpus(self.fn)
        for doc in corpus:
            yield doc



For more details, look inside the file that defines the Dataset class for your particular resource.




In [6]:
print(inspect.getfile(corpus.__class__))

/Users/joshuamailman/gensim-data/text8/__init__.py


With the corpus has been downloaded and loaded, let's use it to train a word2vec model.




In [7]:
from gensim.models.word2vec import Word2Vec
model_skp = Word2Vec(corpus, sg=1)

2021-03-21 02:28:00,174 : INFO : collecting all words and their counts
2021-03-21 02:28:00,177 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-03-21 02:28:04,767 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2021-03-21 02:28:04,767 : INFO : Loading a fresh vocabulary
2021-03-21 02:28:05,002 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2021-03-21 02:28:05,002 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2021-03-21 02:28:05,140 : INFO : deleting the raw counts dictionary of 253854 items
2021-03-21 02:28:05,145 : INFO : sample=0.001 downsamples 38 most-common words
2021-03-21 02:28:05,145 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2021-03-21 02:28:05,301 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2021-03-21 02:28:05,301 : 

2021-03-21 02:29:13,728 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-03-21 02:29:13,729 : INFO : EPOCH - 2 : training on 17005207 raw words (12506772 effective words) took 29.1s, 429412 effective words/s
2021-03-21 02:29:14,747 : INFO : EPOCH 3 - PROGRESS: at 3.41% examples, 421042 words/s, in_qsize 5, out_qsize 0
2021-03-21 02:29:15,755 : INFO : EPOCH 3 - PROGRESS: at 6.94% examples, 425106 words/s, in_qsize 5, out_qsize 0
2021-03-21 02:29:16,773 : INFO : EPOCH 3 - PROGRESS: at 10.46% examples, 426973 words/s, in_qsize 5, out_qsize 0
2021-03-21 02:29:17,782 : INFO : EPOCH 3 - PROGRESS: at 13.82% examples, 423978 words/s, in_qsize 5, out_qsize 0
2021-03-21 02:29:18,801 : INFO : EPOCH 3 - PROGRESS: at 17.28% examples, 424032 words/s, in_qsize 5, out_qsize 0
2021-03-21 02:29:19,824 : INFO : EPOCH 3 - PROGRESS: at 20.69% examples, 422761 words/s, in_qsize 5, out_qsize 0
2021-03-21 02:29:20,828 : INFO : EPOCH 3 - PROGRESS: at 24.04% examples, 422686 words/s, in_q

2021-03-21 02:30:20,719 : INFO : EPOCH 5 - PROGRESS: at 27.10% examples, 419176 words/s, in_qsize 5, out_qsize 0
2021-03-21 02:30:21,728 : INFO : EPOCH 5 - PROGRESS: at 30.51% examples, 420200 words/s, in_qsize 5, out_qsize 0
2021-03-21 02:30:22,759 : INFO : EPOCH 5 - PROGRESS: at 33.98% examples, 420788 words/s, in_qsize 5, out_qsize 0
2021-03-21 02:30:23,768 : INFO : EPOCH 5 - PROGRESS: at 37.33% examples, 420591 words/s, in_qsize 5, out_qsize 0
2021-03-21 02:30:24,807 : INFO : EPOCH 5 - PROGRESS: at 40.86% examples, 421024 words/s, in_qsize 5, out_qsize 0
2021-03-21 02:30:25,810 : INFO : EPOCH 5 - PROGRESS: at 44.33% examples, 422012 words/s, in_qsize 5, out_qsize 0
2021-03-21 02:30:26,811 : INFO : EPOCH 5 - PROGRESS: at 47.62% examples, 421410 words/s, in_qsize 5, out_qsize 0
2021-03-21 02:30:27,816 : INFO : EPOCH 5 - PROGRESS: at 50.91% examples, 420852 words/s, in_qsize 5, out_qsize 0
2021-03-21 02:30:28,819 : INFO : EPOCH 5 - PROGRESS: at 54.32% examples, 421244 words/s, in_qsiz

In [8]:
model_bow = Word2Vec(corpus, sg=0)

2021-03-21 02:30:42,364 : INFO : collecting all words and their counts
2021-03-21 02:30:42,367 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-03-21 02:30:47,052 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2021-03-21 02:30:47,053 : INFO : Loading a fresh vocabulary
2021-03-21 02:30:47,262 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2021-03-21 02:30:47,263 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2021-03-21 02:30:47,396 : INFO : deleting the raw counts dictionary of 253854 items
2021-03-21 02:30:47,402 : INFO : sample=0.001 downsamples 38 most-common words
2021-03-21 02:30:47,402 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2021-03-21 02:30:47,569 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2021-03-21 02:30:47,570 : 

In [804]:
model_skp.wv.save_word2vec_format('data/model_skp', binary=False)
model_bow.wv.save_word2vec_format('data/model_bow', binary=False)

2021-03-21 20:53:37,146 : INFO : storing 71290x100 projection weights into data/model_skp
2021-03-21 20:53:40,647 : INFO : storing 71290x100 projection weights into data/model_bow


In [None]:
model_skp = gensim.models.Word2Vec.load('data/model_skp.bin')
model_bow = gensim.models.Word2Vec.load('data/model_bow.bin')

Now that we have our word2vec model, let's find words that are similar to 'tree'.




In [25]:
word = 'rug'
words = ['goose' ,'brant', 'honking', 'greylag']

In [26]:
print(model_skp.wv.most_similar(word))

[('muttered', 0.9140298366546631), ('latrine', 0.9127816557884216), ('nape', 0.9126898050308228), ('ponde', 0.9087767601013184), ('snatched', 0.907937228679657), ('strode', 0.9071558713912964), ('handmaidens', 0.9067798852920532), ('lures', 0.9058092832565308), ('sneaks', 0.9054685235023499), ('blackening', 0.9050582647323608)]


In [27]:
print(model_bow.wv.most_similar(word))

[('dammit', 0.7821866869926453), ('butts', 0.7684436440467834), ('hestia', 0.7587953209877014), ('relaxes', 0.7585235238075256), ('brazen', 0.757510781288147), ('choking', 0.7536804676055908), ('hellboy', 0.7532745599746704), ('gunfighter', 0.7508139610290527), ('rhadamanthys', 0.7474358081817627), ('stomping', 0.7474220991134644)]


In [28]:
#print(model_skp.wv.most_similar(words))

In [29]:
#print(model_bow.wv.most_similar(words))

You can use the API to download several different corpora and pretrained models.
Here's how to list all resources available in gensim-data:




In [30]:

model_skp.predict_output_word([word], topn=4)

[('her', 0.000117058196),
 ('she', 9.678167e-05),
 ('called', 7.754944e-05),
 ('leg', 7.688983e-05)]

In [31]:
model_bow.predict_output_word([word], topn=4)

[('who', 2.238965e-05),
 ('t', 2.1998887e-05),
 ('stone', 2.1704434e-05),
 ('al', 2.1177275e-05)]

In [805]:
#pos='dog'
#pos_list = ['academic', 'gown']
pos_list = ['owl']

In [806]:
model_skp.most_similar(pos_list, topn=4)


  model_skp.most_similar(pos_list, topn=4)


[('screech', 0.7620271444320679),
 ('elk', 0.7512877583503723),
 ('lizard', 0.7465932369232178),
 ('crocodile', 0.7463972568511963)]

In [807]:
model_bow.most_similar(pos_list, topn=4)

  model_bow.most_similar(pos_list, topn=4)


[('crab', 0.7441952228546143),
 ('humpback', 0.7424106001853943),
 ('lizard', 0.7364765405654907),
 ('cay', 0.7333977222442627)]

In [808]:

model_skp.most_similar(positive=[*pos_list, 'walks'], negative=['man'], topn=4)

  model_skp.most_similar(positive=[*pos_list, 'walks'], negative=['man'], topn=4)


[('stump', 0.5987354516983032),
 ('hula', 0.5817593932151794),
 ('goose', 0.5799045562744141),
 ('swims', 0.5794539451599121)]

In [809]:
model_bow.most_similar(positive=[*pos_list, 'walks'], negative=['man'], topn=4)

  model_bow.most_similar(positive=[*pos_list, 'walks'], negative=['man'], topn=4)


[('stump', 0.6686378717422485),
 ('chasing', 0.6664310693740845),
 ('poplar', 0.6654931902885437),
 ('floats', 0.6644579768180847)]

In [810]:
model_bow.most_similar(positive=[*pos_list, 'fingers'], negative=['man'], topn=5)

  model_bow.most_similar(positive=[*pos_list, 'fingers'], negative=['man'], topn=5)


[('hook', 0.6833391189575195),
 ('padded', 0.6609590649604797),
 ('paw', 0.6584718227386475),
 ('cones', 0.657898485660553),
 ('sticky', 0.6558718681335449)]

In [811]:
model_skp.most_similar(positive=[*pos_list, 'nose'], negative=['man'], topn=5)

  model_skp.most_similar(positive=[*pos_list, 'nose'], negative=['man'], topn=5)


[('tail', 0.6247210502624512),
 ('crab', 0.6005396246910095),
 ('sooty', 0.6003273725509644),
 ('whitish', 0.5975860357284546),
 ('claws', 0.5963559746742249)]

In [779]:
model_skp.most_similar(positive=[*pos_list, 'speaks'], negative=['man'], topn=5)

  model_skp.most_similar(positive=[*pos_list, 'speaks'], negative=['man'], topn=5)


[('ideographic', 0.5345851182937622),
 ('monolingual', 0.5307929515838623),
 ('colloquial', 0.5177149772644043),
 ('transliteration', 0.5169132947921753),
 ('mispronounced', 0.507492184638977)]

In [780]:
model_bow.most_similar(positive=[*pos_list, 'speaks'], negative=['man'], topn=5)

  model_bow.most_similar(positive=[*pos_list, 'speaks'], negative=['man'], topn=5)


[('spelled', 0.6013485193252563),
 ('pronounce', 0.5587227940559387),
 ('spelt', 0.552275538444519),
 ('dhe', 0.5515412092208862),
 ('cockney', 0.5448827147483826)]

In [685]:
model_skp.n_similarity(['train'], ['epicycle','southbound'])

  model_skp.n_similarity(['train'], ['epicycle','southbound'])


0.51274174

In [686]:
#model_skp.predict_output_word(words, topn=4)

In [687]:
#model_bow.predict_output_word(words, topn=4)

In [688]:
#model.wv.most_similar(word, topn=4)

In [122]:
import json
info = api.info()
print(json.dumps(info, indent=4))

{
    "corpora": {
        "semeval-2016-2017-task3-subtaskBC": {
            "num_records": -1,
            "record_format": "dict",
            "file_size": 6344358,
            "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py",
            "license": "All files released for the task are free for general research use",
            "fields": {
                "2016-train": [
                    "..."
                ],
                "2016-dev": [
                    "..."
                ],
                "2017-test": [
                    "..."
                ],
                "2016-test": [
                    "..."
                ]
            },
            "description": "SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collect

There are two types of data resources: corpora and models.



In [None]:
print(info.keys())

Let's have a look at the available corpora:



In [None]:
for corpus_name, corpus_data in sorted(info['corpora'].items()):
    print(
        '%s (%d records): %s' % (
            corpus_name,
            corpus_data.get('num_records', -1),
            corpus_data['description'][:40] + '...',
        )
    )

... and the same for models:



In [None]:
for model_name, model_data in sorted(info['models'].items()):
    print(
        '%s (%d records): %s' % (
            model_name,
            model_data.get('num_records', -1),
            model_data['description'][:40] + '...',
        )
    )

If you want to get detailed information about a model/corpus, use:




In [None]:
fake_news_info = api.info('fake-news')
print(json.dumps(fake_news_info, indent=4))

Sometimes, you do not want to load a model into memory. Instead, you can request
just the filesystem path to the model. For that, use:




In [None]:
print(api.load('glove-wiki-gigaword-50', return_path=True))

If you want to load the model to memory, then:




In [151]:
model = api.load("glove-wiki-gigaword-50")


2021-03-20 01:29:26,405 : INFO : loading projection weights from /Users/joshuamailman/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
2021-03-20 01:29:36,071 : INFO : loaded (400000, 50) matrix from /Users/joshuamailman/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
2021-03-20 01:29:36,111 : INFO : precomputing L2-norms of word weight vectors


[('cobra', 0.708692729473114),
 ('gts', 0.7059628963470459),
 ('gaboon', 0.6975671052932739),
 ('gts-r', 0.67454993724823),
 ('longhorn', 0.6680278778076172),
 ('panther', 0.666718065738678),
 ('cc', 0.6618682742118835),
 ('ah-1z', 0.6288684606552124),
 ('scorpions', 0.6254101991653442),
 ('mustang', 0.6176854968070984)]

In [161]:
model.most_similar("pretzel")

[('pretzels', 0.6768024563789368),
 ('waffle', 0.668554425239563),
 ('itarsi', 0.6505174040794373),
 ('popsicle', 0.6484500765800476),
 ('screwdriver', 0.6474739909172058),
 ('lug', 0.6465184688568115),
 ('nut', 0.645367443561554),
 ('keg', 0.6438860893249512),
 ('snickers', 0.6391291618347168),
 ('swizzle', 0.6360814571380615)]

In [131]:
model.most_similar('fly')

[('flying', 0.8130614757537842),
 ('flies', 0.7638437151908875),
 ('sail', 0.760538637638092),
 ('cruise', 0.7593040466308594),
 ('landing', 0.7464283108711243),
 ('flights', 0.7390480041503906),
 ('catch', 0.7383242249488831),
 ('bound', 0.7368704676628113),
 ('flight', 0.7362315654754639),
 ('planes', 0.7273046970367432)]

For corpora, the corpus is never loaded to memory, all corpora are iterables wrapped in
a special class ``Dataset``, with an ``__iter__`` method.




In [697]:
"shell's".strip("'s'")

'hell'

In [700]:
"shell's".replace("\'s",'')

'shell'

  'lawn_mower' in model_bow


False

In [790]:
import pickle
import numpy as np
import pandas as pd

#w2v_output_df = pd.DataFrame(columns = ['label', 'sim_bow', 'walks_bow', 'walks_skp', 'fingers_bow'])

# with open("data/" + "img_label_word_list.pickle", 'rb') as to_read:  
#     img_label_word_list  =  pickle.load(to_read)  

import csv
img_label_list = []

def get_label_category(idx_):
    if idx_ in set(range(0, 218)).union( range(223, 226)).union( range(383, 502)).union( range(601,658)):
#         if idx_ == 392:
#             print( idx_, 'garter_snake')
        return 'animal'
    elif idx_ in  range(229, 296):
        #print('moving_object')
        return 'moving_object'  
    else:
        return 'object'
    
    
label_word_cat_dict = {}

with open('data/image_net_labels_with_original_index_numbers.txt', newline = '') as labels:                                                                                          
    label_reader = csv.reader(labels, delimiter=' ')
    for label_row in label_reader:
        img_label_list.append( label_row[2])     
        #print(label_row[1])
        id_num = int(label_row[1])
        #print(id_num)
        category =get_label_category( id_num )
        #print(label_row[2], category )
        label_word_cat_dict[label_row[2]] = category
        
#img_label_list   

In [751]:
499 in set(range(0, 188)).union( range(321, 408)).union( range(497,544))

True

In [789]:
get_label_category(int('218'))

'moving_object'

In [760]:
label_word_cat_dict

{'kit_fox': 'animal',
 'English_setter': 'animal',
 'Siberian_husky': 'animal',
 'Australian_terrier': 'animal',
 'English_springer': 'animal',
 'grey_whale': 'animal',
 'lesser_panda': 'animal',
 'Egyptian_cat': 'animal',
 'ibex': 'animal',
 'Persian_cat': 'animal',
 'cougar': 'animal',
 'gazelle': 'animal',
 'porcupine': 'animal',
 'sea_lion': 'animal',
 'malamute': 'animal',
 'badger': 'animal',
 'Great_Dane': 'animal',
 'Walker_hound': 'animal',
 'Welsh_springer_spaniel': 'animal',
 'whippet': 'animal',
 'Scottish_deerhound': 'animal',
 'killer_whale': 'animal',
 'mink': 'animal',
 'African_elephant': 'animal',
 'Weimaraner': 'animal',
 'soft-coated_wheaten_terrier': 'animal',
 'Dandie_Dinmont': 'animal',
 'red_wolf': 'animal',
 'Old_English_sheepdog': 'animal',
 'jaguar': 'animal',
 'otterhound': 'animal',
 'bloodhound': 'animal',
 'Airedale': 'animal',
 'hyena': 'animal',
 'meerkat': 'animal',
 'giant_schnauzer': 'animal',
 'titi': 'animal',
 'three-toed_sloth': 'animal',
 'sorre

In [723]:
"shell's".replace("'s",'')

'shell'

In [720]:
get_label_category(499)
#set(range(0, 188)).union( range(321, 408)).union( range(497,544))

'animal'

In [766]:
label_word_dict = {}

for label in img_label_list:
    #print(label)
    if label_word_cat_dict[ label ] is 'animal':
        label_w = label.lower().split('_')[-1]
        if label_w in model_bow:
            label_words = [label_w] 
    else:
        label_words =  [w.replace("'s",'') for w in label.lower().split('_') if w.replace("'s",'') in model_bow] 
    if len(label_words) >0:
        label_word_dict[label] = label_words

label_word_dict 

  if label_word_cat_dict[ label ] is 'animal':
  if label_w in model_bow:
  label_words =  [w.replace("'s",'') for w in label.lower().split('_') if w.replace("'s",'') in model_bow]


{'kit_fox': ['fox'],
 'English_setter': ['setter'],
 'Siberian_husky': ['husky'],
 'Australian_terrier': ['terrier'],
 'English_springer': ['springer'],
 'grey_whale': ['whale'],
 'lesser_panda': ['panda'],
 'Egyptian_cat': ['cat'],
 'ibex': ['ibex'],
 'Persian_cat': ['cat'],
 'cougar': ['cougar'],
 'gazelle': ['gazelle'],
 'porcupine': ['porcupine'],
 'sea_lion': ['lion'],
 'malamute': ['lion'],
 'badger': ['badger'],
 'Great_Dane': ['dane'],
 'Walker_hound': ['hound'],
 'Welsh_springer_spaniel': ['spaniel'],
 'whippet': ['spaniel'],
 'Scottish_deerhound': ['spaniel'],
 'killer_whale': ['whale'],
 'mink': ['mink'],
 'African_elephant': ['elephant'],
 'Weimaraner': ['elephant'],
 'soft-coated_wheaten_terrier': ['terrier'],
 'Dandie_Dinmont': ['terrier'],
 'red_wolf': ['wolf'],
 'Old_English_sheepdog': ['sheepdog'],
 'jaguar': ['jaguar'],
 'otterhound': ['jaguar'],
 'bloodhound': ['bloodhound'],
 'Airedale': ['bloodhound'],
 'hyena': ['hyena'],
 'meerkat': ['hyena'],
 'giant_schnauzer':

In [816]:




        
#print(label_word_dict)    
#list_of_dicts = []



label_record_list = []
for k, v in label_word_dict.items():
    #print(k, v)
    sim_bow_list = model_bow.most_similar(v, topn=4)
    sim_bow_list = [(w, round(num,2)) for w, num in sim_bow_list]
#   label_record = {'label': k, 'sim_bow': sim_bow_list}
#   label_record_list.append(label_record)

    walks_bow_list = model_bow.most_similar(positive = [*v, 'walks'], negative=['man'], topn=4)
    walks_bow_list = [(w, round(num,2)) for w, num in walks_bow_list]
    
    walks_skp_list = model_skp.most_similar(positive = [*v, 'walks'], negative=['man'], topn=4)
    walks_skp_list = [(w, round(num,2)) for w, num in walks_skp_list]

    fingers_bow_list = model_bow.most_similar(positive = [*v, 'fingers'], negative=['man'], topn=4)
    fingers_bow_list = [(w, round(num,2)) for w, num in fingers_bow_list]
    
    #category = label_word_cat_dict[ k.split('_')[-1]]
    category = label_word_cat_dict[ k ]
    
    label_record = {'label': k, 'category': category, 'sim_bow': sim_bow_list,'walks_bow': walks_bow_list, 'walks_skp': walks_skp_list,  'fingers_bow': fingers_bow_list}
    label_record_list.append(label_record)
    
    
w2v_output_df = pd.DataFrame(label_record_list    )

# for record in label_record_list:
#     label = record['label']
#     for row in label:
#         row[]

    
    #, 'walks_bow', 'walks_skp', 'fingers_bow'
    
#w2v_output_df.update({'sim_bow', sim_bow_dict_list})
#w2v_output_df
#pd.DataFrame(sim_bow_dict_list)
    
#     feature_names = vectorizer.get_feature_names()

# nearby_words_dict = {}

# num = 3
# for w in img_label_word_list:
#     ind = feature_names.index(w)
#     item_col = cor_coefficient_matrix [:,ind]
#     idx = np.argpartition(cor_coefficient_matrix , -num)[-num:]
#     indices = idx[np.argsort((-cor_coefficient_matrix )[idx])]
#     #nearby_words_list = map(feature_names.__getitem__, indices) 
#     nearby_words_list = list(itemgetter(*indices)(nearby_words_list))
#     nearby_words_dict[w] = nearby_words_list

  sim_bow_list = model_bow.most_similar(v, topn=4)
  walks_bow_list = model_bow.most_similar(positive = [*v, 'walks'], negative=['man'], topn=4)
  walks_skp_list = model_skp.most_similar(positive = [*v, 'walks'], negative=['man'], topn=4)
  fingers_bow_list = model_bow.most_similar(positive = [*v, 'nose'], negative=['man'], topn=4)


In [817]:
w2v_output_df

Unnamed: 0,label,category,sim_bow,walks_bow,walks_skp,fingers_bow
0,kit_fox,animal,"[(cbs, 0.72), (leno, 0.71), (nbc, 0.68), (ueck...","[(cbs, 0.63), (tonight, 0.63), (nbc, 0.61), (a...","[(airing, 0.58), (cbs, 0.57), (nightly, 0.56),...","[(hook, 0.63), (cherry, 0.6), (flashing, 0.6),..."
1,English_setter,animal,"[(pinscher, 0.87), (sheepdog, 0.87), (cur, 0.8...","[(knobs, 0.68), (cayenne, 0.68), (taps, 0.67),...","[(shakuhachi, 0.62), (tsk, 0.62), (coonhound, ...","[(fluke, 0.71), (paddles, 0.71), (opal, 0.71),..."
2,Siberian_husky,animal,"[(sheepdog, 0.89), (catahoula, 0.88), (collie,...","[(weasels, 0.71), (gulfs, 0.7), (rhodope, 0.69...","[(plunges, 0.65), (snowboarding, 0.64), (raft,...","[(opal, 0.73), (spiny, 0.72), (teflon, 0.72), ..."
3,Australian_terrier,animal,"[(spaniel, 0.92), (mastiff, 0.89), (bulldog, 0...","[(poplar, 0.7), (oyster, 0.64), (catahoula, 0....","[(foxhound, 0.61), (poplar, 0.6), (coonhound, ...","[(striped, 0.73), (hook, 0.73), (chili, 0.69),..."
4,English_springer,animal,"[(verlag, 0.96), (phys, 0.76), (pp, 0.75), (el...","[(verlag, 0.64), (alto, 0.56), (ontologies, 0....","[(verlag, 0.63), (prentice, 0.6), (mcgraw, 0.5...","[(transverse, 0.63), (cmp, 0.63), (alto, 0.62)..."
...,...,...,...,...,...,...
902,parallel_bars,object,"[(escalators, 0.7), (floors, 0.7), (vertically...","[(escalators, 0.71), (ramps, 0.68), (wires, 0....","[(aisles, 0.73), (escalators, 0.72), (intercha...","[(vertically, 0.74), (vertical, 0.73), (button..."
903,flagpole,object,"[(helsingborg, 0.74), (bluff, 0.71), (vestibul...","[(helsingborg, 0.67), (empties, 0.66), (balcon...","[(bleachers, 0.72), (talladega, 0.69), (stoke,...","[(fingerboard, 0.7), (sloped, 0.68), (hook, 0...."
904,coffee_mug,object,"[(steamed, 0.83), (tomato, 0.83), (cinnamon, 0...","[(cayenne, 0.79), (canned, 0.78), (pineapple, ...","[(doughnuts, 0.77), (burgers, 0.77), (servings...","[(powdered, 0.79), (soda, 0.79), (roasted, 0.7..."
905,rubber_eraser,object,"[(margarine, 0.84), (paste, 0.84), (plywood, 0...","[(cans, 0.81), (sticky, 0.78), (pots, 0.77), (...","[(shuffles, 0.8), (styrofoam, 0.79), (crepe, 0...","[(sticky, 0.82), (padded, 0.81), (paste, 0.8),..."


In [818]:
w2v_output_df.to_csv('w2v_analogies.csv')

In [788]:

"spotter's_wheel".replace("\'s",'')

'spotter_wheel'

In [591]:
'Dane' in model_bow

  'Dane' in model_bow


False

  'irish' in model_bow


True