In [1]:
%matplotlib inline


How to download pre-trained models and corpora
==============================================

Demonstrates simple and quick access to common corpora and pretrained models.



In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

One of Gensim's features is simple and easy access to common data.
The `gensim-data <https://github.com/RaRe-Technologies/gensim-data>`_ project stores a
variety of corpora and pretrained models.
Gensim has a :py:mod:`gensim.downloader` module for programmatically accessing this data.
This module leverages a local cache (in user's home folder, by default) that
ensures data is downloaded at most once.

This tutorial:

* Downloads the text8 corpus, unless it is already on your local machine
* Trains a Word2Vec model from the corpus (see `sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py` for a detailed tutorial)
* Leverages the model to calculate word similarity
* Demonstrates using the API to load other models and corpora

Let's start by importing the api module.




In [2]:
import gensim.downloader as api

Now, let's download the text8 corpus and load it as a Python object
that supports streamed access.




In [3]:
#corpus = api.load('text8')
corpus = api.load("text8")

In this case, our corpus is an iterable.
If you look under the covers, it has the following definition:



In [4]:
import inspect
print(inspect.getsource(corpus.__class__))

class Dataset(object):
    def __init__(self, fn):
        self.fn = fn

    def __iter__(self):
        corpus = Text8Corpus(self.fn)
        for doc in corpus:
            yield doc



For more details, look inside the file that defines the Dataset class for your particular resource.




In [5]:
print(inspect.getfile(corpus.__class__))

/Users/joshuamailman/gensim-data/text8/__init__.py


With the corpus has been downloaded and loaded, let's use it to train a word2vec model.




In [10]:
import gensim

In [11]:
from gensim.models.word2vec import Word2Vec
model_skp = Word2Vec(corpus, sg=1)

2021-03-25 12:07:55,014 : INFO : collecting all words and their counts
2021-03-25 12:07:55,017 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-03-25 12:07:59,886 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2021-03-25 12:07:59,886 : INFO : Loading a fresh vocabulary
2021-03-25 12:08:00,056 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2021-03-25 12:08:00,057 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2021-03-25 12:08:00,224 : INFO : deleting the raw counts dictionary of 253854 items
2021-03-25 12:08:00,235 : INFO : sample=0.001 downsamples 38 most-common words
2021-03-25 12:08:00,236 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2021-03-25 12:08:00,466 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2021-03-25 12:08:00,467 : 

2021-03-25 12:09:11,429 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-03-25 12:09:11,440 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-03-25 12:09:11,446 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-03-25 12:09:11,446 : INFO : EPOCH - 2 : training on 17005207 raw words (12507090 effective words) took 29.9s, 417931 effective words/s
2021-03-25 12:09:12,451 : INFO : EPOCH 3 - PROGRESS: at 2.70% examples, 339680 words/s, in_qsize 5, out_qsize 0
2021-03-25 12:09:13,457 : INFO : EPOCH 3 - PROGRESS: at 5.88% examples, 364657 words/s, in_qsize 5, out_qsize 0
2021-03-25 12:09:14,466 : INFO : EPOCH 3 - PROGRESS: at 9.17% examples, 377332 words/s, in_qsize 5, out_qsize 0
2021-03-25 12:09:15,477 : INFO : EPOCH 3 - PROGRESS: at 12.46% examples, 384275 words/s, in_qsize 5, out_qsize 0
2021-03-25 12:09:16,480 : INFO : EPOCH 3 - PROGRESS: at 15.87% examples, 392205 words/s, in_qsize 5, out_qsize 0
2021-03-25 12:09:17,483 :

2021-03-25 12:10:16,705 : INFO : EPOCH 5 - PROGRESS: at 17.11% examples, 421036 words/s, in_qsize 5, out_qsize 0
2021-03-25 12:10:17,714 : INFO : EPOCH 5 - PROGRESS: at 20.58% examples, 422531 words/s, in_qsize 5, out_qsize 0
2021-03-25 12:10:18,732 : INFO : EPOCH 5 - PROGRESS: at 23.99% examples, 422768 words/s, in_qsize 5, out_qsize 0
2021-03-25 12:10:19,755 : INFO : EPOCH 5 - PROGRESS: at 27.34% examples, 421306 words/s, in_qsize 5, out_qsize 0
2021-03-25 12:10:20,781 : INFO : EPOCH 5 - PROGRESS: at 30.51% examples, 418094 words/s, in_qsize 5, out_qsize 0
2021-03-25 12:10:21,791 : INFO : EPOCH 5 - PROGRESS: at 33.86% examples, 418288 words/s, in_qsize 5, out_qsize 0
2021-03-25 12:10:22,800 : INFO : EPOCH 5 - PROGRESS: at 37.21% examples, 418387 words/s, in_qsize 5, out_qsize 0
2021-03-25 12:10:23,818 : INFO : EPOCH 5 - PROGRESS: at 40.68% examples, 419005 words/s, in_qsize 5, out_qsize 0
2021-03-25 12:10:24,824 : INFO : EPOCH 5 - PROGRESS: at 43.97% examples, 418361 words/s, in_qsiz

In [7]:
model_bow = Word2Vec(corpus, sg=0)

2021-03-25 12:05:13,429 : INFO : collecting all words and their counts
2021-03-25 12:05:13,432 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-03-25 12:05:18,207 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2021-03-25 12:05:18,207 : INFO : Loading a fresh vocabulary
2021-03-25 12:05:18,427 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2021-03-25 12:05:18,428 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2021-03-25 12:05:18,565 : INFO : deleting the raw counts dictionary of 253854 items
2021-03-25 12:05:18,572 : INFO : sample=0.001 downsamples 38 most-common words
2021-03-25 12:05:18,573 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2021-03-25 12:05:18,764 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2021-03-25 12:05:18,764 : 

In [12]:
model_skp.wv.save_word2vec_format('data/model_skp', binary=False)
model_bow.wv.save_word2vec_format('data/model_bow', binary=False)

2021-03-25 12:10:41,500 : INFO : storing 71290x100 projection weights into data/model_skp
2021-03-25 12:10:45,096 : INFO : storing 71290x100 projection weights into data/model_bow


In [13]:
model_skp = gensim.models.Word2Vec.load('data/model_skp.bin')
model_bow = gensim.models.Word2Vec.load('data/model_bow.bin')

2021-03-25 12:10:48,666 : INFO : loading Word2Vec object from data/model_skp.bin


UnpicklingError: invalid load key, '7'.

Now that we have our word2vec model, let's find words that are similar to 'tree'.




In [None]:
word = 'rug'
words = ['goose' ,'brant', 'honking', 'greylag']

In [None]:
print(model_skp.wv.most_similar(word))

In [None]:
print(model_bow.wv.most_similar(word))

In [None]:
#print(model_skp.wv.most_similar(words))

In [None]:
#print(model_bow.wv.most_similar(words))

You can use the API to download several different corpora and pretrained models.
Here's how to list all resources available in gensim-data:




In [None]:

model_skp.predict_output_word([word], topn=4)

In [None]:
model_bow.predict_output_word([word], topn=4)

In [None]:
#pos='dog'
#pos_list = ['academic', 'gown']
pos_list = ['owl']

In [None]:
model_skp.most_similar(pos_list, topn=4)


In [None]:
model_bow.most_similar(pos_list, topn=4)

In [None]:

model_skp.most_similar(positive=[*pos_list, 'walks'], negative=['man'], topn=4)

In [None]:
model_bow.most_similar(positive=[*pos_list, 'walks'], negative=['man'], topn=4)

In [None]:
model_bow.most_similar(positive=[*pos_list, 'fingers'], negative=['man'], topn=5)

In [None]:
model_skp.most_similar(positive=[*pos_list, 'nose'], negative=['man'], topn=5)

In [None]:
model_skp.most_similar(positive=[*pos_list, 'speaks'], negative=['man'], topn=5)

In [None]:
model_bow.most_similar(positive=[*pos_list, 'speaks'], negative=['man'], topn=5)

In [None]:
model_skp.n_similarity(['train'], ['epicycle','southbound'])

In [None]:
#model_skp.predict_output_word(words, topn=4)

In [None]:
#model_bow.predict_output_word(words, topn=4)

In [None]:
#model.wv.most_similar(word, topn=4)

In [None]:
import json
info = api.info()
print(json.dumps(info, indent=4))

There are two types of data resources: corpora and models.



In [None]:
print(info.keys())

Let's have a look at the available corpora:



In [None]:
for corpus_name, corpus_data in sorted(info['corpora'].items()):
    print(
        '%s (%d records): %s' % (
            corpus_name,
            corpus_data.get('num_records', -1),
            corpus_data['description'][:40] + '...',
        )
    )

... and the same for models:



In [None]:
for model_name, model_data in sorted(info['models'].items()):
    print(
        '%s (%d records): %s' % (
            model_name,
            model_data.get('num_records', -1),
            model_data['description'][:40] + '...',
        )
    )

If you want to get detailed information about a model/corpus, use:




In [None]:
fake_news_info = api.info('fake-news')
print(json.dumps(fake_news_info, indent=4))

Sometimes, you do not want to load a model into memory. Instead, you can request
just the filesystem path to the model. For that, use:




In [None]:
print(api.load('glove-wiki-gigaword-50', return_path=True))

If you want to load the model to memory, then:




In [None]:
model = api.load("glove-wiki-gigaword-50")


In [None]:
model.most_similar("pretzel")

In [None]:
model.most_similar('fly')

For corpora, the corpus is never loaded to memory, all corpora are iterables wrapped in
a special class ``Dataset``, with an ``__iter__`` method.




In [None]:
"shell's".strip("'s'")

In [None]:
"shell's".replace("\'s",'')

In [15]:
import pickle
import numpy as np
import pandas as pd

#w2v_output_df = pd.DataFrame(columns = ['label', 'sim_bow', 'walks_bow', 'walks_skp', 'fingers_bow'])

# with open("data/" + "img_label_word_list.pickle", 'rb') as to_read:  
#     img_label_word_list  =  pickle.load(to_read)  

import csv
img_label_list = []

def get_label_category(idx_):
    if idx_ in set(range(0, 218)).union( range(223, 226)).union( range(383, 502)).union( range(601,658)):
#         if idx_ == 392:
#             print( idx_, 'garter_snake')
        return 'animal'
    elif idx_ in  range(229, 296):
        #print('moving_object')
        return 'moving_object'  
    else:
        return 'object'
    
    
label_word_cat_dict = {}

with open('data/image_net_labels_with_original_index_numbers.txt', newline = '') as labels:                                                                                          
    label_reader = csv.reader(labels, delimiter=' ')
    for label_row in label_reader:
        img_label_list.append( label_row[2])     
        #print(label_row[1])
        id_num = int(label_row[1])
        #print(id_num)
        category =get_label_category( id_num )
        #print(label_row[2], category )
        label_word_cat_dict[label_row[2]] = category
        
#img_label_list   

In [16]:
499 in set(range(0, 188)).union( range(321, 408)).union( range(497,544))

True

In [17]:
get_label_category(int('218'))

'object'

In [18]:
label_word_cat_dict

{'kit_fox': 'animal',
 'English_setter': 'animal',
 'Siberian_husky': 'animal',
 'Australian_terrier': 'animal',
 'English_springer': 'animal',
 'grey_whale': 'animal',
 'lesser_panda': 'animal',
 'Egyptian_cat': 'animal',
 'ibex': 'animal',
 'Persian_cat': 'animal',
 'cougar': 'animal',
 'gazelle': 'animal',
 'porcupine': 'animal',
 'sea_lion': 'animal',
 'malamute': 'animal',
 'badger': 'animal',
 'Great_Dane': 'animal',
 'Walker_hound': 'animal',
 'Welsh_springer_spaniel': 'animal',
 'whippet': 'animal',
 'Scottish_deerhound': 'animal',
 'killer_whale': 'animal',
 'mink': 'animal',
 'African_elephant': 'animal',
 'Weimaraner': 'animal',
 'soft-coated_wheaten_terrier': 'animal',
 'Dandie_Dinmont': 'animal',
 'red_wolf': 'animal',
 'Old_English_sheepdog': 'animal',
 'jaguar': 'animal',
 'otterhound': 'animal',
 'bloodhound': 'animal',
 'Airedale': 'animal',
 'hyena': 'animal',
 'meerkat': 'animal',
 'giant_schnauzer': 'animal',
 'titi': 'animal',
 'three-toed_sloth': 'animal',
 'sorre

In [19]:
"shell's".replace("'s",'')

'shell'

In [20]:
get_label_category(499)
#set(range(0, 188)).union( range(321, 408)).union( range(497,544))

'animal'

In [21]:
label_word_dict = {}

for label in img_label_list:
    #print(label)
    if label_word_cat_dict[ label ] is 'animal':
        label_w = label.lower().split('_')[-1]
        if label_w in model_bow:
            label_words = [label_w] 
    else:
        label_words =  [w.replace("'s",'') for w in label.lower().split('_') if w.replace("'s",'') in model_bow] 
    if len(label_words) >0:
        label_word_dict[label] = label_words

label_word_dict 

  if label_word_cat_dict[ label ] is 'animal':
  if label_w in model_bow:
  label_words =  [w.replace("'s",'') for w in label.lower().split('_') if w.replace("'s",'') in model_bow]


{'kit_fox': ['fox'],
 'English_setter': ['setter'],
 'Siberian_husky': ['husky'],
 'Australian_terrier': ['terrier'],
 'English_springer': ['springer'],
 'grey_whale': ['whale'],
 'lesser_panda': ['panda'],
 'Egyptian_cat': ['cat'],
 'ibex': ['ibex'],
 'Persian_cat': ['cat'],
 'cougar': ['cougar'],
 'gazelle': ['gazelle'],
 'porcupine': ['porcupine'],
 'sea_lion': ['lion'],
 'malamute': ['lion'],
 'badger': ['badger'],
 'Great_Dane': ['dane'],
 'Walker_hound': ['hound'],
 'Welsh_springer_spaniel': ['spaniel'],
 'whippet': ['spaniel'],
 'Scottish_deerhound': ['spaniel'],
 'killer_whale': ['whale'],
 'mink': ['mink'],
 'African_elephant': ['elephant'],
 'Weimaraner': ['elephant'],
 'soft-coated_wheaten_terrier': ['terrier'],
 'Dandie_Dinmont': ['terrier'],
 'red_wolf': ['wolf'],
 'Old_English_sheepdog': ['sheepdog'],
 'jaguar': ['jaguar'],
 'otterhound': ['jaguar'],
 'bloodhound': ['bloodhound'],
 'Airedale': ['bloodhound'],
 'hyena': ['hyena'],
 'meerkat': ['hyena'],
 'giant_schnauzer':

In [None]:




        
#print(label_word_dict)    
#list_of_dicts = []



label_record_list = []
for k, v in label_word_dict.items():
    #print(k, v)
    sim_bow_list = model_bow.most_similar(v, topn=4)
    sim_bow_list = [(w, round(num,2)) for w, num in sim_bow_list]
#   label_record = {'label': k, 'sim_bow': sim_bow_list}
#   label_record_list.append(label_record)

    walks_bow_list = model_bow.most_similar(positive = [*v, 'walks'], negative=['man'], topn=4)
    walks_bow_list = [(w, round(num,2)) for w, num in walks_bow_list]
    
    walks_skp_list = model_skp.most_similar(positive = [*v, 'walks'], negative=['man'], topn=4)
    walks_skp_list = [(w, round(num,2)) for w, num in walks_skp_list]

    fingers_bow_list = model_bow.most_similar(positive = [*v, 'fingers'], negative=['man'], topn=4)
    fingers_bow_list = [(w, round(num,2)) for w, num in fingers_bow_list]
    
    #category = label_word_cat_dict[ k.split('_')[-1]]
    category = label_word_cat_dict[ k ]
    
    label_record = {'label': k, 'category': category, 'sim_bow': sim_bow_list,'walks_bow': walks_bow_list, 'walks_skp': walks_skp_list,  'fingers_bow': fingers_bow_list}
    label_record_list.append(label_record)
    
    
w2v_output_df = pd.DataFrame(label_record_list    )

# for record in label_record_list:
#     label = record['label']
#     for row in label:
#         row[]

    
    #, 'walks_bow', 'walks_skp', 'fingers_bow'
    
#w2v_output_df.update({'sim_bow', sim_bow_dict_list})
#w2v_output_df
#pd.DataFrame(sim_bow_dict_list)
    
#     feature_names = vectorizer.get_feature_names()

# nearby_words_dict = {}

# num = 3
# for w in img_label_word_list:
#     ind = feature_names.index(w)
#     item_col = cor_coefficient_matrix [:,ind]
#     idx = np.argpartition(cor_coefficient_matrix , -num)[-num:]
#     indices = idx[np.argsort((-cor_coefficient_matrix )[idx])]
#     #nearby_words_list = map(feature_names.__getitem__, indices) 
#     nearby_words_list = list(itemgetter(*indices)(nearby_words_list))
#     nearby_words_dict[w] = nearby_words_list

In [None]:
w2v_output_df

In [22]:
with open("data/" + "label_word_cat_dict.pickle", 'wb') as to_write:
    pickle.dump(label_word_cat_dict, to_write)
      

In [818]:
w2v_output_df.to_csv('w2v_analogies.csv')

In [788]:

"spotter's_wheel".replace("\'s",'')

'spotter_wheel'

In [591]:
'Dane' in model_bow

  'Dane' in model_bow


False

  'irish' in model_bow


True