In [1]:
%matplotlib inline


How to download pre-trained models and corpora
==============================================

Demonstrates simple and quick access to common corpora and pretrained models.



In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

One of Gensim's features is simple and easy access to common data.
The `gensim-data <https://github.com/RaRe-Technologies/gensim-data>`_ project stores a
variety of corpora and pretrained models.
Gensim has a :py:mod:`gensim.downloader` module for programmatically accessing this data.
This module leverages a local cache (in user's home folder, by default) that
ensures data is downloaded at most once.

This tutorial:

* Downloads the text8 corpus, unless it is already on your local machine
* Trains a Word2Vec model from the corpus (see `sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py` for a detailed tutorial)
* Leverages the model to calculate word similarity
* Demonstrates using the API to load other models and corpora

Let's start by importing the api module.




In [2]:
import gensim.downloader as api

Now, let's download the text8 corpus and load it as a Python object
that supports streamed access.




In [706]:
#corpus = api.load('text8')
corpus = api.load("text8")

In this case, our corpus is an iterable.
If you look under the covers, it has the following definition:



In [4]:
import inspect
print(inspect.getsource(corpus.__class__))

class Dataset(object):
    def __init__(self, fn):
        self.fn = fn

    def __iter__(self):
        corpus = Text8Corpus(self.fn)
        for doc in corpus:
            yield doc



For more details, look inside the file that defines the Dataset class for your particular resource.




In [7]:
print(inspect.getfile(corpus.__class__))

/Users/joshuamailman/gensim-data/text8/__init__.py


With the corpus has been downloaded and loaded, let's use it to train a word2vec model.




In [18]:
from gensim.models.word2vec import Word2Vec
model_skp = Word2Vec(corpus, sg=1)

2021-03-20 01:06:36,381 : INFO : collecting all words and their counts
2021-03-20 01:06:36,384 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-03-20 01:06:40,194 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2021-03-20 01:06:40,194 : INFO : Loading a fresh vocabulary
2021-03-20 01:06:40,405 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2021-03-20 01:06:40,406 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2021-03-20 01:06:40,544 : INFO : deleting the raw counts dictionary of 253854 items
2021-03-20 01:06:40,550 : INFO : sample=0.001 downsamples 38 most-common words
2021-03-20 01:06:40,550 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2021-03-20 01:06:40,726 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2021-03-20 01:06:40,726 : 

2021-03-20 01:07:48,693 : INFO : EPOCH 3 - PROGRESS: at 3.94% examples, 477027 words/s, in_qsize 5, out_qsize 0
2021-03-20 01:07:49,716 : INFO : EPOCH 3 - PROGRESS: at 8.05% examples, 486145 words/s, in_qsize 5, out_qsize 0
2021-03-20 01:07:50,717 : INFO : EPOCH 3 - PROGRESS: at 12.05% examples, 489325 words/s, in_qsize 5, out_qsize 0
2021-03-20 01:07:51,719 : INFO : EPOCH 3 - PROGRESS: at 15.93% examples, 487789 words/s, in_qsize 5, out_qsize 0
2021-03-20 01:07:52,725 : INFO : EPOCH 3 - PROGRESS: at 19.93% examples, 489500 words/s, in_qsize 5, out_qsize 0
2021-03-20 01:07:53,737 : INFO : EPOCH 3 - PROGRESS: at 23.87% examples, 489974 words/s, in_qsize 5, out_qsize 0
2021-03-20 01:07:54,760 : INFO : EPOCH 3 - PROGRESS: at 27.81% examples, 489252 words/s, in_qsize 5, out_qsize 0
2021-03-20 01:07:55,762 : INFO : EPOCH 3 - PROGRESS: at 31.69% examples, 489417 words/s, in_qsize 5, out_qsize 0
2021-03-20 01:07:56,764 : INFO : EPOCH 3 - PROGRESS: at 35.45% examples, 487904 words/s, in_qsize 

2021-03-20 01:08:56,583 : INFO : EPOCH 5 - PROGRESS: at 33.98% examples, 420438 words/s, in_qsize 5, out_qsize 0
2021-03-20 01:08:57,604 : INFO : EPOCH 5 - PROGRESS: at 37.39% examples, 420485 words/s, in_qsize 5, out_qsize 0
2021-03-20 01:08:58,619 : INFO : EPOCH 5 - PROGRESS: at 40.92% examples, 421691 words/s, in_qsize 5, out_qsize 0
2021-03-20 01:08:59,634 : INFO : EPOCH 5 - PROGRESS: at 44.39% examples, 422268 words/s, in_qsize 5, out_qsize 0
2021-03-20 01:09:00,678 : INFO : EPOCH 5 - PROGRESS: at 47.91% examples, 422514 words/s, in_qsize 5, out_qsize 0
2021-03-20 01:09:01,709 : INFO : EPOCH 5 - PROGRESS: at 51.44% examples, 422973 words/s, in_qsize 5, out_qsize 0
2021-03-20 01:09:02,712 : INFO : EPOCH 5 - PROGRESS: at 54.91% examples, 423742 words/s, in_qsize 4, out_qsize 0
2021-03-20 01:09:03,729 : INFO : EPOCH 5 - PROGRESS: at 58.32% examples, 423550 words/s, in_qsize 5, out_qsize 0
2021-03-20 01:09:04,729 : INFO : EPOCH 5 - PROGRESS: at 61.73% examples, 423796 words/s, in_qsiz

In [19]:
model_bow = Word2Vec(corpus, sg=0)

2021-03-20 01:09:15,871 : INFO : collecting all words and their counts
2021-03-20 01:09:15,873 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-03-20 01:09:20,092 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2021-03-20 01:09:20,092 : INFO : Loading a fresh vocabulary
2021-03-20 01:09:20,301 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2021-03-20 01:09:20,302 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2021-03-20 01:09:20,444 : INFO : deleting the raw counts dictionary of 253854 items
2021-03-20 01:09:20,450 : INFO : sample=0.001 downsamples 38 most-common words
2021-03-20 01:09:20,451 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2021-03-20 01:09:20,648 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2021-03-20 01:09:20,649 : 

Now that we have our word2vec model, let's find words that are similar to 'tree'.




In [281]:
word = 'rug'
words = ['goose' ,'brant', 'honking', 'greylag']

In [282]:
print(model_skp.wv.most_similar(word))

[('thurs', 0.9178668260574341), ('dripped', 0.9154368042945862), ('nape', 0.9138864278793335), ('shuloch', 0.9128257036209106), ('buries', 0.9122443199157715), ('gagged', 0.9122151732444763), ('calmness', 0.9110994338989258), ('wrestles', 0.9105159044265747), ('latrine', 0.9051564335823059), ('defiled', 0.9051159620285034)]


In [283]:
print(model_bow.wv.most_similar(word))

[('butts', 0.7981554269790649), ('crabbe', 0.7922760248184204), ('rhadamanthys', 0.7860302329063416), ('stomping', 0.7854064106941223), ('locust', 0.7842223644256592), ('barnacles', 0.7840744256973267), ('choking', 0.7827549576759338), ('aurelia', 0.7816535234451294), ('grizzled', 0.780861496925354), ('pinus', 0.780792236328125)]


In [284]:
#print(model_skp.wv.most_similar(words))

In [285]:
#print(model_bow.wv.most_similar(words))

You can use the API to download several different corpora and pretrained models.
Here's how to list all resources available in gensim-data:




In [286]:

model_skp.predict_output_word([word], topn=4)

[('her', 0.00011968485),
 ('she', 9.788734e-05),
 ('called', 8.559184e-05),
 ('who', 8.2942905e-05)]

In [287]:
model_bow.predict_output_word([word], topn=4)

[('al', 2.2131764e-05),
 ('who', 2.1848786e-05),
 ('called', 2.1380702e-05),
 ('john', 2.0971931e-05)]

In [809]:
pos='zebra'


In [810]:
model_skp.most_similar([pos])


  model_skp.most_similar([pos])


[('quagga', 0.8735522031784058),
 ('rhinoceros', 0.863740086555481),
 ('przewalski', 0.851159930229187),
 ('equus', 0.8478255271911621),
 ('humpback', 0.8394851684570312),
 ('saimiri', 0.8394293785095215),
 ('grasshopper', 0.8376014232635498),
 ('hyena', 0.8344372510910034),
 ('danio', 0.8339352607727051),
 ('spectacled', 0.8324639201164246)]

In [811]:
model_bow.most_similar([pos])

  model_bow.most_similar([pos])


[('equus', 0.8659811019897461),
 ('amaranth', 0.8481062650680542),
 ('toothed', 0.844868540763855),
 ('humpback', 0.8387032747268677),
 ('danio', 0.8349896669387817),
 ('beaked', 0.8306174874305725),
 ('catfish', 0.8293459415435791),
 ('ursus', 0.824088990688324),
 ('leopard', 0.8222962617874146),
 ('pigeon', 0.8187264204025269)]

In [812]:

model_skp.most_similar(positive=[pos, 'walks'], negative=['man'], topn=20)

  model_skp.most_similar(positive=[pos, 'walks'], negative=['man'], topn=20)


[('bushes', 0.6928241848945618),
 ('palms', 0.6802634596824646),
 ('otters', 0.6726007461547852),
 ('antelope', 0.6682460904121399),
 ('shrubs', 0.6664403676986694),
 ('pelicans', 0.6552487015724182),
 ('gulls', 0.6551059484481812),
 ('grasshoppers', 0.6497182250022888),
 ('bumblebee', 0.6476486921310425),
 ('herons', 0.6462903618812561),
 ('moths', 0.6440606713294983),
 ('orchids', 0.643992006778717),
 ('kingfishers', 0.6433441638946533),
 ('odontoceti', 0.64076167345047),
 ('badgers', 0.6404194831848145),
 ('wasps', 0.6402060985565186),
 ('hyenas', 0.6384462714195251),
 ('lizards', 0.6375194787979126),
 ('deciduous', 0.636717677116394),
 ('daisies', 0.6310447454452515)]

In [813]:
model_bow.most_similar(positive=[pos, 'walks'], negative=['man'], topn=5)

  model_bow.most_similar(positive=[pos, 'walks'], negative=['man'], topn=5)


[('poplar', 0.7040863037109375),
 ('bedrock', 0.6735628843307495),
 ('toothed', 0.6691359281539917),
 ('hiking', 0.6683255434036255),
 ('lizards', 0.6669633388519287)]

In [814]:
model_bow.most_similar(positive=[pos, 'fingers'], negative=['man'], topn=5)

  model_bow.most_similar(positive=[pos, 'fingers'], negative=['man'], topn=5)


[('odontoceti', 0.6995912790298462),
 ('cones', 0.6942052841186523),
 ('striped', 0.687303900718689),
 ('pea', 0.6815760135650635),
 ('grooves', 0.6767802238464355)]

In [815]:
model_skp.most_similar(positive=[pos, 'says'], negative=['man'], topn=5)

  model_skp.most_similar(positive=[pos, 'says'], negative=['man'], topn=5)


[('subfamilia', 0.6191520094871521),
 ('bovidae', 0.6139534711837769),
 ('carassius', 0.6130903959274292),
 ('brachydanio', 0.6126963496208191),
 ('proteles', 0.6119928956031799)]

In [816]:
model_bow.most_similar(positive=[pos, 'says'], negative=['man'], topn=5)

  model_bow.most_similar(positive=[pos, 'says'], negative=['man'], topn=5)


[('conjectured', 0.5926094055175781),
 ('dagestani', 0.5866895914077759),
 ('manatee', 0.5680640339851379),
 ('surmised', 0.5677024126052856),
 ('manatus', 0.5585182905197144)]

In [817]:
model_skp.n_similarity(['train'], ['epicycle','southbound'])

  model_skp.n_similarity(['train'], ['epicycle','southbound'])


0.5088859

In [538]:
#model_skp.predict_output_word(words, topn=4)

In [280]:
#model_bow.predict_output_word(words, topn=4)

In [141]:
#model.wv.most_similar(word, topn=4)

In [122]:
import json
info = api.info()
print(json.dumps(info, indent=4))

{
    "corpora": {
        "semeval-2016-2017-task3-subtaskBC": {
            "num_records": -1,
            "record_format": "dict",
            "file_size": 6344358,
            "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py",
            "license": "All files released for the task are free for general research use",
            "fields": {
                "2016-train": [
                    "..."
                ],
                "2016-dev": [
                    "..."
                ],
                "2017-test": [
                    "..."
                ],
                "2016-test": [
                    "..."
                ]
            },
            "description": "SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collect

There are two types of data resources: corpora and models.



In [None]:
print(info.keys())

Let's have a look at the available corpora:



In [None]:
for corpus_name, corpus_data in sorted(info['corpora'].items()):
    print(
        '%s (%d records): %s' % (
            corpus_name,
            corpus_data.get('num_records', -1),
            corpus_data['description'][:40] + '...',
        )
    )

... and the same for models:



In [None]:
for model_name, model_data in sorted(info['models'].items()):
    print(
        '%s (%d records): %s' % (
            model_name,
            model_data.get('num_records', -1),
            model_data['description'][:40] + '...',
        )
    )

If you want to get detailed information about a model/corpus, use:




In [None]:
fake_news_info = api.info('fake-news')
print(json.dumps(fake_news_info, indent=4))

Sometimes, you do not want to load a model into memory. Instead, you can request
just the filesystem path to the model. For that, use:




In [None]:
print(api.load('glove-wiki-gigaword-50', return_path=True))

If you want to load the model to memory, then:




In [151]:
model = api.load("glove-wiki-gigaword-50")


2021-03-20 01:29:26,405 : INFO : loading projection weights from /Users/joshuamailman/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
2021-03-20 01:29:36,071 : INFO : loaded (400000, 50) matrix from /Users/joshuamailman/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
2021-03-20 01:29:36,111 : INFO : precomputing L2-norms of word weight vectors


[('cobra', 0.708692729473114),
 ('gts', 0.7059628963470459),
 ('gaboon', 0.6975671052932739),
 ('gts-r', 0.67454993724823),
 ('longhorn', 0.6680278778076172),
 ('panther', 0.666718065738678),
 ('cc', 0.6618682742118835),
 ('ah-1z', 0.6288684606552124),
 ('scorpions', 0.6254101991653442),
 ('mustang', 0.6176854968070984)]

In [161]:
model.most_similar("pretzel")

[('pretzels', 0.6768024563789368),
 ('waffle', 0.668554425239563),
 ('itarsi', 0.6505174040794373),
 ('popsicle', 0.6484500765800476),
 ('screwdriver', 0.6474739909172058),
 ('lug', 0.6465184688568115),
 ('nut', 0.645367443561554),
 ('keg', 0.6438860893249512),
 ('snickers', 0.6391291618347168),
 ('swizzle', 0.6360814571380615)]

In [131]:
model.most_similar('fly')

[('flying', 0.8130614757537842),
 ('flies', 0.7638437151908875),
 ('sail', 0.760538637638092),
 ('cruise', 0.7593040466308594),
 ('landing', 0.7464283108711243),
 ('flights', 0.7390480041503906),
 ('catch', 0.7383242249488831),
 ('bound', 0.7368704676628113),
 ('flight', 0.7362315654754639),
 ('planes', 0.7273046970367432)]

For corpora, the corpus is never loaded to memory, all corpora are iterables wrapped in
a special class ``Dataset``, with an ``__iter__`` method.


