In [1]:
%matplotlib inline


How to download pre-trained models and corpora
==============================================

Demonstrates simple and quick access to common corpora and pretrained models.



In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

One of Gensim's features is simple and easy access to common data.
The `gensim-data <https://github.com/RaRe-Technologies/gensim-data>`_ project stores a
variety of corpora and pretrained models.
Gensim has a :py:mod:`gensim.downloader` module for programmatically accessing this data.
This module leverages a local cache (in user's home folder, by default) that
ensures data is downloaded at most once.

This tutorial:

* Downloads the text8 corpus, unless it is already on your local machine
* Trains a Word2Vec model from the corpus (see `sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py` for a detailed tutorial)
* Leverages the model to calculate word similarity
* Demonstrates using the API to load other models and corpora

Let's start by importing the api module.




In [3]:
import gensim.downloader as api

Now, let's download the text8 corpus and load it as a Python object
that supports streamed access.




In [4]:
corpus = api.load('text8')

2021-03-19 00:00:26,643 : INFO : Creating /Users/joshuamailman/gensim-data




2021-03-19 00:00:41,083 : INFO : text8 downloaded


In this case, our corpus is an iterable.
If you look under the covers, it has the following definition:



In [5]:
import inspect
print(inspect.getsource(corpus.__class__))

class Dataset(object):
    def __init__(self, fn):
        self.fn = fn

    def __iter__(self):
        corpus = Text8Corpus(self.fn)
        for doc in corpus:
            yield doc



For more details, look inside the file that defines the Dataset class for your particular resource.




In [6]:
print(inspect.getfile(corpus.__class__))

/Users/joshuamailman/gensim-data/text8/__init__.py


With the corpus has been downloaded and loaded, let's use it to train a word2vec model.




In [7]:
from gensim.models.word2vec import Word2Vec
model = Word2Vec(corpus)

2021-03-19 00:02:48,447 : INFO : collecting all words and their counts
2021-03-19 00:02:48,458 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-03-19 00:03:11,880 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2021-03-19 00:03:11,887 : INFO : Loading a fresh vocabulary
2021-03-19 00:03:13,035 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2021-03-19 00:03:13,048 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2021-03-19 00:03:13,580 : INFO : deleting the raw counts dictionary of 253854 items
2021-03-19 00:03:13,598 : INFO : sample=0.001 downsamples 38 most-common words
2021-03-19 00:03:13,600 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2021-03-19 00:03:14,225 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2021-03-19 00:03:14,230 : 

2021-03-19 00:04:59,145 : INFO : EPOCH 2 - PROGRESS: at 79.89% examples, 363390 words/s, in_qsize 5, out_qsize 0
2021-03-19 00:05:00,170 : INFO : EPOCH 2 - PROGRESS: at 83.07% examples, 364163 words/s, in_qsize 4, out_qsize 1
2021-03-19 00:05:01,172 : INFO : EPOCH 2 - PROGRESS: at 86.13% examples, 364789 words/s, in_qsize 5, out_qsize 0
2021-03-19 00:05:02,179 : INFO : EPOCH 2 - PROGRESS: at 89.30% examples, 365789 words/s, in_qsize 5, out_qsize 0
2021-03-19 00:05:03,183 : INFO : EPOCH 2 - PROGRESS: at 91.71% examples, 363654 words/s, in_qsize 6, out_qsize 0
2021-03-19 00:05:04,221 : INFO : EPOCH 2 - PROGRESS: at 94.53% examples, 362817 words/s, in_qsize 5, out_qsize 0
2021-03-19 00:05:05,222 : INFO : EPOCH 2 - PROGRESS: at 97.18% examples, 361803 words/s, in_qsize 5, out_qsize 0
2021-03-19 00:05:05,943 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-03-19 00:05:05,947 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-03-19 00:05:05,949 : I

Now that we have our word2vec model, let's find words that are similar to 'tree'.




In [8]:
print(model.wv.most_similar('tree'))

2021-03-19 00:06:02,681 : INFO : precomputing L2-norms of word weight vectors


[('leaf', 0.7059791088104248), ('trees', 0.6996309161186218), ('bark', 0.6464654803276062), ('flower', 0.6286193132400513), ('bird', 0.6203575730323792), ('fruit', 0.6173238754272461), ('avl', 0.6114338040351868), ('cactus', 0.5901271104812622), ('pond', 0.5722821950912476), ('leaves', 0.5659260153770447)]


You can use the API to download several different corpora and pretrained models.
Here's how to list all resources available in gensim-data:




In [74]:
model.predict_output_word(['bottle'], topn=4)

[('glass', 0.004556643),
 ('hot', 0.0027073936),
 ('powder', 0.0025544222),
 ('soft', 0.002444317)]

In [75]:
model.wv.most_similar('bottle')

[('bucket', 0.8313448429107666),
 ('bag', 0.810011088848114),
 ('lamp', 0.7992446422576904),
 ('pile', 0.7955589890480042),
 ('cake', 0.7852670550346375),
 ('bottles', 0.7827358245849609),
 ('jar', 0.7749974727630615),
 ('needle', 0.7721670269966125),
 ('toilet', 0.7720123529434204),
 ('glass', 0.7589759230613708)]

In [None]:
import json
info = api.info()
print(json.dumps(info, indent=4))

There are two types of data resources: corpora and models.



In [None]:
print(info.keys())

Let's have a look at the available corpora:



In [None]:
for corpus_name, corpus_data in sorted(info['corpora'].items()):
    print(
        '%s (%d records): %s' % (
            corpus_name,
            corpus_data.get('num_records', -1),
            corpus_data['description'][:40] + '...',
        )
    )

... and the same for models:



In [None]:
for model_name, model_data in sorted(info['models'].items()):
    print(
        '%s (%d records): %s' % (
            model_name,
            model_data.get('num_records', -1),
            model_data['description'][:40] + '...',
        )
    )

If you want to get detailed information about a model/corpus, use:




In [None]:
fake_news_info = api.info('fake-news')
print(json.dumps(fake_news_info, indent=4))

Sometimes, you do not want to load a model into memory. Instead, you can request
just the filesystem path to the model. For that, use:




In [None]:
print(api.load('glove-wiki-gigaword-50', return_path=True))

If you want to load the model to memory, then:




In [None]:
model = api.load("glove-wiki-gigaword-50")
model.most_similar("glass")

For corpora, the corpus is never loaded to memory, all corpora are iterables wrapped in
a special class ``Dataset``, with an ``__iter__`` method.


