In [1]:
%matplotlib inline


How to download pre-trained models and corpora
==============================================

Demonstrates simple and quick access to common corpora, models, and other data.


In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

One of Gensim's features is simple and easy access to some common data.
The `gensim-data <https://github.com/RaRe-Technologies/gensim-data>`_ project stores a variety of corpora, models and other data.
Gensim has a :py:mod:`gensim.downloader` module for programmatically accessing this data.
The module leverages a local cache that ensures data is downloaded at most once.

This tutorial:

* Retrieves the text8 corpus, unless it is already on your local machine
* Trains a Word2Vec model from the corpus (see `sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py` for a detailed tutorial)
* Leverages the model to calculate word similarity
* Demonstrates using the API to load other models and corpora

Let's start by importing the api module.




In [3]:
import gensim.downloader as api

2020-10-06 13:02:00,760 : INFO : 'pattern' package not found; tag filters are not available for English


Now, lets download the text8 corpus and load it to memory (automatically)




In [4]:
corpus = api.load('text8')

2020-10-06 13:02:00,780 : INFO : Creating /root/gensim-data




2020-10-06 13:02:06,498 : INFO : text8 downloaded


In this case, corpus is an iterable.
If you look under the covers, it has the following definition:



In [5]:
import inspect
print(inspect.getsource(corpus.__class__))

class Dataset(object):
    def __init__(self, fn):
        self.fn = fn

    def __iter__(self):
        corpus = Text8Corpus(self.fn)
        for doc in corpus:
            yield doc



For more details, look inside the file that defines the Dataset class for your particular resource.




In [6]:
print(inspect.getfile(corpus.__class__))

/root/gensim-data/text8/__init__.py


As the corpus has been downloaded and loaded, let's create a word2vec model of our corpus.




In [7]:
from gensim.models.word2vec import Word2Vec
model = Word2Vec(corpus)

2020-10-06 13:02:06,539 : INFO : collecting all words and their counts
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2020-10-06 13:02:06,549 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-10-06 13:02:13,353 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2020-10-06 13:02:13,354 : INFO : Loading a fresh vocabulary
2020-10-06 13:02:13,566 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2020-10-06 13:02:13,567 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2020-10-06 13:02:13,810 : INFO : deleting the raw counts dictionary of 253854 items
2020-10-06 13:02:13,820 : INFO : sample=0.001 downsamples 38 most-common words
2020-10-06 13:02:13,821 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2020-10-06 13:02:14,014 : INFO : estimated required memory for 71290 

Now that we have our word2vec model, let's find words that are similar to 'tree'




In [8]:
print(model.most_similar('tree'))

  """Entry point for launching an IPython kernel.
2020-10-06 13:04:02,965 : INFO : precomputing L2-norms of word weight vectors


[('trees', 0.7049376368522644), ('bark', 0.6889162659645081), ('leaf', 0.6727646589279175), ('flower', 0.6248071193695068), ('fruit', 0.6198490262031555), ('vine', 0.6014647483825684), ('seed', 0.5978450775146484), ('leaves', 0.5887178778648376), ('bee', 0.5858387351036072), ('avl', 0.5811556577682495)]


  if np.issubdtype(vec.dtype, np.int):


You can use the API to download many corpora and models. You can get the list of all the models and corpora that are provided, by using the code below:




In [16]:
import json
info = api.info()
print(json.dumps(info, indent=4))

{
    "corpora": {
        "semeval-2016-2017-task3-subtaskBC": {
            "num_records": -1,
            "record_format": "dict",
            "file_size": 6344358,
            "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py",
            "license": "All files released for the task are free for general research use",
            "fields": {
                "2016-train": [
                    "..."
                ],
                "2016-dev": [
                    "..."
                ],
                "2017-test": [
                    "..."
                ],
                "2016-test": [
                    "..."
                ]
            },
            "description": "SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collect

There are two types of data: corpora and models.



In [17]:
print(info.keys())

dict_keys(['corpora', 'models'])


Let's have a look at the available corpora:



In [18]:
for corpus_name, corpus_data in sorted(info['corpora'].items()):
    print(
        '%s (%d records): %s' % (
            corpus_name,
            corpus_data.get('num_records', -1),
            corpus_data['description'][:40] + '...',
        )
    )

20-newsgroups (18846 records): The notorious collection of approximatel...
__testing_matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Synopsis of t...
__testing_multipart-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Synopsis of t...
fake-news (12999 records): News dataset, contains text and metadata...
patent-2017 (353197 records): Patent Grant Full Text. Contains the ful...
quora-duplicate-questions (404290 records): Over 400,000 lines of potential question...
semeval-2016-2017-task3-subtaskA-unannotated (189941 records): SemEval 2016 / 2017 Task 3 Subtask A una...
semeval-2016-2017-task3-subtaskBC (-1 records): SemEval 2016 / 2017 Task 3 Subtask B and...
text8 (1701 records): First 100,000,000 bytes of plain text fr...
wiki-english-20171001 (4924894 records): Extracted Wikipedia dump from October 20...


... and the same for models:



In [19]:
for model_name, model_data in sorted(info['models'].items()):
    print(
        '%s (%d records): %s' % (
            model_name,
            model_data.get('num_records', -1),
            model_data['description'][:40] + '...',
        )
    )

__testing_word2vec-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Word vecrors ...
conceptnet-numberbatch-17-06-300 (1917247 records): ConceptNet Numberbatch consists of state...
fasttext-wiki-news-subwords-300 (999999 records): 1 million word vectors trained on Wikipe...
glove-twitter-100 (1193514 records): Pre-trained vectors based on  2B tweets,...
glove-twitter-200 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-25 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-50 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-wiki-gigaword-100 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-200 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-300 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-50 (400000 records): Pre-trained vectors based on Wikipedia 2...
word2vec-google-news-300 (3000000 records): Pre-trai

If you want to get detailed information about the model/corpus, use:




In [13]:
fake_news_info = api.info('fake-news')
print(json.dumps(fake_news_info, indent=4))

{
    "num_records": 12999,
    "record_format": "dict",
    "file_size": 20102776,
    "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/fake-news/__init__.py",
    "license": "https://creativecommons.org/publicdomain/zero/1.0/",
    "fields": {
        "crawled": "date the story was archived",
        "ord_in_thread": "",
        "published": "date published",
        "participants_count": "number of participants",
        "shares": "number of Facebook shares",
        "replies_count": "number of replies",
        "main_img_url": "image from story",
        "spam_score": "data from webhose.io",
        "uuid": "unique identifier",
        "language": "data from webhose.io",
        "title": "title of story",
        "country": "data from webhose.io",
        "domain_rank": "data from webhose.io",
        "author": "author of story",
        "comments": "number of Facebook comments",
        "site_url": "site URL from BS detector",
        "text": "tex

Sometimes, you do not want to load the model to memory. You would just want to get the path to the model. For that, use :




In [14]:
print(api.load('glove-wiki-gigaword-50', return_path=True))



2020-10-06 13:04:11,548 : INFO : glove-wiki-gigaword-50 downloaded


/root/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz


If you want to load the model to memory, then:




In [23]:
model = api.load("glove-wiki-gigaword-50")
model.most_similar("glass")

2020-10-06 13:23:29,164 : INFO : loading projection weights from /root/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2020-10-06 13:23:52,038 : INFO : loaded (400000, 50) matrix from /root/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz
2020-10-06 13:23:52,103 : INFO : precomputing L2-norms of word weight vectors
  if np.issubdtype(vec.dtype, np.int):


[('plastic', 0.7942505478858948),
 ('metal', 0.770871639251709),
 ('walls', 0.7700636386871338),
 ('marble', 0.7638524174690247),
 ('wood', 0.7624281048774719),
 ('ceramic', 0.7602593302726746),
 ('pieces', 0.7589111924171448),
 ('stained', 0.7528817057609558),
 ('tile', 0.748193621635437),
 ('furniture', 0.746385931968689)]

In corpora, the corpus is never loaded to memory, all corpuses wrapped to special class ``Dataset`` and provide ``__iter__`` method




In [24]:
model.get_vector('car')

array([ 0.47685 , -0.084552,  1.4641  ,  0.047017,  0.14686 ,  0.5082  ,
       -1.2228  , -0.22607 ,  0.19306 , -0.29756 ,  0.20599 , -0.71284 ,
       -1.6288  ,  0.17096 ,  0.74797 , -0.061943, -0.65766 ,  1.3786  ,
       -0.68043 , -1.7551  ,  0.58319 ,  0.25157 , -1.2114  ,  0.81343 ,
        0.094825, -1.6819  , -0.64498 ,  0.6322  ,  1.1211  ,  0.16112 ,
        2.5379  ,  0.24852 , -0.26816 ,  0.32818 ,  1.2916  ,  0.23548 ,
        0.61465 , -0.1344  , -0.13237 ,  0.27398 , -0.11821 ,  0.1354  ,
        0.074306, -0.61951 ,  0.45472 , -0.30318 , -0.21883 , -0.56054 ,
        1.1177  , -0.36595 ], dtype=float32)

In [25]:
model.get_vector('banana')

array([-0.25522 , -0.75249 , -0.86655 ,  1.1197  ,  0.12887 ,  1.0121  ,
       -0.57249 , -0.36224 ,  0.44341 , -0.12211 ,  0.073524,  0.21387 ,
        0.96744 , -0.068611,  0.51452 , -0.053425, -0.21966 ,  0.23012 ,
        1.043   , -0.77016 , -0.16753 , -1.0952  ,  0.24837 ,  0.20019 ,
       -0.40866 , -0.48037 ,  0.10674 ,  0.5316  ,  1.111   , -0.19322 ,
        1.4768  , -0.51783 , -0.79569 ,  1.7971  , -0.33392 , -0.14545 ,
       -1.5454  ,  0.0135  ,  0.10684 , -0.30722 , -0.54572 ,  0.38938 ,
        0.24659 , -0.85166 ,  0.54966 ,  0.82679 , -0.68081 , -0.77864 ,
       -0.028242, -0.82872 ], dtype=float32)

In [27]:
model.get_vector('vocabulary')

array([-0.40975 ,  0.14091 , -1.1001  , -0.71934 ,  0.38255 ,  0.42821 ,
        0.27204 , -0.74735 , -0.80416 ,  0.60244 ,  0.52853 ,  0.43571 ,
        0.84135 , -0.4106  , -0.33996 , -0.034945, -0.51291 ,  0.58942 ,
        1.1016  , -0.31603 , -0.37732 ,  0.22156 , -0.46774 ,  0.46372 ,
        1.0298  , -0.32891 , -1.1721  ,  0.022795,  0.10548 , -0.27759 ,
        2.0419  , -0.53649 ,  0.45539 , -0.067378,  0.58818 ,  0.56012 ,
       -0.6788  ,  0.19493 , -0.37379 ,  0.18017 ,  0.75873 , -0.18599 ,
        0.1176  ,  0.61295 , -0.35261 ,  0.025468,  1.494   ,  1.4015  ,
       -0.6125  , -0.34257 ], dtype=float32)