# Tutorial for using Gensim's API for downloading corpuses/models
Let's start by importing the api module.

In [1]:
import gensim.downloader as api

Now, lets download the text8 corpus and load it.

In [3]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
corpus = api.load('text8')

2017-09-30 20:00:53,429 : INFO : Creating /home/chaitali/gensim-data/text8
2017-09-30 20:00:53,431 : INFO : Creation of /home/chaitali/gensim-data/text8 successful.
2017-09-30 20:00:53,433 : INFO : Downloading text8
2017-09-30 20:04:46,938 : INFO : text8 downloaded
2017-09-30 20:04:46,951 : INFO : Extracting files from /home/chaitali/gensim-data/text8
2017-09-30 20:04:48,888 : INFO : text8 installed


As the corpus has been installed, let's create a word2vec model of our corpus.

In [4]:
from gensim.models.word2vec import Word2Vec
model = Word2Vec(corpus)

2017-09-30 20:04:59,672 : INFO : collecting all words and their counts
2017-09-30 20:04:59,677 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-09-30 20:05:06,425 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2017-09-30 20:05:06,426 : INFO : Loading a fresh vocabulary
2017-09-30 20:05:06,711 : INFO : min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2017-09-30 20:05:06,711 : INFO : min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2017-09-30 20:05:06,925 : INFO : deleting the raw counts dictionary of 253854 items
2017-09-30 20:05:06,935 : INFO : sample=0.001 downsamples 38 most-common words
2017-09-30 20:05:06,936 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2017-09-30 20:05:06,938 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2017-09-30 20:05:07,267 : INFO : resetting lay

2017-09-30 20:06:18,828 : INFO : PROGRESS: at 77.95% examples, 690734 words/s, in_qsize 5, out_qsize 0
2017-09-30 20:06:19,838 : INFO : PROGRESS: at 79.07% examples, 690668 words/s, in_qsize 5, out_qsize 0
2017-09-30 20:06:20,845 : INFO : PROGRESS: at 80.20% examples, 690773 words/s, in_qsize 5, out_qsize 0
2017-09-30 20:06:21,848 : INFO : PROGRESS: at 81.33% examples, 690841 words/s, in_qsize 5, out_qsize 0
2017-09-30 20:06:22,853 : INFO : PROGRESS: at 82.43% examples, 690748 words/s, in_qsize 5, out_qsize 0
2017-09-30 20:06:23,861 : INFO : PROGRESS: at 83.55% examples, 690766 words/s, in_qsize 5, out_qsize 0
2017-09-30 20:06:24,868 : INFO : PROGRESS: at 84.67% examples, 690835 words/s, in_qsize 6, out_qsize 0
2017-09-30 20:06:25,870 : INFO : PROGRESS: at 85.77% examples, 690917 words/s, in_qsize 5, out_qsize 0
2017-09-30 20:06:26,876 : INFO : PROGRESS: at 86.88% examples, 690986 words/s, in_qsize 4, out_qsize 1
2017-09-30 20:06:27,879 : INFO : PROGRESS: at 88.00% examples, 691091 wor

Now that we have our word2vec model, let's find words that are similar to 'tree'

In [5]:
model.most_similar('tree')

2017-09-30 20:08:12,509 : INFO : precomputing L2-norms of word weight vectors


[('trees', 0.7073001861572266),
 ('bark', 0.7032904028892517),
 ('leaf', 0.6881209015846252),
 ('bird', 0.6044381260871887),
 ('flower', 0.6009336709976196),
 ('fruit', 0.597153902053833),
 ('avl', 0.5837888717651367),
 ('cactus', 0.5712562799453735),
 ('bee', 0.5658263564109802),
 ('garden', 0.565678596496582)]

You can use the API to download many corpuses and models. You can get the list of all the models and corpuses that are provided, by using the code below:

In [9]:
import json
data_list = api.info()
print(json.dumps(data_list, indent=4))

{
    "model": {
        "Google_News_word2vec": {
            "desc": "Google has published pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases.",
            "filename": "GoogleNews-vectors-negative300.bin.gz",
            "checksum": "4fa963d128fe65ec8cd5dd4d9377f8ed"
        },
        "fasttext_eng_model": {
            "desc": "fastText is a library for efficient learning of word representations and sentence classification.These vectors for english language in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) with default parameters.",
            "filename": "wiki.en.vec",
            "checksum": "2de532213d7fa8b937263337c6e9deeb"
        },
        "glove_common_crawl_42B": {
            "desc": "This model is trained on Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors). GloVe is an unsupervised learning algorithm for 

If you want to get detailed information about the model/corpus, use:

In [2]:
glove_common_crawl_info = api.info('glove_common_crawl_42B')
print(glove_common_crawl_info)

This model is trained on Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors). GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.


Sometimes, you do not want to load the corpus/model to memory. You would just want to get the path to the corpus/model. For that, use :

In [8]:
text8_path = api.load('text8', return_path=True)