# Tutorial for using Gensim's API for downloading corpuses/models
Let's start by importing the api module.

In [2]:
import gensim.downloader as api

Now, lets download the text8 corpus. For that, you have to use the load function. If the dataset has been already downloaded, then corpus path will be returned. Otherwise,  dataset will be downloaded and then path will be returned.

In [3]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
corpus_path = api.load('text8')
print("Path to corpus: ", corpus_path)

2017-10-25 00:54:48,792 : INFO : Downloading text8




2017-10-25 00:55:11,503 : INFO : text8 downloaded


Path to corpus:  /home/chaitali/gensim-data/text8/text8


As the corpus has been downloaded and extracted, let's create a word2vec model of our corpus.

In [4]:
from gensim.models.word2vec import Text8Corpus
from gensim.models.word2vec import Word2Vec
corpus = Text8Corpus(corpus_path)
model = Word2Vec(corpus)

2017-10-25 00:55:47,594 : INFO : collecting all words and their counts
2017-10-25 00:55:47,600 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-10-25 00:55:55,687 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2017-10-25 00:55:55,688 : INFO : Loading a fresh vocabulary
2017-10-25 00:55:55,968 : INFO : min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2017-10-25 00:55:55,969 : INFO : min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2017-10-25 00:55:56,172 : INFO : deleting the raw counts dictionary of 253854 items
2017-10-25 00:55:56,181 : INFO : sample=0.001 downsamples 38 most-common words
2017-10-25 00:55:56,181 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2017-10-25 00:55:56,182 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2017-10-25 00:55:56,474 : INFO : resetting lay

2017-10-25 00:57:07,903 : INFO : PROGRESS: at 76.34% examples, 676871 words/s, in_qsize 5, out_qsize 0
2017-10-25 00:57:08,911 : INFO : PROGRESS: at 77.54% examples, 677805 words/s, in_qsize 5, out_qsize 0
2017-10-25 00:57:09,915 : INFO : PROGRESS: at 78.72% examples, 678531 words/s, in_qsize 5, out_qsize 0
2017-10-25 00:57:10,923 : INFO : PROGRESS: at 79.89% examples, 679198 words/s, in_qsize 4, out_qsize 1
2017-10-25 00:57:11,929 : INFO : PROGRESS: at 81.09% examples, 680040 words/s, in_qsize 5, out_qsize 0
2017-10-25 00:57:12,933 : INFO : PROGRESS: at 82.29% examples, 680809 words/s, in_qsize 5, out_qsize 0
2017-10-25 00:57:13,939 : INFO : PROGRESS: at 83.47% examples, 681462 words/s, in_qsize 5, out_qsize 0
2017-10-25 00:57:14,960 : INFO : PROGRESS: at 84.68% examples, 682292 words/s, in_qsize 6, out_qsize 1
2017-10-25 00:57:15,970 : INFO : PROGRESS: at 85.88% examples, 683163 words/s, in_qsize 5, out_qsize 0
2017-10-25 00:57:16,985 : INFO : PROGRESS: at 87.04% examples, 683714 wor

Now that we have our word2vec model, let's find words that are similar to 'tree'

In [5]:
model.most_similar('tree')

2017-10-25 01:00:28,412 : INFO : precomputing L2-norms of word weight vectors


[('trees', 0.6827899217605591),
 ('leaf', 0.6613928079605103),
 ('avl', 0.6607560515403748),
 ('bark', 0.6449544429779053),
 ('bird', 0.6373804807662964),
 ('flower', 0.6225261688232422),
 ('fruit', 0.6008641719818115),
 ('grass', 0.5877920985221863),
 ('cave', 0.5793893933296204),
 ('cactus', 0.5789991021156311)]

You can use the API to download many corpora and models. You can get the list of all the models and corpora that are provided, by using the code below:

In [3]:
import json
data_list = api.info()
print(json.dumps(data_list, indent=4))

{
    "corpora": {
        "text8": {
            "description": "Cleaned small sample from wikipedia",
            "checksum": "f407f5aed497fc3b0fb33b98c4f9d855",
            "file_name": "text8"
        },
        "fake-news": {
            "description": "It contains text and metadata scraped from 244 websites tagged as 'bullshit' here by the BS Detector Chrome Extension by Daniel Sieradski.",
            "checksum": "a61f985190ba361defdfc3fef616b9cd",
            "file_name": "fake.csv",
            "source": "Kaggle"
        }
    },
    "models": {
        "glove-wiki-gigaword-50": {
            "description": "Pre-trained vectors ,Wikipedia 2014 + Gigaword 5,6B tokens, 400K vocab, uncased. https://nlp.stanford.edu/projects/glove/",
            "parameters": "dimension = 50",
            "preprocessing": "Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-50.txt`",
            "papers": "https://nlp.stanford.edu/pubs/glove.pdf"

If you want to get detailed information about the model/corpus, use:

In [4]:
fake_news_info = api.info('fake-news')
print(fake_news_info)

{'description': "It contains text and metadata scraped from 244 websites tagged as 'bullshit' here by the BS Detector Chrome Extension by Daniel Sieradski.", 'checksum': 'a61f985190ba361defdfc3fef616b9cd', 'file_name': 'fake.csv', 'source': 'Kaggle'}


Sometimes, you do not want to load the model to memory. You would just want to get the path to the model. For that, use :

In [8]:
print(api.load('glove-wiki-gigaword-50', return_path=True))

/home/chaitali/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.txt


If you want to load the model to memory, then:

In [10]:
model = api.load("glove-wiki-gigaword-50")
model.most_similar("glass")

[('plastic', 0.7942505478858948),
 ('metal', 0.770871639251709),
 ('walls', 0.7700636386871338),
 ('marble', 0.7638523578643799),
 ('wood', 0.7624280452728271),
 ('ceramic', 0.7602593302726746),
 ('pieces', 0.7589111924171448),
 ('stained', 0.7528817057609558),
 ('tile', 0.748193621635437),
 ('furniture', 0.7463858723640442)]

In corpora, the corpus is never loaded to memory. Always,(i.e it doesn't matter if return_path is True or false) the path to the dataset file/folder will be returned and then you will hve to load it yourself.