Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up word2vec / fasttext model loading #2642

Closed
piskvorky opened this issue Oct 21, 2019 · 17 comments · Fixed by #2671
Closed

Speed up word2vec / fasttext model loading #2642

piskvorky opened this issue Oct 21, 2019 · 17 comments · Fixed by #2671
Labels
Hacktoberfest Issues marked for hacktoberfest help wanted impact HIGH Show-stopper for affected users performance Issue related to performance (in HW meaning) reach MEDIUM Affects a significant number of users testing Issue related with testing (code, documentation, etc)

Comments

@piskvorky
Copy link
Owner

piskvorky commented Oct 21, 2019

Loading a large word2vec model with load_word_format(binary=True) is slow. Users complain that loading the "standard" models published by Facebook is too slow, plus it also affects the speed of our own tests and tutorial autogeneration.

Some numbers:

time gensim.models.keyedvectors.KeyedVectors.load_word2vec_format('./word2vec-google-news-300.gz', binary=True)
2019-10-21 22:24:08,326 : INFO : loading projection weights from ./word2vec-google-news-300.gz
2019-10-21 22:26:54,620 : INFO : loaded (3000000, 300) matrix from ./word2vec-google-news-300.gz
CPU times: user 2min 42s, sys: 3.64 s, total: 2min 46s
Wall time: 2min 46s

The I/O part itself = only loading the bytes from the file without any interpretation, takes about 30 seconds:

time full = io.BytesIO(smart_open.open('./word2vec-google-news-300.gz', 'rb').read())
CPU times: user 20.9 s, sys: 8.13 s, total: 29.1 s
Wall time: 31.9 s

…which means our parsing code is taking up the majority of the load_word2vec_format time. Ideally, we shouldn't need much more than the 30 seconds (= the raw I/O speed) for the full load_word_format(binary=True). Nearly 3 minutes is too much.

Task: Optimize load_word2vec_format, especially the binary=True branch. The code seems to live here:
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L369

@piskvorky piskvorky added testing Issue related with testing (code, documentation, etc) performance Issue related to performance (in HW meaning) help wanted impact HIGH Show-stopper for affected users reach MEDIUM Affects a significant number of users Hacktoberfest Issues marked for hacktoberfest labels Oct 21, 2019
@piyush01123
Copy link

Hi, I am interested in solving this issue. However I was unable to reproduce what you got. Specifically, the following code block (from what you linked):

from gensim.models.keyedvectors import Vocab
from gensim import utils
import numpy as np
import io
import smart_open
import time

tit_file_read = time.time()
fin = utils.open("~/Downloads/GoogleNews-vectors-negative300.bin", 'rb')
tat_file_read = time.time()

header = utils.to_unicode(fin.readline(), 'utf-8')
vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
binary_len = np.dtype(np.float32).itemsize * vector_size

word2vec = {}
word=[]

tit_for = time.time()
for i in range(vocab_size):
    # mixed text and binary: read text first, then binary
    word = []
    while True:
        ch = fin.read(1)  # Python uses I/O buffering internally
        if ch == b' ':
            break
        if ch == b'':
            raise EOFError("unexpected end of input; is count incorrect or file otherwise damaged?")
        if ch != b'\n':  # ignore newlines in front of words (some binary files have)
            word.append(ch)
    word = utils.to_unicode(b''.join(word), encoding='utf-8')
    with utils.ignore_deprecation_warning():
        # TODO use frombuffer or something similar
        weights = np.fromstring(fin.read(binary_len), dtype=np.float32)
    word2vec[word] = weights

tat_for = time.time()
print("File Read", tat_file_read-tit_file_read)
print("For Loop", tat_for-tit_for)

gives the report:

File Read 0.00021314620971679688
For Loop 99.58815383911133

If you change the file reader to this:

fin = io.BytesIO(smart_open.open('~/Downloads/GoogleNews-vectors-negative300.bin', 'rb').read())

I get this report:

File Read 7.0926032066345215
For Loop 166.45309591293335

So, what we have in the library seems fine to me.
Am I missing something here?

@piskvorky
Copy link
Owner Author

piskvorky commented Oct 23, 2019

That timing is strange. It is not possible that parsing from an in-memory buffer is 1.6x slower than reading and parsing from a zipped file from disk. Maybe you ran out of RAM and you're swapping? How much free RAM do you have?

In any case, on your machine, we want to get closer to the raw I/O speed of 7 seconds, instead of 100 seconds. The parsing overhead is too much. The overhead is even worse on your machine than on mine, because you have a faster disk.

A good first step would be to line-profile the code, to see where the bottlenecks are.

@piskvorky
Copy link
Owner Author

piskvorky commented Oct 31, 2019

Ping @piyush-kgp are you able to profile? This would be an awesome ticket to resolve, with great impact.

@piyush01123
Copy link

Hey sorry, I have been busy lately. Maybe some one else could take this up

lopusz added a commit to lopusz/gensim that referenced this issue Nov 7, 2019
lopusz added a commit to lopusz/gensim that referenced this issue Nov 7, 2019
lopusz added a commit to lopusz/gensim that referenced this issue Nov 7, 2019
lopusz added a commit to lopusz/gensim that referenced this issue Nov 7, 2019
@lopusz
Copy link
Contributor

lopusz commented Nov 7, 2019

Hello @piskvorky, Hello @mpenkov

I created a pull request #2671 improving on the issue.

Measurements on my laptop on standard Google news vectors:

Reading file to bytes 24.5 seconds

Reading file to KeyedVectors (Before) 127.8 seconds
Reading file to KeyedVectors (After) 44.3 seconds

No magic, but I believe a solid improvement.

I am curious what you think?

I used the following benchmark

import timeit

import numpy as np

import gensim.models.keyedvectors
import gensim.utils

def read_plain(fname):
    with gensim.utils.open(fname, "rb") as fin:
        res = fin.read()
    return res

W2V_FNAME = './word2vec-google-news-300.gz'

print("Reading file to bytes")
t = timeit.default_timer()
x = read_plain(W2V_FNAME)
delta_t = timeit.default_timer() - t
print("Time for the operation %.1f seconds\n" % delta_t)

print("Reading file to KeyedVectors")
t = timeit.default_timer()
x = gensim.models.keyedvectors.KeyedVectors.load_word2vec_format(W2V_FNAME, binary=True)
delta_t = timeit.default_timer() - t
print("Time for the operation %.1f seconds\n" % delta_t)

print("Number of words in the obtained model  = %d" %len(x.vocab))

@piskvorky
Copy link
Owner Author

piskvorky commented Nov 7, 2019

Thanks @lopusz . Can you share the profiling numbers? Plus a high level description of your "attack vector", what optimization choices you made and why.

Let's go over that before delving into the code nitty-gritty. Cheers.

lopusz added a commit to lopusz/gensim that referenced this issue Nov 8, 2019
@lopusz
Copy link
Contributor

lopusz commented Nov 8, 2019

Hello Radim,

Thanks for your quick reaction. Indeed I should be probably more descriptive in the ticket.

A helicopter view on the approach.

As you know, w2v file consist of blocks containing word, space, and vector (block of float32).

The original approach first reads each block character by character util space. On every read the character is appended to a list. Finally the join is called on the list of letters to obtain word and bulk read of the vector is taking place. Obviously, these has quite a lot of overhead per word.

My take is to read one chunk (default size 100kb) of bytes from the file. Then parse as many complete word-space-vector blocks as the chunk contains. Then read and append next chunk, repeat until file exhausted. Parsing is done with minimal copying & bookkeeping. The patch is fairly short. Today I tried to make it even more readable, so it could perhaps speak better than myself.

Profiles:

Original (truncated below 0.5 sec tottime):

         351516849 function calls (351056029 primitive calls) in 187.918 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   27.915   27.915  187.293  187.293 utils_any2vec.py:149(_load_word2vec_format)
 47258510   19.851    0.000   78.680    0.000 gzip.py:271(read)
47703404/47258549   18.706    0.000   43.615    0.000 {method 'read' of '_io.BufferedReader' objects}
   444856   16.212    0.000   16.212    0.000 {method 'decompress' of 'zlib.Decompress' objects}
  3000000   12.010    0.000   12.010    0.000 {built-in method numpy.fromstring}
  3000000   11.692    0.000   14.636    0.000 utils_any2vec.py:206(add_word)
 47258511   10.521    0.000   15.215    0.000 _compression.py:12(_check_not_closed)
  3000005    6.663    0.000   19.612    0.000 warnings.py:119(filterwarnings)
  6000000    5.730    0.000   32.572    0.000 utils.py:1479(ignore_deprecation_warning)
 47258513    4.693    0.000    4.693    0.000 gzip.py:298(closed)
  3000013    4.177    0.000    7.640    0.000 warnings.py:159(_add_filter)
 44336046    3.302    0.000    3.302    0.000 {method 'append' of 'list' objects}
  3000001    3.259    0.000    3.964    0.000 utils.py:348(any2unicode)
   444857    3.008    0.000    3.008    0.000 {built-in method zlib.crc32}
  3000000    2.980    0.000    3.304    0.000 warnings.py:449(__enter__)
  3000001    2.946    0.000    2.946    0.000 {method 'astype' of 'numpy.ndarray' objects}
  3000000    2.648    0.000    3.269    0.000 contextlib.py:59(__init__)
  3000009    2.337    0.000    2.337    0.000 {method 'remove' of 'list' objects}
  3000000    2.318    0.000    7.694    0.000 contextlib.py:85(__exit__)
  6000665    2.249    0.000    2.314    0.000 re.py:286(_compile)
  3000000    2.181    0.000    2.466    0.000 warnings.py:468(__exit__)
   444856    1.917    0.000   26.821    0.000 _compression.py:66(readinto)
  3000000    1.898    0.000    1.898    0.000 {method 'join' of 'bytes' objects}
15027404/15027403    1.851    0.000    1.851    0.000 {built-in method builtins.isinstance}
  6000027    1.779    0.000   34.351    0.000 {built-in method builtins.next}
  3000000    1.526    0.000    2.204    0.000 keyedvectors.py:203(__init__)
  6000199    1.524    0.000    3.837    0.000 re.py:231(compile)
  3000000    1.460    0.000    1.460    0.000 warnings.py:428(__init__)
   444856    1.411    0.000   24.581    0.000 gzip.py:438(read)
  3000000    1.366    0.000    4.635    0.000 contextlib.py:157(helper)
  3000000    0.956    0.000   29.930    0.000 contextlib.py:79(__enter__)
   444893    0.902    0.000    2.814    0.000 gzip.py:80(read)
  9000013    0.875    0.000    0.875    0.000 {built-in method _warnings._filters_mutated}
  3000018    0.861    0.000    0.861    0.000 {method 'insert' of 'list' objects}
5714868/5714048    0.723    0.000    0.723    0.000 {built-in method builtins.len}
  3000596    0.679    0.000    0.679    0.000 {method 'update' of 'dict' objects}
   444856    0.626    0.000    3.690    0.000 gzip.py:489(_add_read_data)
  3011200    0.626    0.000    0.626    0.000 {built-in method builtins.getattr}

After optimization (truncated below 0.1 sec) tottime:

         37251884 function calls (36986852 primitive calls) in 45.248 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   249066   14.593    0.000   14.593    0.000 {method 'decompress' of 'zlib.Decompress' objects}
  3000000    8.391    0.000   11.215    0.000 utils_any2vec.py:189(__add_word_to_result)
    35589    4.894    0.000   23.696    0.001 utils_any2vec.py:213(__add_words_from_binary_chunk_to_result)
   249068    2.842    0.000    2.842    0.000 {built-in method zlib.crc32}
  3000000    2.767    0.000    2.767    0.000 {built-in method numpy.frombuffer}
  3000000    1.790    0.000    2.285    0.000 keyedvectors.py:203(__init__)
  3000001    1.756    0.000    1.756    0.000 {method 'astype' of 'numpy.ndarray' objects}
284695/35628    1.188    0.000   20.557    0.001 {method 'read' of '_io.BufferedReader' objects}
  3000009    1.134    0.000    1.134    0.000 {method 'decode' of 'bytes' objects}
  3000000    1.038    0.000    1.300    0.000 utils_any2vec.py:207(__remove_initial_new_line)
   249067    0.667    0.000   20.270    0.000 _compression.py:66(readinto)
7468957/7468137    0.657    0.000    0.657    0.000 {built-in method builtins.len}
  3035588    0.629    0.000    0.629    0.000 {method 'find' of 'bytes' objects}
   249067    0.601    0.000   19.475    0.000 gzip.py:438(read)
  3000596    0.495    0.000    0.495    0.000 {method 'update' of 'dict' objects}
        1    0.314    0.314   44.665   44.665 utils_any2vec.py:149(_load_word2vec_format)
  3077536    0.260    0.000    0.260    0.000 {method 'append' of 'list' objects}
   249066    0.225    0.000    3.088    0.000 gzip.py:489(_add_read_data)
   249106    0.219    0.000    1.120    0.000 gzip.py:80(read)

I hope that this clears the details ;)

Best regards,
Michał

cc @piskvorky

@piskvorky
Copy link
Owner Author

Thanks again. I meant line-profile of that one function, _load_word2vec_format.

I'm still curious because my understanding was that read() should be doing buffering already, internally. But since your extra buffering on top improves performance so much, I guess it's not. Some of my assumptions must have been wrong.

@piskvorky
Copy link
Owner Author

piskvorky commented Nov 8, 2019

And by the way, in connection to your PR which already shows great improvements: we're shipping automatically compiled versions of Gensim too, for {Windows, Mac, Linux} x {various Python versions}. So if you feel any one function / loop / algo is a major bottleneck because of pure Python, feel free to unleash Cython on it. Or at least let us know and we will :)

@lopusz
Copy link
Contributor

lopusz commented Nov 8, 2019

Hi Radim, I believe you are right, the reader does buffering internally.

For me it seems, the thing is more related to the number of calls (each letter of 3_000_000 vocab!).

If you look at the attached profiles, the _io.BufferedReader.read call was executed 47_000_000+ times in the unoptimized case (call occurs once per letter of each word + it seems there are some recursive calls). This took 43.615 sec.

In the optimized case there is "only" 284_000 calls taking 20.55 sec.

I think there is some constant per call cost which blows up, if you do a call once per letter.

There are also some other places where precious wall time leaks, e.g., it seems that
almost 7 seconds go to suppressing deprecation warnings ;)

Anyway line profile, would be for sure interesting. I am at it.

@lopusz
Copy link
Contributor

lopusz commented Nov 8, 2019

Here comes the line profile of the original _load_word2vec_format. Indeed a lot goes to read,
deprecation warnings are visible too ;)

Total time: 473.53 s                                                            
Function: _load_word2vec_format at line 148                                     
                                                                                
Line #      Hits         Time  Per Hit   % Time  Line Contents                  
==============================================================                  
   148                                           @profile                       
   149                                           def _load_word2vec_format(cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict',
   150                                                                     limit=None, datatype=REAL):
   186         1          9.0      9.0      0.0      from gensim.models.keyedvectors import Vocab
   187         1          1.0      1.0      0.0      counts = None              
   188         1          1.0      1.0      0.0      if fvocab is not None:     
   189                                                   logger.info("loading word counts from %s", fvocab)
   190                                                   counts = {}            
   191                                                   with utils.open(fvocab, 'rb') as fin:
   192                                                       for line in fin:   
   193                                                           word, count = utils.to_unicode(line, errors=unicode_errors).strip().split()
   194                                                           counts[word] = int(count)
   195                                                                          
   196         1         10.0     10.0      0.0      logger.info("loading projection weights from %s", fname)
   197         1        197.0    197.0      0.0      with utils.open(fname, 'rb') as fin:
   198         1        171.0    171.0      0.0          header = utils.to_unicode(fin.readline(), encoding=encoding)
   199         1          7.0      7.0      0.0          vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
   200         1          1.0      1.0      0.0          if limit:              
   201                                                       vocab_size = min(vocab_size, limit)
   202         1         15.0     15.0      0.0          result = cls(vector_size)
   203         1          1.0      1.0      0.0          result.vector_size = vector_size
   204         1         15.0     15.0      0.0          result.vectors = zeros((vocab_size, vector_size), dtype=datatype)
   205                                                                          
   206         1          1.0      1.0      0.0          def add_word(word, weights):
   207                                                       word_id = len(result.vocab)
   208                                                       if word in result.vocab:
   209                                                           logger.warning("duplicate word '%s' in %s, ignoring all but first", word, fname)
   210                                                           return         
   211                                                       if counts is None: 
   212                                                           # most common scenario: no vocab file given. just make up some bogus counts, in descending order
   213                                                           result.vocab[word] = Vocab(index=word_id, count=vocab_size - word_id)
   214                                                       elif word in counts:
   215                                                           # use count from the vocab file
   216                                                           result.vocab[word] = Vocab(index=word_id, count=counts[word])
   217                                                       else:              
   218                                                           # vocab file given, but word is missing -- set count to None (TODO: or raise?)
   219                                                           logger.warning("vocabulary file is incomplete: '%s' is missing", word)
   220                                                           result.vocab[word] = Vocab(index=word_id, count=None)
   221                                                       result.vectors[word_id] = weights
   222                                                       result.index2word.append(word)
   223                                                                          
   224         1          1.0      1.0      0.0          if binary:             
   225         1          3.0      3.0      0.0              binary_len = dtype(REAL).itemsize * vector_size
   226   3000001    3592876.0      1.2      0.8              for _ in range(vocab_size):
   227                                                           # mixed text and binary: read text first, then binary
   228   3000000    3362574.0      1.1      0.7                  word = []      
   229   3000000    3189548.0      1.1      0.7                  while True:    
   230  44258510  116027581.0      2.6     24.5                      ch = fin.read(1)  # Python uses I/O buffering internally
   231  44258510   46767715.0      1.1      9.9                      if ch == b' ':
   232   3000000    3226081.0      1.1      0.7                          break  
   233  41258510   43162191.0      1.0      9.1                      if ch == b'':
   234                                                                   raise EOFError("unexpected end of input; is count incorrect or file otherwise damaged?")
   235  41258510   42878365.0      1.0      9.1                      if ch != b'\n':  # ignore newlines in front of words (some binary files have)
   236  41258510   46056608.0      1.1      9.7                          word.append(ch)
   237   3000000   12026255.0      4.0      2.5                  word = utils.to_unicode(b''.join(word), encoding=encoding, errors=unicode_errors)
   238   3000000   58346305.0     19.4     12.3                  with utils.ignore_deprecation_warning():
   239                                                               # TODO use frombuffer or something similar
   240   3000000   72620750.0     24.2     15.3                      weights = fromstring(fin.read(binary_len), dtype=REAL).astype(datatype)
   241   3000000   22272438.0      7.4      4.7                  add_word(word, weights)
   242                                                   else:                  
   243                                                       for line_no in range(vocab_size):
   244                                                           line = fin.readline()
   245                                                           if line == b'':
   246                                                               raise EOFError("unexpected end of input; is count incorrect or file otherwise damaged?")
   247                                                           parts = utils.to_unicode(line.rstrip(), encoding=encoding, errors=unicode_errors).split(" ")
   248                                                           if len(parts) != vector_size + 1:
   249                                                               raise ValueError("invalid vector on line %s (is this really the text format?)" % line_no)
   250                                                           word, weights = parts[0], [datatype(x) for x in parts[1:]]
   251                                                           add_word(word, weights)
   252         1          3.0      3.0      0.0      if result.vectors.shape[0] != len(result.vocab):
   253                                                   logger.info(           
   254                                                       "duplicate words detected, shrinking matrix size from %i to %i",
   255                                                       result.vectors.shape[0], len(result.vocab)
   256                                                   )                      
   257                                                   result.vectors = ascontiguousarray(result.vectors[: len(result.vocab)])
   258         1          2.0      2.0      0.0      assert (len(result.vocab), vector_size) == result.vectors.shape
   259                                                                          
   260         1          9.0      9.0      0.0      logger.info("loaded %s matrix from %s", result.vectors.shape, fname)
   261         1          1.0      1.0      0.0      return result              

@piskvorky
Copy link
Owner Author

piskvorky commented Nov 8, 2019

Nice! That super slow utils.ignore_deprecation_warning() is unexpected and surprising. CC @mpenkov.

I look forward to reviewing your PR. Do you see much potential for C (Cython) level optimizations left in the loading?

@lopusz
Copy link
Contributor

lopusz commented Nov 8, 2019

As far as the Cython goes - I did two small tweaks yesterday (converted a custom function to bytes.find call + removed one unnecessary slicing). It gave the current version nice
8s boost. So now the comparison on my laptop reads:

Reading file to bytes 24.5 seconds
Reading file to KeyedVectors (Before) 127.8 seconds
Reading file to KeyedVectors (After) 36.3 seconds

This is not that far from the plain reading & decompression "baseline".

I do not see any more very easy wins even with Cython, but one pair of eyes may be easily fooled in such cases ;)

piskvorky pushed a commit that referenced this issue Nov 18, 2019
* Speed up word2vec binary model loading (#2642)

* Add correctness tests for optimized word2vec model loading (#2642)

* Include remarks of Radim to code speeding up vectors loading (#2671)

* Include remarks of Michael to code speeding up vectors loading (#2671)

* Refactor _load_word2vec_format into a few functions for better readability

* Clean-up _add_word_to_result function
@piskvorky
Copy link
Owner Author

piskvorky commented Nov 18, 2019

@lopusz for the record: what was your final benchmark timing, using the merged PR? (on the same dataset and HW as your timings above, apples-to-apples)

lopusz added a commit to lopusz/gensim that referenced this issue Nov 20, 2019
@lopusz
Copy link
Contributor

lopusz commented Nov 20, 2019

@piskvorky This is actually a very good question leading to quite sweet story ;)

The benchmark was done before refactoring.

Of course I did not run the benchmark after cleaning-up (It was really just minor clean-up, right?)

And of course when I did the benchmark, prompted by your comment, it turned out that it runs in
~39 seconds. OK, still not bad, but slightly slower than 36-37ish I was getting before refactoring.

I got curious why and it turned out to be easy to spot. The culprit was the import in the
very inner function (remember, 3_000_000 calls) ;)

The 2s slower is not a tragedy, but since it is easy to fix I propose to merge that slightly updated version.

Lesson learned - there are no innocuous clean-ups when it comes to performance + always be profiling ;)

@lopusz
Copy link
Contributor

lopusz commented Nov 20, 2019

PR #2682

@piskvorky
Copy link
Owner Author

Thanks! It makes me wonder how much we're losing due to Python (as opposed to Cython) here. You know, 3,000,000x "nothing" adds up quickly…

But if the theoretical maximum is 24s, and we're at 37s now, it's likely diminishing returns. 50% overhead (in pure Python) is actually much better than I hoped for or expected.

mpenkov added a commit that referenced this issue Nov 1, 2020
* added release/check_wheels.py (#2610)

* added release/check_wheels.py

* added preamble

* Update release/check_wheels.py

Co-Authored-By: Radim Řehůřek <me@radimrehurek.com>

* respond to review comments

* Add hacktoberfest-related documentation (#2616)

* git add HACKTOBERFEST.md

* clarify contributions

* respond to review comments

* add link to HACKTOBERFEST.md from README.md

* typo

* include comments from Gordon

* Fixed #2554 (#2619)

* Properly install Pattern library for documentation build (#2626)

* Probably fixes #2534

* Uppercase P

* Added comment

* Disable Py2.7 builds under Travis, CircleCI and AppVeyor (#2601)

* Disable Py2.7 builds under Travis and AppVeyor

* use Py3.7.4 image under CircleCI

* tweak circleci config.yml

* patch tox.ini

* more fixes to get docs building under tox

* s/python3.7/python3/

* delay annoy ImportError until actual use

* bring back Pattern

* simplify invokation of pip command

* add install_numpy_scipy.py

* fixup

* use sys.executable

* adjust version in install_wheels.py

* adjust travis.yml

* adjust version in install_wheels.py back

* add logging statements

* use version_info instead of sys.version

* fixup

* Handling for iterables without 0-th element, fixes #2556 (#2629)

* Handling for iterables without 0-th element, fixes #2556

* Improved accessing the first element for the case of big datasets

* Move Py2 deprecation warning to top of changelog (#2627)

It belongs at the top. People should see it immediately without having to scroll down to an older release.

* Change find_interlinks return type to list of tuples (#2636)

* Change interlinks format to list of tuples. Fixes #2635

This commit fixes the issue in #2635

This commit changes the interlinks storage in the `segment_wiki` script from dictionary to a list of tuples.

We can process the test wikidata used in the test suite of gensim to inspect the new behavior.
```
python gensim/scripts/segment_wiki.py -i \
    -f ~/Downloads/enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2 \
    -o ~/Downloads/enwiki-latest.json.gz
```

We get the following output:

```
$ cat ~/Downloads/enwiki-latest.json.gz | zcat | head -1 | jq -r '.interlinks[] | [.[0], .[1]] | @TSV' | sort | head
-ism	-ism
1848 Revolution	1848 Revolution
1917 October Revolution	1917 October Revolution
6 February 1934 crisis	February 1934 riots
A. S. Neill	A. S. Neill
AK Press	AK Press
Abu Hanifa	Abu Hanifa
Adolf Brand	Adolf Brand
Adolf Brand	Adolf Brand
Adolf Hitler	Hitler
```

All tests pass for the related test file.

```
python -m unittest gensim.test.test_scripts
/Users/smishra/miniconda3/envs/TwitterNER/lib/python3.7/bz2.py:131: ResourceWarning: unclosed file <_io.BufferedReader name='/Users/smishra/workspace/codes/python/gensim/gensim/test/test_data/enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2'>
  self._buffer = None
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.....
----------------------------------------------------------------------
Ran 5 tests in 6.298s

OK
```

* Updated docstrings

* Fixed flake8 issue of long line in docsrtring

* Fixed comments and replaces assertTrue with assertEqual

* Fixed unittest comment and checks for wikicorpus

* Improve gensim documentation (numfocus) (#2591)

* Update makefile to point to new subdirectory

* Update layout.html to show new documentation sections

* introduce sphinx gallery

* reorganize gallery

* trim tut3.rst

* git add docs/to_python.py

* git add gallery/010_tutorials/run_doc2vec_lee.py

* minor layout tweak

* add downloader api howto

* add fasttext tutorial and howto

* use pprint in fasttext tutorial

* add summarization tutorial

* git add gallery/020_howtos/run_howto_compare_lda.py

* add fasttext thumbnails

* adding core concepts tutorial

* add summarization plot

* update notebook to use 20newsgroups

* update notebook

* improve notebook

* update howtos

* fix distance metrics tutorial

* improve distance_metrics.ipynb

* git add gallery/010_tutorials/run_distance_metrics.py

* git add gallery/020_howtos/run_news_classification.py

* move downloader API to tutorials section

* add docs/src/auto_examples so bindr can pick up the notebooks

* minor changes

* git add gallery/010_tutorials/run_lda.py

* more minor changes

* More minor changes

* git add gallery/010_tutorials/run_word2vec.py

* updated notebooks

* git add gallery/010_tutorials/run_wmd.py

* add image

* move parts of intro.rst to core concepts tutorial

* move README.txt to wiki

* get rid of fasttext wrapper tutorial

* update top-level heading

* more minor changes

* minor updates

* improve Doc2Vec tutorial, move explanations from IMDB

* git add gallery/020_howtos/run_doc2vec_imdb.py

* git st

* fix notebook paths for bindr

* rename gallery to documentation

* git add binder/requirements.txt

* git add auto_examples/000_core/requirements.txt

* adding requirements.txt for binder

* removing requirements files added in desperation

* update conf.py

* remove temporary files from git branch

* rm images

* merge "getting started" into "core concepts"

* add some clarifying text

* add Jupyter notebook

* Revert "get rid of fasttext wrapper tutorial"

This reverts commit 3ec0a46.

* get rid of fasttext wrapper guide

* git add auto_examples/

* minor fixes

* fix typo

* add listing of corpora and models

* get rid of binder

* git add gallery/020_howtos/run_doc.py

* more instructions for authorship

* improve linkage between core tutorials

* add highlighting

* move downloader to howto

* restore support and about sections

* sync toolbars

* Add installation instructions to top page

* clean up html

* add wordcloud-based thumbnails

* updated notebooks

* update script

* add sphinx-gallery to doc dependencies

* include memory_profiler in docs_testenv

* git add README.rst

* use proper temporary file

* reorganize tutorials section

* clarify version control in README.rst

* git rm 020_howtos/saved_model_wrapper

* move pivoted document normalization to tutorials section

* fix ordering in howto section

* add images

* add annoy to doc dependencies

* update gitignore

* disable tox spinner

* turn off progress bar for pip

* fix labels

* naming fixes

* git rm docs/notebooks/gensim\ Quick\ Start.ipynb

* git rm docs/notebooks/Corpora_and_Vector_Spaces.ipynb

* git rm gensim\ Quick\ Start.ipynb

* git rm docs/notebooks/Topics_and_Transformations.ipynb

* git rm docs/notebooks/Similarity_Queries.ipynb

* git rm docs/notebooks/summarization_tutorial.ipynb

* git rm docs/notebooks/distance_metrics.ipynb

* git rm docs/notebooks/word2vec.ipynb

* git rm docs/notebooks/doc2vec-lee.ipynb

* git rm docs/notebooks/gensim_news_classification.ipynb

* git rm docs/notebooks/lda_training_tips.ipynb

* git rm docs/notebooks/doc2vec-IMDB.ipynb

* git rm docs/notebooks/annoytutorial.ipynb

* git rm tutorial.rst tut1.rst tut2.rst tut3.rst

* minor update to layout.html

* git rm changes_080.rst

* minor tweaks to gallery and surrounding docs

* remove cruft from run_doc2vec_imdb.py

* update doc howto

* fixup

* git add requirements_docs.txt

* more dependencies in requirements_docs.txt

* re-enable LDA howto

* add missing images

* add built LDA howto

* port tutorials.md to gallery

* WIP: cleaning up docs

* language clean up + pin exact versions in doc requirements

* git add redirects.csv test_redirects.py

* remove gensim_numfocus namespace qualifier

* doc cleanup in Other resources

* fix redirects

* regenerated tutorials

* Added tools/check_gallery.py

* committing unsuccessful attempt to fix a tutorial before deleting it

* remove tutorials that don't work

* index page fixes

* add install anchor

* Update redirects.csv

* link fixes from local testing

* replace easy_install with pip

* renamed run_040_compare_lda.py to run_compare_lda.py

* minor fixes

* more fixes from website testing

* updating wordcloud images

* add pandas to requirements_docs.txt

* !!

* more dependency + code fixes

* update upload path to "live" website

* update test_redirects.py

* git rm redirects.csv test_redirects.py

* fix setup.py to get documentation to build under CircleCI (#2650)

* Fix links to documentation in README.md (#2646)

* Fix links to documentation in README.md

* Update README.md

* Delete requirements.txt (#2648)

* Remove native Python implementations of Cython extensions (#2630)

* Remove native Python implementations of Cython extensions

Fix #2511

* remove print statement in tox.ini

* remove print statement in tox.ini

* fix flake8 issues

* fix missing imports

* adjust exception message

* bring back FAST_VERSION variable

* fixup: missing parens

* disable progress bar for tox

* respond to review comments

* remove C/C++ sources generated from Cython files

* update setup.py

* remove duplicate line in setup.py

* fix numpy bootstrapping

* update tox.ini

* handle cython dependency in setup.py

* fixup in setup.py: lowercase c

* more cython sourcery

* fix tox.ini

* Fix merge artifact in setup.py

* fix merge artifact

* disable pip progress bar under CircleCI

* replacing deleted notebooks with placeholders (#2654)

* Document accessing model's vocabulary (#2661)

* document accessing model's vocabulary

* update images

* Improve explanation of top_chain_var parameter in Dynamic Topic Model (DTM) documentation

* improve & corrected gensim documentation (#2637)

* more descriptive explanation of top_chain_var

* Comment out Hacktober Fest from README (#2677)

- uncomment next year

* Update word2vec2tensor.py (#2678)

* Speed up word2vec model loading (#2671)

* Speed up word2vec binary model loading (#2642)

* Add correctness tests for optimized word2vec model loading (#2642)

* Include remarks of Radim to code speeding up vectors loading (#2671)

* Include remarks of Michael to code speeding up vectors loading (#2671)

* Refactor _load_word2vec_format into a few functions for better readability

* Clean-up _add_word_to_result function

* Fix local import degrading the performance of word2vec model loading (#2671) (#2682)

* [Issue-2670] Bug fix: Initialize doc_no2 because it is not set when corpus' is empty (#2672)

* [Issue-2670] Bug fix: Initialize doc_no2 because it is not set when 'corpus' is empty

* [Issue-2670] Add: unittests should fail on invalid input (generator and empty corpus)

* [Issue-2670] Add: Fix unittest for generator

* [Issue-2670] Fix unittest tox:flake8 errors

* [Issue-2670] Fix: empty corpus def in unittest

* [Issue-2670] Fix: empty corpus and generator unittests

* [Issue-2670] Fix: empty corpus and generator unittests

* Warn when BM25.average_idf < 0 (#2687)

Closes #2684

* Rerun Soft Cosine Measure tutorial notebook (#2691)

* Fix simple typo: voacab -> vocab (#2719)

Closes #2718

* Fix appveyor builds (#2706)

* move install_wheels script

* git add continuous_integration/check_wheels.py

* bump versions for numpy and scipy

* update old requirements.txt

* add file header

* get rid of install_wheels.py hack

* fixup: update travis.yml

* Update continuous_integration/check_wheels.py

Co-Authored-By: Radim Řehůřek <me@radimrehurek.com>

* Update continuous_integration/check_wheels.py

Co-Authored-By: Radim Řehůřek <me@radimrehurek.com>

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>

* Change similarity strategy when finding n best (#2720)

* Find largest by absolute value

* Add helper function to simplify code & add unit test for it

* Initialize self.cfs in Dictionary.compatify method (#2618)

* Fix for #2574

* Fix for #2574

* Fix ValueError when instantiating SparseTermSimilarityMatrix (#2689)

* force python int before calling islice. islice don't accept numpy int

* add test to check islice error

* it makes test to fail

* make sure that islice receives a python int

* fix typo

* Refactor bm25 to include model parametrization (cont.) (#2722)

* Refactor bm25 to include model parametrization

* Refactor constants back and fix typo

* Refactor parameters order and description

* Add BM25 tests
This closes #2597 and closes #2606

* Simplify asserts in BM25 tests

* Refactor BM25.get_score

Co-authored-by: Marcelo d'Almeida <md@id.uff.br>

* Fix overflow error for `*Vec` corpusfile-based training (#2700)

* long long types for expected_examples & total_documents

* regenerate .cpp files

* Implement saving to Facebook format (#2712)

* Add writing header for binary FB format (#2611)

* Adding writing vocabulary, vectors, output layer for FB format (#2611)

* Clean up writing to binary FB format (#2611)

* Adding tests for saving FastText models to binary FB format (#2611)

* Extending tests for saving FastText models to binary FB format (#2611)

* Clean up (flake8) writing to binary FB format (#2611)

* Word count bug fix + including additional test (#2611)

* Removing f-strings for Python 3.5 compatibility + clean-up(#2611)

* Clean up the comments (#2611)

* Removing forgotten f-string for Python 3.5 compatibility (#2611)

* Correct tests failing @ CI (#2611)

* Another attempt to correct tests failing @ CI (#2611)

* Yet another attempt to correct tests failing @ CI (#2611)

* New attempt to correct tests failing @ CI (#2611)

* Fix accidentally broken test (#2611)

* Include Radim remarks to saving models in binary FB format (#2611)

* Correcting loss bug (#2611)

* Completed correcting loss bug (#2611)

* Correcting breaking doc building bug (#2611)

* Include first batch of Michael remarks

* Refactoring SaveFacebookFormatRoundtripModelToModelTest according to Michael remarks (#2611)

* Refactoring remaining tests according to Michael remarks (#2611)

* Cleaning up the test refactoring (#2611)

* Refactoring handling tuple result from struct.unpack (#2611)

* Removing unused import (#2611)

* Refactoring variable name according to Michael review (#2611)

* Removing redundant saving in test for Facebook binary saving (#2611)

* Minimizing context manager blocks span (#2611)

* Remove obsolete comment (#2611)

* Shortening method name (#2611)

* Moving model parameters to _check_roundtrip function (#2611)

* Finished moving model parameters to _check_roundtrip function (#2611)

* Clean-up FT_HOME behaviour (#2611)

* Simplifying vectors equality check (#2611)

* Unifying testing method names (#2611)

* Refactoring _create_and_save_fb_model method name (#2611)

* Refactoring test names (#2611)

* Refactoring flake8 errors (#2611)

* Correcting fasttext invocation handling (#2611)

* Removing _parse_wordvectors function (#2611)

* Correcting whitespace and simplifying test assertion (#2611)

* Removing redundant anonymous variable (#2611)

* Moving assertion outside of a context manager (#2611)

* Function rename (#2611)

* Cleaning doc strings and comments in FB binary format saving functionality (#2611)

* Cleaning doc strings in end-user API for FB binary format saving (#2611)

* Correcting FT_CMD execution in SaveFacebookByteIdentityTest (#2611)

* Use time.time instead of time.clock in gensim/models/hdpmodel.py (#2730)

* Use time.process_time() instead of time.clock()

* time.process_time() -> time.time()

* better replacement of deprecated .clock()

* drop py35, add py38 (travis), update explicit dependency versions

* better CI logs w/ gdb after core dump

* improved comments via piskvorky review

Co-Authored-By: Radim Řehůřek <me@radimrehurek.com>

* rm autogenerated *.cpp files that shouldn't be in source control

* Fix TypeError when using the -m flag (#2734)

Currently, if you attempt to use the script with the --min-article-character you get an error because it gets parsed a string and the functions expect an int. This fix addresses the issue.

```
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/gensim/scripts/segment_wiki.py", line 385, in <module>
    include_interlinks=args.include_interlinks
  File "/usr/local/lib/python3.6/dist-packages/gensim/scripts/segment_wiki.py", line 141, in segment_and_write_all_articles
    for idx, article in enumerate(article_stream):
  File "/usr/local/lib/python3.6/dist-packages/gensim/scripts/segment_wiki.py", line 100, in segment_all_articles
    for article in wiki_sections_text:
  File "/usr/local/lib/python3.6/dist-packages/gensim/scripts/segment_wiki.py", line 332, in get_texts_with_sections
    if sum(len(body.strip()) for (_, body) in sections) < self.min_article_character:
TypeError: '<' not supported between instances of 'int' and 'str'```

* del cython.sh

* Improve documentation in run_similarity_queries example (#2770)

* Fix fastText word_vec() for OOV words with use_norm=True (#2764)

* add a test for oov similarity

* fix a test for oov similarity

* fix it once more

* prepare the real fix

* remove a redundant variable

* less accurate comparison

Co-authored-by: David Dale <ddale@yandex-team.ru>

* remove mention of py27 (#2751)

on 25 oct 2019, setup.py was updated to require python 3.5. this change removes the suggestion of testing against py27.

* Fix KeyedVectors.add matrix type (#2761)

* add type test

* cast internal state to passed type

* ekv -> kv

* parametrize datatype & cast embeddings passed to `add` to KV datatype

* set f32 as default type

Co-authored-by: Ivan Menshikh <imenshikh@embedika.ru>
Co-authored-by: Michael Penkov <m@penkov.dev>

* use collections.abc for Mapping (#2750)

* use collections.abc.Mapping when available

* ignore py2, tox -e py27-linux revealed setup.py requires python 3.5

* use collections.abc.Iterable

* Fix out of range issue in gensim.summarization.keywords (#2738)

* Fixed out of range error in keywords.py

* Now using min() function to improve readability

* Added a test to make sure that keywords does not
fail when words param is greater than number
of words in string

* Fixing travisCI build error from not having 2  lines after class definition

* Fixed whitespace issue for flake8

Co-authored-by: Carter Olsen <olsencar@oregonstate.edu>

* fixed get_keras_embedding, now accepts word mapping (#2676)

* fixed get_keras_embedding, now accepts word mapping

* skip tests if keras not installed

* removed unnessecary comment from test_keyed_vectors

* fixed indentation

* fixed flake import error

* moved skip test decorator to class

* Update gensim/models/keyedvectors.py

Co-Authored-By: Michael Penkov <m@penkov.dev>

* Update gensim/models/keyedvectors.py

Co-Authored-By: Michael Penkov <m@penkov.dev>

* Update gensim/models/keyedvectors.py

Co-Authored-By: Michael Penkov <m@penkov.dev>

* renamed keras_installed flag to upper case, removed unneeded comment

Co-authored-by: Zhicharevich <Alex_Zhicharevich@intuit.com>
Co-authored-by: Michael Penkov <m@penkov.dev>

* Add downloads badge to README

- idea from piskvorky/smart_open#440

* Get rid of "wheels" badge

* link downloads badge to pepy instead of pypi

* fix broken english in tests (#2773)

* fix build, use KeyedVectors class (#2774)

* cElementTree has been deprecated since Python 3.3 and removed in Python 3.9.

* Fix FastText RAM usage in tests (+ fixes for wheel building) (#2791)

* pin `bucket` parameter (to avoid RAM issues on CI system) + get rid win32 skip

* fix flake8

* partially fix doc building

* better workaround for docs build

* fix sphinx-gallery

* avoid test error

* get back loading of old model (because large buckets)

* Update setup.py

Co-Authored-By: Radim Řehůřek <me@radimrehurek.com>

* Update gensim/test/test_fasttext.py

Co-Authored-By: Radim Řehůřek <me@radimrehurek.com>

* define missing buckets & fix formatting

Co-authored-by: Ivan Menshikh <imenshikh@embedika.ru>
Co-authored-by: Radim Řehůřek <me@radimrehurek.com>

* Fix typo in comments\nThe rows of the corpus are actually documents, fix the comment to reduce confusion

* Add osx+py38 case for avoid multiprocessing issue (#2800)

* add osx+py38 case for avoid multiprocessing issue

* add comment, fix warning

* extend comment

Co-Authored-By: Radim Řehůřek <me@radimrehurek.com>

* Update gensim/utils.py

* Update gensim/utils.py

Co-Authored-By: Michael Penkov <m@penkov.dev>

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>
Co-authored-by: Michael Penkov <m@penkov.dev>

* Use nicer twitter badge

* Use downloads badge from shields.io

* Use blue in badges

* Remove conda-forge badge

* Make twitter badge blue, too

* Cache badges

- use google's caching proxy for img.shields.io badges
- fixes #2805

* Use HTML comments instead of Markdown comment

- simpler & easier to read and maintain

* [MRG] Update README instructions + clean up testing (#2814)

* update README instructions

* WIP: enable test deps

* unpin old tensorflow in tests
- old versions not present in newer Pythons

* looking into segfault in py3.6
- https://travis-ci.org/github/RaRe-Technologies/gensim/jobs/681096362

* put back pyemd

* put keras back

* put back tensorflow

* investigate segfault in py3.6

* address review comments

* avoid py3.6 segfault in Travis tests

* Add basic yml file for setup pipeline (will fail)

* revert back travis

* Replace AppVeyor by Azure Pipelines (#2824)

* dummy change of pipeline file

* good bye appveyor

* no specific trigger

* attempt to trigger by PR

* ???

* [REVERT ME] Specify only tests that fails

* get platform & stay only single test

* try to debug

* fail fast

* no re-runs

* meh

* hack?

* Revert "good bye appveyor" (FOR COMPARISON WITH AZURE)

This reverts commit cd57175.

* try to understand, where \r comes from

* continue debug

* raise

* upd

* okay, try to avoid CRLF

* revert back

* bye again, appveyor

* delete appveyor stuff too

* move visdom to linux-only env (to avoid frequent failures on win)

* fix docs building

* Update CHANGELOG.md (#2829)

* Update CHANGELOG.md (#2831)

* Fix-2253: Remove docker folder since it fails to build   (#2833)

* Removed Docker from gensim since docker image fails to build and there's nobody to maintain docker

* Remove irrelevant comment about docker

* LdaModel documentation update -remove claim that it accepts CSC matrix as input (#2832)

* Update LDA model documentation to remove the claim that LDA accepts CSC matrices as an input

* Update CHANGELOG.md

Co-authored-by: Michael Penkov <m@penkov.dev>

* delete .gitattributes (#2836)

* delete .gitattributes

* disable certain tests on Azure pipelines

* tweak env var behavior

* disable one more test

* make the newest version of flake8 happy

* patch tox.ini to pin flake8 and flake8-rst versions

Co-authored-by: Michael Penkov <m@penkov.dev>

* Fix for Python 3.9/3.10: remove xml.etree.cElementTree (#2846)

* Update classifiers in setup.py

* Python 3.3+ uses a fast implementation whenever available

* Don't import ElementTree as ET

* Correct grammar in docs (#2573)

* Correct grammar in docs

* Update gensim/scripts/glove2word2vec.py

Co-Authored-By: Radim Řehůřek <me@radimrehurek.com>

* Update gensim/scripts/glove2word2vec.py

Co-Authored-By: Michael Penkov <m@penkov.dev>

* Update glove2word2vec.py

* Update glove2word2vec.py

* Update gensim/scripts/glove2word2vec.py

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>
Co-authored-by: Michael Penkov <m@penkov.dev>

* Don't proxy-cache badges with Google Images (#2854)

* pin keras=2.3.1 because 2.4.3 causes KerasWord2VecWrappper test failure in Py 3.8 (#2868)

work around for #2865

* Expose max_final_vocab parameter in FastText constructor (#2867)

* Expose max_final_vocab parameter in FastText constructor

* Fix lint error

* respond to reviewer comments

* add unit test

Co-authored-by: Cristi Burca <mail@scribu.net>

* Replace numpy.random.RandomState with SFC64 - for speed (#2864)

Co-authored-by: Marcin Cylke <marcin.cylke@allegro.pl>

* Update CHANGELOG.md

* Clarify that license is LGPL-2.1 (#2871)

Use only the license attribute with an SPDX license id for V2.1 of the
LGPL.

The setup.py classifier was inconsistent with the license attribute as 
one pointed to LGPL-2.0-or-later and one to LGPL-2.1-only which seems 
to be otherwise the correct license based on header comments and
documentation. Classifiers do not support the LGPL v2.1 and are 
eventually being subsumed by the license field [1]

[1] https://discuss.python.org/t/improving-license-clarity-with-better-package-metadata/2154

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

* Fix travis issues for latest keras versions. (#2869)

* Fix travis issues for latest keras versions.

* Unpin keras version.

* verify "inputs" and "outputs" named params on keras 2.3.1

* Unpin keras version

* Put cell outputs back to the soft cosine measure benchmark notebook (#2808)

Revert bcee414

* KeyedVectors  & *2Vec API streamlining, consistency (#2698)

* slim low-value warnings

* clarify vectors/vectors_vocab relationship; fix lockf & nonsense ngram-norming confusion

* mv FT, KV tests to right place

* rm deprecations, obsolete refs/tests, delete_temporary_training_data, update usages

* update usages, tests, flake8 cleanup

* expand KeyedVectors to obviate Doc2VecKeyedVectors; upconvert old offset-style doctags

* fix docstring warnings; update usages

* rm unused old plain-python codepaths

* unify class comments under __init__ for consistncy w/ api doc presentation

* name/comment harmonization (rm 'entity', lessen 'word'-centricity)

* table formatting

* return pyemd to linux test env

* split backcompat tests for better resolution

* convert Vocab & related data items to use dataclasses

* rm obsolete Vocab/Trainable/abstract/Wrapper classes, persistent callbacks (bug #2136), outdated tests/warnings; update usages

* tune tests for stability, runtimes; rm auto reruns that hide flakiness

* fix numpy FutureWarning: arrays to stack must be sequence

* (commented-out) deoptimization option

* stronger FB model testing; no _unpack_copy test

* merge redundant methods; rm duplicated imports/defs

* rationalize _lockf, buckets_word behaviors

* rename .docvecs to .dv

* update usages; rm obsolete tests; restore gensim.utils import

* intensify FT tests (more epochs, more buckets)

* flake8-3.8.0 style fixes - but also pin flake8-3.7.9 vs 3.8.0 'output_file' error

* replace vectors_norm with 1d norms

* tighten testParallel

* rm .vocab & 'Vocab' classes; add expandable 'vecattrs'

* update usages (no vocabs)

* enable running inside '-m mtprof' (or cProfile) via explicit unittest.main(module=..)

* faster sample_int reads

* load_word2vec_format(.., no_header=True) to support GLoVe text vectors

* refactor & comment lockf feature; allow single-element lockf

* improve FT comment

* rm deprecated/unneded init_sims calls

* fixes to code style

* flake8: fix overlong lines

* rm stray merge error

* rm duplicated , old nonstandard hash workarounds

* use numpy-recommended PRNG constructor

* add sg to FastTextConfig & consult it; rm remaining broken-hash cruft

* reorg conditional packages for clarity

* comments, names, refactoring, randomization

* Apply suggestions from code review

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>

* fix cruft left from suggestion

* fix numpy-32bit-on-Windows; executable docs

* mv lee_corpus to utils; cleanup

* update poincare for latest KV __init__ signature

* restore word_vec method for proper overriding, but rm usages

* Apply suggestions from code review

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>

* adjust testParallel against failure risk

* intensify training for an occasionally failing test

* clarify word/char ngrams handling; rm outdated comments

* mostly avoid duplciating FastTextConfig fields into locals

* avoid copies/pointers for no-bucket (FT as W2V) case

* rm obsolete test (already skipped & somewhat originally misguided)

* simpler/faster .get(..., default) (avoids exception-catching in has_index_for)

* add default option to get_index; avoid exception in has_index_for

* chained range check

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>

* Update CHANGELOG.md

Co-authored-by: Radim Řehůřek <radimrehurek@seznam.cz>
Co-authored-by: Radim Řehůřek <me@radimrehurek.com>
Co-authored-by: Michael Penkov <m@penkov.dev>

* Delete .gitattributes

* test showing FT failure as W2V

* set .vectors even when ngrams off

* use _save_specials/_load_specials per type

* Make docs clearer on `alpha` parameter in LDA model

* Update Hoffman paper link

* rm whitespace

* Update gensim/models/ldamodel.py

* Update gensim/models/ldamodel.py

* Update gensim/models/ldamodel.py

* re-applying changes from #2821

* migrating + regenerating changed docs

* fix forgotten iteritems

* remove extra `model.wv`

* split overlong doc line

* get rid of six in doc2vec

* increase test timeout for Visdom server

* add 32/64 bits report

* add deprecations for init_sims()

* remove vectors_norm + add link to migration guide to deprecation warnings

* rename vectors_norm everywhere, update tests, regen docs

* put back no-op property setter of deprecated vectors_norm

* fix typo

* fix flake8

* disable Keras tests
- failing with weird errors on py3.7+3.8, see https://travis-ci.org/github/RaRe-Technologies/gensim/jobs/713448950#L862

* test showing FT failure as W2V

* set .vectors even when ngrams off

* Update gensim/test/test_fasttext.py

* Update gensim/test/test_fasttext.py

* refresh docs for run_annoy tutorial

* Reduce memory use of the term similarity matrix constructor, deprecate the positive_definite parameter, and extend normalization capabilities of the inner_product method (#2783)

* Deprecate SparseTermSimilarityMatrix's positive_definite parameter

* Reference paper on efficient implementation of soft cosine similarity

* Add example with Annoy indexer to SparseTermSimilarityMatrix

* Add example of obtaining word embeddings from SparseTermSimilarityMatrix

* Reduce space complexity of SparseTermSimilarityMatrix construction
Build matrix using arrays and bitfields rather than DOK sparse format

This work is based on the following blog post by @maciejkula:
https://maciejkula.github.io/2015/02/22/incremental-construction-of-sparse-matrices/

* Fix a typo in the soft cosine similarity Jupyter notebook

* Add human-readable string representation for TermSimilarityIndex

* Avoid sparse term similarity matrix computation when nonzero_limit <= 0

* Extend normalization in the inner_product method

Support the `maintain` vector normalization scheme.
Support separate vector normalization schemes for queries and documents.

* Remove a note in the docstring of SparseTermSimilarityMatrix

* Rerun continuous integration tests

* Use ==/!= to compare constant literals

* Add human-readable string representation for TermSimilarityIndex (cont.)

* Prod flake8 with a coding style violation in a docstring

* Collapse two lambdas into one internal function

* Revert "Prod flake8 with a coding style violation in a docstring"

This reverts commit 6557b84.

* Avoid str.format()

* Slice SparseTermSimilarityMatrix.inner_product tests by input types

* Remove similarity_type_code local variable

* Remove starting underscore from local function name

* Save indentation level and define populate_buffers function

* Extract SparseTermSimilarityMatrix constructor body to _create_source

* Extract NON_NEGATIVE_NORM_ASSERTION_MESSAGE to a module-level constant

* Extract cell assignment logic to cell_full local function

* Split variable swapping into three separate statements

* Extract normalization from the body of SparseTermSimilarityMatrix.inner_product

* Wrap overlong line

* Add test_inner_product_zerovector_zerovector and test_inner_product_zerovector_vector tests

* Further split test_inner_product into 63 test cases

* Raise ValueError when dictionary is empty

* Fix doc2vec crash for large sets of doc-vectors (#2907)

* Fix AttributeError in WikiCorpus (#2901)

* bug fix: wikicorpus getstream from data file-path \n Replace fname with input

* refactor: use property decorator for input

Co-authored-by: jshah02 <jenisnehal.shah@factset.com>

* Corrected info about elements of the job queue

* Add unused args of `_update_alpha`

* intensify cbow+hs tests; bulk testing method

* use increment operator

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>

* Change num_words to topn in dtm_coherence (#2926)

* Integrate what is essentially the same process

* docstirng fixes

* get rid of python2 constructs

* Remove Keras dependency (#2937)

* remove keras dependency
- relegate keras exports to FAQ: https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q13-how-do-i-export-a-trained-word2vec-model-to-keras

* remove forgotten notebook with keras

* Update CHANGELOG.md

Co-authored-by: Michael Penkov <m@penkov.dev>

* code style fixes while debugging pickle model sizes

* py2 to 3: get rid of forgotten range

* fix docs

* get rid of numpy.str_

* Fix deprecations in SoftCosineSimilarity (#2940)

* Remove deprecated Soft Cosine Measure parameters, functions, and tests.

Here is a detailed list of the deprecations:
- Parameter `positive_definite` of `SparseTermSimilarityMatrix` has been
  renamed to `dominant`. Test `test_positive_definite` has been removed.
- Parameter `similarity_matrix` of `SoftCosineSimilarity` no longer
  accepts unencapsulated sparse matrices.
- Parameter `normalized` of `SparseTermSimilarityMatrix.inner_product`
  no longer accepts booleans.
- Function `matutils.softcossim` has been superseded by method
  `SparseTermSimilarityMatrix.inner_product`. Tests in
  `TestSoftCosineSimilarity` have been removed.

* Remove unused imports

* Fix additional warnings from the CI test suite

* Update CHANGELOG.md

Co-authored-by: Michael Penkov <m@penkov.dev>

* Fix "generator" language in word2vec docs (#2935)

* Fix docs about Word2Vec (fix #2934)

Docs say you can use a generator as the first argument, but you can't.

The tempfile path was also unused, so that's been removed.

* Fix langauge to make it clear streaming is supported

Technically a generator is a kind of iterator, so this clarifies that a
restartable iterator (as opposed to a consumable generator) is
necessary.

* Update gensim/models/word2vec.py

* Update CHANGELOG.md

Co-authored-by: Michael Penkov <m@penkov.dev>

* Bump minimum Python version to 3.6 (#2947)

* remove claims of Python 3.5 support

brings `setup.py` up to sync with #2713 & #2715 changes

* remove py2.7 and py3.5 from web index page

* Update CHANGELOG.md

Co-authored-by: Radim Řehůřek <radimrehurek@seznam.cz>
Co-authored-by: Michael Penkov <m@penkov.dev>

* fix index2entity, fix docs, hard-fail deprecated properties

* fix typos + more doc fixes + fix failing tests

* more index2word => index_to_key fixes

* finish method renaming
- add() => add_vectors()
- add_one() => add_vector()

* Update gensim/models/word2vec.py

Co-authored-by: Michael Penkov <m@penkov.dev>

* a few more style fixes

* fix nonsensical word2vec path examples

* more doc fixes

* `it` => `itertools`, + code style fixes

* Refactor ldamulticore to serialize less data (#2300)

* fix

* Update CHANGELOG.md

Co-authored-by: Michael Penkov <m@penkov.dev>

* new docs theme

* redo copy on web index page

* fix docs in KeyedVectors

* clean up docs structure

* hopepage header update, social panel and new favicon

* fix flake8

* reduce space under code section

* fix images in core tutorials

* WIP: migrating tutorials to 4.0

* fix doc2vec tutorial FIXMEs

* add autogenerated docs

* fixing flake8 errors

* remove gensim.summarization subpackage, docs and test data (#2958)

* remove gensim.summarization subpackage, docs and test data

* Update changelog

* remove old import

* Remove distance metrics and pivoted normalization tutorials

* reuse from test.utils

* test re-saving-native-FT after update-vocab (#2853)

* avoid buggy shared list use (#2943)

* pre-assert save_facebook_model anomaly

* unittest.skipIf instead of pytest.skipIf

* refactor init/update vectors/vectors_vocab; bulk randomization

* unify/correct Word2Vec & FastText corpus/train parameter checking

* suggestions from code review

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>

* improve train() corpus_iterable parameter doc-comment

* disable pytest-rerunfailures due to pytest-dev/pytest-rerunfailures#128

* comment clarity from review

* specify dtype to avoid interim float64

* use inefficient-but-all-tests-pass 'uniform' for now, w/ big FIXME comment

* refactor phrases

* float32 random; diversified dv seed; disable bad test

* double-backticks

Co-authored-by: Michael Penkov <m@penkov.dev>

* inline seed diversifier; unittest.skip

* fix phrases tests

* clean up rendered docs for phrases

* fix sklearn_api.phrases tests + docs
- removed testing of loading of old models for backward compatibility, because the wrappers use plain pickle and so don't support SaveLoad overrides

* fix flake8 warnings in docstrings

* rename export_phrases to find_phrases + add actual export_phrases

* skip common english words by default in phrases

* sphinx doesn't allow custom section titles :(

* use FIXME for comments/doc-comments/names that must change pre-4.0.0

* ignore conjunctions in phrases

* make ENGLISH_COMMON_TERMS optional

* fix typo

* docs: use full version as the "short version"

* phrases: rename common_terms => connector_words

* fix typo

* ReST does not support nested markup

* make flake8 shut up

* improve HTML doc formatting for consecutive paragraphs

* fix typos

* add benchmark script

* silence flake8

* remove dependency on `six`

* regen tutorials

* Notification at the top of page in documentation

* Update notification.html

* Update changelog for 4.0.0 release (#2981)

* update changelog for 4.0.0 release

* fixup

* wip: cleaning up changelog

* extend + clean up changelog

* note the removal of deprecations in CHANGELOG

* finish CHANGELOG
- except removed modules, pending info from @mpenkov

* CHANGELOG formatting fixes

* fix outdated docs
- found while updating the migration guide

* update migration hyperlinks

* fixing fixable FIXMEs, in preparation for 4.0.0beta

* fixing iter + size in docstrings

* fix typo

* clean up logic & docs for KeyedVectors.save_word2vec_format

* flake8 fix

* py3k: `class X(object):` -> `class X:`

* work around issues with flake8-rst

* add issues without a PR

* improve changelog script

* simplify pagination

* more flake8-rst fixing

Co-authored-by: Radim Řehůřek <radimrehurek@seznam.cz>

* bumped version to 4.0.0beta

* remove reference to cython.sh

* update link in readme

* clean up merge artifact

Co-authored-by: SanthoshBala18 <santhoshbala18@gmail.com>
Co-authored-by: Kirill Malev <playittodeath@gmail.com>
Co-authored-by: Shubhanshu Mishra <shubhanshumishra@gmail.com>
Co-authored-by: Joel Ong <ong.joel.94@gmail.com>
Co-authored-by: Radim Řehůřek <radimrehurek@seznam.cz>
Co-authored-by: Karthik Ravi <hikarthikravi@gmail.com>
Co-authored-by: lopusz <lopusz@users.noreply.github.com>
Co-authored-by: Paul Rigor <paulrigor@users.noreply.github.com>
Co-authored-by: Vít Novotný <witiko@mail.muni.cz>
Co-authored-by: Tim Gates <tim.gates@iress.com>
Co-authored-by: Radim Řehůřek <me@radimrehurek.com>
Co-authored-by: Rubanenko Evgeny <erubanenko@gmail.com>
Co-authored-by: Pablo Torres <pablo.torres.t@gmail.com>
Co-authored-by: Marcelo d'Almeida <md@id.uff.br>
Co-authored-by: Dmitry Persiyanov <persiyanov@phystech.edu>
Co-authored-by: Wataru Hirota <nobuyoshi2426@gmail.com>
Co-authored-by: Gordon Mohr <gojogit@gmail.com>
Co-authored-by: Tenoke <sviltodorov@gmail.com>
Co-authored-by: Martino Mensio <martinomensio@outlook.it>
Co-authored-by: David Dale <dale.david@mail.ru>
Co-authored-by: David Dale <ddale@yandex-team.ru>
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
Co-authored-by: Ivan Menshikh <menshikh.iv@gmail.com>
Co-authored-by: Ivan Menshikh <imenshikh@embedika.ru>
Co-authored-by: carterols <38794079+carterols@users.noreply.github.com>
Co-authored-by: Carter Olsen <olsencar@oregonstate.edu>
Co-authored-by: Hamekoded <alex.zhicharevich@gmail.com>
Co-authored-by: Zhicharevich <Alex_Zhicharevich@intuit.com>
Co-authored-by: Karthikeyan Singaravelan <tir.karthi@gmail.com>
Co-authored-by: Chenxin-Guo <cg633@cornell.edu>
Co-authored-by: Faiyaz Hasan <faiyaz.hasan1@gmail.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Shiv Dhar <shivdhar@gmail.com>
Co-authored-by: Cristi Burca <mail@scribu.net>
Co-authored-by: Marcin Cylke <marcin.cylke+github@gmail.com>
Co-authored-by: Marcin Cylke <marcin.cylke@allegro.pl>
Co-authored-by: Philippe Ombredanne <pombredanne@nexb.com>
Co-authored-by: Devi Sandeep <sandeep0138@gmail.com>
Co-authored-by: S Mono <10430241+xh2@users.noreply.github.com>
Co-authored-by: jeni Shah <jenishah@users.noreply.github.com>
Co-authored-by: jshah02 <jenisnehal.shah@factset.com>
Co-authored-by: lunastera <lounastera@gmail.com>
Co-authored-by: Megan <megan.stodel@bbc.co.uk>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: horpto <__Singleton__@hackerdom.ru>
Co-authored-by: Vaclav Dvorak <admin@wdv.cz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hacktoberfest Issues marked for hacktoberfest help wanted impact HIGH Show-stopper for affected users performance Issue related to performance (in HW meaning) reach MEDIUM Affects a significant number of users testing Issue related with testing (code, documentation, etc)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants