### Performance improvement in Phrases module

#### Author - Prakhar Pratyush (@prakhar2b)
[Google summer of code '17 live blog](https://rare-technologies.com/google-summer-of-code-2017-live-blog-performance-improvement-in-gensim-and-fasttext/)

| Optimization       | Python 2.7     | Python 3.6 | PR |
| ------------- |:-------------:| :------------:|
|    original  | ~ 36-38 sec | ~32-35 sec |
| cython (static typing)      | ~30-32 sec    |
| any2utf8 (without cython)| ~20-22 sec     | ~23-26 sec | #1413 |
| cython (with any2utf8)| ~15-18 sec     |  ~19-21 sec |  #1385 |

In [1]:
! python --version

Python 3.6.1 :: Anaconda 4.4.0 (64-bit)


In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import profile
%load_ext autoreload

import gensim
from gensim.models.word2vec import Text8Corpus
%autoreload

2017-06-27 13:22:11,249 : INFO : 'pattern' package not found; tag filters are not available for English


In [3]:
#! git clone https://github.com/prakhar2b/gensim.git

In [4]:
!pwd

/home/prakhar


In [5]:
#! wget http://mattmahoney.net/dc/text8.zip 

In [6]:
#! unzip text8.zip

In [2]:
import os
text8_file = os.path.abspath('text8')
print(text8_file)

/home/prakhar/text8


In [8]:
% cd gensim

/home/prakhar/gensim


In [9]:
!git checkout develop
%autoreload

Already on 'develop'
Your branch is up-to-date with 'origin/develop'.


In [None]:
! python setup.py install
%autoreload 

In [11]:
# currently on develop --- original code
from gensim.models import Phrases
bigram = Phrases(Text8Corpus(text8_file))
%timeit bigram = Phrases(Text8Corpus(text8_file))

2017-06-27 13:12:26,162 : INFO : collecting all words and their counts
2017-06-27 13:12:26,166 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2017-06-27 13:12:59,505 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences
2017-06-27 13:12:59,506 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
2017-06-27 13:12:59,510 : INFO : collecting all words and their counts
2017-06-27 13:12:59,513 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2017-06-27 13:13:33,815 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences
2017-06-27 13:13:33,816 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
2017-06-27 13:13:33,909 : INFO : collecting all words and their counts
2017-06-27 13:13:33,911 : INFO : PROGRESS: at sentence #0, 

1 loop, best of 3: 34.5 s per loop


In [12]:
! git checkout any2utf8
%autoreload

Switched to branch 'any2utf8'
Your branch is up-to-date with 'origin/any2utf8'.


In [None]:
! python setup.py install
%autoreload 

In [14]:
# currently on any2utf8 
from gensim.models import Phrases
bigram = Phrases(Text8Corpus(text8_file))
%timeit bigram = Phrases(Text8Corpus(text8_file))

2017-06-27 13:15:25,852 : INFO : collecting all words and their counts
2017-06-27 13:15:25,857 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2017-06-27 13:15:56,700 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences
2017-06-27 13:15:56,702 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
2017-06-27 13:15:56,868 : INFO : collecting all words and their counts
2017-06-27 13:15:56,877 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2017-06-27 13:16:28,853 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences
2017-06-27 13:16:28,855 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
2017-06-27 13:16:28,948 : INFO : collecting all words and their counts
2017-06-27 13:16:28,952 : INFO : PROGRESS: at sentence #0, 

1 loop, best of 3: 26 s per loop


In [15]:
! git checkout gsoc17_phrases
%autoreload 
# cython + any2utf8 optimization

Switched to branch 'gsoc17_phrases'
Your branch is up-to-date with 'origin/gsoc17_phrases'.


In [None]:
!python setup.py build_ext --inplace
%autoreload 
!python setup.py install
%autoreload 

In [3]:
# currently on gsoc17_phrases (cython) 
from gensim.models import Phrases
bigram = Phrases(Text8Corpus(text8_file))
%timeit bigram = Phrases(Text8Corpus(text8_file))

2017-06-27 13:22:19,394 : INFO : collecting all words and their counts
2017-06-27 13:22:19,401 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2017-06-27 13:22:39,148 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences
2017-06-27 13:22:39,149 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
2017-06-27 13:22:39,244 : INFO : collecting all words and their counts
2017-06-27 13:22:39,247 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2017-06-27 13:22:59,694 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences
2017-06-27 13:22:59,695 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
2017-06-27 13:22:59,778 : INFO : collecting all words and their counts
2017-06-27 13:22:59,781 : INFO : PROGRESS: at sentence #0, 

1 loop, best of 3: 20.1 s per loop
