Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Fetching contributors…

Cannot retrieve contributors at this time

216 lines (185 sloc) 13.826 kB
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Change Set for 0.8.0 &mdash; gensim</title>
<link rel="stylesheet" href="_static/default.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: '',
VERSION: '0.8.5',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true
};
</script>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script>
<script type="text/javascript" src="_static/jquery.js"></script>
<script type="text/javascript" src="_static/underscore.js"></script>
<script type="text/javascript" src="_static/doctools.js"></script>
<link rel="author" title="About these documents" href="about.html" />
<link rel="top" title="gensim" href="index.html" />
<!-- twitter search widget
<script type="text/javascript" src="_static/widget.js"></script>
-->
<meta property="og:title" content="#gensim" />
<meta property="og:description" content="Efficient topic modelling in Python" />
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-24066335-1']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
</head>
<body>
<div class="related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
accesskey="I">index</a></li>
<li class="right" >
<a href="py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li><a href="index.html">Gensim home</a>|&nbsp;</li>
<li><a href="tutorial.html">Tutorials</a>|&nbsp;</li>
<li><a href="http://groups.google.com/group/gensim">Support</a>|&nbsp;</li>
<li><a href="https://github.com/piskvorky/gensim/wiki">Contribute</a>|&nbsp;</li>
<li><a href="apiref.html">API reference</a>&raquo;</li>
</ul>
</div>
<div class="sphinxsidebar">
<div class="sphinxsidebarwrapper">
<h3><a href="index.html">Table Of Contents</a></h3>
<ul>
<li><a class="reference internal" href="#">Change Set for 0.8.0</a><ul>
<li><a class="reference internal" href="#codestyle-changes">Codestyle Changes</a></li>
<li><a class="reference internal" href="#similarity-queries">Similarity Queries</a></li>
<li><a class="reference internal" href="#simplified-directory-structure">Simplified Directory Structure</a></li>
<li><a class="reference internal" href="#other-changes-that-you-re-unlikely-to-notice-unless-you-look">Other changes (that you&#8217;re unlikely to notice unless you look)</a></li>
</ul>
</li>
</ul>
<div id="searchbox" style="display: none">
<h3>Quick search</h3>
<form class="search" action="search.html" method="get">
<input type="text" name="q" size="24" />
<input type="submit" value="Go" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
<p class="searchtip" style="font-size: 90%">
Enter search terms or a module, class or function name.
</p>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
</div>
</div>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body">
<div class="section" id="change-set-for-0-8-0">
<span id="changes-080"></span><h1>Change Set for 0.8.0<a class="headerlink" href="#change-set-for-0-8-0" title="Permalink to this headline">¶</a></h1>
<p>Release 0.8.0 concludes the 0.7.x series, which was about API consolidation and performance.
In 0.8.x, I&#8217;d like to extend <cite>gensim</cite> with new functionality and features.</p>
<div class="section" id="codestyle-changes">
<h2>Codestyle Changes<a class="headerlink" href="#codestyle-changes" title="Permalink to this headline">¶</a></h2>
<p>Codebase was modified to comply with <a class="reference external" href="http://www.python.org/dev/peps/pep-0008/">PEP8: Style Guide for Python Code</a>.
This means the 0.8.0 API is <strong>backward incompatible</strong> with the 0.7.x series.</p>
<p>That&#8217;s not as tragic as it sounds, gensim was almost there anyway. The changes are few and pretty straightforward:</p>
<ol class="arabic simple">
<li>the <cite>numTopics</cite> parameter is now <cite>num_topics</cite></li>
<li><cite>addDocuments()</cite> method becomes <cite>add_documents()</cite></li>
<li><cite>toUtf8()</cite> =&gt; <cite>to_utf8()</cite></li>
<li>... you get the idea: replace <cite>camelCase</cite> with <cite>lowercase_with_underscores</cite>.</li>
</ol>
<p>If you stored a model that is affected by this to disk, you&#8217;ll need to rename its attributes manually:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="n">lsa</span> <span class="o">=</span> <span class="n">gensim</span><span class="o">.</span><span class="n">models</span><span class="o">.</span><span class="n">LsiModel</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">&#39;/some/path&#39;</span><span class="p">)</span> <span class="c"># load old &lt;0.8.0 model</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">lsa</span><span class="o">.</span><span class="n">num_terms</span><span class="p">,</span> <span class="n">lsa</span><span class="o">.</span><span class="n">num_topics</span> <span class="o">=</span> <span class="n">lsa</span><span class="o">.</span><span class="n">numTerms</span><span class="p">,</span> <span class="n">lsa</span><span class="o">.</span><span class="n">numTopics</span> <span class="c"># rename attributes</span>
<span class="gp">&gt;&gt;&gt; </span><span class="k">del</span> <span class="n">lsa</span><span class="o">.</span><span class="n">numTerms</span><span class="p">,</span> <span class="n">lsa</span><span class="o">.</span><span class="n">numTopics</span> <span class="c"># clean up old attributes (optional)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">lsa</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">&#39;/some/path&#39;</span><span class="p">)</span> <span class="c"># save again to disk, as 0.8.0 compatible</span>
</pre></div>
</div>
<p>Only attributes (variables) need to be renamed; method names (functions) are not affected, due to the way <cite>pickle</cite> works.</p>
</div>
<div class="section" id="similarity-queries">
<h2>Similarity Queries<a class="headerlink" href="#similarity-queries" title="Permalink to this headline">¶</a></h2>
<p>Improved speed and scalability of <a class="reference internal" href="tut2.html"><em>similarity queries</em></a>.</p>
<p>The <cite>Similarity</cite> class can now index corpora of arbitrary size more efficiently.
Internally, this is done by splitting the index into several smaller pieces (&#8220;shards&#8221;) that fit in RAM
and can be processed independently. In addition, documents can now be added to a <cite>Similarity</cite> index dynamically.</p>
<p>There is also a new way to query the similarity indexes:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="n">index</span> <span class="o">=</span> <span class="n">MatrixSimilarity</span><span class="p">(</span><span class="n">corpus</span><span class="p">)</span> <span class="c"># create an index</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sims</span> <span class="o">=</span> <span class="n">index</span><span class="p">[</span><span class="n">document</span><span class="p">]</span> <span class="c"># get cosine similarity of query &quot;document&quot; against every document in the index</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sims</span> <span class="o">=</span> <span class="n">index</span><span class="p">[</span><span class="n">chunk_of_documents</span><span class="p">]</span> <span class="c"># new syntax!</span>
</pre></div>
</div>
<p>Advantage of the last line (querying multiple documents at the same time) is faster execution.</p>
<p>This faster execution is also utilized <em>automatically for you</em> if you&#8217;re using the <tt class="docutils literal"><span class="pre">for</span> <span class="pre">sims</span> <span class="pre">in</span> <span class="pre">index:</span> <span class="pre">...</span></tt> syntax
(which returns pairwise similarities of documents in the index).</p>
<p>To see the speed-up on your machine, run <tt class="docutils literal"><span class="pre">python</span> <span class="pre">-m</span> <span class="pre">gensim.test.simspeed</span></tt> (and compare to my results <a class="reference external" href="http://groups.google.com/group/gensim/msg/4f6f171a869e4fca?">here</a> to see how your machine fares).</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">This current functionality of querying is as far as I wanted to get with gensim.
More optimizations and smarter indexing are certainly possible, but I&#8217;d like to
focus on other features now. Pull requests are still welcome though :)</p>
</div>
<p>Check out the <a class="reference internal" href="similarities/docsim.html#module-gensim.similarities.docsim" title="gensim.similarities.docsim: Document similarity queries"><tt class="xref py py-mod docutils literal"><span class="pre">updated</span> <span class="pre">documentation</span></tt></a> of the similarity classes for more info.</p>
</div>
<div class="section" id="simplified-directory-structure">
<h2>Simplified Directory Structure<a class="headerlink" href="#simplified-directory-structure" title="Permalink to this headline">¶</a></h2>
<p>Instead of the java-esque <tt class="docutils literal"><span class="pre">ROOT_DIR/src/gensim</span></tt> directory structure of gensim,
the packages now reside directly in <tt class="docutils literal"><span class="pre">ROOT_DIR/gensim</span></tt> (no superfluous <tt class="docutils literal"><span class="pre">src</span></tt>). See the new structure <a class="reference external" href="https://github.com/piskvorky/gensim">on github</a>.</p>
</div>
<div class="section" id="other-changes-that-you-re-unlikely-to-notice-unless-you-look">
<h2>Other changes (that you&#8217;re unlikely to notice unless you look)<a class="headerlink" href="#other-changes-that-you-re-unlikely-to-notice-unless-you-look" title="Permalink to this headline">¶</a></h2>
<ul class="simple">
<li>Improved efficiency of <tt class="docutils literal"><span class="pre">lsi[corpus]</span></tt> transformations (documents are chunked internally for better performance).</li>
<li>Large matrices (numpy/scipy.sparse, in <cite>LsiModel</cite>, <cite>Similarity</cite> etc.) are now mmapped to/from disk when doing <cite>save/load</cite>. The <cite>cPickle</cite> approach used previously was too <a class="reference external" href="http://groups.google.com/group/gensim/browse_thread/thread/3c4c6c0f76c5938c#">buggy</a> and <a class="reference external" href="http://dieter.plaetinck.be/poor_mans_pickle_implementations_benchmark.html">slow</a>.</li>
<li>Renamed <cite>chunks</cite> parameter to <cite>chunksize</cite> (i.e. <cite>LsiModel(corpus, num_topics=100, chunksize=20000)</cite>). This better reflects its purpose: size of a chunk=number of documents to be processed at once.</li>
<li>Also improved memory efficiency of LSI and LDA model generation (again).</li>
<li>Removed SciPy 0.6 from the list of supported SciPi versions (need &gt;=0.7 now).</li>
<li>Added more unit tests.</li>
<li>Several smaller fixes; see the <a class="reference external" href="https://github.com/piskvorky/gensim/commits/0.8.0">commit history</a> for full account.</li>
</ul>
<div class="admonition-future-directions admonition">
<p class="first admonition-title">Future Directions?</p>
<p class="last">If you have ideas or proposals for new features for 0.8.x, now is the time to let me know:
<a class="reference external" href="http://groups.google.com/group/gensim">gensim mailing list</a>.</p>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
>index</a></li>
<li class="right" >
<a href="py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li><a href="index.html">Gensim home</a>|&nbsp;</li>
<li><a href="tutorial.html">Tutorials</a>|&nbsp;</li>
<li><a href="http://groups.google.com/group/gensim">Support</a>|&nbsp;</li>
<li><a href="https://github.com/piskvorky/gensim/wiki">Contribute</a>|&nbsp;</li>
<li><a href="apiref.html">API reference</a>&raquo;</li>
</ul>
</div>
<div class="footer">
&copy; Copyright 2009-2012, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
Last updated on Sep 10, 2012.
</div>
</body>
</html>
Jump to Line
Something went wrong with that request. Please try again.