## Using wrappers for Scikit learn API

This tutorial is about using gensim models as a part of your scikit learn workflow with the help of wrappers found at ```gensim.sklearn_integration.SklearnWrapperGensimLdaModel```

The wrapper available (as of now) are :
* LdaModel (```gensim.sklearn_integration.SklearnWrapperGensimLdaModel.SklearnWrapperLdaModel```),which implements gensim's ```LdaModel``` in a scikit-learn interface

### LdaModel

To use LdaModel begin with importing LdaModel wrapper

In [1]:
from gensim.sklearn_integration.SklearnWrapperGensimLdaModel import SklearnWrapperLdaModel

Next we will create a dummy set of texts and convert it into a corpus

In [2]:
from gensim.corpora import Dictionary
texts = [['complier', 'system', 'computer'],
 ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],
 ['graph', 'flow', 'network', 'graph'],
 ['loading', 'computer', 'system'],
 ['user', 'server', 'system'],
 ['tree','hamiltonian'],
 ['graph', 'trees'],
 ['computer', 'kernel', 'malfunction','computer'],
 ['server','system','computer']]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

Then to run the LdaModel on it

In [3]:
model=SklearnWrapperLdaModel(num_topics=2,id2word=dictionary,iterations=20, random_state=1)
model.fit(corpus)
model.print_topics(2)



[(0,
  u'0.164*computer + 0.117*system + 0.105*graph + 0.061*server + 0.057*tree + 0.046*malfunction + 0.045*kernel + 0.045*complier + 0.043*loading + 0.039*hamiltonian'),
 (1,
  u'0.102*graph + 0.083*system + 0.072*tree + 0.064*server + 0.059*user + 0.059*computer + 0.057*trees + 0.056*eulerian + 0.055*node + 0.052*flow')]

### Integration with Sklearn

To provide a better example of how it can be used with Sklearn, Let's use CountVectorizer method of sklearn. For this example we will use [20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/). We will only use the categories rec.sport.baseball and sci.crypt and use it to generate topics.

In [4]:
import numpy as np
from gensim import matutils
from gensim.models.ldamodel import LdaModel
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from gensim.sklearn_integration.SklearnWrapperGensimLdaModel import SklearnWrapperLdaModel

In [5]:
rand = np.random.mtrand.RandomState(1) # set seed for getting same result
cats = ['rec.sport.baseball', 'sci.crypt']
data = fetch_20newsgroups(subset='train',
                        categories=cats,
                        shuffle=True)

Next, we use countvectorizer to convert the collection of text documents to a matrix of token counts.

In [6]:
vec = CountVectorizer(min_df=10, stop_words='english')

X = vec.fit_transform(data.data)
vocab = vec.get_feature_names() #vocab to be converted to id2word 

id2word=dict([(i, s) for i, s in enumerate(vocab)])

Next, we just need to fit X and id2word to our Lda wrapper.

In [7]:
obj=SklearnWrapperLdaModel(id2word=id2word,num_topics=5,passes=20)
lda=obj.fit_predict(X)
lda.print_topics()

[(0,
  u'0.031*fierkelab + 0.025*bitnet + 0.020*digex + 0.020*false + 0.018*cover + 0.014*disk + 0.014*fear + 0.013*effort + 0.012*amolitor + 0.012*brian'),
 (1,
  u'0.022*corporate + 0.017*accurate + 0.015*chance + 0.008*dawson + 0.007*assess + 0.007*afford + 0.006*administration + 0.006*denning + 0.006*broad + 0.006*fails'),
 (2,
  u'0.018*clark + 0.017*accurate + 0.014*decipher + 0.014*example + 0.013*authentication + 0.012*cases + 0.012*follow + 0.011*basically + 0.011*candidates + 0.011*decryption'),
 (3,
  u'0.116*abroad + 0.093*asking + 0.078*cryptography + 0.040*arithmetic + 0.033*argue + 0.033*ciphertext + 0.031*courtesy + 0.028*456 + 0.028*facts + 0.020*beastmaster'),
 (4,
  u'0.026*certain + 0.020*book + 0.020*69 + 0.019*demand + 0.019*87 + 0.019*cracking + 0.018*farm + 0.012*face + 0.011*constitutional + 0.009*cryptography')]