## Using wrappers for Scikit learn API

This tutorial is about using gensim models as a part of your scikit learn workflow with the help of wrappers found at ```gensim.sklearn_integration```

The wrappers available (as of now) are :
* LdaModel (```gensim.sklearn_api.ldamodel.LdaTransformer```),which implements gensim's ```LDA Model``` in a scikit-learn interface

* LsiModel (```gensim.sklearn_api.lsimodel.LsiTransformer```),which implements gensim's ```LSI Model``` in a scikit-learn interface

* RpModel (```gensim.sklearn_api.rpmodel.RpTransformer```),which implements gensim's ```Random Projections Model``` in a scikit-learn interface

* LDASeq Model (```gensim.sklearn_api.ldaseqmodel.LdaSeqTransformer```),which implements gensim's ```LdaSeqModel``` in a scikit-learn interface

### LDA Model

To use LdaModel begin with importing LdaModel wrapper

In [1]:
from gensim.sklearn_api import LdaTransformer

Using TensorFlow backend.


Next we will create a dummy set of texts and convert it into a corpus

In [2]:
from gensim.corpora import Dictionary
texts = [
    ['complier', 'system', 'computer'],
    ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],
    ['graph', 'flow', 'network', 'graph'],
    ['loading', 'computer', 'system'],
    ['user', 'server', 'system'],
    ['tree', 'hamiltonian'],
    ['graph', 'trees'],
    ['computer', 'kernel', 'malfunction', 'computer'],
    ['server', 'system', 'computer']
]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

Then to run the LdaModel on it

In [3]:
model = LdaTransformer(num_topics=2, id2word=dictionary, iterations=20, random_state=1)
model.fit(corpus)
model.transform(corpus)

array([[ 0.85275316,  0.14724687],
       [ 0.12390183,  0.87609816],
       [ 0.46129951,  0.53870052],
       [ 0.84924179,  0.15075824],
       [ 0.49180096,  0.50819904],
       [ 0.40086922,  0.59913075],
       [ 0.28454426,  0.71545571],
       [ 0.88776201,  0.11223802],
       [ 0.84210372,  0.15789627]], dtype=float32)

#### Integration with Sklearn

To provide a better example of how it can be used with Sklearn, Let's use CountVectorizer method of sklearn. For this example we will use [20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/). We will only use the categories rec.sport.baseball and sci.crypt and use it to generate topics.

In [4]:
import numpy as np
from gensim import matutils
from gensim.models.ldamodel import LdaModel
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from gensim.sklearn_api.ldamodel import LdaTransformer

In [5]:
rand = np.random.mtrand.RandomState(1) # set seed for getting same result
cats = ['rec.sport.baseball', 'sci.crypt']
data = fetch_20newsgroups(subset='train', categories=cats, shuffle=True)

Next, we use countvectorizer to convert the collection of text documents to a matrix of token counts.

In [6]:
vec = CountVectorizer(min_df=10, stop_words='english')

X = vec.fit_transform(data.data)
vocab = vec.get_feature_names()  # vocab to be converted to id2word 

id2word = dict([(i, s) for i, s in enumerate(vocab)])

Next, we just need to fit X and id2word to our Lda wrapper.

In [7]:
obj = LdaTransformer(id2word=id2word, num_topics=5, iterations=20)
lda = obj.fit(X)

#### Example for Using Grid Search

In [8]:
from sklearn.model_selection import GridSearchCV
from gensim.models.coherencemodel import CoherenceModel

In [9]:
def scorer(estimator, X, y=None):
    goodcm = CoherenceModel(model=estimator.gensim_model, texts=texts, dictionary=estimator.gensim_model.id2word, coherence='c_v')
    return goodcm.get_coherence()

In [10]:
obj = LdaTransformer(id2word=dictionary, num_topics=5, iterations=20)
parameters = {'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)}
model = GridSearchCV(obj, parameters, scoring=scorer, cv=5)
model.fit(corpus)

GridSearchCV(cv=5, error_score='raise',
       estimator=LdaTransformer(alpha='symmetric', chunksize=2000, decay=0.5, eta=None,
        eval_every=10, gamma_threshold=0.001,
        id2word=<gensim.corpora.dictionary.Dictionary object at 0x7fdc747688d0>,
        iterations=20, minimum_probability=0.01, num_topics=5, offset=1.0,
        passes=1, random_state=None, update_every=1),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=<function scorer at 0x7fdc72c7bde8>, verbose=0)

In [11]:
model.best_params_

{'iterations': 50, 'num_topics': 10}

#### Example of Using Pipeline

In [12]:
from sklearn.pipeline import Pipeline
from sklearn import linear_model

def print_features_pipe(clf, vocab, n=10):
    ''' Better printing for sorted list '''
    coef = clf.named_steps['classifier'].coef_[0]
    print coef
    print 'Positive features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[::-1][:n] if coef[j] > 0]))
    print 'Negative features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[:n] if coef[j] < 0]))

In [13]:
id2word = Dictionary([_.split() for _ in data.data])
corpus = [id2word.doc2bow(i.split()) for i in data.data]

In [14]:
model = LdaTransformer(num_topics=15, id2word=id2word, iterations=10, random_state=37)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))

[ 0.3032212   0.53114732 -0.3556002   0.05528797 -0.23462074  0.10164825
 -0.34895972 -0.07528751 -0.31437197 -0.24760965 -0.27430636 -0.05328458
  0.1792989  -0.11535102  0.98473296]
Positive features: >Pat:0.98 considered,:0.53 Fame.:0.30 internet...:0.18 comp.org.eff.talk.:0.10 Keach:0.06
Negative features: Fame,:-0.36 01101001B:-0.35 circuitry:-0.31 hanging:-0.27 red@redpoll.neoucom.edu:-0.25 comp.org.eff.talk,:-0.23 dome.:-0.12 *best*:-0.08 trawling:-0.05
0.648489932886


### LSI Model

To use LsiModel begin with importing LsiModel wrapper

In [15]:
from gensim.sklearn_api import LsiTransformer

#### Example of Using Pipeline

In [16]:
model = LsiTransformer(num_topics=15, id2word=id2word)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))

[ 0.13652575  0.00381802  0.02641866 -0.08494218 -0.02367595 -0.60081222
  1.0711397   0.03908158  0.43832007 -0.54889399  0.20215952 -0.21837008
  1.3051509   0.08722772  0.17582433]
Positive features: internet...:1.31 01101001B:1.07 circuitry:0.44 hanging:0.20 >Pat:0.18 Fame.:0.14 dome.:0.09 *best*:0.04 Fame,:0.03 considered,:0.00
Negative features: comp.org.eff.talk.:-0.60 red@redpoll.neoucom.edu:-0.55 trawling:-0.22 Keach:-0.08 comp.org.eff.talk,:-0.02
0.865771812081


### Random Projections Model

To use RpModel begin with importing RpModel wrapper

In [17]:
from gensim.sklearn_api import RpTransformer

#### Example of Using Pipeline

In [18]:
model = RpTransformer(num_topics=2)
np.random.mtrand.RandomState(1)  # set seed for getting same result
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))

[ 0.03590816  0.00244356]
Positive features: Fame.:0.04 considered,:0.00
Negative features: 
0.630033557047


### LDASeq Model

To use LdaSeqModel begin with importing LdaSeqModel wrapper

In [19]:
from gensim.sklearn_api import LdaSeqTransformer

#### Example of Using Pipeline

In [20]:
test_data = data.data[0:2]
test_target = data.target[0:2]
id2word = Dictionary(map(lambda x: x.split(), test_data))
corpus = [id2word.doc2bow(i.split()) for i in test_data]

model = LdaSeqTransformer(id2word=id2word, num_topics=2, time_slice=[1, 1, 1], initialize='gensim')
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, test_target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, test_target))

  convergence = np.fabs((bound - old_bound) / old_bound)


[ 0.04877324 -0.04877324]
Positive features: What:0.05
Negative features: NLCS:-0.05
1.0
