## Using wrappers for Scikit learn API

This tutorial is about using gensim models as a part of your scikit learn workflow with the help of wrappers found at ```gensim.sklearn_integration```

The wrappers available (as of now) are :
* LdaModel (```gensim.sklearn_integration.sklearn_wrapper_gensim_ldaModel.SklLdaModel```),which implements gensim's ```LDA Model``` in a scikit-learn interface

* LsiModel (```gensim.sklearn_integration.sklearn_wrapper_gensim_lsiModel.SklLsiModel```),which implements gensim's ```LSI Model``` in a scikit-learn interface

* RpModel (```gensim.sklearn_integration.sklearn_wrapper_gensim_rpmodel.SklRpModel```),which implements gensim's ```Random Projections Model``` in a scikit-learn interface

* LDASeq Model (```gensim.sklearn_integration.sklearn_wrapper_gensim_lsiModel.SklLdaSeqModel```),which implements gensim's ```LdaSeqModel``` in a scikit-learn interface

### LDA Model

To use LdaModel begin with importing LdaModel wrapper

In [1]:
from gensim.sklearn_integration import SklLdaModel

Using TensorFlow backend.


Next we will create a dummy set of texts and convert it into a corpus

In [2]:
from gensim.corpora import Dictionary
texts = [
    ['complier', 'system', 'computer'],
    ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],
    ['graph', 'flow', 'network', 'graph'],
    ['loading', 'computer', 'system'],
    ['user', 'server', 'system'],
    ['tree', 'hamiltonian'],
    ['graph', 'trees'],
    ['computer', 'kernel', 'malfunction', 'computer'],
    ['server', 'system', 'computer']
]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

Then to run the LdaModel on it

In [3]:
model = SklLdaModel(num_topics=2, id2word=dictionary, iterations=20, random_state=1)
model.fit(corpus)
model.transform(corpus)

array([[ 0.85275314,  0.14724686],
       [ 0.12390183,  0.87609817],
       [ 0.4612995 ,  0.5387005 ],
       [ 0.84924177,  0.15075823],
       [ 0.49180096,  0.50819904],
       [ 0.40086923,  0.59913077],
       [ 0.28454427,  0.71545573],
       [ 0.88776198,  0.11223802],
       [ 0.84210373,  0.15789627]])

#### Integration with Sklearn

To provide a better example of how it can be used with Sklearn, Let's use CountVectorizer method of sklearn. For this example we will use [20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/). We will only use the categories rec.sport.baseball and sci.crypt and use it to generate topics.

In [4]:
import numpy as np
from gensim import matutils
from gensim.models.ldamodel import LdaModel
from sklearn.datasets import fetch_20newsgroups
from gensim.sklearn_integration.sklearn_wrapper_gensim_ldamodel import SklLdaModel

In [5]:
rand = np.random.mtrand.RandomState(1) # set seed for getting same result
cats = ['alt.atheism',
        'comp.graphics'
       ]
data = fetch_20newsgroups(subset='train', categories=cats, shuffle=True)

Next, we use use the loaded data to create our dictionary and corpus.

In [6]:
id2word = Dictionary([_.split() for _ in data.data])
corpus = [id2word.doc2bow(i.split()) for i in data.data]

Next, we just need to fit corpus and id2word to our Lda wrapper.

In [7]:
obj = SklLdaModel(id2word=id2word, num_topics=5, iterations=20)
lda = obj.fit(corpus)

#### Example for Using Grid Search

In [8]:
from sklearn.model_selection import GridSearchCV

The inbuilt `score` function of Lda wrapper class provides two modes : `perplexity` and `u_mass` for computing the scores of the candidate models. The preferred mode for the scoring function is specified using `scorer` parameter of the wrapper as follows : 

In [9]:
obj = SklLdaModel(id2word=id2word, num_topics=2, iterations=5, scorer='u_mass') # here 'scorer' can be 'perplexity' or 'u_mass'
parameters = {'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)}
model = GridSearchCV(obj, parameters, cv=2, scoring=None)
model.fit(corpus)

model.best_params_

{'iterations': 20, 'num_topics': 2}

You can also supply a custom scoring function of your choice using the `scoring` parameter of `GridSearchCV` function. The example shown below uses `c_v` mode of `CoherenceModel` class for computing the scores of the candidate models.

In [14]:
from gensim.models.coherencemodel import CoherenceModel

# supplying a custom scoring function
def scoring_function(estimator, X, y=None):
    goodcm = CoherenceModel(model=estimator.gensim_model, texts=X, dictionary=estimator.gensim_model.id2word, coherence='c_uci')
    return goodcm.get_coherence()

obj = SklLdaModel(id2word=id2word, num_topics=5, iterations=5)
parameters = {'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)}
model = GridSearchCV(obj, parameters, cv=5, scoring=scoring_function)  # specify `scorer` param
model.fit(corpus)

model.best_params_

{'iterations': 1, 'num_topics': 2}

#### Example of Using Pipeline

In [8]:
from sklearn.pipeline import Pipeline
from sklearn import linear_model

def print_features_pipe(clf, vocab, n=10):
    ''' Better printing for sorted list '''
    coef = clf.named_steps['classifier'].coef_[0]
    print coef
    print 'Positive features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[::-1][:n] if coef[j] > 0]))
    print 'Negative features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[:n] if coef[j] < 0]))

In [9]:
model = SklLdaModel(num_topics=15, id2word=id2word, iterations=10, random_state=37)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))

[ 1.32565743  0.75503221  0.70518754  0.50225871  0.18737378 -0.04328554
 -0.42893934 -0.4814934  -0.58204094 -0.51909754 -0.51790585 -0.41859839
 -0.35134981 -0.18368892 -0.09067942]
Positive features: weber@sipi.usc.edu:1.33 considered.:0.76 al-Qanawi,:0.71 360.0;:0.50 talk.origins:0.19
Negative features: course...:-0.58 Stoakley):-0.52 Western:-0.52 localized:-0.48 resisted:-0.43 >>this):-0.42 (M.J.:-0.35 Tennessee.:-0.18 spider?:-0.09 circuitry:-0.04
0.682330827068


### LSI Model

To use LsiModel begin with importing LsiModel wrapper

In [10]:
from gensim.sklearn_integration import SklLsiModel

#### Example of Using Pipeline

In [11]:
model = SklLsiModel(num_topics=15, id2word=id2word)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))

[-0.0359862  -0.81408832 -0.01489664  0.08309666 -0.5422912   0.03222213
  0.00932597  0.35993054  0.17835018 -0.40297285  0.17579351 -0.21592558
  0.11585845  0.12755213  0.03434153]
Positive features: localized:0.36 course...:0.18 Western:0.18 Tennessee.:0.13 (M.J.:0.12 360.0;:0.08 spider?:0.03 circuitry:0.03 resisted:0.01
Negative features: considered.:-0.81 talk.origins:-0.54 Stoakley):-0.40 >>this):-0.22 weber@sipi.usc.edu:-0.04 al-Qanawi,:-0.01
0.832706766917


### Random Projections Model

To use RpModel begin with importing RpModel wrapper

In [12]:
from gensim.sklearn_integration import SklRpModel

#### Example of Using Pipeline

In [13]:
model = SklRpModel(num_topics=2)
np.random.mtrand.RandomState(1)  # set seed for getting same result
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))

[-0.0011877  -0.01663632]
Positive features: 
Negative features: considered.:-0.02 weber@sipi.usc.edu:-0.00
0.614661654135


### LDASeq Model

To use LdaSeqModel begin with importing LdaSeqModel wrapper

In [14]:
from gensim.sklearn_integration import SklLdaSeqModel

#### Example of Using Pipeline

In [15]:
test_data = data.data[0:2]
test_target = data.target[0:2]
id2word = Dictionary(map(lambda x: x.split(), test_data))
corpus = [id2word.doc2bow(i.split()) for i in test_data]

model = SklLdaSeqModel(id2word=id2word, num_topics=2, time_slice=[1, 1, 1], initialize='gensim')
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, test_target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, test_target))

  convergence = np.fabs((bound - old_bound) / old_bound)


[-0.04876308  0.04876308]
Positive features: Organization::0.05
Negative features: all:-0.05
1.0
