## Using wrappers for Scikit learn API

This tutorial is about using gensim models as a part of your scikit learn workflow with the help of wrappers found at ```gensim.sklearn_integration.sklearn_wrapper_gensim_ldaModel```

The wrapper available (as of now) are :
* LdaModel (```gensim.sklearn_integration.sklearn_wrapper_gensim_ldaModel.SklearnWrapperLdaModel```),which implements gensim's ```LdaModel``` in a scikit-learn interface

### LdaModel

To use LdaModel begin with importing LdaModel wrapper

In [4]:
from gensim.sklearn_integration.sklearn_wrapper_gensim_ldamodel import SklearnWrapperLdaModel

In [2]:
import sys
sys.path.insert(0, '/home/kris/Desktop/GsoC2K17/gensim/')

In [3]:
import gensim
gensim.__path__

['/home/kris/Desktop/GsoC2K17/gensim/gensim']

Next we will create a dummy set of texts and convert it into a corpus

In [5]:
from gensim.corpora import Dictionary
texts = [['complier', 'system', 'computer'],
 ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],
 ['graph', 'flow', 'network', 'graph'],
 ['loading', 'computer', 'system'],
 ['user', 'server', 'system'],
 ['tree','hamiltonian'],
 ['graph', 'trees'],
 ['computer', 'kernel', 'malfunction','computer'],
 ['server','system','computer']]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

Then to run the LdaModel on it

In [6]:
model=SklearnWrapperLdaModel(num_topics=2,id2word=dictionary,iterations=20, random_state=1)
model.fit(corpus)
model.print_topics(2)
model.transform(corpus)



array([[ 0.85275314,  0.14724686],
       [ 0.12390183,  0.87609817],
       [ 0.4612995 ,  0.5387005 ],
       [ 0.84924177,  0.15075823],
       [ 0.49180096,  0.50819904],
       [ 0.40086923,  0.59913077],
       [ 0.28454427,  0.71545573],
       [ 0.88776198,  0.11223802],
       [ 0.84210373,  0.15789627]])

### Integration with Sklearn

To provide a better example of how it can be used with Sklearn, Let's use CountVectorizer method of sklearn. For this example we will use [20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/). We will only use the categories rec.sport.baseball and sci.crypt and use it to generate topics.

In [7]:
import numpy as np
from gensim import matutils
from gensim.models.ldamodel import LdaModel
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from gensim.sklearn_integration.sklearn_wrapper_gensim_ldamodel import SklearnWrapperLdaModel

In [8]:
rand = np.random.mtrand.RandomState(1) # set seed for getting same result
cats = ['rec.sport.baseball', 'sci.crypt']
data = fetch_20newsgroups(subset='train',
                        categories=cats,
                        shuffle=True)

Next, we use countvectorizer to convert the collection of text documents to a matrix of token counts.

In [6]:
vec = CountVectorizer(min_df=10, stop_words='english')

X = vec.fit_transform(data.data)
vocab = vec.get_feature_names() #vocab to be converted to id2word 

id2word=dict([(i, s) for i, s in enumerate(vocab)])

Next, we just need to fit X and id2word to our Lda wrapper.

In [None]:
obj=SklearnWrapperLdaModel(id2word=id2word,num_topics=5,passes=20)
lda=obj.fit(X)
lda.print_topics()

#### Using together with Scikit learn's Logistic Regression

Now lets try Sklearn's logistic classifier to classify the given categories into two types.Ideally we should get postive weights when cryptography is talked about and negative when baseball is talked about.

In [12]:
from sklearn import linear_model

In [9]:
def print_features(clf, vocab, n=10):
    ''' Better printing for sorted list '''
    coef = clf.coef_[0]
    print 'Positive features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[::-1][:n] if coef[j] > 0]))
    print 'Negative features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[:n] if coef[j] < 0]))

In [10]:
clf=linear_model.LogisticRegression(penalty='l1', C=0.1) #l1 penalty used
clf.fit(X,data.target)
print_features(clf,vocab)

Positive features: clipper:1.50 code:1.24 key:1.04 encryption:0.95 chip:0.37 nsa:0.37 government:0.37 uk:0.36 org:0.23 cryptography:0.23
Negative features: baseball:-1.32 game:-0.71 year:-0.61 team:-0.38 edu:-0.27 games:-0.26 players:-0.23 ball:-0.17 season:-0.14 phillies:-0.11


### Example for Using Grid Search

In [13]:
from sklearn.model_selection  import GridSearchCV
from gensim.models.coherencemodel import CoherenceModel

In [14]:
def scorer(estimator, X,y=None):
    goodcm = CoherenceModel(model=estimator, texts= texts, dictionary=estimator.id2word, coherence='c_v')
    return goodcm.get_coherence()

In [15]:
obj=SklearnWrapperLdaModel(id2word=dictionary,num_topics=5,passes=20)
parameters = {'num_topics':(2, 3, 5, 10), 'iterations':(1,20,50)}
model = GridSearchCV(obj, parameters, scoring=scorer, cv=5)
model.fit(corpus)

GridSearchCV(cv=5, error_score='raise',
       estimator=<gensim.sklearn_integration.sklearn_wrapper_gensim_ldamodel.SklearnWrapperLdaModel object at 0x7f583789ad50>,
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=<function scorer at 0x7f5837889938>, verbose=0)

In [16]:
model.best_params_

{'iterations': 50, 'num_topics': 2}

### Example of Using Pipeline

In [26]:
from sklearn.pipeline import Pipeline
def print_features_pipe(clf, vocab, n=10):
    ''' Better printing for sorted list '''
    coef = clf.named_steps['classifier'].coef_[0]
    print coef
    print 'Positive features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[::-1][:n] if coef[j] > 0]))
    print 'Negative features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[:n] if coef[j] < 0]))


In [10]:
id2word=Dictionary(map(lambda x : x.split(),data.data))
corpus = [id2word.doc2bow(i.split()) for i in data.data]

In [27]:
model=SklearnWrapperLdaModel(num_topics=5,id2word=id2word,iterations=50, random_state=1)
clf=linear_model.LogisticRegression(penalty='l2', C=0.1) #l1 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print pipe.score(corpus, data.target)



[ 0.09038178 -0.10099753]
Positive features: Fame.:0.09
Negative features: considered,:-0.10
0.519295302013
