This notebook provides a tutorial on how to use the library.

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

# Datasets

Datasets management is made simple. You can view the available datasets:

In [3]:
from gsitk.datasets.datasets import DatasetManager

dm = DatasetManager()

Preparing the data is done once. For all datasets:

In [None]:
data = dm.prepare_datasets()

In [10]:
data.keys()

dict_keys(['imdb', 'vader', 'semeval13', 'sst', 'multidomain', 'imdb_unsup', 'pl04', 'semeval14', 'sentiment140'])

And for only one dataset:

In [4]:
data = dm.prepare_datasets(['vader', 'pl05'])

In [5]:
print(type(data))
print(data.keys())

<class 'dict'>
dict_keys(['vader', 'pl05'])


Data is a dict, in which each value is a pandas DataFrame with the dataset.

In [6]:
data['vader'].head()

Unnamed: 0,polarity,text
0,1,"[somehow, i, was, blessed, with, some, really,..."
1,1,"[yay, ., another, good, phone, interview, .]"
2,1,"[we, were, number, deep, last, night, amp, the..."
3,1,"[lmao, allcaps, ,, amazing, allcaps, !]"
4,-1,"[two, words, that, should, die, this, year, :,..."


In [7]:
data['vader']['polarity'].value_counts()

 1    2901
-1    1299
Name: polarity, dtype: int64

# Preprocessing

GSITK has utilities for preprocessing:

In [8]:
from gsitk.preprocess import simple, pprocess_twitter, normalize

text = "My grandmother is an apple. Please, believe me!"
twitter_text = "@POTUS please let me enter to the USA #thanks"

print('simple', simple.preprocess(text))
print('twitter', pprocess_twitter.preprocess(twitter_text))
print('normalize', normalize.preprocess(text))

simple ['my', 'grandmother', 'is', 'an', 'apple', '.', 'please', ',', 'believe', 'me', '!']
twitter <user> please let me enter to the usa <allcaps> <hastag> thanks
normalize ['my', 'grandmother', 'is', 'an', 'apple', '.', 'please', ',', 'believe', 'me', '!']


# Features

GSITK has a variety of feature extrators. For exaple, in order to use a word2vec model as feature extractor, write:

In [39]:
from gsitk.features.word2vec import Word2VecFeatures

w2v_feat = Word2VecFeatures(w2v_model_path='/data/w2vmodel_500d_5mc')

Extracting features is made by the method `transform`. All feature extractors implement `transform`.

In [45]:
transformed = w2v_feat.transform(data['imdb']['text'].values)
transformed.shape

(50000, 500)

In [62]:
transformed[0].shape

(500,)

If extracting the features is time consuming, you can save the features locally:

In [47]:
from gsitk.features import features

features.save_features(transformed, 'w2v__sentiment40')

And you can load them later:

In [49]:
features.load_features('w2v__sentiment')

array([[ 0.04839503, -0.03920275,  0.01310699, ..., -0.01793178,
         0.01850573,  0.01894511],
       [ 0.02001294, -0.01502401, -0.0211135 , ..., -0.01764425,
        -0.00566167,  0.02577729],
       [ 0.01879481, -0.04025034, -0.02238391, ..., -0.01603499,
         0.00581812,  0.03437515],
       ..., 
       [ 0.01735126, -0.02752644, -0.02615537, ..., -0.00227182,
         0.00647882,  0.01969421],
       [ 0.01858013, -0.01519343, -0.01451839, ..., -0.00798909,
         0.00773863,  0.04368705],
       [ 0.03160627, -0.0360069 , -0.006861  , ..., -0.01662612,
         0.00133611,  0.0172867 ]])

# Evaluation: difficult made easy

In [9]:
data_ready = {}
for data_k, data_v in data.items():
    data_ready[data_k] = data_v.copy()
    data_ready[data_k]['text'] = data_v['text'].apply(' '.join).values

Prepare the pipelines exactly the same as in sklearn.

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

pipeline.fit(data_ready['vader']['text'].values,
             data_ready['vader']['polarity'].values.astype(int))
pipeline.name = 'pipeline_trained'
pipeline.named_steps['vect'].name = 'myvect'
pipeline.named_steps['tfidf'].name = 'mytfidf'
pipeline.named_steps['clf'].name = 'mylogisticregressor'


pipeline2 = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

pipeline2.fit(data_ready['pl05']['text'].values,
              data_ready['pl05']['polarity'].values.astype(int))
pipeline2.name = 'pipeline_trained2'

Let the `Evaluation` do its job: evaluate your pipelines!

In [15]:
from gsitk.evaluation.evaluation import Evaluation

datasets_evaluation = {
    'vader': data_ready['vader'],
    'pl05': data_ready['pl05']
}

ev = Evaluation(tuples=None,
                datasets=datasets_evaluation,
                pipelines=[pipeline, pipeline2])
ev.evaluate()
ev.results

Unnamed: 0,Dataset,Features,Model,CV,accuracy,precision_macro,recall_macro,f1_weighted,f1_micro,f1_macro,Description
0,vader,,pipeline_trained__vader,False,0.992143,0.992596,0.988998,0.992128,0.992143,0.990772,vect(myvect) --> tfidf(mytfidf) --> clf(mylogi...
1,vader,,pipeline_trained2__vader,False,0.596429,0.630961,0.649194,0.608576,0.596429,0.59155,vect --> tfidf --> clf
2,pl05,,pipeline_trained__pl05,False,0.578962,0.585842,0.579002,0.570405,0.578962,0.570422,vect(myvect) --> tfidf(mytfidf) --> clf(mylogi...
3,pl05,,pipeline_trained2__pl05,False,0.926788,0.926838,0.926787,0.926786,0.926788,0.926786,vect --> tfidf --> clf


# Classifiers

In [24]:
from gsitk.classifiers.vader import VaderClassifier

vc = VaderClassifier()
vc.predict(data_ready['vader']['text'].values)

array([ 1.,  1.,  1., ...,  0.,  0.,  1.])

# Evaluation: bonus

The evaluation process uses pipes. Pipe are a way of organizing the different elements of the evaluation. Pipes are represented by EvalTuples, that are a way of specifiying which datasets, features and classifiers we want to evaluate.

For evaluating a set of models that predict from a set of features, `EvalTuple` are specified. The next example evaluates a simple logistic regressions model that uses word2vec features to predict the sentiment of the `IMDB` dataset.

In [72]:
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier()

# transformed is the features extracted from the IMDB dataset
# to properly evaluate, separate in train and test 
# using the original dataset fold
train_indices = (data['imdb']['fold'] == 'train').values
test_indices =(data['imdb']['fold'] == 'test').values

transformed_train = transformed[train_indices]
transformed_test = transformed[test_indices]


sgd.fit(transformed_train, data['imdb']['polarity'][train_indices])

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

Prepare the model, features and `EvalTuple` for the evaluation.

In [74]:
from gsitk.pipe import Model, Features, EvalTuple

models = [Model(name='sgd', classifier=sgd)]

feats = [Features(name='w2v__imdb_test', dataset='imdb', values=transformed_test)]

ets = [EvalTuple(classifier='sgd', features='w2v__imdb_test', labels='imdb')]

Perform the evaluation!

In [122]:
from gsitk.evaluation.evaluation import Evaluation

ev = Evaluation(datasets=data, features=feats, models=models, tuples=ets)

In [123]:
# run the evaluation
ev.evaluate()

# view the results
ev.results

Unnamed: 0,Dataset,Features,Model,CV,accuracy,precision_macro,recall_macro,f1_weighted,f1_micro,f1_macro
0,imdb,w2v__imdb_test,sgd,False,0.76164,0.782904,0.76164,0.757075,0.76164,0.757075
