# **GENSIM Package Transformations**

In this notebook, we'll be be using transformations from the gensim package to showcase how to transform documents from one vector representation to another. Additionally, each transformation will then be used to train an ADABoost classifier, and compare the results of each transformation.

In [None]:
import pandas as pd
import gensim
import nltk
import string
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

For our purpose, we'll be using the spam.csv file, which contains messages labeled as spam or ham. 

In [None]:
spam = pd.read_csv('spam.csv', encoding='latin-1')

In [None]:
spam = spam.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
spam.columns = ["label", "text"]

Before turning our messages into a word corpus, we'll do some preprocessing which includes deaccentuating and lowercasing words, as well as removing stopwords.

In [None]:
stopwords = nltk.corpus.stopwords.words('english')

In [None]:
message_corpus = []
for row in range(len(spam)):
  # preprocess list of lowercase tokens and deaccentuate
  message = gensim.utils.simple_preprocess(spam.iloc[row]['text'],deacc=True, min_len=3, max_len=15)
  # remove stopwords and punctuations
  message = [word for word in message if word not in stopwords]
  message_corpus.append(message)

In [None]:
message_corpus

[['jurong',
  'point',
  'crazy',
  'available',
  'bugis',
  'great',
  'world',
  'buffet',
  'cine',
  'got',
  'amore',
  'wat'],
 ['lar', 'joking', 'wif', 'oni'],
 ['free',
  'entry',
  'wkly',
  'comp',
  'win',
  'cup',
  'final',
  'tkts',
  'may',
  'text',
  'receive',
  'entry',
  'question',
  'std',
  'txt',
  'rate',
  'apply'],
 ['dun', 'say', 'early', 'hor', 'already', 'say'],
 ['nah', 'think', 'goes', 'usf', 'lives', 'around', 'though'],
 ['freemsg',
  'hey',
  'darling',
  'week',
  'word',
  'back',
  'like',
  'fun',
  'still',
  'xxx',
  'std',
  'chgs',
  'send',
  'rcv'],
 ['even', 'brother', 'like', 'speak', 'treat', 'like', 'aids', 'patent'],
 ['per',
  'request',
  'melle',
  'melle',
  'oru',
  'minnaminunginte',
  'nurungu',
  'vettam',
  'set',
  'callertune',
  'callers',
  'press',
  'copy',
  'friends',
  'callertune'],
 ['winner',
  'valued',
  'network',
  'customer',
  'selected',
  'receivea',
  'prize',
  'reward',
  'claim',
  'call',
  'claim',
  

In [None]:
len(message_corpus)

5572

In [None]:
from gensim import corpora
# creating dictionary of tokens
dictionary = corpora.Dictionary(message_corpus)
print(dictionary)

Dictionary(7305 unique tokens: ['amore', 'available', 'buffet', 'bugis', 'cine']...)


Most of the gensim transformations take as input a bag-of-words corpus, so we'll transform our message dictionary of words into a bag-of-words vector.

In [None]:
# creating bag-of-words corpus
bow_corpus = [dictionary.doc2bow(text) for text in message_corpus]

## Term Frequency * Inverse Document Frequency (Tf-Idf) 

The Tf-Idf transforms the bag-of-words into a weighted representation of the words, in which rare words in the training corpus will have a higher value. However, the dimensionality will remain the same.

In [None]:
from gensim import models

In [None]:
tfidf_mod = models.TfidfModel(bow_corpus, normalize = True)

In [None]:
# transforming bag of words corpus to tfidf corpus
tfidf_vector = gensim.interfaces.TransformedCorpus(tfidf_mod, bow_corpus)

In [None]:
from gensim import matutils

In [None]:
# transforming gensim corpus to sparse matrix
X_tfidf = matutils.corpus2csc(tfidf_vector)

In [None]:
pd.DataFrame(X_tfidf.toarray().transpose())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7295,7296,7297,7298,7299,7300,7301,7302,7303,7304
0,0.39624,0.263461,0.364398,0.306848,0.306848,0.275006,0.146625,0.181151,0.39624,0.278411,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
1,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
2,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
3,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
4,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
5568,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
5569,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.623061,0.623061,0.000000
5570,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.432601


In [None]:
# transforming sparse matrix to dataframe
X_featurestfidf = pd.DataFrame(X_tfidf.toarray()).transpose()

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn import datasets
#Import train_test_split function
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [None]:
# splitting into training and testing
X_traintfidf, X_testtfidf, y_traintfidf, y_testtfidf = train_test_split(X_featurestfidf, spam['label'], test_size=0.3, random_state = 101)

In [None]:
abc = AdaBoostClassifier(n_estimators=50,
                         learning_rate=0.5, random_state = 101)
# Train Adaboost Classifer
mod1 = abc.fit(X_traintfidf, y_traintfidf)

#Predict the response for test dataset
y_pred = mod1.predict(X_testtfidf)

In [None]:
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_testtfidf, y_pred))

Accuracy: 0.9647129186602871


In [None]:
print(metrics.classification_report(y_testtfidf, y_pred))

              precision    recall  f1-score   support

         ham       0.97      0.99      0.98      1464
        spam       0.94      0.76      0.84       208

    accuracy                           0.96      1672
   macro avg       0.95      0.88      0.91      1672
weighted avg       0.96      0.96      0.96      1672



## Latent Semantic Indexing (LSI)

LSI transforms documents from bow or tfidf representation into a space of lower dimensions, called topics, and each document is given a weightage of contribution to each topic.

In [None]:
lsi = models.LsiModel(bow_corpus, id2word=dictionary, num_topics=300)

In [None]:
lsi_vector = lsi[bow_corpus]

In [None]:
len(lsi_vector)

5572

In [None]:
X_LSI = matutils.corpus2csc(lsi_vector)

In [None]:
X_featuresLSI = pd.DataFrame(X_LSI.toarray()).transpose()

In [None]:
X_featuresLSI

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,0.135817,0.213466,0.120078,0.184018,0.140104,0.389382,0.165992,-0.252707,-0.137950,-0.207838,...,0.036287,0.067984,0.060165,0.034592,0.032275,-0.006394,0.085294,-0.085869,-0.071515,0.229787
1,0.008239,0.018685,0.007676,0.026827,0.011824,0.060273,0.030725,-0.017850,-0.011899,-0.008839,...,-0.024398,0.074671,-0.115685,0.028521,0.040768,0.103688,0.018748,0.080379,0.112499,0.029095
2,0.837041,0.273232,-1.095298,-0.566076,0.449358,-0.206580,0.264157,-0.045365,-0.489145,-0.165866,...,0.022164,-0.038512,0.167684,-0.120036,0.027064,0.011120,-0.052591,0.113043,0.088004,0.066460
3,0.061093,0.151863,0.095601,0.140420,0.117956,0.135954,0.036404,-0.030144,-0.065124,0.019397,...,-0.103615,0.025102,-0.036519,0.015055,-0.030534,-0.018533,-0.058430,-0.041632,0.034455,0.103705
4,0.056748,0.147770,0.082239,0.071741,0.042888,-0.026019,-0.074541,-0.006723,-0.033641,0.006500,...,0.012140,-0.094125,0.001820,0.017781,0.092455,-0.016863,-0.102730,0.089390,-0.054593,-0.047009
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,1.156535,-0.667675,0.582350,0.197979,-0.113655,-0.270491,0.358460,-0.246078,-0.528459,0.149771,...,-0.199700,-0.082163,0.099363,-0.117927,0.002628,0.012277,-0.078950,-0.038033,-0.068012,0.037805
5568,0.067226,0.103803,0.072612,0.154803,0.075348,0.213555,0.083171,-0.152298,-0.038540,-0.136345,...,-0.006278,0.003445,-0.012224,-0.004894,0.018625,-0.014743,-0.002016,0.016985,0.013405,0.006955
5569,0.001754,0.006137,0.006084,-0.009829,-0.002173,0.006372,0.007561,0.005491,-0.006020,-0.004914,...,0.022111,-0.015711,-0.016156,-0.011250,-0.000941,0.000053,-0.015154,0.005366,-0.002169,0.012570
5570,0.531138,0.328690,-0.503457,-0.049735,0.403553,0.389459,-0.228506,0.425628,-0.502489,0.597453,...,-0.020299,0.182971,-0.079126,0.092954,-0.064044,-0.124320,0.073444,-0.092496,0.100776,-0.016639


In [None]:
# splitting into training and testing
X_trainLSI, X_testLSI, y_trainLSI, y_testLSI = train_test_split(X_featuresLSI, spam['label'], test_size=0.3, random_state = 101)

In [None]:
abc = AdaBoostClassifier(n_estimators=50,
                         learning_rate=0.5, random_state = 101)
# Train Adaboost Classifer
mod = abc.fit(X_trainLSI, y_trainLSI)

In [None]:
#Predict the response for test dataset
y_predLSI = mod.predict(X_testLSI)

In [None]:
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_testLSI, y_predLSI))

Accuracy: 0.9742822966507177


In [None]:
print(metrics.classification_report(y_testLSI, y_predLSI))

              precision    recall  f1-score   support

         ham       0.98      0.99      0.99      1464
        spam       0.95      0.84      0.89       208

    accuracy                           0.97      1672
   macro avg       0.96      0.92      0.94      1672
weighted avg       0.97      0.97      0.97      1672



## Random Projection (RP)

RP also reduces dimensionality. It approximates distances between documents, by throwing in a little randomness.

In [None]:
rpMod = models.RpModel(bow_corpus,num_topics=300)  # fit model

# transforming bag of words corpus to lda corpus
rp_vector = gensim.interfaces.TransformedCorpus(rpMod, bow_corpus)


In [None]:
len(rp_vector)

5572

In [None]:
# transforming gensim corpus to sparse matrix
X_rp = matutils.corpus2csc(rp_vector)

In [None]:
# transforming sparse matrix to dataframe
X_featuresrp = pd.DataFrame(X_rp.toarray()).transpose()

In [None]:
X_featuresrp

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,0.115470,0.000000,0.000000,-0.346410,0.230940,0.230940,0.230940,0.230940,0.115470,0.000000,...,0.000000,-0.230940,0.115470,-0.115470,0.346410,0.000000,-0.461880,0.000000,0.000000,0.115470
1,-0.115470,0.000000,0.000000,0.230940,0.000000,-0.115470,0.000000,-0.230940,0.115470,0.115470,...,0.115470,0.115470,0.115470,-0.115470,0.000000,0.000000,0.115470,0.000000,-0.115470,0.000000
2,0.057735,0.173205,0.057735,0.057735,0.057735,0.288675,0.404145,0.057735,-0.288675,0.057735,...,-0.404145,0.057735,-0.173205,-0.288675,0.057735,-0.173205,-0.173205,0.173205,0.288675,0.057735
3,-0.230940,-0.115470,0.000000,0.230940,0.115470,-0.230940,-0.346410,-0.115470,0.000000,-0.230940,...,0.000000,-0.115470,-0.115470,0.230940,0.115470,0.000000,0.000000,-0.230940,0.115470,-0.230940
4,0.173205,-0.173205,0.057735,-0.057735,-0.288675,0.057735,-0.173205,0.057735,0.288675,-0.057735,...,0.173205,0.057735,-0.173205,-0.057735,-0.288675,-0.057735,0.057735,0.057735,0.404145,-0.173205
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0.000000,0.115470,-0.115470,0.000000,-0.115470,0.115470,-0.346410,-0.230940,0.000000,0.115470,...,0.000000,0.115470,0.000000,-0.230940,0.230940,0.000000,-0.461880,0.230940,0.000000,0.577350
5568,-0.057735,-0.173205,0.057735,0.057735,-0.057735,-0.057735,-0.057735,0.057735,0.057735,-0.057735,...,0.057735,-0.057735,0.057735,-0.057735,-0.057735,-0.173205,0.057735,-0.057735,0.057735,-0.057735
5569,0.057735,-0.057735,-0.057735,-0.057735,0.057735,0.173205,0.057735,0.057735,0.057735,0.057735,...,-0.057735,-0.057735,-0.057735,0.057735,-0.173205,-0.173205,-0.057735,0.057735,0.173205,0.057735
5570,0.115470,-0.115470,-0.115470,-0.115470,0.346410,-0.230940,0.000000,0.230940,0.115470,0.000000,...,0.115470,-0.115470,0.000000,0.115470,0.000000,0.230940,0.000000,-0.115470,-0.230940,0.230940


In [None]:
# splitting into training and testing
X_trainrp, X_testrp, y_trainrp, y_testrp = train_test_split(X_featuresrp, spam['label'], test_size=0.3, random_state = 101)

In [None]:
abc = AdaBoostClassifier(n_estimators=50,
                         learning_rate=0.5, random_state = 101)
# Train Adaboost Classifer
mod = abc.fit(X_trainrp, y_trainrp)

#Predict the response for test dataset
y_predrp = mod.predict(X_testrp)

In [None]:
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_testrp, y_predrp))

Accuracy: 0.9425837320574163


In [None]:
print(metrics.classification_report(y_testrp, y_predrp))

              precision    recall  f1-score   support

         ham       0.95      0.99      0.97      1464
        spam       0.87      0.63      0.73       208

    accuracy                           0.94      1672
   macro avg       0.91      0.81      0.85      1672
weighted avg       0.94      0.94      0.94      1672



## Latent Dirichlet Allocation (LDA) Transformation

LDA, similar to LSI, transforms bow into lower dimensions. The difference is that LDA can be interpreted as probability distributions over words.

In [None]:
# creating lda model with 600 topics
lda = models.LdaModel(bow_corpus, num_topics=600, alpha='auto', random_state = 101)

  diff = np.log(self.expElogbeta)


In [None]:
# transforming bag of words corpus to lda corpus
lda_vector = gensim.interfaces.TransformedCorpus(lda, bow_corpus)

In [None]:
len(lda_vector)

5572

In [None]:
# transforming gensim corpus to sparse matrix
X_lda = matutils.corpus2csc(lda_vector)

In [None]:
# transforming sparse matrix to dataframe
X_features_lda = pd.DataFrame(X_lda.toarray()).transpose()

In [None]:
# splitting into training and testing
X_trainlda, X_testlda, y_trainlda, y_testlda = train_test_split(X_features_lda, spam['label'], test_size=0.3, random_state = 101)

In [None]:
abc = AdaBoostClassifier(n_estimators=50,
                         learning_rate=0.5, random_state = 101)
# Train Adaboost Classifer
model = abc.fit(X_trainlda, y_trainlda)

#Predict the response for test dataset
y_predlda = model.predict(X_testlda)

In [None]:
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_testlda, y_predlda))

Accuracy: 0.9055023923444976


In [None]:
print(metrics.classification_report(y_testlda, y_predlda))

              precision    recall  f1-score   support

         ham       0.91      1.00      0.95      1464
        spam       0.90      0.27      0.41       208

    accuracy                           0.91      1672
   macro avg       0.90      0.63      0.68      1672
weighted avg       0.91      0.91      0.88      1672



## Hierarchical Dirichlet Process (HDP)

HDP is another transformation that aims to reduce dimensionality. However, contrary to LSI and LDA, it infers the number of topics from the training corpus.

In [None]:
# creating HDP model
hdp = models.HdpModel(bow_corpus, dictionary)

In [None]:
# transforming bag of words corpus to hdp corpus
hdp_vector = gensim.interfaces.TransformedCorpus(hdp, bow_corpus)

In [None]:
# transforming gensim corpus to sparse matrix
X_hdp = matutils.corpus2csc(hdp_vector)

In [None]:
# transforming sparse matrix to dataframe
X_features_hdp = pd.DataFrame(X_hdp.toarray()).transpose()

In [None]:
# splitting into training and testing
X_trainhdp, X_testhdp, y_trainhdp, y_testhdp = train_test_split(X_features_hdp, spam['label'], test_size=0.3, random_state = 101)

In [None]:
abc = AdaBoostClassifier(n_estimators=50,
                         learning_rate=0.5, random_state = 101)
# Train Adaboost Classifer
model = abc.fit(X_trainhdp, y_trainhdp)

#Predict the response for test dataset
y_predhdp = model.predict(X_testhdp)

In [None]:
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_testhdp, y_predhdp))

Accuracy: 0.8971291866028708


In [None]:
print(metrics.classification_report(y_testhdp, y_predhdp))

              precision    recall  f1-score   support

         ham       0.90      0.99      0.94      1464
        spam       0.74      0.27      0.39       208

    accuracy                           0.90      1672
   macro avg       0.82      0.63      0.67      1672
weighted avg       0.88      0.90      0.88      1672



## Comparison

The following table shows the ADABoost performance results for each of the transformations.

| Transformation | Accuracy | Precision |
| --- | --- | --- |
| TF-IDF | 96% | 94% |
| LSI | 97% | 95% |
| RP | 94% | 87% |
| LDA | 91% | 90% |
| HDP | 90% | 74% |

Since we're trying to classify messages into spam or ham, we would like to have a high precision, because this would reduce the number of false positives. As seen in the table above, the LSI transformation is the best option for our purposes, since it has the highest precision. However, very similar results can also be achieved with TF-IDF.
