# 2 latent methods for dimension reduction and topic modeling

![](https://cdn.pixabay.com/photo/2015/11/07/11/17/golden-gate-bridge-1030999_960_720.jpg)
Photo: https://pixabay.com/en/golden-gate-bridge-women-back-1030999/

Before the state-of-the-art word embedding technique, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) area good approaches to deal with NLP problems. Both LSA and LDA have same input which is Bag of words in matrix format. LSA focus on reducing matrix dimension while LDA solves topic modeling problems.

I will not go through mathematical detail and as there is lot of great material for that. You may check it from reference. For the sake of keeping it easy to understand, I did not do pre-processing such as stopwords removal. It is critical part when you use LSA, LSI and LDA. After reading this article, you will know:
- Latent Semantic Analysis (LSA)
- Latent Dirichlet Allocation (LDA)
- Take Away

In [1]:
from sklearn.datasets import fetch_20newsgroups
train_raw = fetch_20newsgroups(subset='train')
test_raw = fetch_20newsgroups(subset='test')

x_train = train_raw.data
y_train = train_raw.target

x_test = test_raw.data
y_test = test_raw.target

# Latent Semantic Analysis (LSA)

The idea is that words will occurs in similar pieces of text if they have similar meaning. People usually use Latent Semantic Indexing (LSI) as an alternative name in NLP field.

First of all, we have m documents and n words as input. An m * n matrix can be constructed while column and row are document and word respectively. You can use count occurrence or TF-IDF score. However, TF-IDF is better than count occurrence in most of the time as high frequency do not account for better classification.

![](https://1.bp.blogspot.com/-tnzPA6dDtTU/Vw6EWm_PjCI/AAAAAAABDwI/JatHtUJb4fsce9E-Ns5t02_nakFtGrsugCLcB/s1600/%25E8%259E%25A2%25E5%25B9%2595%25E5%25BF%25AB%25E7%2585%25A7%2B2016-04-14%2B%25E4%25B8%258A%25E5%258D%25881.39.07.png)
Photo: http://mropengate.blogspot.com/2016/04/tf-idf-in-r-language.html

The idea of TF-IDF is that high frequency may not able to provide much information gain. In another word, rare words contribute more weights to the model. Word importance will be increased if the number of occurrence within same document (i.e. training record). On the other hand, it will be decreased if it occurs in corpus (i.e. other training records). For detail, you may check this [blog](https://towardsdatascience.com/3-basic-approaches-in-bag-of-words-which-are-better-than-word-embeddings-c2cbc7398016).

The challenge is that the matrix is very sparse (or high dimension) and noisy (or include lots of low frequency word). So truncated SVD is adopted to reduce dimension.

![]()

The idea of SVD is finding the most valuable information and using lower dimension t to represent same thing.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, KFold
from multiprocessing import cpu_count
import pandas as pd

In [3]:
def build_bow(x_train, x_test):
    tfidf_vec = TfidfVectorizer(use_idf=False, norm='l2')
    
    transformed_x_train = tfidf_vec.fit_transform(x_train)
    transformed_x_test = tfidf_vec.transform(x_test)
    
    print('BoW output shape:', transformed_x_train.shape)
    
    return tfidf_vec, transformed_x_train, transformed_x_test

tfidf_bow, x_train_bow, x_test_bow = build_bow(x_train, x_test)

BoW output shape: (11314, 130107)


In [4]:
def build_tf_idf(x_train, x_test):
    tfidf_vec = TfidfVectorizer(use_idf=True, norm='l2')
    
    transformed_x_train = tfidf_vec.fit_transform(x_train)
    transformed_x_test = tfidf_vec.transform(x_test)
    
    print('TF-IDF output shape:', transformed_x_train.shape)
    
    return tfidf_vec, transformed_x_train, transformed_x_test

tfidf_vec, x_train_tfidf, x_test_tfidf = build_tf_idf(x_train, x_test)

TF-IDF output shape: (11314, 130107)


In [None]:
def build_lsa(x_train, x_test, dim=50):
    tfidf_vec = TfidfVectorizer(use_idf=True, norm='l2')
    svd = TruncatedSVD(n_components=dim)
    
    transformed_x_train = tfidf_vec.fit_transform(x_train)
    transformed_x_test = tfidf_vec.transform(x_test)
    
    print('TF-IDF output shape:', transformed_x_train.shape)
    
    x_train_svd = svd.fit_transform(transformed_x_train)
    x_test_svd = svd.transform(transformed_x_test)
    
    print('LSA output shape:', x_train_svd.shape)
    
    explained_variance = svd.explained_variance_ratio_.sum()
    print("Sum of explained variance ratio: %d%%" % (int(explained_variance * 100)))
    
    return x_train_svd, x_test_svd

x_train_lsa_50, x_test_lsa_50 = build_lsa(x_train, x_test)
x_train_lsa_100, x_test_lsa_100 = build_lsa(x_train, x_test, dim=100)
x_train_lsa_200, x_test_lsa_200 = build_lsa(x_train, x_test, dim=200)
x_train_lsa_400, x_test_lsa_400 = build_lsa(x_train, x_test, dim=400)

TF-IDF output shape: (11314, 130107)
LSA output shape: (11314, 50)
Sum of explained variance ratio: 8%
TF-IDF output shape: (11314, 130107)
LSA output shape: (11314, 100)
Sum of explained variance ratio: 12%
TF-IDF output shape: (11314, 130107)
LSA output shape: (11314, 200)
Sum of explained variance ratio: 19%


We can see that the dimension reduces from 130k to 50, 100, 200 or 400 only.

In [7]:
lr_model_bow = LogisticRegression(solver='newton-cg',n_jobs=cpu_count(), multi_class='auto')
lr_model_bow.fit(x_train_bow, y_train)

cv = KFold(n_splits=5, shuffle=True, random_state=123)
    
scores = cross_val_score(lr_model_bow, x_test_bow, y_test, cv=cv, scoring='accuracy')
print("Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.7629 (+/- 0.0256)


In [9]:
lr_model = LogisticRegression(solver='newton-cg',n_jobs=cpu_count(), multi_class='auto')
lr_model.fit(x_train_tfidf, y_train)

cv = KFold(n_splits=5, shuffle=True, random_state=123)
    
scores = cross_val_score(lr_model, x_test_tfidf, y_test, cv=cv, scoring='accuracy')
print("Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.8784 (+/- 0.0216)


In [21]:
x_test_lsa_50.shape[1]

50

In [24]:
def test_lsa(x_train, x_test):
    lr_model = LogisticRegression(solver='newton-cg',n_jobs=cpu_count(), multi_class='auto')
    lr_model.fit(x_train, y_train)

    cv = KFold(n_splits=5, shuffle=True, random_state=123)

    scores = cross_val_score(lr_model, x_test, y_test, cv=cv, scoring='accuracy')
    print("Accuracy for dim %s: %0.4f (+/- %0.4f)" % (x_train.shape[1], scores.mean(), scores.std() * 2))

test_lsa(x_train_lsa_50, x_test_lsa_50)
test_lsa(x_train_lsa_100, x_test_lsa_100)
test_lsa(x_train_lsa_200, x_test_lsa_200)
test_lsa(x_train_lsa_400, x_test_lsa_400)

Accuracy for dim 50: 0.6557 (+/- 0.0369)
Accuracy for dim 100: 0.7176 (+/- 0.0301)
Accuracy for dim 200: 0.7574 (+/- 0.0371)
Accuracy for dim 400: 0.7909 (+/- 0.0284)


# Take Away
- Both of them use __Bag-of-words as input matrix__
- The challenge of SVD is that we are __hard to determine the optimal number of dimension__. In general, low dimension consume less resource but we may not able to distinguish opposite meaning words while high dimension overcome it but consuming more resource.

# About Me
I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. You can reach me from [Medium Blog](https://medium.com/@makcedward) or [Github](https://github.com/makcedward).

# Reference
- [1] SVD Tutorial: https://cs.fit.edu/~dmitra/SciComp/Resources/singular-value-decomposition-fast-track-tutorial.pdf
- [2] CUHK LSI Tutorial: http://www1.se.cuhk.edu.hk/~seem5680/lecture/LSI-Eg.pdf
- [3] Stanford LSI Tutorial: https://nlp.stanford.edu/IR-book/pdf/18lsi.pdf
- [4] LSA and LDA Explanation: https://cs.stanford.edu/~ppasupat/a9online/1140.html