# LSA : Latent Semantic Analysis

## References

* ETC
  - [Text Mining 101: A Stepwise Introduction to Topic Modeling using Latent Semantic Analysis (using Python)](https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/)

## DATA

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_colwidth", 200)

### Get Data

In [2]:
from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data
len(documents)

11314

### Data cleansing

In [3]:
news_df = pd.DataFrame( {'doc': documents } )

# remain only alphabets
news_df['clean_doc'] = news_df['doc'].str.replace( "[^a-zA-Z#]", " " )
# removing short words
news_df['clean_doc'] = news_df['clean_doc'].apply( lambda x: ' '.join( [ w for w in x.split() if len(w) > 3 ] ) )
# make all text lowercase
news_df['clean_doc'] = news_df['clean_doc'].apply( lambda x: x.lower() )

In [4]:
# remove stopwords
"""
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# tokenization
tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split())

# remove stop-words
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])

# de-tokenization
detokenized_doc = []
for i in range(len(news_df)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)

news_df['clean_doc'] = detokenized_doc
"""

"\nfrom nltk.corpus import stopwords\nstop_words = stopwords.words('english')\n\n# tokenization\ntokenized_doc = news_df['clean_doc'].apply(lambda x: x.split())\n\n# remove stop-words\ntokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])\n\n# de-tokenization\ndetokenized_doc = []\nfor i in range(len(news_df)):\n    t = ' '.join(tokenized_doc[i])\n    detokenized_doc.append(t)\n\nnews_df['clean_doc'] = detokenized_doc\n"

## Document-Term Matrix

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer( stop_words='english', 
                             max_features=1000, # kepp top 1000 terms
                             max_df = 0.05, smooth_idf=True )

X = vectorizer.fit_transform( news_df['clean_doc'] )
print( X.shape )

(11314, 1000)


In [6]:
terms = vectorizer.get_feature_names()
print( " ".join( terms ) )

ability absolutely accept access according account action actions acts actual added addition additional address administration advance advice agencies agree algorithm allow allowed allows amendment america american americans analysis angeles anonymous answer answers anti anybody apparently appear appears apple application applications apply appreciate appreciated approach appropriate april arab archive area areas aren argument arguments armenia armenian armenians arms army article articles asked asking assume assuming atheism atheists attack attacks attempt author authority auto average avoid aware away background base baseball based basic basically basis begin beginning belief beliefs bible bike bios bits black block blood blue board body book books boston bought break bring brought build building built business cable california calling calls came canada card cards care carry cars cases cause caused center certain certainly chance change changed changes char character cheap check chic

## Topic Modeling via SVD

In [7]:
from sklearn.decomposition import TruncatedSVD

svd_model = TruncatedSVD( n_components=20, algorithm="randomized", n_iter=100, random_state=122 )
svd_model.fit( X )

# The components of svd_model are our topics
print( svd_model.components_.shape )

(20, 1000)


In [8]:
for i, comp in enumerate( svd_model.components_ ):
    terms_comp = zip( terms, comp )   # term and it's score of each topics
    sorted_terms = sorted( terms_comp, key=lambda x: x[1], reverse=True )[:5]
    top5 = [  term for term, score in  sorted_terms ] 
    print( "# Topic {:02d} : {}".format( i+1, " ".join( top5 ) ) )

# Topic 01 : drive card program file government
# Topic 02 : drive card scsi disk file
# Topic 03 : game team games season players
# Topic 04 : drive scsi drives jesus controller
# Topic 05 : card video monitor drivers cards
# Topic 06 : chip government clipper encryption phone
# Topic 07 : email sale offer interested condition
# Topic 08 : bike window engine speed space
# Topic 09 : israel space card israeli file
# Topic 10 : space nasa data shuttle launch
# Topic 11 : window israel israeli display server
# Topic 12 : address advance email info list
# Topic 13 : game bike israel space chip
# Topic 14 : israel team bike israeli jews
# Topic 15 : window file list space bike
# Topic 16 : program files government anybody armenian
# Topic 17 : list bike software game version
# Topic 18 : armenian armenians version turkish software
# Topic 19 : monitor software color computer info
# Topic 20 : heard list games program chip


In [22]:
X_topics = svd_model.transform(X)
print( X_topics.shape )
print( X_topics[0, : ] )

(11314, 20)
[ 0.10576955 -0.08813723 -0.04810648  0.04913651  0.01133246  0.01591086
 -0.05865154 -0.00723201  0.13446685 -0.0818737   0.06789282  0.00350589
  0.04187729  0.06997614 -0.01820805 -0.02899951  0.01215114  0.00526533
 -0.00040477  0.02610957]


## Topics Visualization

In [16]:
# conda install -c conda-forge umap-learn
import umap

embedding = umap.UMAP( n_neighbors=150, min_dist=0.5, random_state=12 ).fit_transform(X_topics)

plt.figure(figsize=(7,5))
plt.scatter(embedding[:, 0], embedding[:, 1], 
c = dataset.target,
s = 10, # size
edgecolor='none'
)
plt.show()

AttributeError: module 'umap' has no attribute 'UMAP'