## Latent Semantic Analysis

In this notebook we're going to look at how we can 'mine' concepts from a corpus (collection) of text documents.

In the first week of class everyone wrote their own definition of data science.   This week I'm going to show you how to extract 'concepts' from that corpus mathematically.  The techinque we're going to use is called latent semantic analysis.  

In [1]:
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jakfar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

I exported the forum posts for module 0 into an XML file.  Each post is wrapped in <text></text> tags.   I'll use BeautifulSoup to process the XML

In [11]:
posts = open('raw_forum_posts.dat', 'r').read()

In [12]:
soup = BeautifulSoup(posts, 'lxml')
postTxt = soup.findAll('text')  #all posts <text> 
postDocs = [x.text for x in postTxt]
postDocs.pop(0)
postDocs = [x.lower() for x in postDocs]

FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

Stopwords are words that I don't want to convert to featurs,becuase they aren't especially useful.  Words like 'a', 'and', and 'the' are good stopwords in english.   I can use a built in list of stopwords from nltk to get me started.  Then, I'll add some custom stopwords that are 'html junk' that I need to clean out of my data.

In [5]:
stopset = set(stopwords.words('english'))
stopset.update(['lt','p','/p','br','amp','quot','field','font','normal','span','0px','rgb','style','51', 
                'spacing','text','helvetica','size','family', 'space', 'arial', 'height', 'indent', 'letter'
                'line','none','sans','serif','transform','line','variant','weight','times', 'new','strong', 'video', 'title'
                'white','word','letter', 'roman','0pt','16','color','12','14','21', 'neue', 'apple', 'class',  ])


### TF-IDF Vectorizing

I'm going to use scikit-learn's TF-IDF vectorizer to take my corpus and convert each document into a sparse matrix of TFIDF Features...

In [6]:
#Before!
postDocs[0]

'<p>data science is about analyzing relevant data to obtain patterns of information in order to help achieve a goal. the main focus of the data analysis is the goal rather then the methodology on how it will achieved. this allows for creative thinking and allowing for the optimal solution or model to be found wihtout the constraint of a specific methodology.</p>'

In [7]:
vectorizer = TfidfVectorizer(stop_words=stopset,
                                 use_idf=True, ngram_range=(1, 3))
X = vectorizer.fit_transform(postDocs)

In [8]:
X[0]

<1x3367 sparse matrix of type '<class 'numpy.float64'>'
	with 89 stored elements in Compressed Sparse Row format>

In [9]:
#After
print(X[0])

  (0, 553)	0.10905143902
  (0, 3301)	0.10905143902
  (0, 1290)	0.10905143902
  (0, 1940)	0.10905143902
  (0, 2747)	0.10905143902
  (0, 2069)	0.10905143902
  (0, 107)	0.10905143902
  (0, 2969)	0.10905143902
  (0, 620)	0.10905143902
  (0, 110)	0.10905143902
  (0, 54)	0.10905143902
  (0, 1906)	0.10905143902
  (0, 2369)	0.10905143902
  (0, 1384)	0.10905143902
  (0, 147)	0.10905143902
  (0, 650)	0.10905143902
  (0, 1254)	0.10905143902
  (0, 1805)	0.10905143902
  (0, 1380)	0.10905143902
  (0, 49)	0.10905143902
  (0, 1463)	0.10905143902
  (0, 2079)	0.10905143902
  (0, 1589)	0.10905143902
  (0, 2146)	0.10905143902
  (0, 2030)	0.10905143902
  :	:
  (0, 1284)	0.0882799340268
  (0, 1932)	0.0761293825223
  (0, 2743)	0.0969008875155
  (0, 2067)	0.10905143902
  (0, 105)	0.10905143902
  (0, 2967)	0.0969008875155
  (0, 618)	0.0969008875155
  (0, 108)	0.10905143902
  (0, 52)	0.10905143902
  (0, 1904)	0.21810287804
  (0, 2367)	0.10905143902
  (0, 143)	0.071509957232
  (0, 1250)	0.0882799340268
  (0, 180

### LSA

Input:  X, a matrix where m is the number of documents I have, and n is the number of terms.

Process:   I'm going to decompose X into three matricies called U, S, and T.  When we do the decomposition, we have to pick a value k, that's how many concepts we are going to keep.  

$$X \approx USV^{T}$$

U will be a m x k matrix.  The rows will be documents and the columns will be 'concepts'

S will be a k x k diagnal matrix.   The elements will be the amount of variation captured from each concept.

V will be a n x k (mind the transpose) matrix.   The rows will be terms and the columns will be conepts.  

In [10]:
X.shape

(27, 3367)

In [11]:
lsa = TruncatedSVD(n_components=27, n_iter=100)
lsa.fit(X)



TruncatedSVD(algorithm='randomized', n_components=27, n_iter=100,
       random_state=None, tol=0.0)

In [12]:
#This is the first row for V
lsa.components_[0]

array([ 0.00524647,  0.00524647,  0.00524647, ...,  0.00435294,
        0.00435294,  0.00435294])

In [13]:
import sys
print (sys.version)

3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Dec  7 2015, 11:16:01) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]


In [17]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print("Concept %d:" % i )
    for term in sortedTerms:
        print(term[0])
    print (" ")

Concept 0:
data
procedures
large amounts
large amounts data
science
amounts
amounts data
different
could
large
 
Concept 1:
procedures
large amounts
large amounts data
could
amounts
amounts data
large
used
according data
according data science
 
Concept 2:
white
converted
make
white converted
white white
20 white ata
according
47 20 white
methods
perspective
 
Concept 3:
white
questions
answer
using
part
canada
contacts
way
since
efficient
 
Concept 4:
business
statistics
fields
knowledge
complex
gained
opinion
statistics data
studies
use statistics
 
Concept 5:
dig
white
business
methods
perspective
archaeology
history
many
different
others
 
Concept 6:
different
provide
child
collect
lego
ways
scientist
collect data
data sets
feedback
 
Concept 7:
dig
archaeology
history
analysis
find
computer
organization
whether
read
decisions
 
Concept 8:
large
business
competitive edge
edge
especially
questions
competitive
data scientist
technologies
scientist
 
Concept 9:
digital
trends
knowledg

In [18]:
lsa.components_

array([[  5.24646626e-03,   5.24646626e-03,   5.24646626e-03, ...,
          4.35293786e-03,   4.35293786e-03,   4.35293786e-03],
       [ -9.94441560e-03,  -9.94443477e-03,  -9.94448909e-03, ...,
         -7.47511672e-03,  -7.47511672e-03,  -7.47511781e-03],
       [  3.24430859e-02,   3.36540548e-02,   5.96254890e-02, ...,
         -5.67205587e-03,  -5.67205587e-03,  -5.56974736e-03],
       ..., 
       [ -1.15545114e-02,  -3.33576033e-01,  -2.17246552e-01, ...,
         -3.25054705e-04,  -3.25054705e-04,  -2.14573508e-03],
       [ -2.55995170e-02,   1.89322338e-01,  -2.38088432e-01, ...,
          8.85681690e-04,   8.85681690e-04,  -1.13710361e-03],
       [  4.94312104e-01,  -2.37995590e-02,  -3.26411566e-01, ...,
         -8.96565955e-04,  -8.96565955e-04,  -1.28023812e-03]])