## Latent Semantic Analysis

In this notebook we're going to look at how we can 'mine' concepts from a corpus (collection) of text documents.

In the first week of class everyone wrote their own definition of data science.   This week I'm going to show you how to extract 'concepts' from that corpus mathematically.  The techinque we're going to use is called latent semantic analysis.  

In [1]:
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [2]:
#run this only once
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/mike/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

I exported the forum posts for module 0 into an XML file.  Each post is wrapped in <text></text> tags.   I'll use BeautifulSoup to process the XML

In [3]:
posts = open('raw_forum_posts.dat', 'r').read()

In [4]:
soup = BeautifulSoup(posts)
postTxt = soup.findAll('text')  #all posts <text> 
postDocs = [x.text for x in postTxt]
postDocs.pop(0)
postDocs = [x.lower() for x in postDocs]

Stopwords are words that I don't want to convert to featurs,becuase they aren't especially useful.  Words like 'a', 'and', and 'the' are good stopwords in english.   I can use a built in list of stopwords from nltk to get me started.  Then, I'll add some custom stopwords that are 'html junk' that I need to clean out of my data.

In [5]:
stopset = set(stopwords.words('english'))
stopset.update(['lt','p','/p','br','amp','quot','field','font','normal','span','0px','rgb','style','51', 
                'spacing','text','helvetica','size','family', 'space', 'arial', 'height', 'indent', 'letter'
                'line','none','sans','serif','transform','line','variant','weight','times', 'new','strong', 'video', 'title'
                'white','word','letter', 'roman','0pt','16','color','12','14','21', 'neue', 'apple', 'class',  ])


### TF-IDF Vectorizing

I'm going to use scikit-learn's TF-IDF vectorizer to take my corpus and convert each document into a sparse matrix of TFIDF Features...

In [6]:
#Before!
postDocs[0]

u'<p>data science is about analyzing relevant data to obtain patterns of information in order to help achieve a goal. the main focus of the data analysis is the goal rather then the methodology on how it will achieved. this allows for creative thinking and allowing for the optimal solution or model to be found wihtout the constraint of a specific methodology.</p>'

In [7]:
vectorizer = TfidfVectorizer(stop_words=stopset,
                                 use_idf=True, ngram_range=(1, 3))
X = vectorizer.fit_transform(postDocs)

In [8]:
X[0]

<1x3390 sparse matrix of type '<type 'numpy.float64'>'
	with 89 stored elements in Compressed Sparse Row format>

In [9]:
#After
print X[0]

  (0, 558)	0.10905143902
  (0, 3324)	0.10905143902
  (0, 1301)	0.10905143902
  (0, 1951)	0.10905143902
  (0, 2758)	0.10905143902
  (0, 2080)	0.10905143902
  (0, 107)	0.10905143902
  (0, 2980)	0.10905143902
  (0, 625)	0.10905143902
  (0, 110)	0.10905143902
  (0, 54)	0.10905143902
  (0, 1917)	0.10905143902
  (0, 2380)	0.10905143902
  (0, 1395)	0.10905143902
  (0, 147)	0.10905143902
  (0, 655)	0.10905143902
  (0, 1265)	0.10905143902
  (0, 1816)	0.10905143902
  (0, 1391)	0.10905143902
  (0, 49)	0.10905143902
  (0, 1474)	0.10905143902
  (0, 2090)	0.10905143902
  (0, 1600)	0.10905143902
  (0, 2157)	0.10905143902
  (0, 2041)	0.10905143902
  :	:
  (0, 1295)	0.0882799340268
  (0, 1943)	0.0761293825223
  (0, 2754)	0.0969008875155
  (0, 2078)	0.10905143902
  (0, 105)	0.10905143902
  (0, 2978)	0.0969008875155
  (0, 623)	0.0969008875155
  (0, 108)	0.10905143902
  (0, 52)	0.10905143902
  (0, 1915)	0.21810287804
  (0, 2378)	0.10905143902
  (0, 143)	0.071509957232
  (0, 1261)	0.0882799340268
  (0, 181

###LSA

Input:  X, a matrix where m is the number of documents I have, and n is the number of terms.

Process:   I'm going to decompose X into three matricies called U, S, and T.  When we do the decomposition, we have to pick a value k, that's how many concepts we are going to keep.  

$$X \approx USV^{T}$$

U will be a m x k matrix.  The rows will be documents and the columns will be 'concepts'

S will be a k x k diagnal matrix.   The elements will be the amount of variation captured from each concept.

V will be a n x k (mind the transpose) matrix.   The rows will be terms and the columns will be conepts.  

In [10]:
X.shape

(27, 3390)

In [11]:
lsa = TruncatedSVD(n_components=27, n_iter=100)
lsa.fit(X)



TruncatedSVD(algorithm='randomized', n_components=27, n_iter=100,
       random_state=None, tol=0.0)

In [12]:
#This is the first row for V
lsa.components_[0]

array([ 0.00523425,  0.00523425,  0.00523425, ...,  0.00431874,
        0.00431874,  0.00431874])

In [13]:
import sys
print (sys.version)

2.7.9 |Anaconda 2.1.0 (64-bit)| (default, Dec 15 2014, 10:33:51) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]


In [14]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print "Concept %d:" % i
    for term in sortedTerms:
        print term[0]
    print " "

Concept 0:
data
procedures
large amounts
large amounts data
science
amounts
amounts data
different
could
large
 
Concept 1:
data
science
data science
make
white
answer
decisions
using
questions
finding
 
Concept 2:
white
converted
white converted
white white
information
big
big data
20
20 white ata
47
 
Concept 3:
make
decisions
better
good
use data
use
make better
better decisions
make better decisions
knowledge
 
Concept 4:
business
help
ultimately
data help
scientist
fields
competitive edge
edge
especially
methods
 
Concept 5:
answer
part
finding
art
relevant
relevant data
problem
business
using
methods
 
Concept 6:
answer
good
questions
using
decisions
canada
contacts
practices
much
art
 
Concept 7:
statistics
answers
questions
gaining
information
science
days
everyone
hello everyone
hello
 
Concept 8:
people
predict
define
learn data
learn data science
concept
concept data
concept data science
find
problems
 
Concept 9:
greater
humanity
ultimately
techniques
faster
increase data
u

In [15]:
lsa.components_

array([[ 0.00523425,  0.00523425,  0.00523425, ...,  0.00431874,
         0.00431874,  0.00431874],
       [ 0.00997879,  0.0099788 ,  0.0099788 , ...,  0.00746359,
         0.00746359,  0.00746359],
       [ 0.05343461,  0.05274622,  0.05274622, ..., -0.00421642,
        -0.00421642, -0.00421642],
       ..., 
       [ 0.01367572,  0.00640647,  0.00640647, ...,  0.00482124,
         0.00482124,  0.00482124],
       [-0.02086219, -0.00217038, -0.00217038, ..., -0.00016045,
        -0.00016045, -0.00016045],
       [ 0.08823589, -0.00363022, -0.00363022, ..., -0.00160267,
        -0.00160267, -0.00160267]])