## Latent Semantic Analysis - Lab
 
I'll be mining concepts from the 'rec.sport.baseball' collections of newsgroups. I will perform TF-IDF Vectorizing and Latent Semantic Analysis on the corpus as well.

In [291]:
from bs4 import BeautifulSoup #Beautiful Soup is a Python library for pulling data out of HTML and XML files. 
import nltk #Natural Language Toolkit.
from nltk.corpus import stopwords #used for high-frequency words like the, to and also that we sometimes want to filter out of a document before further processing
from sklearn.feature_extraction.text import TfidfVectorizer #Used to extract features in a format supported by machine learning algorithms.
                                                            #TfidfVectorizer - Convert a collection of raw documents to a matrix of TF-IDF features.
from sklearn.decomposition import TruncatedSVD #Dimensionality reduction using truncated SVD (aka LSA).
import re

In [154]:
#run this only once
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/JG/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Download the rec.sport.baseball dataset below.

In [295]:
from sklearn.datasets import fetch_20newsgroups
categories = ['rec.sport.baseball']
dataset = fetch_20newsgroups(subset='all',shuffle=True, random_state=42, categories=categories)
corpus = dataset.data

Cleanup data set.

In [297]:
for idx in range(0,len(corpus)):
    # Use regular expressions to do a find-and-replace
    corpus[idx] = re.sub("[^a-zA-Z]",           # The pattern to search for
                         " ",                   # The pattern to replace it with
                         corpus[idx])  # The text to search

In [298]:
corpus[0]

u'From  writingctr leo bsuvc bsu edu Subject  Re  CUB fever  Organization  Ball State University  Muncie  In   Univ  Computing Svc s Lines       In article  kingoz           camelot   kingoz camelot bradley edu  Orin Roth  writes          CUB fever is hitting me again  I m beginning to think they have a       chance this year   what the heck am i thinking        Sorry  Just a moment of incompetence       I ll be ok  Really        Orin       Bradley U            I m really a jester in disguise                                     I hear ya   Then again  we must remember that we are indeed Cub fans  and that the Cubs will eventually blow it   After all  the Cubs are the easiest team in the National League to root for   No Pressure   You know they will lose eventually   Oh well  I suppose we must have faith   After all  they do look pretty good  and they don t even have Sandberg back yet     CUBS IN           CHA '

In [299]:
corpus[10]

u'From  hamkins geisel csl uiuc edu  Jon Hamkins  Subject  Re  Triva question on Bosio s No hitter Organization  Center for Reliable and High Performance Computing  University of Illinois at Urbana Champaign Lines     NNTP Posting Host  grinch csl uiuc edu  wall cc swarthmore edu  Matthew Wall  writes    I don t actually have the answer to this one    Bosio  after walking the first two batters  retired    straight for a   back end  perfect game   Well  there were    outs in a row with no hits or walks in between  but really  he only retired    batters in a row   The first out of the game was the front end of a double play   Still counts as a back end perfect game in my book  though    Congrats to Chris Bosio   Too bad the Brewers couldn t hold on to him            Jon Hamkins   hamkins uiuc edu           University of Illionois '

Create stopwords using the standard "English" Stop word directory. Also add custom stopwords to get a better understanding of valuable data.

In [320]:
stopset = set(stopwords.words('english'))
stopset.update(['\n','com','edu','cs','nntp','cs','Re:','@','>','--','vb30',
                'lafibm','posting','host','ibm','subject','reply','would',
                'come','university','ca','adobe','said',])

### TF-IDF Vectorizing

I'm going to use scikit-learn's TF-IDF vectorizer to take my corpus and convert each document into a sparse matrix of TFIDF Features...

In [321]:
#Before!
corpus[0]

u'From  writingctr leo bsuvc bsu edu Subject  Re  CUB fever  Organization  Ball State University  Muncie  In   Univ  Computing Svc s Lines       In article  kingoz           camelot   kingoz camelot bradley edu  Orin Roth  writes          CUB fever is hitting me again  I m beginning to think they have a       chance this year   what the heck am i thinking        Sorry  Just a moment of incompetence       I ll be ok  Really        Orin       Bradley U            I m really a jester in disguise                                     I hear ya   Then again  we must remember that we are indeed Cub fans  and that the Cubs will eventually blow it   After all  the Cubs are the easiest team in the National League to root for   No Pressure   You know they will lose eventually   Oh well  I suppose we must have faith   After all  they do look pretty good  and they don t even have Sandberg back yet     CUBS IN           CHA '

In [322]:
vectorizer = TfidfVectorizer(stop_words=stopset,
                                 use_idf=True, ngram_range=(1, 2)) #This will convert a collection of raw documents
                                                                   #into matrix of TF-IDF features.
X = vectorizer.fit_transform(corpus) #Learn vocabulary and idf, return term-document matrix.

In [323]:
X[0]

<1x82613 sparse matrix of type '<type 'numpy.float64'>'
	with 137 stored elements in Compressed Sparse Row format>

In [324]:
#After
print X[0]

  (0, 15871)	0.0965237950924
  (0, 82246)	0.0965237950924
  (0, 5083)	0.0965237950924
  (0, 61669)	0.0910953309428
  (0, 21650)	0.0965237950924
  (0, 28321)	0.0910953309428
  (0, 55381)	0.0602677692641
  (0, 40643)	0.0965237950924
  (0, 22845)	0.0965237950924
  (0, 46151)	0.0965237950924
  (0, 69938)	0.0965237950924
  (0, 78888)	0.0965237950924
  (0, 48823)	0.0763868460347
  (0, 21717)	0.0965237950924
  (0, 40866)	0.0965237950924
  (0, 36969)	0.0965237950924
  (0, 55343)	0.0965237950924
  (0, 60450)	0.0965237950924
  (0, 38163)	0.0965237950924
  (0, 46461)	0.0589917337094
  (0, 71153)	0.0965237950924
  (0, 19919)	0.0965237950924
  (0, 15876)	0.0965237950924
  (0, 8340)	0.0965237950924
  (0, 21710)	0.0965237950924
  :	:
  (0, 12000)	0.0514416306135
  (0, 72181)	0.0295297613638
  (0, 6902)	0.0704714831393
  (0, 31589)	0.0449770952442
  (0, 80701)	0.0204306655198
  (0, 60573)	0.0725352894259
  (0, 50087)	0.145070578852
  (0, 9226)	0.130086037979
  (0, 10660)	0.137367465634
  (0, 36636)	0.

### LSA

Input:  X, a matrix where m is the number of documents I have, and n is the number of terms.

Process:   I'm going to decompose X into three matricies called U, S, and T.  When we do the decomposition, we have to pick a value k, that's how many concepts we are going to keep.  

$$X \approx USV^{T}$$

U will be a m x k matrix.  The rows will be documents and the columns will be 'concepts'

S will be a k x k diagnal matrix.   The elements will be the amount of variation captured from each concept.

V will be a n x k (mind the transpose) matrix.   The rows will be terms and the columns will be conepts.  

In [325]:
X.shape

(994, 82613)

In [326]:
lsa = TruncatedSVD(n_components=10, n_iter=100)
lsa.fit(X)



TruncatedSVD(algorithm='randomized', n_components=10, n_iter=100,
       random_state=None, tol=0.0)

In [327]:
#This is the first row for V
lsa.components_[0]

array([ 0.0078717 ,  0.00131177,  0.00059651, ...,  0.00206443,
        0.00058628,  0.00058628])

In [328]:
import sys
print (sys.version)

2.7.12 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:43:17) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)]


In [329]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print "Concept %d:" % i
    for term in sortedTerms:
        print term[0]
    print " "

Concept 0:
year
team
game
writes
article
baseball
games
players
one
good
 
Concept 1:
morris
lost
think
sox
see
season
many
apr
year
one
 
Concept 2:
games
braves
season
league
organization
baseball
better
two
way
lost
 
Concept 3:
year
one
braves
season
well
say
article
news
jewish
last year
 
Concept 4:
year
team
braves
play
well
know
game
apr
john
hit
 
Concept 5:
hitter
years
article apr
let
hit
john
well
know
time
gant
 
Concept 6:
well
better
year
last
sox
two
make
win
many
players
 
Concept 7:
organization
team
new
aa atlanta
article
lines
like
cubs
sox
think
 
Concept 8:
year
baseball
aa freenet
aa formerly
lines
much
aa improve
back
article apr
season
 
Concept 9:
writes
time
games
year
good
aa formerly
article
baseball
aa improve
aa atlanta
 


In [330]:
lsa.components_

array([[ 0.0078717 ,  0.00131177,  0.00059651, ...,  0.00206443,
         0.00058628,  0.00058628],
       [ 0.0012602 , -0.01029942,  0.01007711, ..., -0.00201067,
        -0.00015524, -0.00015524],
       [-0.01217207, -0.01625772,  0.04494135, ...,  0.00031474,
        -0.00108783, -0.00108783],
       ..., 
       [-0.02564566,  0.00427531,  0.06703105, ..., -0.0021611 ,
         0.000294  ,  0.000294  ],
       [ 0.00202877, -0.01223918, -0.09818833, ...,  0.0027089 ,
         0.00030748,  0.00030748],
       [ 0.03800456,  0.00392628, -0.03595597, ...,  0.0008463 ,
         0.00122033,  0.00122033]])