# Latent Semantic Analysis Lab
## Heather R. Sanders
## CSC 570: Data Science
## Spring 2016, Mike Bernico
This lab will mathematically mine concepts from a baseball newsgroup corpus.

Get the newsgroup data

In [1]:
from sklearn.datasets import fetch_20newsgroups

Store the baseball newsgroup category name into categories variable

In [2]:
categories = ['rec.sport.baseball']

Grab the baseball newsgroup data and store it into the dataset variable.

In [3]:
dataset = fetch_20newsgroups(subset='all', shuffle=True, random_state=42, categories=categories)

Create the corpus variable with the baseball newsgroup data.

In [4]:
corpus = dataset.data

In [5]:
# Import BeautifulSoup to clean up newsgroup posts.
# Import Natural Language Toolkit, Stopwords, TFID vectorizer, and Truncated SVD.
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [6]:
# Download stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Heather\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
# Update stopwords
stopset = set(stopwords.words('english'))
stopset.update(['lt','p','/p','br','/n','[\w\.-]+@[\w\.-]+','edu','com','lafayette','re','ibm','aix','would','lafibm','vb30','amp','quot','field','font','normal','span','0px','rgb','style','51', 
                'spacing','text','helvetica','size','family', 'space', 'arial', 'height', 'indent', 'letter','mjones','mike jones','pegasus','uiuc','gt0523e',
                'line','none','sans','serif','transform','line','variant','weight','times', 'new','strong', 'video', 'title','dave','kingman','jack','mss','netcom'
                'white','word','letter', 'roman','0pt','16','color','12','14','21', 'neue', 'apple', 'class','jhu','nix','hcf','0','1','2','3,','4','5','6','7','8','9',
                'nntp','next','steph','john','ca','barman','cs','work','come','scott','dwarener','fls','subject','nextwork','bill',
                '01','02','03','04','05','06','07','08','09','warner','rose','post','posting','mark','david','hulman','singer','roger',
                'smith','steve','lustig','murray','garvey','alomar','lomar','obp','baerga','niguma','dale','stephenson','asd',
                'netcom','vax','cc','cv','dwarner','clarku','nd','jfr2','mail','adobe','ucdavis','reply','snichols','econ','00','jhunix',
                'hp','much','vma','list',])

## TF-IDF Vectorizing
Convert each document in the baseball newsgroup corpus into a sparse matrix of TFIDF Features.

In [17]:
# 
vectorizer = TfidfVectorizer(stop_words=stopset, use_idf=True, ngram_range=(1, 3))
C = vectorizer.fit_transform(corpus)

In [18]:
C[0]

<1x184684 sparse matrix of type '<type 'numpy.float64'>'
	with 224 stored elements in Compressed Sparse Row format>

In [19]:
print C[0]

  (0, 49907)	0.0740966252538
  (0, 183936)	0.0740966252538
  (0, 28153)	0.0740966252538
  (0, 142303)	0.0740966252538
  (0, 61413)	0.0740966252538
  (0, 75036)	0.0740966252538
  (0, 129618)	0.0740966252538
  (0, 100508)	0.0740966252538
  (0, 63801)	0.0740966252538
  (0, 111310)	0.0740966252538
  (0, 158195)	0.0740966252538
  (0, 176476)	0.0740966252538
  (0, 116245)	0.0740966252538
  (0, 61552)	0.0740966252538
  (0, 101002)	0.0740966252538
  (0, 92643)	0.0740966252538
  (0, 129528)	0.0740966252538
  (0, 139425)	0.0740966252538
  (0, 95266)	0.0740966252538
  (0, 111960)	0.0740966252538
  (0, 160602)	0.0740966252538
  (0, 58006)	0.0740966252538
  (0, 49947)	0.0740966252538
  (0, 34782)	0.0740966252538
  (0, 61542)	0.0740966252538
  :	:
  (0, 31887)	0.0540975318289
  (0, 81850)	0.0345267295813
  (0, 180458)	0.015683628739
  (0, 139631)	0.0556818155888
  (0, 118936)	0.111363631178
  (0, 36598)	0.0998607275822
  (0, 39677)	0.105450325626
  (0, 15726)	0.0740966252538
  (0, 91991)	0.111363631

## LSA
Input: C, a matrix where m is the number of documents I have, and n is the number of terms.

Process: Decompose C into three matricies called U, S, and T. For the decomposition, pick a value k, that's how many concepts we are going to keep.

X≈USV (Transposed)

U will be a m x k matrix. The rows will be documents and the columns will be 'concepts'

S will be a k x k diagnal matrix. The elements will be the amount of variation captured from each concept.

V will be a n x k (mind the transpose) matrix. The rows will be terms and the columns will be concepts.

In [20]:
C.shape

(994, 184684)

In [21]:
lsa = TruncatedSVD(n_components=994, n_iter=100)
lsa.fit(C)

TruncatedSVD(algorithm='randomized', n_components=994, n_iter=100,
       random_state=None, tol=0.0)

In [22]:
#This is the first row for V
lsa.components_[0]

array([ 0.02160058,  0.00215205,  0.00050374, ...,  0.00119645,
        0.00119645,  0.00119645])

In [23]:
import sys
print (sys.version)

2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Dec  7 2015, 14:10:42) [MSC v.1500 64 bit (AMD64)]


In [24]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print "Concept %d:" % i
    for term in sortedTerms:
        print term[0]
    print " "

Concept 0:
year
team
game
writes
article
baseball
games
players
one
good
 
Concept 1:
jewish
jewish baseball
jewish baseball players
baseball players
lowenstein
koufax
players
sandy koufax
pablo
able except
 
Concept 2:
lost
won
idle
san
berkeley
york
sox
scores
chicago
angels
 
Concept 3:
bonds
williams
batting
stanford
batting 4th
giants
punjabi
leland
leland stanford
4th
 
Concept 4:
gatech
prism
torre
prism gatech
gilkey
lankford
georgia
gant
hitter power
hitter power pinch
 
Concept 5:
gant
hirschbeck
duke
umpire
box
games
braves
game
eric
strike
 
Concept 6:
clutch
sabo
performance
average
samuel
non clutch
clutch situations
hitting
batting average
hit
 
Concept 7:
gant
morris
hirschbeck
duke
maynard
won
viola
cornell
clemens
laurentian
 
Concept 8:
clutch
gant
hall
hirschbeck
fame
hall fame
lost
won
sabo
future
 
Concept 9:
games
clutch
game
pitcher
baseball
length
colorado
speed
sabo
indiana
 
Concept 10:
indiana
journalism
journalism indiana
clutch
cornell
duke
gant
hirschbeck