<p style="color:blue">Import the necessary toolkits and libraries namely: Natural Language Toolkit (nltk), Regular Expressions (re), Sci Toolkit (sklearn)</p>

In [1]:
import nltk
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.datasets import fetch_20newsgroups

<p style="color:blue">Fetch data from newsgroup 'sci.space' </p>("I am interested in space & astronomy")

In [2]:
categories = ['sci.space']
dataset = fetch_20newsgroups(subset='all',shuffle=True, random_state=42, categories=categories)
corpus = dataset.data

<p style="color:blue">Let us glance the raw data retrieved from the newsgroup</p>

In [3]:
corpus[0]

u'From: aws@iti.org (Allen W. Sherzer)\nSubject: Re: Orbital RepairStation\nOrganization: Evil Geniuses for a Better Tomorrow\nLines: 20\n\nIn article <C5HCBo.Joy@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:\n\n>The biggest problem with this is that all orbits are not alike.  It can\n>actually be more expensive to reach a satellite from another orbit than\n>from the ground.  \n\nBut with cheaper fuel from space based sources it will be cheaper to \nreach more orbits than from the ground.\n\nAlso remember, that the presence of a repair/supply facility adds value\nto the space around it. If you can put your satellite in an orbit where it\ncan be reached by a ready source of supply you can make it cheaper and gain\nbenefit from economies of scale.\n\n  Allen\n-- \n+---------------------------------------------------------------------------+\n| Lady Astor:   "Sir, if you were my husband I would poison your coffee!"   |\n| W. Churchill: "Madam, if you were my wife, I would

In [4]:
len(corpus)

987

<h4>Observation</h4>
<p style="color:brown">
There are 987 posts. Each typical post has the format: <br>
    "From: {EmailId} \nSubject: {Text} \nOrganization: {Text} \nLines: {# of lines} {Content}" <br>
There are lots of \n, Special characters, numbers and other text that needs to be cleaned. We will be interested in the {Content} portion to extract meaningful information
</p>

<p style="color:blue">
Convert content into lower case
</p>

In [5]:
corpus = [x.lower() for x in corpus]

<p style="color:blue">
Clean 'corpus' as follows:
<ul>
<li>Extract substring that follows 'lines'</li>
<li>Replace \n with a space</li>
<li>Replace " with a space</li>
<li>Replace special characters with a space</li>
<li>Replace numbers with a space</li>
<li>Replace multiple spaces with a single space</li>
</ul>
</p>

In [6]:
cleancorpus = []
for x in corpus:
    cleanline = x[x.find("lines: ")+7:]
    cleanline = cleanline.replace("\n"," ")
    cleanline = cleanline.replace("\\"," ")
    cleanline = cleanline.replace("\""," ")
    cleanline = re.sub( '[(,!)<+=_*&>#%|{:;`~?/}-]', ' ', cleanline )
    cleanline = re.sub( '[0-9]', ' ', cleanline )    
    cleanline = re.sub( '\s+', ' ', cleanline )    
    cleancorpus.append(cleanline)

<p style="color:blue">
Let us check the progress on the cleanup. We will be using 'cleancorpus' for our further analysis
</p>

In [7]:
cleancorpus[0]

u' in article c hcbo.joy@zoo.toronto.edu henry@zoo.toronto.edu henry spencer writes the biggest problem with this is that all orbits are not alike. it can actually be more expensive to reach a satellite from another orbit than from the ground. but with cheaper fuel from space based sources it will be cheaper to reach more orbits than from the ground. also remember that the presence of a repair supply facility adds value to the space around it. if you can put your satellite in an orbit where it can be reached by a ready source of supply you can make it cheaper and gain benefit from economies of scale. allen lady astor sir if you were my husband i would poison your coffee w. churchill madam if you were my wife i would drink it. days to first flight of dcx '

<h4>Observation</h4>
<p style="color:brown">
Email addresses need to be removed
</p>

<p style="color:blue">
We now proceed with extracting a list of emails from 'cleancorpus'. Credit for this regex and corresponding function to extract email goes to 
Dennis Ideler at https://gist.github.com/dideler/5219706
</p>

In [8]:
regex = re.compile(("([a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
                    "{|}~-]+)*(@|\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.|"
                    "\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))

In [9]:
def get_emails(s):
    """Returns an iterator of matched emails found in string s."""
    # Removing lines that start with '//' because the regular expression
    # mistakenly matches patterns like 'http://foo@bar.com' as '//foo@bar.com'.
    return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))

<p style="color:blue">
Extract the emails from 'cleancorpus' into 'emails' 
</p>

In [10]:
emails = []

for x in cleancorpus:
    for email in get_emails(x):
       emails.append(email)

In [11]:
emails

[u'hcbo.joy@zoo.toronto.edu',
 u'henry@zoo.toronto.edu',
 u'aws@iti.org',
 u'cspara.decnet@fedex.msfc.nasa.gov',
 u'lek@aip.org',
 u'dennisn@ecs.comm.mot.com',
 u'maverick@wpi.wpi.edu',
 u's@andrew.cmu.edu',
 u'henry@zoo.toronto.edu',
 u'prb@access.digex.com',
 u'dragon@access.digex.com',
 u'aws@iti.org',
 u'nsmca@aurora.alaska.edu',
 u'rinnett@mojo.eng.umd.edu',
 u'sysmgr@king.eng.umd.edu',
 u'higgins@fnalf.fnal.gov',
 u'sysmgr@cadlab.eng.umd.edu',
 u'clarke@acme.ucf.edu',
 u'henry@zoo.toronto.edu',
 u'jfc@athena.mit.edu',
 u'henry@zoo.toronto.edu',
 u'w@theporch.raider.net',
 u'gene@theporch.raider.net',
 u'gene@theporch.raider.net',
 u'sysmgr@king.eng.umd.edu',
 u'ikc@zoo.toronto.edu',
 u'henry@zoo.toronto.edu',
 u'sysmgr@cadlab.eng.umd.edu',
 u'dkelo@pepvax.pepperdine.edu',
 u'prb@access.digex.net',
 u'prb@access.digex.net',
 u'snydefj@eng.auburn.edu',
 u'henry@zoo.toronto.edu',
 u'teezee@netcom.com',
 u'tffreeba@indyvax.iupui.edu',
 u'gnb@bby.com.au',
 u'steinly@topaz.ucsc.edu',
 

<p style="color:blue">
Download stopwords
</p>

In [12]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Arkantos\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
stopset = set(stopwords.words('english'))

<p style="color:blue">
Add emails to stopwords. Add a few more words that can be filtered out (This list will be updated later)
</p>

In [14]:
stopset.update(emails)
stopset.update(['edu', 'com', 'gov','gmt', 'nntp','net','mil','aa','fri', 'apr','henry','ll',
                'aan','de','deze',
                'sunday', 'monday','tuesday', 'wednesday','thursday', 'friday, saturday'])

<h4>TF-IDF Vectorizing</h4>
<p style="color:blue">
Converting each document in 'cleancorpus' into a sparse matrix 'X' of TFIDF Features
</p>

In [15]:
vectorizer = TfidfVectorizer(stop_words=stopset,
                                 use_idf=True, ngram_range=(1, 3))
X = vectorizer.fit_transform(cleancorpus)

In [16]:
X[0]

<1x228715 sparse matrix of type '<type 'numpy.float64'>'
	with 191 stored elements in Compressed Sparse Row format>

<p style="color:blue">
Let us peek at the first sparse martix
</p>

In [17]:
print X[0]

  (0, 69378)	0.0575151178149
  (0, 44938)	0.0575151178149
  (0, 53743)	0.0588071478322
  (0, 224767)	0.0588071478322
  (0, 222147)	0.0588071478322
  (0, 111801)	0.0588071478322
  (0, 31115)	0.0588071478322
  (0, 32964)	0.0588071478322
  (0, 148169)	0.0588071478322
  (0, 225194)	0.0588071478322
  (0, 88180)	0.0588071478322
  (0, 180512)	0.0588071478322
  (0, 13831)	0.0588071478322
  (0, 101477)	0.0588071478322
  (0, 5509)	0.0595113814479
  (0, 172440)	0.0836667920996
  (0, 55885)	0.0836667920996
  (0, 19959)	0.0836667920996
  (0, 75132)	0.0836667920996
  (0, 30519)	0.0836667920996
  (0, 113002)	0.0836667920996
  (0, 195239)	0.0836667920996
  (0, 184414)	0.0836667920996
  (0, 160851)	0.0836667920996
  (0, 160426)	0.0836667920996
  :	:
  (0, 163879)	0.0403880305219
  (0, 6125)	0.0270794736434
  (0, 184436)	0.04749927339
  (0, 18169)	0.04096179144
  (0, 184852)	0.0416589244684
  (0, 74049)	0.0447533111241
  (0, 30494)	0.143280974277
  (0, 81253)	0.0875531533149
  (0, 136937)	0.062154783331

<h4>Latent Semantic Analysis</h4>
<p style="color:blue">
Decomposing X into three matrics U, S & V
</p>

In [18]:
X.shape

(987, 228715)

<p style="color:blue">
Dimensionality reduction using Single Value Decomposition (SVD)
<ul>
<li> Dimensionality of output: 100(recommended to be 100 by Sklearn Documentation for LSA)</li>
<li> Number of iterations: 50</li>
</ul>
</p>

In [19]:
lsa = TruncatedSVD(n_components=100, n_iter=50)
lsa.fit(X)

TruncatedSVD(algorithm='randomized', n_components=100, n_iter=50,
       random_state=None, tol=0.0)

<p style="color:red">
The above step will take some time to run. Wait until it completes
</p>

In [20]:
lsa.components_[0]

array([ 0.00040268,  0.00040268,  0.00040268, ...,  0.00075217,
        0.00075217,  0.00075217])

<p style="color:blue">
Extract features from the concepts. Modified the original code to store the features in a list 'features'
</p>

In [21]:
features = []
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print "Concept %d:" % i
    for term in sortedTerms:
        print term[0]
        features.append(term[0])
    print " "

Concept 0:
space
would
nasa
writes
like
article
one
shuttle
alaska
orbit
 
Concept 1:
elements
afit
afit af
updated daily
af
celestial bbs
software also available
software also
daily
current
 
Concept 2:
jpl
baalke
kelvin jpl
kelvin jpl nasa
jpl nasa
kelvin
command
nasa
comet
spacecraft
 
Concept 3:
alaska
gene
nasa
raider
acad
acad alaska
theporch
theporch raider
jpl nasa
jpl
 
Concept 4:
space
nasa
earth
launch
research
satellite
part
center
venus
software
 
Concept 5:
space
shuttle
mission
hst
information
boost
going
station
years
data
 
Concept 6:
see
might
station
fuel
end
sky
different
us
power
uiuc
 
Concept 7:
nasa
station
dc
april
going
posting
umd
space
high
software
 
Concept 8:
alaska
high
system
zoo
zoo toronto
mars
spencer
earth
like
toronto
 
Concept 9:
overhead
allen
reston
isn
time
back
atmosphere
wrap
much
host
 
Concept 10:
space
us
access digex
digex
power
many
anyone
shuttle
station
sky
 
Concept 11:
nasa
overhead
alaska
university
program
systems
distribution
writ

In [22]:
features

[u'space',
 u'would',
 u'nasa',
 u'writes',
 u'like',
 u'article',
 u'one',
 u'shuttle',
 u'alaska',
 u'orbit',
 u'elements',
 u'afit',
 u'afit af',
 u'updated daily',
 u'af',
 u'celestial bbs',
 u'software also available',
 u'software also',
 u'daily',
 u'current',
 u'jpl',
 u'baalke',
 u'kelvin jpl',
 u'kelvin jpl nasa',
 u'jpl nasa',
 u'kelvin',
 u'command',
 u'nasa',
 u'comet',
 u'spacecraft',
 u'alaska',
 u'gene',
 u'nasa',
 u'raider',
 u'acad',
 u'acad alaska',
 u'theporch',
 u'theporch raider',
 u'jpl nasa',
 u'jpl',
 u'space',
 u'nasa',
 u'earth',
 u'launch',
 u'research',
 u'satellite',
 u'part',
 u'center',
 u'venus',
 u'software',
 u'space',
 u'shuttle',
 u'mission',
 u'hst',
 u'information',
 u'boost',
 u'going',
 u'station',
 u'years',
 u'data',
 u'see',
 u'might',
 u'station',
 u'fuel',
 u'end',
 u'sky',
 u'different',
 u'us',
 u'power',
 u'uiuc',
 u'nasa',
 u'station',
 u'dc',
 u'april',
 u'going',
 u'posting',
 u'umd',
 u'space',
 u'high',
 u'software',
 u'alaska',
 u'h

<p style="color:blue">
Convert into set to get unique features
</p>

In [23]:
uniquefeatures = set(features)

In [24]:
uniquefeatures

{u'aacs attitude',
 u'aams closed',
 u'aams closed bays',
 u'aangegeven',
 u'aangegeven men',
 u'aangegeven men lid',
 u'aantal',
 u'aantal voordrachten',
 u'aantal voordrachten met',
 u'able',
 u'acad',
 u'acad alaska',
 u'access',
 u'access digex',
 u'actually',
 u'af',
 u'afit',
 u'afit af',
 u'alaska',
 u'allen',
 u'also',
 u'anyone',
 u'april',
 u'around',
 u'article',
 u'atmosphere',
 u'au',
 u'aurora alaska',
 u'baalke',
 u'back',
 u'better',
 u'big',
 u'billion',
 u'boost',
 u'ca',
 u'celestial bbs',
 u'center',
 u'comet',
 u'command',
 u'commercial',
 u'cost',
 u'costs',
 u'could',
 u'current',
 u'daily',
 u'data',
 u'day',
 u'days',
 u'dc',
 u'dennis',
 u'design',
 u'different',
 u'digex',
 u'digex pat writes',
 u'distribution',
 u'done',
 u'earth',
 u'elements',
 u'end',
 u'eng',
 u'enough',
 u'etc',
 u'ether',
 u'even',
 u'every',
 u'far',
 u'find',
 u'first',
 u'flight',
 u'fred',
 u'ftp',
 u'fuel',
 u'gene',
 u'get',
 u'give',
 u'given',
 u'go',
 u'going',
 u'good',
 u'go

In [25]:
wordDictFeatures = dict.fromkeys(uniquefeatures, 0)

In [26]:
for word in features:
    wordDictFeatures[word]+=1

<p style="color:blue">
Additional analysis using Dictionary
</p>

In [27]:
wordDictFeatures

{u'aacs attitude': 7,
 u'aams closed': 3,
 u'aams closed bays': 2,
 u'aangegeven': 1,
 u'aangegeven men': 1,
 u'aangegeven men lid': 7,
 u'aantal': 1,
 u'aantal voordrachten': 4,
 u'aantal voordrachten met': 1,
 u'able': 1,
 u'acad': 1,
 u'acad alaska': 1,
 u'access': 6,
 u'access digex': 5,
 u'actually': 2,
 u'af': 1,
 u'afit': 1,
 u'afit af': 1,
 u'alaska': 14,
 u'allen': 1,
 u'also': 15,
 u'anyone': 5,
 u'april': 3,
 u'around': 4,
 u'article': 8,
 u'atmosphere': 2,
 u'au': 2,
 u'aurora alaska': 2,
 u'baalke': 2,
 u'back': 4,
 u'better': 2,
 u'big': 3,
 u'billion': 2,
 u'boost': 2,
 u'ca': 5,
 u'celestial bbs': 1,
 u'center': 1,
 u'comet': 1,
 u'command': 1,
 u'commercial': 3,
 u'cost': 8,
 u'costs': 1,
 u'could': 10,
 u'current': 2,
 u'daily': 1,
 u'data': 8,
 u'day': 1,
 u'days': 2,
 u'dc': 2,
 u'dennis': 1,
 u'design': 1,
 u'different': 2,
 u'digex': 4,
 u'digex pat writes': 1,
 u'distribution': 7,
 u'done': 2,
 u'earth': 16,
 u'elements': 1,
 u'end': 1,
 u'eng': 1,
 u'enough': 2,

<h4>Observation</h4>
<p style="color:brown">
Some really interesting words/phrases show up and they are all related to space and astronomy including shuttle, satellite, spacecraft, venus, jpl, launch, mission, moon, etc. 
</p>

<h4>Inference</h4>
<p style="color:blue">
Using LSA we have been able to identify that the conversation is primarily about space and astronomy. 
we have derived features from unstructured data 
</p>