<center>
<img src="http://www.bigdive.eu/wp-content/uploads/2012/05/logoBIGDIVE-01.png">
</center>

---

# Text Analysis

## André Panisson

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Loading the dataset
### The 20 Newsgroups data set

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups

The data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other (e.g. comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly unrelated (e.g misc.forsale / soc.religion.christian).

In [2]:
from sklearn import datasets
dataset = datasets.fetch_20newsgroups()
documents = dataset.data

In [3]:
print documents[0]

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [5]:
len(documents)

11314

### Data cleaning

Removing from text pieces that do not convey any semantic meaning (e.g., mail headers, email adresses, host names...)

In [9]:
import re
from_re = re.compile(r"^From: .*\n")

print from_re.sub('', documents[0])

Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [18]:
for i in range(len(documents)):
    documents[i] = from_re.sub('', documents[i])
print documents[0]

Subject: WHAT car is this!?Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







### Exercise: create a regular expression to remove the Nntp-Posting-Host header

In [19]:
nntp_Posting_Host_regEx = re.compile(r"\nNntp-Posting-Host: .*\n")

print nntp_Posting_Host_regEx.sub('', documents[0])

Subject: WHAT car is this!?Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [24]:
for i in range(len(documents)):
    documents[i] = nntp_Posting_Host_regEx.sub('', documents[i])
print documents[1]

Subject: SI Clock Poll - Final Call
Summary: Final call for SI clock reports
Keywords: SI,acceleration,clock,upgrade
Article-I.D.: shelley.1qvfo9INNc3s
Organization: University of Washington
Lines: 11
NNTP-Posting-Host: carson.u.washington.edu

A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.

Guy Kuo <guykuo@u.washington.edu>



# From Text Messages to Feature Vectors
We need to transform our text data into feature vectors, numerical representations which are suitable for performing statistical analysis. The most common way to do this is to apply a bag-of-words approach where the frequency of an occurrence of a word becomes a feature for our classifier.

*Bag of words fa una matrice con il numero di occorrenze con ciascuna parola. Righe sono le righe e colonne sono le parole. Andremo ad untilizzare una matrice sparsa*




## Term Frequency-Inverse Document Frequency

We want to consider the relative importance of particular words, so we'll use term frequency–inverse document frequency as a weighting factor. This will control for the fact that some words are more "spamy" than others.

## Mathematical details

tf–idf is the product of two statistics, term frequency and inverse document
frequency. Various ways for determining the exact values of both statistics
exist. In the case of the '''term frequency''' tf(''t'',''d''), the simplest
choice is to use the ''raw frequency'' of a term in a document, i.e. the
number of times that term ''t'' occurs in document ''d''. If we denote the raw
frequency of ''t'' by f(''t'',''d''), then the simple tf scheme is
tf(''t'',''d'') = f(''t'',''d''). Other possibilities
include:

  * boolean_data_type "frequencies": tf(''t'',''d'') = 1 if ''t'' occurs in ''d'' and 0 otherwise; 
  * logarithmically scaled frequency: tf(''t'',''d'') = log (f(''t'',''d'') + 1); 
  * augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency divided by the maximum raw frequency of any term in the document: :$\mathrm{tf}(t,d) = 0.5 + \frac{0.5 \times \mathrm{f}(t, d)}{\max\{\mathrm{f}(w, d):w \in d\}}$

The '''inverse document frequency''' is a measure of whether the term is
common or rare across all documents. It is obtained by dividing the total
number of documents by the number of documents containing the
term, and then taking the logarithm of that quotient.

:$ \mathrm{idf}(t, D) = \log \frac{|D|}{|\{d \in D: t \in d\}|}$

with

  * $ |D| $: cardinality of D, or the total number of documents in the corpus 
  * $ |\{d \in D: t \in d\}| $ : number of documents where the term $ t $ appears (i.e., $ \mathrm{tf}(t,d) eq 0$). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the formula to $1 + |\{d \in D: t \in d\}|$. 

Mathematically the base of the log function does not matter and constitutes a
constant multiplicative factor towards the overall result.

Then tf–idf is calculated as

$$\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \times \mathrm{idf}(t, D)$$

## PER CAPIRE 

Immaginiamo di avere un documento che abbia dentro The Enviroment. In tutti gli altri documenti The comparir tante volte invece Environment comparira' meno volte. 
Il punto e' che quello che mi identifica il documento e' Environment. 
PEr gestire questa cosa moltiplica la matricie document frequency per la matrice inversa document frequency ed in questo modo ottengo le parole che hanno piu' importanza per la caratterizzazione del documento (ovvero fa emergere Environment molto di piu' che non il termine the).

Questa e' la formula:
$$\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \times \mathrm{idf}(t, D)$$

In [29]:
from sklearn.feature_extraction import text

vectorizer = text.CountVectorizer(max_df=0.8, max_features=10000, stop_words=text.ENGLISH_STOP_WORDS)
counts = vectorizer.fit_transform(documents) #trasforma questo dataset in matrice sparsa di conteggi
tfidf = text.TfidfTransformer().fit_transform(counts) 

#CountVectorizer --> Trasforma il documento nella matrice di textfrequency. Aggiungi i parametri per configurare al meglio la vettorizzazione
# ad esempio rimuovere le parole che compaiono almeno in due documenti, piuttosto che dire di escludere le parole
# che compaoino in piu' dell'80% di documenti

In [30]:
counts #n documenti per d parole

<11314x10000 sparse matrix of type '<type 'numpy.int64'>'
	with 951246 stored elements in Compressed Sparse Row format>

In [32]:
#esempio prime parole
counts[0].data

array([5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [33]:
tfidf #e' data normalizzato

<11314x10000 sparse matrix of type '<type 'numpy.float64'>'
	with 951246 stored elements in Compressed Sparse Row format>

In [35]:
tfidf[0].data #in pratica e' la matrice che fornisce i pesi "corretti"

array([ 0.19298338,  0.13719973,  0.14524295,  0.08086716,  0.08819209,
        0.10133524,  0.10743744,  0.12236812,  0.15880473,  0.08891808,
        0.15906611,  0.142492  ,  0.12972588,  0.06279798,  0.12784899,
        0.11991732,  0.14476534,  0.19479681,  0.13985368,  0.11038861,
        0.08311487,  0.17019761,  0.10133524,  0.20351164,  0.12397835,
        0.13522098,  0.13254293,  0.15854576,  0.14689626,  0.097357  ,
        0.12412942,  0.19673539,  0.130286  ,  0.09300894,  0.14556592,
        0.1145887 ,  0.16041078,  0.05444059,  0.55957772])

In [36]:
vectorizer

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.8, max_features=10000, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=frozenset(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', '...'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once']),
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [37]:
#vetorizer ha come proprieta' il vocabolario con le parole che ha trovato
vectorizer.vocabulary_

{u'woods': 9799,
 u'hanging': 4316,
 u'broward': 1733,
 u'bringing': 1717,
 u'wednesday': 9691,
 u'shows': 8192,
 u'cfa': 2010,
 u'frederick': 3943,
 u'270': 305,
 u'272': 306,
 u'273': 307,
 u'274': 308,
 u'275': 309,
 u'278': 310,
 u'targa': 8817,
 u'errors': 3480,
 u'usenet': 9377,
 u'designing': 2948,
 u'kids': 5160,
 u'controversy': 2510,
 u'dna': 3124,
 u'inevitable': 4736,
 u'benedikt': 1497,
 u'intake': 4831,
 u'morally': 6000,
 u'wang': 9632,
 u'want': 9634,
 u'beyer': 1516,
 u'travel': 9119,
 u'wrong': 9844,
 u'fit': 3813,
 u'wiretapping': 9764,
 u'fix': 3816,
 u'fij': 3775,
 u'effects': 3326,
 u'sheep': 8153,
 u'6ql': 565,
 u'estimate': 3504,
 u'sys': 8773,
 u'needed': 6170,
 u'master': 5716,
 u'genesis': 4070,
 u'0d': 20,
 u'ahmet': 903,
 u'feeling': 3742,
 u'affairs': 868,
 u'vga': 9506,
 u'sy_': 8757,
 u'tech': 8846,
 u'saying': 7925,
 u'chelios': 2063,
 u'plate': 6857,
 u'altogether': 977,
 u'nicely': 6225,
 u'patch': 6667,
 u'widget': 9727,
 u'news': 6207,
 u'lots': 553

In [39]:
vectorizer.get_feature_names()

[u'00',
 u'000',
 u'005',
 u'01',
 u'02',
 u'02238',
 u'02p',
 u'03',
 u'030',
 u'0358',
 u'04',
 u'040',
 u'0400',
 u'05',
 u'06',
 u'07',
 u'08',
 u'09',
 u'0b',
 u'0c',
 u'0d',
 u'0el',
 u'0em',
 u'0g',
 u'0i',
 u'0l',
 u'0m',
 u'0p',
 u'0q',
 u'0qax',
 u'0t',
 u'0tbxn',
 u'0tq',
 u'0u',
 u'0w',
 u'10',
 u'100',
 u'1000',
 u'101',
 u'102',
 u'1024',
 u'1024x768',
 u'102nd',
 u'103',
 u'104',
 u'105',
 u'106',
 u'107',
 u'108',
 u'109',
 u'10k',
 u'10th',
 u'11',
 u'110',
 u'1100',
 u'111',
 u'112',
 u'113',
 u'114',
 u'115',
 u'116',
 u'117',
 u'118',
 u'119',
 u'12',
 u'120',
 u'1200',
 u'121',
 u'122',
 u'123',
 u'124',
 u'125',
 u'126',
 u'127',
 u'128',
 u'1280x1024',
 u'129',
 u'13',
 u'130',
 u'131',
 u'132',
 u'133',
 u'134',
 u'135',
 u'136',
 u'137',
 u'138',
 u'139',
 u'13p',
 u'13q',
 u'13qs',
 u'13s',
 u'14',
 u'140',
 u'1400',
 u'141',
 u'142',
 u'143',
 u'144',
 u'145',
 u'146',
 u'147',
 u'148',
 u'149',
 u'15',
 u'150',
 u'1500',
 u'151',
 u'152',
 u'1542',
 u'155',


In [43]:
print(counts[0].nonzero()[1])
array(vectorizer.get_feature_names())[counts[0].nonzero()[1]]

[1897 9317 5705 2246 6628  104 9795 3434 7923 2793 3160 8447 5515 5280 3270
  580 1849 3161 7419 8300  822 1774 8077 7637 1617 5198 5957 3422 8416 9937
 7104 4448 4759 5516 5615 8927 4634 1732 6182]


array([u'car', u'university', u'maryland', u'college', u'park', u'15',
       u'wondering', u'enlighten', u'saw', u'day', u'door', u'sports',
       u'looked', u'late', u'early', u'70s', u'called', u'doors',
       u'really', u'small', u'addition', u'bumper', u'separate', u'rest',
       u'body', u'know', u'model', u'engine', u'specs', u'years',
       u'production', u'history', u'info', u'looking', u'mail', u'thanks',
       u'il', u'brought', u'neighborhood'], 
      dtype='<U80')

In [44]:
counts[0].data

array([5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

## uno dei problemi del bag of words e' che non tiene conto dell'ordine delle parole

In [45]:
# in count vectorized permette di prendere le parole composte impostando ngram_range. Nell'esempio seguente prende 
# anche le parole che compaiono a due a due
vectorizer = text.CountVectorizer(max_df=0.8, max_features=10000, stop_words=text.ENGLISH_STOP_WORDS, ngram_range=(1,2))
counts = vectorizer.fit_transform(documents) #trasforma questo dataset in matrice sparsa di conteggi
tfidf = text.TfidfTransformer().fit_transform(counts) 


In [46]:
array(vectorizer.get_feature_names())[counts[0].nonzero()[1]]

array([u'car', u'university', u'maryland', u'college', u'park', u'15',
       u'wondering', u'saw', u'day', u'door', u'sports', u'looked',
       u'late', u'early', u'called', u'doors', u'really', u'small',
       u'addition', u'bumper', u'separate', u'rest', u'body', u'know',
       u'model', u'engine', u'specs', u'years', u'production', u'history',
       u'info', u'looking', u'mail', u'thanks', u'il', u'brought',
       u'neighborhood', u'organization university', u'university maryland',
       u'maryland college', u'college park', u'lines 15', u'mail thanks'], 
      dtype='<U80')

In [27]:
#Elenco delle stopwords

from sklearn.feature_extraction import text
text.ENGLISH_STOP_WORDS

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

In questo caso la stessa parola in singolare e plurale vengono contate come parole diverse (ad esempio: sports e sport)

### Natural Language Processing with NLTK

NLTK (Natural Language ToolKit) is a library for symbolic and statistical natural language processing (NLP).

It supports a few functionalities for NLP, such as:

- Lexical analysis: Word and text tokenizer
- n-gram and collocations
- Part-of-speech tagger
- Tree model and Text chunker for capturing
- Named-entity recognition

Trasformano le parole normalizzandole.

Due approcci a normalizzate:
- Stemming (prende la parola e toglie la parola)
- Lemmatize

In [47]:
from nltk.corpus import wordnet as wn
from nltk import stem
import re

pattern = re.compile('(?u)\\b[A-Za-z]{3,}')

stemmer = stem.SnowballStemmer('english')
def stemming(doc):
    l = [stemmer.stem(t) for t in pattern.findall(doc)]
    return [w for w in l if len(w) > 2]

wnl = stem.WordNetLemmatizer()
def lemmatize(doc):
    
    def lemma(w):
        l = wnl.lemmatize(t, wn.NOUN)
        if l == w:
            l = wnl.lemmatize(t, wn.ADJ)
        if l == w:
            l = wnl.lemmatize(t, wn.ADV)
        if l == w:
            l = wnl.lemmatize(t, wn.VERB)
        return l
    
    l = [lemma(t) for t in pattern.findall(doc)]
    return [w for w in l if len(w) > 2]


In [52]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/bigdive/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [53]:
sentence = """For grammatical reasons, documents are going to use different forms of a word,
such as organize, organizes, and organizing.
Additionally, there are families of derivationally related words with similar meanings,
such as democracy, democratic, and democratization."""

print sentence
print
print stemming(sentence) #stemmi --> riporta la parola al suo root e organize, organizes diventano organ
print
print lemmatize(sentence) #lemmitize --> riporta la parola al suo lemma e organize, organizes diventano organize

For grammatical reasons, documents are going to use different forms of a word,
such as organize, organizes, and organizing.
Additionally, there are families of derivationally related words with similar meanings,
such as democracy, democratic, and democratization.

[u'for', u'grammat', u'reason', u'document', u'are', u'use', u'differ', u'form', u'word', u'such', u'organ', u'organ', u'and', u'organ', u'addit', u'there', u'are', u'famili', u'deriv', u'relat', u'word', u'with', u'similar', u'mean', u'such', u'democraci', u'democrat', u'and', u'democrat']

['For', 'grammatical', u'reason', u'document', 'use', 'different', u'form', 'word', 'such', 'organize', u'organize', 'and', u'organize', 'Additionally', 'there', u'family', 'derivationally', u'relate', u'word', 'with', 'similar', u'meaning', 'such', 'democracy', 'democratic', 'and', 'democratization']


In [54]:
vectorizer = text.CountVectorizer(max_df=0.95, max_features=10000, stop_words='english',
                                  encoding='latin1', tokenizer=lemmatize, ngram_range=(1, 2))
counts = vectorizer.fit_transform(dataset.data)
tfidf = text.TfidfTransformer().fit_transform(counts)

In [55]:
vectorizer.vocabulary_

{u'say say': 7639,
 u'space university': 8149,
 u'inevitable': 4233,
 u'benedikt': 811,
 u'fit': 3279,
 u'fix': 3280,
 u'fij': 3241,
 u'fin': 3259,
 u'article magnus': 493,
 u'syx': 8602,
 u'turkish plane': 9057,
 u'dna': 2492,
 u'grind wire': 3735,
 u'chain': 1348,
 u'exact': 3039,
 u'science carnegie': 7681,
 u'time use': 8873,
 u'cunyvm': 2068,
 u'advertisement': 138,
 u'ufl edu': 9101,
 u'norm': 5934,
 u'ico tek': 4104,
 u'hasn': 3849,
 u'hash': 3848,
 u'uprise': 9283,
 u'automate': 636,
 u'maryland': 5322,
 u'learn': 4899,
 u'remember read': 7251,
 u'treatment': 8986,
 u'russian army': 7542,
 u'behave': 785,
 u'influence': 4256,
 u'kansa': 4664,
 u'victor': 9478,
 u'line news': 5022,
 u'chicogo': 1406,
 u'bitnet write': 891,
 u'talk politics': 8612,
 u'organization cornell': 6175,
 u'integrity': 4350,
 u'michigan state': 5484,
 u'production': 6848,
 u'host access': 4024,
 u'illustrate': 4152,
 u'zeos': 9980,
 u'cereal': 1339,
 u'cache card': 1144,
 u'excite': 3053,
 u'fuse': 3449,

In [56]:
array(vectorizer.get_feature_names())[counts[0].nonzero()[1]]

array([u'car', u'organization', u'university', u'maryland', u'college',
       u'park', u'wonder', u'enlighten', u'saw', u'day', u'door', u'sport',
       u'look', u'late', u'early', u'really', u'small', u'addition',
       u'bumper', u'separate', u'rest', u'body', u'know', u'model',
       u'engine', u'spec', u'year', u'production', u'make', u'history',
       u'info', u'mail', u'thank', u'bring', u'neighborhood',
       u'subject car', u'organization university', u'university maryland',
       u'maryland college', u'college park', u'park line', u'line wonder',
       u'sport car', u'mail thank'], 
      dtype='<U26')

Here, we create a variable, tfidf, which is a vectorizer responsible for performing three important steps:

- First, it will build a dictionary of features where keys are terms and values are indices of the term in the feature matrix (that's the fit part in fit_transform)
- Second, it will transform our documents into numerical feature vectors according to the frequency of words appearing in each text message. Since any one text message is short, each feature vector will be made up of mostly zeros, each of which indicates that a given word appeared zero times in that message.
- Lastly, it will compute the tf-idf weights for our term frequency matrix.

In [57]:
# per fare classificazione a questo punto devo creare un classificatore.
# Ad esempio un multinomiale naive bayes per fare la classificazione

## Nonnegative Matrix Factorization for Topic extraction



*Si tratta di un approccio non supervisionato per creare clustering*

Imagine having 5 documents, 2 of them about environment and 2 of them about U.S. Congress and 1 about both, that means it says about government legislation process in protecting an environment. We need to write a program that unmistakably identifies category of each document and also returns a degree of belonging of each document to a particular category. For this elementary example we limit our vocabulary to 5 words: AIR, WATER, POLLUTION, DEMOCRAT, REPUBLICAN. Category ENVIRONMENT and category CONGRESS may contain all 5 words but with different probability. We understand that the word POLLUTION has more chances to be in the article about ENVIRONMENT than in the article about CONGRESS, but can theoretically be in both. Presume after an examination of our data we built following document-term table:

<table border="" cellpadding="3" style="font-family:'Times New Roman'">
<tbody>
<tr>
<td>document/word</td>
<td>air</td>
<td>water</td>
<td>pollution</td>
<td>democrat</td>
<td>republican</td>
</tr>
<tr>
<td>doc 1</td>
<td>3</td>
<td>2</td>
<td>8</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>doc 2</td>
<td>1</td>
<td>4</td>
<td>12</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>doc 3</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>10</td>
<td>11</td>
</tr>
<tr>
<td>doc 4</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>8</td>
<td>5</td>
</tr>
<tr>
<td>doc 5</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

We distinguish our categories by the group of words assigned to them. We decide that category ENVIRONMENT normally should contain only words AIR, WATER, POLLUTION and category CONGRESS should contain only words DEMOCRAT and REPUBLICAN. We build another matrix, each row of which represent category and contains counts for only words that assigned to each category. 

<table border="" cellpadding="3" style="font-family:'Times New Roman'">
<tbody>
<tr>
<td>categories</td>
<td>air</td>
<td>water</td>
<td>pollution</td>
<td>democrat</td>
<td>republican</td>
</tr>
<tr>
<td>ENVIRONMENT</td>
<td>5</td>
<td>7</td>
<td>21</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>CONGRESS</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>19</td>
<td>17</td>
</tr>
</tbody>
</table>

We change values from frequencies to probabilities by dividing them by sums in rows, which turns each row into probability distribution.

<table border="" cellpadding="3" style="font-family:'Times New Roman'">
<caption>Matrix&nbsp;<strong>H</strong></caption>
<tbody>
<tr>
<td>categories</td>
<td>air</td>
<td>water</td>
<td>pollution</td>
<td>democrat</td>
<td>republican</td>
</tr>
<tr>
<td>ENVIRONMENT</td>
<td>0.15</td>
<td>0.21</td>
<td>0.64</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>CONGRESS</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0.53</td>
<td>0.47</td>
</tr>
</tbody>
</table>

Now we create another matrix that contains probability distribution for categories within each document that looks like follows:

<table border="" cellpadding="3" style="font-family:'Times New Roman'">
<caption>Matrix&nbsp;<strong>W</strong></caption>
<tbody>
<tr>
<td>documents</td>
<td>ENVIRONMENT</td>
<td>CONGRESS</td>
</tr>
<tr>
<td>doc 1</td>
<td>1.0</td>
<td>0.0</td>
</tr>
<tr>
<td>doc 2</td>
<td>1.0</td>
<td>0.0</td>
</tr>
<tr>
<td>doc 3</td>
<td>0.0</td>
<td>1.0</td>
</tr>
<tr>
<td>doc 4</td>
<td>0.0</td>
<td>1.0</td>
</tr>
<tr>
<td>doc 5</td>
<td>0.6</td>
<td>0.4</td>
</tr>
</tbody>
</table>

It shows that top two documents speak about environment, next two about congress and last document about both. Ratios 0.6 and 0.4 for the last document are defined by 3 words from environment category and 2 words from congress category. Now we multiply both matrices and compare the result with original data but in a normalized form. Normalization in this case is division of each row by the sum of all elements in rows. The comparison is shown side-by-side below:

<table cellpadding="10" style="font-family:'Times New Roman'">
<tbody>
<tr>
<td>
<table border="" cellpadding="3">
<caption>Product of&nbsp;<strong>W * H</strong></caption>
<tbody>
<tr>
<td>0.15</td>
<td>0.21</td>
<td>0.64</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<td>0.15</td>
<td>0.21</td>
<td>0.64</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.53</td>
<td>0.47</td>
</tr>
<tr>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.53</td>
<td>0.47</td>
</tr>
<tr>
<td>0.09</td>
<td>0.13</td>
<td>0.38</td>
<td>0.21</td>
<td>0.19</td>
</tr>
</tbody>
</table>
</td>
<td>
<table border="" cellpadding="3">
<caption>Normalized data&nbsp;<strong>N</strong></caption>
<tbody>
<tr>
<td>0.23</td>
<td>0.15</td>
<td>0.62</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<td>0.06</td>
<td>0.24</td>
<td>0.70</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.48</td>
<td>0.52</td>
</tr>
<tr>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.61</td>
<td>0.39</td>
</tr>
<tr>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>

The correlation is obvious. The problem definition is to find constrained matrices W and H (given the number of categories), product of which is the best match with normalized data N. When approximation is found matrix H will contain sought categories.

**Formally**, we are trying to minimize this:

$$ \|\mathbf{N} - \mathbf{WH}\|^2_F $$

In [58]:
from sklearn import decomposition # usiamo NMF di sklearn

# Fit the NMF model
nmf = decomposition.NMF(n_components=6) # 6 sono il numero di topic
nmf.fit(tfidf)
W = nmf.transform(tfidf) #per ogni documento il peso
H = nmf.components_ # e' la matrice delle categorie

In [60]:
W.shape #11000 documento e per ciascun documento ho i sei per dei topic

(11314, 6)

In [61]:
H.shape #ho il peso del termine su quel topic

(6, 10000)

In [62]:
H[0] #vettore di pesi per ogni parola

array([ 0.000398  ,  0.00831057,  0.00171016, ...,  0.        ,
        0.        ,  0.        ])

In [64]:
H[0].argsort() # mi da gli indici ordinati per peso

array([4999, 5466, 5465, ..., 5913, 4023, 2721])

In [65]:
H[0].argsort()[:-21:-1]

array([2721, 4023, 5913, 5914, 6694, 6689, 5024, 2090, 2091, 5009, 6162,
       2461, 1503, 9189, 9293, 2470, 9295, 9670, 9797, 7310])

In [66]:
# Inverse the vectorizer vocabulary to be able
feature_names = vectorizer.get_feature_names()

In [70]:
[feature_names[i] for i in H[0].argsort()[:-21-1]]

[u'limit',
 u'method',
 u'meter',
 u'metaphor',
 u'messy',
 u'messiah',
 u'message write',
 u'mess',
 u'merge',
 u'merely',
 u'metric',
 u'mere',
 u'men woman',
 u'men',
 u'member congress',
 u'member',
 u'melkonian',
 u'megatest',
 u'megatek',
 u'meeting',
 u'medical newsletter',
 u'menu',
 u'mets',
 u'metzger',
 u'meyer',
 u'min',
 u'million people',
 u'million muslim',
 u'million dollar',
 u'million',
 u'milk',
 u'militia',
 u'military weapon',
 u'military',
 u'mileage',
 u'migraine',
 u'midnight',
 u'midi',
 u'middle east',
 u'middle class',
 u'microphone',
 u'micron',
 u'microdistrict',
 u'microcircuit',
 u'micro',
 u'michael adam',
 u'median',
 u'meantime',
 u'meaningless',
 u'meaningful',
 u'marry',
 u'marriage',
 u'mark wilson',
 u'mark singer',
 u'mario',
 u'marina',
 u'marijuana',
 u'maria',
 u'marginal',
 u'maple leaf',
 u'maple',
 u'manuscript',
 u'manner',
 u'mankind',
 u'manitoba',
 u'manipulation',
 u'manipulate',
 u'mane magpie',
 u'mane',
 u'mandate',
 u'mancus',
 u'ma

In [67]:
for topic_idx, topic in enumerate(H):
    print "Topic #%d:" % topic_idx
    print ",".join([feature_names[i] for i in topic.argsort()[:-21:-1]])
    print

Topic #0:
edu,host,nntp,nntp post,post host,post,line nntp,cwru,cwru edu,line distribution,organization,distribution,cleveland,university,usa,distribution world,usa line,western,world nntp,reserve university

Topic #1:
god,christian,jesus,bible,say,believe,people,faith,christ,religion,atheist,church,belief,think,life,know,truth,christianity,sin,mean

Topic #2:
window,use,file,card,drive,program,problem,thank,driver,disk,run,scsi,work,help,graphic,version,video,color,mac,need

Topic #3:
com,people,write,article,car,edu,gun,don,think,say,just,right,make,like,good,state,time,year,line article,know

Topic #4:
game,team,player,play,win,year,hockey,season,score,baseball,edu,nhl,fan,good,league,playoff,pitch,toronto,university,think

Topic #5:
key,chip,clipper,encryption,escrow,clipper chip,government,use,algorithm,phone,crypto,security,key escrow,secret,nsa,encrypt,secure,public,wiretap,privacy



# Exercise

Compare the results of Nonnegative Matrix Factorization (NMF) with Latent Dirichlet Allocation (LDA).

In [79]:
from sklearn import decomposition # usiamo NMF di sklearn

# Fit the NMF model
nmf = decomposition.LatentDirichletAllocation(n_topics=6) # Altro metodo per classificare senza dare il numero di componenti
nmf.fit(tfidf)
W = nmf.transform(tfidf) #per ogni documento il peso
H = nmf.components_ # e' la matrice delle categorie

In [80]:
W.shape

(11314, 6)

In [81]:
# Inverse the vectorizer vocabulary to be able
feature_names = vectorizer.get_feature_names()

In [82]:
for topic_idx, topic in enumerate(H):
    print "Topic #%d:" % topic_idx
    print ",".join([feature_names[i] for i in topic.argsort()[:-21:-1]])
    print

Topic #0:
window,file,card,use,drive,thank,graphic,driver,program,mac,problem,disk,scsi,run,color,help,video,monitor,edu,organization

Topic #1:
gatech,gatech edu,prism,ncsu,georgia institute,maine,georgia,ncsu edu,eos,prism gatech,eos ncsu,organization georgia,moa,institute technology,catbyte,dtmedin,dsu edu,dsu,mail upenn,keller

Topic #2:
team,game,player,play,hockey,win,season,score,nhl,playoff,baseball,league,fan,pitch,pittsburgh,detroit,year,toronto,andrew cmu,leaf

Topic #3:
edu,com,write,article,organization,think,say,university,don,good,like,know,just,people,make,time,use,post,god,new

Topic #4:
gld,navy mil,kaldis,nec,behanna,captain,columbia edu,navy,gary dare,nec com,dare,cunixb columbia,cunixb,columbia,new brunswick,centerline com,nswc,dumb automotive,brunswick,automotive concept

Topic #5:
key,clipper,encryption,armenian,chip,cwru,cwru edu,caltech,caltech edu,government,turkish,israeli,cleveland,escrow,israel,clipper chip,nsa,case western,western reserve,reserve universit