## Comparison of different methods for decomposition

In this notebook, we study two widely popular methods for decomposition: Singular Value Decomposition and Non-Negative Matrix Factorization.

1. SVD
    - using scipy.linalg
2. NMF
    - using sklearn.NMF
    - using SGD, numpy


- Mentee: Manoj Pandey
- Mentor: Ivan, RaRe

In [173]:
import numpy as np
from scipy import linalg as la
from sklearn import decomposition
import time

[Data source](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html): Newsgroups are discussion groups on Usenet, which was popular in the 80s and 90s before the web really took off.  This dataset includes 18,000 newsgroups posts with 20 topics.

In [6]:
from sklearn.datasets import fetch_20newsgroups

In [3]:
# NMF is included with scikit-learn, but we are not going to use it
# from sklearn.decomposition import NMF

In [9]:
%matplotlib inline
np.set_printoptions(suppress=True)

In [10]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

Downloading 20news dataset. This may take a few minutes.
INFO:sklearn.datasets.twenty_newsgroups:Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)
INFO:sklearn.datasets.twenty_newsgroups:Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [11]:
newsgroups_train.filenames.shape, newsgroups_train.target.shape

((2034,), (2034,))

In [12]:
newsgroups_train.target_names[:10]

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

In [13]:
np.array(newsgroups_train.target_names)[newsgroups_train.target[:10]]

array(['comp.graphics', 'talk.religion.misc', 'sci.space', 'alt.atheism',
       'sci.space', 'alt.atheism', 'sci.space', 'comp.graphics',
       'sci.space', 'comp.graphics'],
      dtype='<U18')

In [16]:
print("\n".join(newsgroups_train.data[:5]))

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych


Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonstrating such evil corruption
for centuries.

 >In article <1993Apr19.020359.26996@sq.sq.com>, msb@sq.sq.c

### As we can see above, there are three paragraphs. Reading through them, it seems like:
- 1 -> Graphics
- 2 -> religion or atheism
- 3 -> ??
- 4 -> can see the word "theism", so maybe atheism
- 5 -> space

In [19]:
# Let's check it from the data
np.array(newsgroups_train.target_names)[newsgroups_train.target[:5]]

array(['comp.graphics', 'talk.religion.misc', 'sci.space', 'alt.atheism',
       'sci.space'],
      dtype='<U18')

## Vectorization using CountVectorizer / TF-IDF

Let's fit a tf-idf model on our dataset below:

In [21]:
num_topics, num_top_words = 6, 8

In [22]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [129]:
# vectorizer = CountVectorizer(stop_words='english') # also can use tf-idf
vectorizer = TfidfVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(newsgroups_train.data).todense() # (documents, vocab)
vectors.shape #, vectors.nnz / vectors.shape[0], row_means.shape

(2034, 26576)

In [130]:
len(newsgroups_train.data), vectors.shape

(2034, (2034, 26576))

In [131]:
vocab = np.array(vectorizer.get_feature_names())

In [132]:
vocab.shape

(26576,)

In [133]:
vocab[7000:7100]

array(['cosmonauts', 'cosmos', 'cosponsored', 'cost', 'costa', 'costar',
       'costing', 'costly', 'costruction', 'costs', 'cosy', 'cote',
       'couched', 'couldn', 'council', 'councils', 'counsel', 'counselees',
       'counselor', 'count', 'countdown', 'counted', 'counter',
       'counter_clockwise', 'counterargument', 'counterclockwise',
       'countered', 'counterexamples', 'counterfactual', 'counterpart',
       'counterproductive', 'counters', 'counting', 'countless',
       'countries', 'country', 'countryside', 'counts', 'county', 'coup',
       'couple', 'coupled', 'couples', 'courage', 'courageous', 'courant',
       'cournoyer', 'course', 'courses', 'court', 'courteous', 'courtesy',
       'courts', 'cousin', 'coutesy', 'cov', 'covalt', 'covenant',
       'covenent', 'coventry', 'cover', 'coverage', 'coverages', 'covered',
       'covering', 'coverings', 'covers', 'cow', 'coward', 'cowboy',
       'cowboys', 'cowdery', 'cowen', 'cowgirls', 'coy', 'coyote', 'cozy',
    

## Singular Value Decomposition
- `using scipy.linalg.SVG`

![](http://halmusreeftank.com/images/IMG_1.png)

Source: http://halmusreeftank.com

![](https://csiu.github.io/blog//img/figure/2017-04-16/svd.png)

Source: https://csiu.github.io

In [134]:
vectors.shape

(2034, 26576)

In [135]:
%time U, s, Vh = la.svd(vectors, full_matrices=False)

CPU times: user 1min 30s, sys: 3.01 s, total: 1min 33s
Wall time: 48.1 s


In [136]:
U.shape, s.shape, Vh.shape

((2034, 2034), (2034,), (2034, 26576))

> Result

> `(2034, 26576) => ((2034, 2034), (2034,), (2034, 26576))`

> `.  vectors             U           s           Vh     .`

### Checking if the result is actually a decomposition

In [137]:
new_vectors = U @ np.diag(s) @ Vh

In [138]:
new_vectors.shape

(2034, 26576)

In [139]:
np.allclose(vectors, new_vectors)

True

In [140]:
np.allclose(U @ U.T, np.eye(U.shape[0]))

True

In [142]:
np.allclose(Vh @ Vh.T, np.eye(Vh.shape[0]))

True

In [143]:
# Helper method to get topics

num_top_words=8

def show_topics(a):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

In [144]:
show_topics(Vh[:10])

['ditto critus 141592654 point_node n4 do_sphere pnp asg',
 'space graphics thanks program files image nasa ftp',
 'space nasa launch shuttle moon orbit lunar station',
 'ico bobbe tek beauchaine bronx manhattan sank queens',
 'objective think morality don just people moral values',
 'objective morality values moral god science space subjective',
 'graphics comp god software group objective aspects edu',
 'image file cview graphics data use just images',
 'jesus objective christ christian software christians bible did',
 'edu space jesus ftp file nasa files pub']

## Non Negative Matrix Factorization
- first, using NMF from scikit-learn
- then, using numpy

![](https://image.slidesharecdn.com/nlpmeetupsept2016derekgreene-160929091010/95/dynamic-topic-modeling-via-nonnegative-matrix-factorization-dr-derek-greene-4-638.jpg?cb=1475140310)

Source: Slideshare

![](https://mmolano.files.wordpress.com/2014/10/nmf.png)

Source: https://mmolano.files.wordpress.com/2014/10/nmf.png

In [145]:
num_topics = 5

In [146]:
clf = decomposition.NMF(n_components=num_topics, random_state=1)

In [147]:
%time W = clf.fit_transform(vectors)
H = clf.components_

CPU times: user 30.2 s, sys: 1.92 s, total: 32.2 s
Wall time: 12.6 s


In [148]:
vectors.shape

(2034, 26576)

In [149]:
W.shape, H.shape

((2034, 5), (5, 26576))

> Result

> `(2034, 26576) => ((2034, 5), (5, 26576))`

> `.     vectors        W          H      .`

In [150]:
np.allclose(vectors , W@H)

False

In [151]:
la.norm(vectors - W@H)
# I think this is because NMF is approximation

43.712926057951641

In [152]:
show_topics(H)

['people don think just like objective say morality',
 'graphics thanks files image file program windows know',
 'space nasa launch shuttle orbit moon lunar earth',
 'ico bobbe tek beauchaine bronx manhattan sank queens',
 'god jesus bible believe christian atheism does belief']

## numpy to rescue

**Goal**: Decompose $V_{(m \times n)}$ into $V \approx WH$ ;

   where $W_{(m \times d)}$ and $H_{(d \times n)}$, $W,\;H\;>=\;0$, and we've minimized the Frobenius norm of $V-WH$.

**Approach**: We will pick random positive $W$ & $H$, and then use SGD to optimize.

**Sources**:
- Optimality and gradients of NMF: http://users.wfu.edu/plemmons/papers/chu_ple.pdf
- Projected gradients: https://www.csie.ntu.edu.tw/~cjlin/papers/pgradnmf.pdf

In [153]:
lam = 1e3 # lambda
lr = 1e-2 # learning rate = 0.01
m, n = vectors.shape

In [154]:
m, n

(2034, 26576)

In [155]:
mu = 1e-6 # µ
# gradients
def grads(M, W, H):
    R = W@H-M
    return R@H.T + penalty(W, mu)*lam, W.T@R + penalty(H, mu)*lam # dW, dH

In [156]:
# calculate penalty
def penalty(M, mu):
    return np.where(M>=mu,0, np.min(M - mu, 0))

In [157]:
# update
def update(M, W, H, lr):
    dW,dH = grads(M,W,H)
    W -= lr*dW; H -= lr*dH

In [158]:
def report(M,W,H): 
    # Prints frobenius norm and other info
    print ((la.norm(M-W@H)), W.min(), H.min(), (W<0).sum(), (H<0).sum())


In [159]:
W = np.abs(np.random.normal(scale=0.01, size=(m,num_topics)))
H = np.abs(np.random.normal(scale=0.01, size=(num_topics,n)))

In [160]:
report(vectors, W, H)

44.4246561087 1.54643492996e-07 6.43139263611e-08 0 0


In [161]:
update(vectors, W, H, lr)

In [162]:
report(vectors, W, H)

44.4169857252 -0.00121131318517 -6.40467278636e-05 159 282


In [174]:
start = time.time()
for i in range(500): 
    update(vectors,W,H,lr)
    if i % 10 == 0: 
        print ("Iteration {}/500: ".format(i), end='')
        report(vectors,W,H)
print("\n--Done--")
print("Time took: {}s".format(time.time() - start))

Iteration 0/500: 43.8778272029 -0.00208808547441 -0.00170167658404 159 9409
Iteration 10/500: 43.8648893136 -0.00176954883833 -0.000862236048491 191 9076
Iteration 20/500: 43.8512642102 -0.00411962115736 -0.00184820736821 165 8966
Iteration 30/500: 43.8415948928 -0.00162798536787 -0.002165842963 166 8888
Iteration 40/500: 43.8339574782 -0.00149451050905 -0.00201874943895 178 9403
Iteration 50/500: 43.8261295056 -0.00240548813281 -0.0011585879677 175 8945
Iteration 60/500: 43.8207097623 -0.003328709076 -0.00170209761967 151 9157
Iteration 70/500: 43.8126931604 -0.00225727273457 -0.00198283600801 157 9239
Iteration 80/500: 43.8080371293 -0.00286052139051 -0.00113164132538 173 9293
Iteration 90/500: 43.8049970039 -0.000853666222654 -0.00233603374422 152 9114
Iteration 100/500: 43.8014331618 -0.00261673279944 -0.000652801971096 168 9050
Iteration 110/500: 43.8001897979 -0.00154614967032 -0.00194432212029 167 9469
Iteration 120/500: 43.7955471929 -0.000986127478639 -0.00119401821373 163 941

In [175]:
show_topics(H)

['space nasa launch shuttle orbit lunar moon earth',
 'ico bobbe tek bronx beauchaine manhattan sank queens',
 'god people jesus bible believe atheism christian objective',
 'just don like think know did say people',
 'thanks graphics files image file program windows format']

## Summary Table

| SVD from linalg module | NMF from sklearn | NMF using SGD & numpy |
|---|---|---|
|48.1s|12.6s|664.67s|

Note: NMF with SGD & numpy takes this much time, because of the large number of iterations. We also get a decent result with less number of iterations.