# Topic Modeling with NMF and SVD

* **NMF:** Non-Negative Matrix Factorization
* **SVD:** Single Value Decomposition

## The problem:

Topic modeling startis with a **term-document matrix**

![alt text](https://github.com/fastai/course-nlp/raw/e66cc0c5b393212d82aa548c4e1f4e54fcec824b/images/document_term.png)

This is a **bag of words** approach, does not take into account word order or sentence structure. 

Latent Systematic Analysis (LSA) uses SVD

## Getting Started

A dataset of documents from several categories, and attempt to find topics (groups of words) for them. We will try SVD and NMF

### Libraries

In [0]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt
np.set_printoptions(suppress=True)

### Looking at the data

The dataset is Newsgroups, discussion groups on Usenet. Includes 18,000 posts with 20 topics. Popular in the 80's and 90's

In [0]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')

In [4]:
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories,
                                     remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,
                                    remove=remove)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [6]:
newsgroups_train.filenames.shape, newsgroups_train.target.shape

((2034,), (2034,))

In [7]:
print('\n'.join(newsgroups_train.data[:3]))

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych


Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonstrating such evil corruption
for centuries.

 >In article <1993Apr19.020359.26996@sq.sq.com>, msb@sq.sq.c

This is the Sci-Space

In [8]:
np.array(newsgroups_train.target_names)[newsgroups_train.target[:3]]

array(['comp.graphics', 'talk.religion.misc', 'sci.space'], dtype='<U18')

The target attribute is the integer index of the category array

In [9]:
newsgroups_train.target[:10]

array([1, 3, 2, 0, 2, 0, 2, 1, 2, 1])

In [0]:
num_topics, num_top_words = 6,8

## Stop words, stemming, lemmatization

### Stop Words

Stop words are 'extremely common words which would appear to be of little value in helping'. Web search engines generally do not use top lists

#### Natural Language Toolkit, NLT

In [0]:
from sklearn.feature_extraction import stop_words

In [12]:
sorted(list(stop_words.ENGLISH_STOP_WORDS))[:20]

['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amoungst']

***There is no single universal list of stop words***

#### Stemming and Lemmatization

Are the below words the same:

* Organize, Organizes, Organizing

* Democracy, Democratic, Democratization

Stemming and Lemmatization generate the root form of words

Lemmatization uses the **rules** about a language. The resulting tokens are all actual words

"Stemming is the poor-man's lemmatization" (Noah Smith, 2011)

Stemming chops the ends off words. The result may not be actual words, but is faster

In [13]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [0]:
from nltk import stem

In [0]:
wnl = stem.WordNetLemmatizer()
porter = stem.porter.PorterStemmer()

In [0]:
word_list = ['feet', 'foot', 'foots', 'footing']

In [17]:
[wnl.lemmatize(word) for word in word_list]

['foot', 'foot', 'foot', 'footing']

In [18]:
[porter.stem(word) for word in word_list]

['feet', 'foot', 'foot', 'foot']

Here we see the Stemmer cutting off the last three letters.

Try it yourself:

In [0]:
wl1 = ['fly', 'flies', 'flying']
wl2 = ['organize', 'organizes', 'organizing']
wl3 = ['universe', 'university']

In [20]:
[wnl.lemmatize(word) for word in wl1], [porter.stem(word) for word in wl1]

(['fly', 'fly', 'flying'], ['fli', 'fli', 'fli'])

In [21]:
[wnl.lemmatize(word) for word in wl2], [porter.stem(word) for word in wl2]

(['organize', 'organizes', 'organizing'], ['organ', 'organ', 'organ'])

In [22]:
[wnl.lemmatize(word) for word in wl3], [porter.stem(word) for word in wl3]

(['universe', 'university'], ['univers', 'univers'])

**Languages with more complex morphologies may show bigger benefits (more ways to express a word)**

## Spacy

Stemming and lemmatization - implementation dependant

Spacy: Modern and fast nlp library.

!pip install -U spacy

In [0]:
!pip install -U spacy

In [0]:
import spacy

In [0]:
from spacy.lemmatizer import Lemmatizer
lemmatizer = Lemmatizer()

In [30]:
[lemmatizer.lookup(word) for word in word_list]

['feet', 'foot', 'foots', 'footing']

Spacy does not offer stemmer, since lemmatization is considered better (opinionated example)

In [0]:
nlp = spacy.load('en_core_web_sm')

In [33]:
sorted(list(nlp.Defaults.stop_words))[:20]

["'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also']

**Exercise What words appear in spacy but not sklearn?**

Take two lists, find the difference

In [0]:
sp = sorted(list(nlp.Defaults.stop_words))

In [0]:
sk = sorted(list(stop_words.ENGLISH_STOP_WORDS))

In [38]:
set(sp) - set(sk)

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'ca',
 'did',
 'does',
 'doing',
 'just',
 'make',
 "n't",
 'n‘t',
 'n’t',
 'quite',
 'really',
 'regarding',
 'say',
 'unless',
 'used',
 'using',
 'various',
 '‘d',
 '‘ll',
 '‘m',
 '‘re',
 '‘s',
 '‘ve',
 '’d',
 '’ll',
 '’m',
 '’re',
 '’s',
 '’ve'}

**Exercise: Words in sklearn but not spacy**

In [37]:
set(sk) - set(sp)

{'amoungst',
 'bill',
 'cant',
 'co',
 'con',
 'couldnt',
 'cry',
 'de',
 'describe',
 'detail',
 'eg',
 'etc',
 'fill',
 'find',
 'fire',
 'found',
 'hasnt',
 'ie',
 'inc',
 'interest',
 'ltd',
 'mill',
 'sincere',
 'system',
 'thick',
 'thin',
 'un'}

# When to use them?

Often will hurt your performance if **using deep learning**

Another apporach: **sub-word units**