<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Data-Wrangling" data-toc-modified-id="Data-Wrangling-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Wrangling</a></span><ul class="toc-item"><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Load Data</a></span></li><li><span><a href="#Some-data-points" data-toc-modified-id="Some-data-points-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Some data points</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Stemming-and-Lemmatization-with-nltk" data-toc-modified-id="Stemming-and-Lemmatization-with-nltk-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>Stemming and Lemmatization with nltk</a></span></li><li><span><a href="#Lemmatization-with-Spacy" data-toc-modified-id="Lemmatization-with-Spacy-2.3.2"><span class="toc-item-num">2.3.2&nbsp;&nbsp;</span>Lemmatization with Spacy</a></span></li><li><span><a href="#Term-document-matrix" data-toc-modified-id="Term-document-matrix-2.3.3"><span class="toc-item-num">2.3.3&nbsp;&nbsp;</span>Term-document matrix</a></span></li></ul></li></ul></li><li><span><a href="#Experiments" data-toc-modified-id="Experiments-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Experiments</a></span><ul class="toc-item"><li><span><a href="#Fake-news" data-toc-modified-id="Fake-news-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Fake news</a></span></li><li><span><a href="#Disaster-news" data-toc-modified-id="Disaster-news-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Disaster news</a></span></li></ul></li></ul></div>

# Introduction

This notebook is inspired by the fastai nlp course about topic modeling available by clicking this [link](https://www.youtube.com/watch?v=tG3pUwmGjsc&list=PLtmWHNX-gukKocXQOkQjuVxglSDYWsSh9&index=3&t=0s).
<br>
SVD (singular value decomposition) and NNMF (non-negative matrix factorization) are first explored through the newsgroup dataset before using those tools experimentally with the [fake news](https://www.kaggle.com/mrisdal/fake-news) and the [disaster news](https://www.kaggle.com/c/nlp-getting-started) datasets from [Kaggle](https://www.kaggle.com/). 

# Data Wrangling

## Load Data

Details on the newsgroup dataset are given by [Scikit-learn](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html). The data incorporates 20 different classes. Yet, the notebook focuses on three classes only for clarity purpose.
<br>
Beware that even though the data have labels, topic modeling is an unsupervised task.

In [1]:
from sklearn.datasets import fetch_20newsgroups

In [21]:
# fecth data with 3 categories only
categories = ['sci.space', 'talk.politics.guns', 'rec.sport.baseball']
remove = ('headers', 'footers', 'quotes') # part of the text being removed to avoid overfitting
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

## Some data points

In [22]:
type(newsgroups_train)

sklearn.utils.Bunch

In [23]:
len(newsgroups_train.data)

1736

In [24]:
newsgroups_train.target_names

['rec.sport.baseball', 'sci.space', 'talk.politics.guns']

In [25]:
# baseball
print(newsgroups_train.target[10])
print('-------')
print(newsgroups_train.data[10])

0
-------

That's might be what it takes to beat the Braves this year.  


In [26]:
# space
print(newsgroups_train.target[7])
print('-------')
print(newsgroups_train.data[7])

1
-------



I hate to pour cold water on this, but currently seawater extracted
uranium, even using the new, improved fiber absorbers from Japan, is
about 20 times more expensive than uranium on the spot market.
Uranium is *very* cheap right now, around $10/lb.  Right now, there
are mines closing because they can't compete with places like Cigar
Lake in Canada (where the ore is so rich they present safety hazards
to the mines, who work in shielded vehicles).  Plenty of other sources
(for example, uranium from phosphate processing) would come on line before
uranium reached $200/lb.

"Demand and supply balance will collapse" is nonsense.  Supply and
demand always balance; what changes is the price.  Is uranium going
to increase in price by a factor of 20 by the end of the century?
Not bloody likely.  New nuclear reactors are not being built
at a sufficient rate.

Uranium from seawater is interesting, but it's a long term project, or
a project that the Japanese might justify on grounds o

In [27]:
# guns
print(newsgroups_train.target[116])
print('-------')
print(newsgroups_train.data[116])

2
-------


Fret not, you made it.


Not while we still have our guns.  <evil grin>  

Hey, gang, it's not about duck hunting, or about dark alleys,
it's about black-clad, helmeted and booted troops storming
houses and violating civil rights under color of law. 

Are YOU ready to defend YOUR Constitution?


## Preprocessing

Those are Usual preprocessing steps in many NLP applications; some controversies may be found about the use of such preprocessing when doing deep learning.

<img src="img/peterSko_tweet.png" alt="" style="width: 65%"/>

Although, the complexity of the models being low for SVD and NNMF, it is appreciated to preprocess the data to reduce noise that cannot be captured anyway.

Some libraries to perform those preprocessing steps
- NLTK
- Spacy
<br>

According to Rachel Thomas from fastai, Spacy is seen as: "A very modern & fast nlp library. Spacy is opinionated, in that it typically offers one highly optimized way to do something (whereas nltk offers a huge variety of ways, although they are usually not as optimized)".

### Stemming and Lemmatization with nltk

In [28]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/maxime/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [29]:
# instantiate Porter stemmer
stem_nltk = nltk.stem.porter.PorterStemmer()
# instantiate lemmer
lem_nltk = nltk.stem.WordNetLemmatizer()

In [30]:
words = ['math', 'mathematics', 'mathematician', 'mathematicians', 'fly', 'flight', 'flee', 'king', 'kingdom']

In [31]:
[stem_nltk.stem(word) for word in words]

['math',
 'mathemat',
 'mathematician',
 'mathematician',
 'fli',
 'flight',
 'flee',
 'king',
 'kingdom']

In [32]:
[lem_nltk.lemmatize(word) for word in words]

['math',
 'mathematics',
 'mathematician',
 'mathematician',
 'fly',
 'flight',
 'flee',
 'king',
 'kingdom']

### Lemmatization with Spacy

In [33]:
import spacy
nlp_spacy = spacy.load('en_core_web_sm')
lem_spacy = nlp_spacy.Defaults.create_lemmatizer()

In [34]:
[lem_spacy.lookup(word) for word in words]

['math',
 'mathematics',
 'mathematician',
 'mathematicians',
 'fly',
 'flight',
 'flee',
 'king',
 'kingdom']

### Term-document matrix

We'll stick to Scikit-learn to preprocess the data

[CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) converts a collection of text documents to a matrix of token counts.

In [67]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np

In [68]:
vectorizer = CountVectorizer(stop_words='english')
# vectorizer = TfidfVectorizer(stop_words='english')

In [69]:
tdm = vectorizer.fit_transform(newsgroups_train.data).todense()

In [70]:
vocab = np.array(vectorizer.get_feature_names())
vocab[6000:6020]

array(['challenge', 'challenged', 'challenger', 'challenges',
       'challenging', 'cham', 'chamber', 'chambering', 'chamberlain',
       'chamberlin', 'champer', 'champion', 'championed', 'champions',
       'championship', 'championships', 'champs', 'chance', 'chancellor',
       'chances'], dtype='<U79')

In [71]:
tdm.shape

(1736, 25382)

In [72]:
len(vocab)

25382

In [74]:
vectorizer.get_feature_names()[0]

'00'

In [88]:
tdm[:,0].sum()

231

In [81]:
tdm.sum(axis=0)

matrix([[231, 289,   2, ...,   2,   1,   1]])

In [76]:
a = tdm.sum(axis=0)

In [77]:
len(a)

1

In [121]:
a

matrix([[231, 289,   2, ...,   2,   1,   1]])

In [119]:
a = tdm.sum(axis=0)

In [124]:
a=a[0]

In [125]:
a.shape

(1, 25382)

In [128]:
a = a.tolist()[0]

In [129]:
a

[231,
 289,
 2,
 4,
 1,
 1,
 2,
 2,
 1,
 1,
 2,
 1,
 1,
 1,
 1,
 1,
 2,
 1,
 1,
 2,
 3,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 143,
 1,
 1,
 3,
 3,
 2,
 1,
 1,
 1,
 1,
 142,
 1,
 1,
 1,
 1,
 2,
 1,
 1,
 2,
 1,
 1,
 2,
 1,
 1,
 1,
 131,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 2,
 1,
 115,
 2,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 1,
 1,
 71,
 1,
 2,
 1,
 4,
 1,
 1,
 2,
 1,
 1,
 3,
 1,
 41,
 2,
 1,
 1,
 1,
 1,
 1,
 11,
 3,
 2,
 1,
 32,
 2,
 1,
 1,
 1,
 2,
 1,
 2,
 1,
 18,
 1,
 1,
 4,
 1,
 1,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 14,
 5,
 1,
 1,
 3,
 3,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 2,
 1,
 1,
 2,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 1,
 2,
 1,
 3,
 1,
 1,
 1,
 2,
 1,
 1,
 4,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 3,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 1,
 1,
 3,
 1,
 1,
 1,
 266,
 111,
 24,
 1,
 1,
 3,
 2,
 2,
 3,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 11,
 1,
 3,
 11,
 1,
 1,
 1,
 31,
 6,
 2,
 1,
 1,
 16,
 1,
 8,
 1,
 1,
 2,
 13,
 1,
 1,


In [94]:
vocab

{'00': [231,
  289,
  2,
  4,
  1,
  1,
  2,
  2,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  2,
  3,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  143,
  1,
  1,
  3,
  3,
  2,
  1,
  1,
  1,
  1,
  142,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  2,
  1,
  1,
  2,
  1,
  1,
  1,
  131,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  2,
  1,
  115,
  2,
  1,
  1,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  71,
  1,
  2,
  1,
  4,
  1,
  1,
  2,
  1,
  1,
  3,
  1,
  41,
  2,
  1,
  1,
  1,
  1,
  1,
  11,
  3,
  2,
  1,
  32,
  2,
  1,
  1,
  1,
  2,
  1,
  2,
  1,
  18,
  1,
  1,
  4,
  1,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  14,
  5,
  1,
  1,
  3,
  3,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  2,
  2,
  1,
  1,
  2,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  2,
  1,
  3,
  1,
  1,
  1,
  2,
  1,
  1,
  4,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  3,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
 

In [97]:
len(a)

1

In [99]:
names = vectorizer.get_feature_names()

In [101]:
type(names)

list

In [107]:
len(names)

25382

In [109]:
len(a)

1

In [110]:
a

[[231,
  289,
  2,
  4,
  1,
  1,
  2,
  2,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  2,
  3,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  143,
  1,
  1,
  3,
  3,
  2,
  1,
  1,
  1,
  1,
  142,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  2,
  1,
  1,
  2,
  1,
  1,
  1,
  131,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  2,
  1,
  115,
  2,
  1,
  1,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  71,
  1,
  2,
  1,
  4,
  1,
  1,
  2,
  1,
  1,
  3,
  1,
  41,
  2,
  1,
  1,
  1,
  1,
  1,
  11,
  3,
  2,
  1,
  32,
  2,
  1,
  1,
  1,
  2,
  1,
  2,
  1,
  18,
  1,
  1,
  4,
  1,
  1,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  14,
  5,
  1,
  1,
  3,
  3,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  2,
  2,
  1,
  1,
  2,
  1,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  2,
  1,
  3,
  1,
  1,
  1,
  2,
  1,
  1,
  4,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  3,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  

In [130]:
vocab = dict(zip(names, a))
vocab = {i: int(v) for i, v in vocab.items()}

In [131]:
vocab

{'00': 231,
 '000': 289,
 '0000': 2,
 '00000': 4,
 '000000': 1,
 '000062david42': 1,
 '000152': 2,
 '00041032': 2,
 '0004136': 1,
 '0004246': 1,
 '0004422': 2,
 '00044513': 1,
 '0004847546': 1,
 '0005': 1,
 '00090711': 1,
 '000th': 1,
 '001': 2,
 '0012': 1,
 '001319': 1,
 '0018': 2,
 '002': 3,
 '0020': 1,
 '0022': 1,
 '0028': 1,
 '0029': 2,
 '003': 1,
 '0033': 1,
 '0034': 1,
 '004': 2,
 '006': 1,
 '0065': 1,
 '007': 1,
 '008': 1,
 '0096b294': 1,
 '0098': 1,
 '00xkv': 1,
 '01': 143,
 '010': 1,
 '013': 1,
 '014': 3,
 '015': 3,
 '016': 2,
 '018': 1,
 '01826': 1,
 '018b': 1,
 '019': 1,
 '02': 142,
 '020': 1,
 '020359': 1,
 '02115': 1,
 '02138': 1,
 '02139': 2,
 '02178': 1,
 '023': 1,
 '023b': 2,
 '024': 1,
 '0245': 1,
 '025': 2,
 '025258': 1,
 '027': 1,
 '029': 1,
 '03': 131,
 '030': 1,
 '0300': 1,
 '033': 1,
 '034': 2,
 '034101': 1,
 '035': 1,
 '037': 1,
 '038': 2,
 '039': 1,
 '04': 115,
 '040': 2,
 '041': 1,
 '04110': 1,
 '041493003715': 1,
 '042': 1,
 '043': 1,
 '044': 1,
 '045': 2,
 '0

In [132]:
t = dict(sorted(vocab.items(), key=lambda x: x[1], reverse=True))

In [None]:
r = {
}

In [133]:
t

{'space': 992,
 'gun': 620,
 'people': 593,
 'don': 592,
 'like': 565,
 'year': 557,
 'just': 540,
 'think': 506,
 'time': 485,
 'good': 418,
 'know': 399,
 'nasa': 381,
 'right': 337,
 'use': 336,
 'new': 329,
 'file': 313,
 'make': 311,
 'years': 296,
 '000': 289,
 'did': 289,
 'guns': 288,
 'better': 271,
 'launch': 271,
 '10': 266,
 'does': 265,
 'way': 256,
 'used': 249,
 'control': 248,
 'data': 244,
 'team': 243,
 've': 242,
 'government': 236,
 'edu': 232,
 '00': 231,
 'firearms': 229,
 'really': 228,
 'earth': 227,
 'say': 221,
 'game': 219,
 'law': 219,
 'said': 217,
 'second': 217,
 'long': 212,
 'point': 208,
 'believe': 202,
 'going': 202,
 'orbit': 201,
 'll': 197,
 'day': 196,
 'national': 196,
 'got': 195,
 'weapons': 195,
 'didn': 193,
 'shuttle': 192,
 'information': 190,
 'satellite': 188,
 'want': 186,
 'things': 184,
 'defense': 183,
 'lunar': 182,
 'won': 182,
 '1993': 181,
 'state': 180,
 'high': 179,
 'power': 178,
 'hit': 173,
 'moon': 171,
 '15': 170,
 'little

In [None]:
for k, v in t.items():
    if v < 10 and v > 5:
        r[k] = v

In [None]:
r

In [None]:
vocab

In [None]:
from scipy import linalg

In [None]:
U, s, Vh = linalg.svd(tdm, full_matrices=False)

In [None]:
print(U.shape, s.shape, Vh.shape)

In [None]:
num_top_words=8

def show_topics(a):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

In [None]:
show_topics(Vh[:10])

# Experiments

## Fake news

## Disaster news