<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="cognitiveclass.ai logo">
</center>


# Machine Learning Foundation

## Course 4, Part e: Non-Negative Matrix Factorization DEMO


This exercise illustrates usage of Non-negative Matrix factorization and covers techniques related to sparse matrices and some basic work with Natural Langauge Processing.  We will use NMF to look at the top words for given topics.


## Data


We'll be using the BBC dataset. These are articles collected from 5 different topics, with the data pre-processed.

These data are available in the data folder (or online [here](http://mlg.ucd.ie/files/datasets/bbc.zip?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01). The data consists of a few files. The steps we'll be following are:

* *bbc.terms* is just a list of words
* *bbc.docs* is a list of artcles listed by topic.

At a high level, we're going to

1. Turn the `bbc.mtx` file into a sparse matrix (a [sparse matrix](https://docs.scipy.org/doc/scipy/reference/sparse.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01) format can be useful for matrices with many values that are 0, and save space by storing the position and values of non-zero elements).
1. Decompose that sparse matrix using NMF.
1. Use the resulting components of NMF to analyze the topics that result.


## Data Setup


Note: This lab has been updated to work in skillsnetwork for your convenience.


In [2]:
import urllib

In [3]:
mtx_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/data/bbc.mtx'
tms_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/data/bbc.terms'
doc_url =  'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/data/bbc.docs'

In [4]:
# Read "bbc.mtx" file
with urllib.request.urlopen(mtx_url) as r:
  content = r.readlines()[2:]


In [5]:
# Read "bbc.terms" file
with urllib.request.urlopen(tms_url) as r:
  content = r.readlines()[2:]
content[:6]

[b'boost\n', b'time\n', b'warner\n', b'profit\n', b'quarterli\n', b'media\n']

In [6]:
with urllib.request.urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/data/bbc.mtx') as r:
    content = r.readlines()[2:]

## Part 1

Here, we will turn this into a list of tuples representing a [sparse matrix](https://docs.scipy.org/doc/scipy/reference/sparse.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01). Remember the description of the file from above:

* *bbc.mtx* is a list: first column is **wordID**, second is **articleID** and the third is the number of times that word appeared in that article.

So, if word 1 appears in article 3, 2 times, one element of our list will be:

`(1, 3, 2)`


In [7]:
[c.split() for c in content][:6]

[[b'1', b'1', b'1.0'],
 [b'1', b'7', b'2.0'],
 [b'1', b'11', b'1.0'],
 [b'1', b'14', b'1.0'],
 [b'1', b'15', b'2.0'],
 [b'1', b'19', b'2.0']]

In [8]:
# Converts bytes --> string --> float
[tuple(map(float, c.split())) for c in content][:6]

[(1.0, 1.0, 1.0),
 (1.0, 7.0, 2.0),
 (1.0, 11.0, 1.0),
 (1.0, 14.0, 1.0),
 (1.0, 15.0, 2.0),
 (1.0, 19.0, 2.0)]

In [9]:
[map(float, c.split()) for c in content][:6]

[<map at 0x7c17c1188610>,
 <map at 0x7c17c1188fa0>,
 <map at 0x7c17c1188340>,
 <map at 0x7c17c1189480>,
 <map at 0x7c17c1189360>,
 <map at 0x7c17c1189540>]

In [10]:
[map(int, map(float, c.split())) for c in content][:6]

[<map at 0x7c17c144a680>,
 <map at 0x7c17c144a560>,
 <map at 0x7c17c144abf0>,
 <map at 0x7c17c144ad40>,
 <map at 0x7c17c144ae30>,
 <map at 0x7c17c144af20>]

In [11]:
sparsemat = [tuple(map(int, map(float, c.split()))) for c in content]
sparsemat[:8]

[(1, 1, 1),
 (1, 7, 2),
 (1, 11, 1),
 (1, 14, 1),
 (1, 15, 2),
 (1, 19, 2),
 (1, 21, 1),
 (1, 29, 1)]

## Part 2: Preparing Sparse Matrix data for NMF


We will use the [coo matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01) function to turn the sparse matrix into an array.


In [12]:
import numpy as np
from scipy.sparse import coo_matrix

In [47]:
rows = [x[1]-1 for x in sparsemat]
cols = [x[0]-1 for x in sparsemat]
values = [x[2]-1 for x in sparsemat]
coo = coo_matrix((values, (rows, cols))) # create sparse matrix

In [14]:
display(coo)

<COOrdinate sparse matrix of dtype 'int64'
	with 286774 stored elements and shape (9636, 2226)>

In [34]:
# Read "bbc.terms" file
with urllib.request.urlopen(tms_url) as r:
  contents = r.readlines()
words = [c.split()[0].decode() for c in contents] # decode for bytes strings
words[:5]

['ad', 'sale', 'boost', 'time', 'warner']

In [35]:
# Read "bbc.docs" file
with urllib.request.urlopen(doc_url) as r:
  contents = r.readlines()
docs = [c.split()[0].decode() for c in contents] # decode for bytes strings
docs[:5]

['business.001',
 'business.002',
 'business.003',
 'business.004',
 'business.005']

In [45]:
len(docs), len(words), coo.shape

(2225, 9635, (9635, 2225))

In [49]:
import pandas as pd
pd.DataFrame(coo.toarray(), columns=words, index=docs).head(10)

Unnamed: 0,ad,sale,boost,time,warner,profit,quarterli,media,giant,jump,...,£339,denialofservic,ddo,seagrav,bot,wirelessli,streamcast,peripher,headphon,flavour
business.001,0,4,1,2,3,9,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
business.002,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
business.003,0,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
business.004,0,0,0,0,0,3,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
business.005,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
business.006,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
business.007,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
business.008,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
business.009,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
business.010,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## NMF


NMF is a way of decomposing a matrix of documents and words so that one of the matrices can be interpreted as the "loadings" or "weights" of each word on a topic.


Check out [the NMF documentation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01) and the [examples of topic extraction using NMF and LDA](http://scikit-learn.org/0.18/auto_examples/applications/topics_extraction_with_nmf_lda.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01).


## Part 3

Here, we will import `NMF`, define a model object with 5 components, and `fit_transform` the data created above.


In [50]:
# Suppress warnings from using older version of sklearn
def warn(*args, **kwargs):
  pass

import warnings
warnings.warn = warn

In [51]:
from sklearn.decomposition import NMF
model = NMF(n_components=5, init='random', random_state=818)
doc_topic = model.fit_transform(coo)
# we should have 9636 observations (articles) and five latent features
print(f"from (9636, 2226) to {doc_topic.shape}.")

from (9636, 2226) to (2225, 5).


In [52]:
# Find feature with highest value per doc
np.argmax(doc_topic, axis=1)

array([4, 1, 3, ..., 1, 1, 2])

## Part 4:

Check out the `components` of this model:


In [53]:
coo.shape

(2225, 9635)

In [54]:
model.components_

array([[0.16842663, 0.33867222, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.35426818, 0.51169176, 0.07210032, ..., 0.00449205, 0.        ,
        0.        ],
       [0.17917495, 0.42448256, 0.00397402, ..., 0.00070098, 0.        ,
        0.        ],
       [0.09696554, 0.21590291, 0.04927755, ..., 0.00102283, 0.        ,
        0.        ],
       [0.22268184, 0.00525106, 0.06118427, ..., 0.        , 0.        ,
        0.        ]])

In [55]:
model.components_.shape

(5, 9635)

In [57]:
topic_word = pd.DataFrame(model.components_.round(3),
             index = ["topic_1","topic_2","topic_3","topic_4","topic_5"],
             columns = words)
topic_word

Unnamed: 0,ad,sale,boost,time,warner,profit,quarterli,media,giant,jump,...,£339,denialofservic,ddo,seagrav,bot,wirelessli,streamcast,peripher,headphon,flavour
topic_1,0.168,0.339,0.0,0.419,0.0,0.058,0.0,0.039,0.0,0.048,...,0.0,0.0,0.0,0.0,0.0,0.0,0.003,0.0,0.0,0.0
topic_2,0.354,0.512,0.072,1.485,0.013,0.148,0.001,0.569,0.031,0.014,...,0.0,0.0,0.037,0.049,0.082,0.0,0.012,0.004,0.0,0.0
topic_3,0.179,0.424,0.004,3.46,0.027,0.002,0.0,0.152,0.019,0.026,...,0.0,0.0,0.0,0.0,0.001,0.0,0.0,0.001,0.0,0.0
topic_4,0.097,0.216,0.049,1.418,0.0,0.147,0.001,0.0,0.006,0.016,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001,0.0,0.0
topic_5,0.223,0.005,0.061,1.075,0.082,0.055,0.0,0.102,0.0,0.148,...,0.0,0.0,0.0,0.0,0.0,0.0,0.006,0.0,0.0,0.0


The original data had 5 topics, as listed in `bbc.docs` (which these topic words relate to).

```
Business
Entertainment
Politics
Sport
Tech
```

In "real life", we would have found a way to use these to inform the model. But for this little demo, we can just compare the recovered topics to the original ones. And they seem to match reasonably well. The order is different, which is to be expected in this kind of model.


In [63]:
np.unique([i.split('.')[0] for i in docs], return_counts=True)

(array(['business', 'entertainment', 'politics', 'sport', 'tech'],
       dtype='<U13'),
 array([510, 386, 417, 511, 401]))

In [58]:
topic_doc = pd.DataFrame(doc_topic.round(5),
                         index = [i.split('.')[0] for i in docs],
                         columns = ['topic_1','topic_2','topic_3','topic_4','topic_5'])
topic_doc

Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5
business,0.00000,0.02640,0.01557,0.00964,0.03487
business,0.00000,0.03848,0.00544,0.01710,0.01375
business,0.00000,0.02036,0.00069,0.02634,0.00000
business,0.02648,0.00835,0.00163,0.07533,0.01134
business,0.00000,0.02329,0.00000,0.00000,0.00048
...,...,...,...,...,...
tech,0.00000,0.15878,0.00000,0.00903,0.00000
tech,0.00000,0.13088,0.00000,0.00880,0.00000
tech,0.00000,0.21999,0.00000,0.04542,0.00974
tech,0.00533,0.08742,0.00000,0.00414,0.00000


In [66]:
topic_doc.reset_index()

Unnamed: 0,index,topic_1,topic_2,topic_3,topic_4,topic_5
0,business,0.00000,0.02640,0.01557,0.00964,0.03487
1,business,0.00000,0.03848,0.00544,0.01710,0.01375
2,business,0.00000,0.02036,0.00069,0.02634,0.00000
3,business,0.02648,0.00835,0.00163,0.07533,0.01134
4,business,0.00000,0.02329,0.00000,0.00000,0.00048
...,...,...,...,...,...,...
2220,tech,0.00000,0.15878,0.00000,0.00903,0.00000
2221,tech,0.00000,0.13088,0.00000,0.00880,0.00000
2222,tech,0.00000,0.21999,0.00000,0.04542,0.00974
2223,tech,0.00533,0.08742,0.00000,0.00414,0.00000


In [69]:
topic_doc.reset_index().groupby('index').mean()

Unnamed: 0_level_0,topic_1,topic_2,topic_3,topic_4,topic_5
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
business,0.008724,0.040825,0.006216,0.037109,0.003858
entertainment,0.066883,0.024052,0.0087,0.007597,0.101418
politics,0.002939,0.136302,0.00481,0.041765,0.002707
sport,0.009978,0.013324,0.058959,0.008326,0.008432
tech,0.017699,0.151173,0.097558,0.011137,0.012171


In [73]:
topic_doc.reset_index().groupby('index').mean().idxmax()

Unnamed: 0,0
topic_1,entertainment
topic_2,tech
topic_3,tech
topic_4,politics
topic_5,entertainment


In [77]:
topic_word.T

Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5
ad,0.168,0.354,0.179,0.097,0.223
sale,0.339,0.512,0.424,0.216,0.005
boost,0.000,0.072,0.004,0.049,0.061
time,0.419,1.485,3.460,1.418,1.075
warner,0.000,0.013,0.027,0.000,0.082
...,...,...,...,...,...
wirelessli,0.000,0.000,0.000,0.000,0.000
streamcast,0.003,0.012,0.000,0.000,0.006
peripher,0.000,0.004,0.001,0.001,0.000
headphon,0.000,0.000,0.000,0.000,0.000


In [79]:
topic_word.T.sort_values(by='topic_1', ascending=False)

Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5
song,11.333,0.000,0.000,0.000,0.000
music,10.833,2.660,0.000,0.000,0.000
best,9.501,0.000,0.000,0.000,8.225
year,6.373,2.236,1.615,3.465,1.564
25,4.790,0.000,0.000,0.000,0.000
...,...,...,...,...,...
investor,0.000,0.069,0.000,0.046,0.003
dab,0.000,0.028,0.004,0.000,0.004
sleek,0.000,0.000,0.000,0.000,0.000
512mb,0.000,0.000,0.000,0.000,0.000


---
### Machine Learning Foundation (C) 2020 IBM Corporation
