<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="cognitiveclass.ai logo">
</center>


# Machine Learning Foundation

## Course 4, Part e: Non-Negative Matrix Factorization DEMO


This exercise illustrates usage of Non-negative Matrix factorization and covers techniques related to sparse matrices and some basic work with Natural Langauge Processing.  We will use NMF to look at the top words for given topics.


## Data


We'll be using the BBC dataset. These are articles collected from 5 different topics, with the data pre-processed. 

These data are available in the data folder (or online [here](http://mlg.ucd.ie/files/datasets/bbc.zip?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01)). The data consists of a few files. The steps we'll be following are:

* *bbc.terms* is just a list of words 
* *bbc.docs* is a list of artcles listed by topic.

At a high level, we're going to 

1. Turn the `bbc.mtx` file into a sparse matrix (a [sparse matrix](https://docs.scipy.org/doc/scipy/reference/sparse.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01) format can be useful for matrices with many values that are 0, and save space by storing the position and values of non-zero elements).
1. Decompose that sparse matrix using NMF.
1. Use the resulting components of NMF to analyze the topics that result.


## Data Setup


Note: This lab has been updated to work in skillsnetwork for your convenience.


In [None]:
# Import urllib library for making HTTP requests
import urllib

In [None]:
# Open the URL to read the bbc.mtx file from remote storage
# urlopen: opens the URL and returns a file-like object
with urllib.request.urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/data/bbc.mtx') as r:
    # Read all lines from the file and skip the first 2 header lines using slice [2:]
    content = r.readlines()[2:]

## Part 1

Here, we will turn this into a list of tuples representing a [sparse matrix](https://docs.scipy.org/doc/scipy/reference/sparse.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01). Remember the description of the file from above:

* *bbc.mtx* is a list: first column is **wordID**, second is **articleID** and the third is the number of times that word appeared in that article.

So, if word 1 appears in article 3, 2 times, one element of our list will be:

`(1, 3, 2)`


In [None]:
# Convert each line into a tuple of integers (wordID, articleID, count)
# c.split(): splits each line by whitespace
# map(float, ...): converts split strings to floats first
# map(int, ...): then converts floats to integers
# tuple(...): creates a tuple from the three values
sparsemat = [tuple(map(int,map(float,c.split()))) for c in content]
# Let's examine the first few elements using slice [:8]
sparsemat[:8]

## Part 2: Preparing Sparse Matrix data for NMF 


We will use the [coo matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01) function to turn the sparse matrix into an array. 


In [None]:
# Import numpy for numerical operations
import numpy as np
# Import coo_matrix (coordinate format) for creating sparse matrices
from scipy.sparse import coo_matrix
# Extract row indices (wordID) from each tuple using list comprehension
rows = [x[0] for x in sparsemat]
# Extract column indices (articleID) from each tuple
cols = [x[1] for x in sparsemat]
# Extract the values (word count) from each tuple
values = [x[2] for x in sparsemat]
# Create a COO sparse matrix from the extracted data
# coo_matrix((data, (row_indices, col_indices))): creates sparse matrix with specified values at given positions
coo = coo_matrix((values, (rows, cols)))

## NMF


NMF is a way of decomposing a matrix of documents and words so that one of the matrices can be interpreted as the "loadings" or "weights" of each word on a topic. 


Check out [the NMF documentation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01) and the [examples of topic extraction using NMF and LDA](http://scikit-learn.org/0.18/auto_examples/applications/topics_extraction_with_nmf_lda.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork821-2023-01-01).


## Part 3

Here, we will import `NMF`, define a model object with 5 components, and `fit_transform` the data created above.


In [None]:
# Suppress warnings from using older version of sklearn
# Define a custom warn function that does nothing (passes)
def warn(*args, **kwargs):
    pass
# Import warnings module
import warnings
# Replace the default warn function with our custom one
warnings.warn = warn

# Import NMF (Non-negative Matrix Factorization) from sklearn
from sklearn.decomposition import NMF
# Create NMF model with 5 components (topics)
# n_components: number of topics to extract (5)
# init: initialization method ('random' uses random initialization)
# random_state: seed for reproducibility (818)
model = NMF(n_components=5, init='random', random_state=818)
# Fit the model to the sparse matrix and transform it to get document-topic matrix
# fit_transform: learns the model and returns the transformed data
doc_topic = model.fit_transform(coo)

# Get the shape of the document-topic matrix
doc_topic.shape
# we should have 9636 observations (articles) and five latent features

### NMF Formula Explanation

Non-negative Matrix Factorization decomposes the original matrix into two lower-rank matrices:

$$V \approx W \times H$$

Where:
- $V$ is the original matrix (documents × words) of size $m \times n$
- $W$ is the document-topic matrix (documents × topics) of size $m \times k$
- $H$ is the topic-word matrix (topics × words) of size $k \times n$
- $k$ is the number of components/topics (in our case, 5)

**Objective Function:**

NMF minimizes the reconstruction error using Frobenius norm:

$$\min_{W,H} \|V - WH\|_F^2 \quad \text{subject to} \quad W \geq 0, H \geq 0$$

The Frobenius norm is defined as:

$$\|V - WH\|_F = \sqrt{\sum_{i=1}^{m}\sum_{j=1}^{n}(V_{ij} - (WH)_{ij})^2}$$

**Non-negativity Constraint:**
All values in $W$ and $H$ must be non-negative ($\geq 0$), which makes the results interpretable as "parts-based" representations where topics are additive combinations of words.

In [None]:
# Find the topic (feature) with the highest value for each document
# np.argmax: returns indices of maximum values
# axis=1: operates along rows (finds max across columns for each row)
np.argmax(doc_topic, axis=1)

## Part 4: 

Check out the `components` of this model:


In [None]:
# Get the shape of the components matrix (topic-word matrix)
# components_: the learned topic-word weight matrix
model.components_.shape

This is five rows, each of which is a "topic" containing the weights of each word on that topic. The exercise is to _get a list of the top 10 words for each topic_. We can just store this in a list of lists.


**Note:** Just like we read in the data above, we'll have to read in the words from the `bbc.terms` file.


In [None]:
# Open the URL to read the bbc.terms file (list of words)
with urllib.request.urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/data/bbc.terms') as r:
    # Read all lines from the file
    content = r.readlines()
# Extract the first word from each line (split by whitespace and take element [0])
words = [c.split()[0] for c in content]

In [None]:
# Initialize empty list to store top words for each topic
topic_words = []
# Iterate through each row (topic) in the components matrix
for r in model.components_:
    # Create list of (value, index) tuples using enumerate
    # enumerate(r): pairs each weight with its index
    # sorted(..., reverse=True): sorts by value in descending order
    # [0:12]: selects top 12 words with highest weights
    a = sorted([(v,i) for i,v in enumerate(r)],reverse=True)[0:12]
    # Extract the actual words using their indices and append to topic_words
    # e[1]: gets the index from each (value, index) tuple
    # words[e[1]]: retrieves the word at that index
    topic_words.append([words[e[1]] for e in a])

In [None]:
# Display the top words for the first 5 topics using slice [:5]
# Here, each set of words relates to the corresponding topic (ie the first set of words relates to topic 'Business', etc.)
topic_words[:5]

The original data had 5 topics, as listed in `bbc.docs` (which these topic words relate to). 

```
Business
Entertainment
Politics
Sport
Tech
```

In "real life", we would have found a way to use these to inform the model. But for this little demo, we can just compare the recovered topics to the original ones. And they seem to match reasonably well. The order is different, which is to be expected in this kind of model.


In [None]:
# Open the URL to read the bbc.docs file (document labels)
with urllib.request.urlopen('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/data/bbc.docs') as r:
    # Read all lines from the file
    doc_content = r.readlines()

# Display the first 8 document labels using slice [:8]    
doc_content[:8]

---
### Machine Learning Foundation (C) 2020 IBM Corporation
