# Discovering interpretable features
In this chapter, you'll learn about a dimension reduction technique called "Non-negative matrix factorization" ("NMF") that expresses samples as combinations of interpretable parts. For example, it expresses documents as combinations of topics, and images in terms of commonly occurring visual patterns. You'll also learn to use NMF to build recommender systems that can find you similar articles to read, or musical artists that match your listening history!

# 1.Non-negative matrix factorization (NMF)
## 1.1 Non-negative data
Which of the following 2-dimensional arrays are examples of non-negative data?

1. A tf-idf word-frequency array.
2. An array daily stock market price movements (up and down), where each row represents a company.
3. An array where rows are customers, columns are products and entries are 0 or 1, indicating whether a customer has purchased a product.

###### Possible Answers:
1. 1 only
2. 2 and 3
3. 1 and 3

__Answer:__ Stock prices can go down as well as up, so an array of daily stock market price movements is not an example of non-negative data. (3)

## 1.2 NMF applied to Wikipedia articles
In the video, you saw NMF applied to transform a toy word-frequency array. Now it's your turn to apply NMF, this time using the tf-idf word-frequency array of Wikipedia articles, given as a csr matrix `articles`. Here, fit the model and transform the articles. In the next exercise, you'll explore the result.

### Instructions:
* Import `NMF` from `sklearn.decomposition`.
* Create an `NMF` instance called `model` with `6` components.
* Fit the model to the word count data `articles`.
* Use the `.transform()` method of `model` to transform `articles`, and assign the result to `nmf_features`.
* Print `nmf_features` to get a first idea what it looks like.

In [1]:
import pandas as pd
from scipy.sparse import csr_matrix

df = pd.read_csv('_datasets/wikipedia-vectors.csv', index_col=0)
articles = csr_matrix(df.transpose())
# from sklearn import decomposition
# import functools
# import inspect

# def _init_wrapper(f):
#     @functools.wraps(f)
#     def wrapper(*args, **kwargs):
#         bound_args = inspect.signature(f).bind(*args, **kwargs)
#         if not bound_args.arguments.get('n_components', None):
#             raise BaseException("Be sure to correctly specify n_components.")
#         else: 
#             return f(*args, **kwargs)
#     return wrapper
        
# if not getattr(decomposition.NMF.__init__, 'decorated', None):
#     decomposition.NMF.__init__ = _init_wrapper(decomposition.NMF.__init__)
#     decomposition.NMF.__init__.decorated = True

In [3]:
articles

<60x13125 sparse matrix of type '<class 'numpy.float64'>'
	with 42091 stored elements in Compressed Sparse Row format>

In [4]:
# Import NMF
from sklearn.decomposition import NMF

# Create an NMF instance: model
model = NMF(n_components=6)

# Fit the model to articles
model.fit(articles)

# Transform the articles: nmf_features
nmf_features = model.transform(articles)

# Print the NMF features
print(nmf_features)

[[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 4.40408746e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 5.66531579e-01]
 [3.82058178e-03 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 3.98595338e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 3.81690831e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 4.85453916e-01]
 [1.29291751e-02 1.37898408e-02 7.76289619e-03 3.34479265e-02
  0.00000000e+00 3.34478785e-01]
 [0.00000000e+00 0.00000000e+00 2.06732391e-02 0.00000000e+00
  6.04345191e-03 3.59015072e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 4.90912793e-01]
 [1.54275247e-02 1.42826360e-02 3.76614820e-03 2.37106468e-02
  2.62559694e-02 4.80712637e-01]
 [1.11739438e-02 3.13697661e-02 3.09471672e-02 6.56985678e-02
  1.96632422e-02 3.38245148e-01]
 [0.00000000e+00 0.00000000e+00 5.30695509e-01 0.0

These NMF features don't make much sense at this point, but you will explore them in the next exercise!

## 1.2 NMF features of the Wikipedia articles
Now you will explore the NMF features you created in the previous exercise. A solution to the previous exercise has been pre-loaded, so the array `nmf_features` is available. Also available is a list titles giving the title of each Wikipedia `article`.

When investigating the features, notice that for both actors, the NMF feature 3 has by far the highest value. This means that both articles are reconstructed using mainly the 3rd NMF component. In the next video, you'll see why: NMF components represent topics (for instance, acting!).

### Instructions:
* Import `pandas` as `pd`.
* Create a DataFrame `df` from `nmf_features` using `pd.DataFrame()`. Set the index to `titles` using `index=titles`.
* Use the `.loc[]` accessor of `df` to select the row with title `'Anne Hathaway'`, and print the result. These are the NMF features for the article about the actress Anne Hathaway.
* Repeat the last step for `'Denzel Washington'` (another actor).

In [6]:
_df = pd.read_csv('_datasets/wikipedia-vectors.csv', index_col=0)
articles = csr_matrix(_df.transpose())
titles = list(_df.columns)

# fix seed (must get same results now and in later exercise)
import numpy as np
np.random.seed(1)

# model solution to previous exercise
from sklearn.decomposition import NMF

model = NMF(n_components=6)
model.fit(articles)
nmf_features = model.transform(articles)

In [7]:
# Import pandas
import pandas as pd

# Create a pandas DataFrame: df
df = pd.DataFrame(nmf_features, index=titles)

# Print the row for 'Anne Hathaway'
print(df.loc[['Anne Hathaway']])

# Print the row for 'Denzel Washington'
print(df.loc[['Denzel Washington']])

                      0    1    2         3    4    5
Anne Hathaway  0.003845  0.0  0.0  0.575711  0.0  0.0
                     0         1    2        3    4    5
Denzel Washington  0.0  0.005601  0.0  0.42238  0.0  0.0


Notice that for both actors, the NMF feature 3 has by far the highest value. This means that both articles are reconstructed using mainly the 3rd NMF component. In the next video, you'll see why: NMF components represent topics (for instance, acting!).

## 1.4 NMF reconstructs samples
In this exercise, you'll check your understanding of how NMF reconstructs samples from its components using the NMF feature values. On the right are the components of an NMF model. If the NMF feature values of a sample are `[2, 1]`, then which of the following is _most likely_ to represent the original sample? 
```
[[ 1.   0.5  0. ]
 [ 0.2  0.1  2.1]]
```
Possible Answers
1. `[2.2, 1.0, 2.0].`
2. `[0.5, 1.6, 3.1].`
3. `[-4.0, 1.0, -2.0].`

Solution:
```
 2 * | [[ 1.   0.5  0. ]
+1 * |  [ 0.2  0.1  2.1]]
------------------------
     | [  2.2  1.1  2.1 ]
```
Answer: (1)