## NMF

a dimension reduction technique called "Non-negative matrix factorization" ("NMF") that expresses samples as combinations of interpretable parts. For example, it expresses documents as combinations of topics, and images in terms of commonly occurring visual patterns.

In [11]:
import pandas as pd
data=pd.read_csv("articles2.csv")
data.info()
documents=data.content[:1000]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49999 entries, 0 to 49998
Data columns (total 10 columns):
Unnamed: 0     49999 non-null int64
id             49999 non-null int64
title          49998 non-null object
publication    49999 non-null object
author         41401 non-null object
date           47373 non-null object
year           47373 non-null float64
month          47373 non-null float64
url            42988 non-null object
content        49999 non-null object
dtypes: float64(2), int64(2), object(6)
memory usage: 3.8+ MB


### A tf-idf word-frequency array

In [12]:

# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Create a TfidfVectorizer: tfidf
tfidf =  TfidfVectorizer()

# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(documents)

# Print result of toarray() method
print(csr_mat.toarray())

# Get the words: words
words = tfidf.get_feature_names()

# Print words
print(words)

#TruncatedSVD
#is able to perform PCA on sparse arrays in csr_matrix format, such as word-frequency arrays. Combine your knowledge of TruncatedSVD and k-means to cluster some popular pages from Wikipedia. In this exercise, build the pipeline. In the next exercise, you'll apply it to the word-frequency array of some Wikipedia articles.


[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [36]:
# Import NMF
from sklearn.decomposition import NMF

# Create an NMF instance: model
model = NMF(n_components=10)

# Fit the model to articles
model.fit(csr_mat)

# Transform the articles: nmf_features
nmf_features = model.transform(csr_mat)

# Print the NMF features
print(nmf_features)


[[1.33408559e-01 6.38851438e-03 2.74314253e-02 ... 0.00000000e+00
  0.00000000e+00 4.96344376e-03]
 [7.58912535e-02 0.00000000e+00 1.05989032e-01 ... 1.21412295e-04
  0.00000000e+00 0.00000000e+00]
 [3.98653900e-02 4.34130674e-02 8.25635784e-02 ... 0.00000000e+00
  0.00000000e+00 2.00968267e-02]
 ...
 [8.22373005e-02 0.00000000e+00 1.95435669e-02 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [1.22386053e-01 1.03090883e-02 4.24696456e-02 ... 0.00000000e+00
  0.00000000e+00 1.06788880e-02]
 [6.14912946e-02 0.00000000e+00 8.46097956e-02 ... 1.68414263e-02
  2.21490699e-02 0.00000000e+00]]


In [37]:
print(model.components_.shape)
# Import pandas
import pandas as pd

# Create a pandas DataFrame: df
df = pd.DataFrame(nmf_features,index=data.title[:1000])

# Print the row for 'Anne Hathaway'
print(df.loc['Patriots Day Is Best When It Digs Past the Heroism'])



(10, 38830)
0    0.133409
1    0.006389
2    0.027431
3    0.000000
4    0.014222
5    0.015565
6    0.000000
7    0.000000
8    0.000000
9    0.004963
Name: Patriots Day Is Best When It Digs Past the Heroism, dtype: float64


#### NMF reconstructs samples
In this exercise, you'll check your understanding of how NMF reconstructs samples from its components using the NMF feature values. On the right are the components of an NMF model. If the NMF feature values of a sample are [2, 1], then which of the following is most likely to represent the original sample?
componenst:
[[1.  0.5 0. ]
 [0.2 0.1 2.1]]

Answer:
2*[1.  0.5 0. ]+1*[0.2 0.1 2.1]=[2.2 , 1.1 , 2.1]

In [38]:
# Import pandas
import pandas as pd

# Create a DataFrame: components_df
components_df = pd.DataFrame(model.components_,columns=words)



# Select row 3: component
component = components_df.iloc[5]

# Print result of nlargest
print("")
print(component.nlargest(10))



the             0.729942
flynn           0.600920
to              0.293191
russian         0.285626
intelligence    0.275566
that            0.250541
russia          0.249361
and             0.172711
was             0.172172
trump           0.171626
Name: 5, dtype: float64


### Building recommender systems using NMF

In [44]:
# Perform the necessary imports
import pandas as pd
from sklearn.preprocessing import normalize

# Normalize the NMF features: norm_features
norm_features = normalize(nmf_features)
# Create a DataFrame: df
df = pd.DataFrame(norm_features,index=data.title[:1000])

# Select the row corresponding to 'Cristiano Ronaldo': article
article = df.loc['Donald Trump Meets, and Assails, the Press']

# Compute the dot products: similarities
similarities = df.dot(article) 

# Display those with the largest cosine similarity
print(similarities.nlargest(10))

title
Donald Trump Meets, and Assails, the Press                             1.000000
Why Trump Is Accusing Obama of Wiretapping                             0.968294
The White House Can’t Easily Repair Its Relationship With the Media    0.956198
President Trump’s Untruths Are Piling Up                               0.954377
The Obama-Trump Truce Is Already Over                                  0.928214
What Happens When a President Is Declared Illegitimate?                0.921217
Why Is Trump Returning to Birther-Style Attacks on Obama?              0.917681
’Alternative Facts’: The Needless Lies of the Trump Administration     0.916511
Trump Kicks Off His 2020 Reelection Campaign on Saturday               0.909959
The Formidable Checks and Balances Imposing on President Trump         0.897096
dtype: float64
