In the previous notebook, we saw that assessing the similarity of authors according to their "review profile" yielded remarkably intuitive groupings. In particular, this method was picking up on more than just the co-presence of 1 or 2 journals in two authors' review sets. Anne Rice, for example, was grouped with other mystery and suspense authors *even though* her top journals were not explicitly dedicated to that genre. What this means is that there is some particular combination of journals that is associated with "detectiveness," at least in this data.

In this notebook, we will attempt to recover some of the most prominent of these underlying combinations that account for the structure of the data. We'll be using a method known as nonnegative matrix factorization (NMF), which decomposes a matrix into two lower-rank matrices that approximate its structure. NMF is used in many applications, such as recommendation systems and topic modeling. 

In [6]:
import pandas as pd
import numpy as np
import math

from sklearn.decomposition import NMF

We'll load and preprocess the data in the same way that we did for the author-similarity comparisons.

In [7]:
# load data
df = pd.read_csv('../data/processed/book_reviews.tsv', sep='\t', index_col=0)
df['author_name'] = df.index.to_series().str.split('\\|\\|').str[1].str.strip()
author_total_books = df['author_name'].value_counts()
df = df.groupby('author_name').sum()
df = df[df.index.notnull()]
df = df.drop('#NAME?')

# prune sparsely-represented authors and journals
auth_min = 20
journal_min = 25
df = df[df.sum(axis=1) >= auth_min]
df = df[df.columns[df.sum() >= journal_min]]
df.shape

# weight data
weighting_scheme = 'TFIDF'
if weighting_scheme == 'TFIDF':
    docs = df.shape[0]
    idfs = [math.log(docs / np.where(df[col] == 0, 0, 1).sum()) for col in df.columns]
    df = df * idfs
elif weighting_scheme == 'PMI': # TO DO
    pass
df.head()

Unnamed: 0_level_0,AB Bookman's Weekly,Publishers Weekly,Esquire,Booklist,Journal of Aesthetics and Art Criticism,International Philosophical Quarterly,Harvard Law Review,Journal of Home Economics,Social Education,Library Journal,...,Journal of Negro Education,Foreign Affairs,Thought,Political Science Reviewer,Mankind,Black Scholar,Social Research,Religious Studies,Daedalus,Threepenny Review
author_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"AARDEMA, Verna",0.0,1.588567,0.0,1.766097,0.0,0.0,0.0,0.0,5.524714,0.136522,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AARON, Chester",0.0,0.907753,0.0,1.059658,0.0,0.0,0.0,0.0,2.762357,0.409565,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AARON, Daniel",0.0,0.453876,0.0,0.17661,0.0,0.0,0.0,0.0,0.0,0.136522,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AARON, Henry J",0.0,0.0,0.0,0.353219,0.0,0.0,0.0,0.0,0.0,0.546086,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AASENG, Nathan",0.0,0.0,0.0,7.064389,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


NMF requires setting a value K for the number of latent features that will be discovered. As with many dimensionality reduction techniques, setting this value is sometimes more art than science. For this project, I used the crossvalidation methods detailed in crossval_nmf.ipynb. There, the optimal value of K was somewhere between 15 and 25.

In [99]:
# https://towardsdatascience.com/nmf-a-visual-explainer-and-python-implementation-7ecdd73491f8
k = 20
model = NMF(n_components=k, init='nndsvd', random_state=99, max_iter=1000)
W = model.fit_transform(df)
H = model.components_
err = model.reconstruction_err_
print(f'Shape of W: {W.shape}')
print(f'Shape of H: {H.shape}')
print(f'Reconstruction error: {err}')

Shape of W: (9043, 20)
Shape of H: (20, 352)
Reconstruction error: 1450.6853160318217


The resulting model gives us two matrices W and H. W contains one row for each author in the original data, with values for each of the K hidden components that we uncovered. H contains one row for each hidden component, with values for each of the 352 journals that went into generating those components.

This lets us do a couple of useful things.

1) we can sort each hidden component in H by its feature values in order to see which journals contribute most towards that component
2) we can sort each author by their uncovered component values in order to see which authors are most prominent within the discovered components

In [100]:
num_journals = 10
num_authors = 10

for ix in range(k):
    print(f'COMPONENT {ix}:\n')
    print('Top journals:')
    for journal_ind in H[ix].argsort()[::-1][:num_journals]:
        print(df.columns[journal_ind])
    print()
    print('Prominent authors:')
    for author_ind in W[:, ix].argsort()[::-1][:num_authors]:
        print(df.iloc[author_ind].name)
    print()

COMPONENT 0:

Top journals:
Center for Children's Books, Bulletin
Horn Book Magazine
Childhood Education
Catholic Library World
Kirkus Reviews
Teachers College Record
Christian Science Monitor
Grade Teacher
Language Arts
Top of the News

Prominent authors:
SELSAM, Millicent E
ASIMOV, Isaac
ROCKWELL, Anne
MELTZER, Milton
YOLEN, Jane
CORBETT, Scott
BRANLEY, Franklyn M
KEATS, Ezra Jack
PRINGLE, Laurence
SIMON, Seymour

COMPONENT 1:

Top journals:
Time
Newsweek
New York Times (Daily)
Saturday Review
National Review
Wall Street Journal
Book World
America
Best Sellers
Christian Science Monitor

Prominent authors:
UPDIKE, John
MAILER, Norman
SOLZHENITSYN, Aleksandr
BUCKLEY, William F, Jr.
NABOKOV, Vladimir
SINGER, Isaac Bashevis
SIMENON, Georges
VIDAL, Gore
ROTH, Philip
MURDOCH, Iris

COMPONENT 2:

Top journals:
Science Books and Films
Appraisal: Children's Science Books
Appraisal: Science Books for Young People
Scientific American
Curriculum Review
Instructor
Childhood Education
Natural Hist

This method reproduced many of the journal clusters that we originally uncovered in Notebook #2. We have clusters for children's literature, science fiction, mainstream publishing, poetry, science education, political science, British journals, Christian publishing, Canadian journals, and more. 

Let's see if this reduced-dimensionality space can give us any insight to the author-similarity comparisons we made last time.

In [101]:
W_df = pd.DataFrame(W, index=df.index)
def author_query(author: str, num_journals: int = 5, num_authors: int = 5):

    print(author)
    print('Top Journal Scores:')
    print(W_df.loc[author].sort_values(ascending=False)[:num_journals])
    print()

    author_vector = W_df.loc[author]
    similarities = W_df.drop(author).apply(lambda x: distance.cosine(x, author_vector), axis=1)

    print('Most Similar Authors:')
    print(similarities.sort_values()[:num_authors])
    print()

In [102]:
author_query('ASIMOV, Isaac')

ASIMOV, Isaac
Top Journal Scores:
5     5.662866
12    4.433196
2     4.388851
7     3.294927
0     2.178704
Name: ASIMOV, Isaac, dtype: float64

Most Similar Authors:
author_name
BOVA, Ben                 0.030258
COOPER, Henry S F, Jr.    0.076049
STINE, G Harry            0.121160
SILVERBERG, Robert        0.161357
SHOOK, Robert L           0.162051
dtype: float64



TO-DO:

1) DBSCAN clustering

2) how to locate authors at the boundary several different components? 