In the previous notebook, we saw that assessing the similarity of authors according to their "review profile" yielded remarkably intuitive groupings. In particular, this method was picking up on more than just the co-presence of 1 or 2 journals in two authors' review sets. Anne Rice, for example, was grouped with other mystery and suspense authors *even though* her top journals were not explicitly dedicated to that genre. What this means is that there is some particular combination of journals that is associated with "detectiveness," at least in this data.

In this notebook, we will attempt to recover some of the most prominent of these underlying combinations that account for the structure of the data. We'll be using a method known as nonnegative matrix factorization (NMF), which decomposes a matrix into two lower-rank matrices that approximate its structure. NMF is used in many applications, such as recommendation systems and topic modeling. 

In [10]:
import pandas as pd
import numpy as np
import math

from scipy.spatial import distance
from sklearn.decomposition import NMF
from sklearn.decomposition import TruncatedSVD

from sklearn.cluster import DBSCAN
import hdbscan

We'll load and preprocess the data in the same way that we did for the author-similarity comparisons.

In [26]:
# load data
df = pd.read_csv('../data/processed/book_reviews.tsv', sep='\t', index_col=0)
df['author_name'] = df.index.to_series().str.split('\\|\\|').str[1].str.strip()
author_total_books = df['author_name'].value_counts()
df = df.groupby('author_name').sum()
df = df[df.index.notnull()]
df = df.drop('#NAME?')

# prune sparsely-represented authors and journals
auth_min = 20
journal_min = 25
df = df[df.sum(axis=1) >= auth_min]
df = df[df.columns[df.sum() >= journal_min]]
author_total_books = author_total_books[df.index]

# weight data
weighting_scheme = 'PMI'
if weighting_scheme == 'TFIDF':
    docs = df.shape[0]
    idfs = [math.log(docs / np.where(df[col] == 0, 0, 1).sum()) for col in df.columns]
    df = df * idfs
elif weighting_scheme == 'PMI': 

    p_joint = df / df.sum().sum() # P(author, journal)
    p_j = df.sum() / df.sum().sum() # P(journal)
    p_a = author_total_books / author_total_books.sum() # P(author)
    p_independent = p_a.apply(lambda a: a * p_j) # P(author) * P(journal)
    weighted = (p_joint / p_independent) + 1 # PMI, then add 1 to allow us to take the log
    weighted = pd.DataFrame(np.log(weighted.values), # take the log
                            index=weighted.index, 
                            columns=weighted.columns)

weighted.head()

Unnamed: 0_level_0,AB Bookman's Weekly,Publishers Weekly,Esquire,Booklist,Journal of Aesthetics and Art Criticism,International Philosophical Quarterly,Harvard Law Review,Journal of Home Economics,Social Education,Library Journal,...,Journal of Negro Education,Foreign Affairs,Thought,Political Science Reviewer,Mankind,Black Scholar,Social Research,Religious Studies,Daedalus,Threepenny Review
author_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"AARDEMA, Verna",0.0,1.177646,0.0,1.422715,0.0,0.0,0.0,0.0,3.271238,0.263945,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AARON, Chester",0.0,1.073555,0.0,1.343786,0.0,0.0,0.0,0.0,2.99613,0.858347,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AARON, Daniel",0.0,0.768003,0.0,0.448975,0.0,0.0,0.0,0.0,0.0,0.434184,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AARON, Henry J",0.0,0.0,0.0,0.48838,0.0,0.0,0.0,0.0,0.0,0.79219,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AASENG, Nathan",0.0,0.0,0.0,1.25809,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


NMF requires setting a value K for the number of latent features that will be discovered. As with many dimensionality reduction techniques, setting this value is sometimes more art than science. For this project, I used the crossvalidation methods detailed in crossval_nmf.ipynb. There, the optimal value of K was somewhere between 15 and 25.

In [27]:
k = 50
model = NMF(n_components=k, init='nndsvd', random_state=99, max_iter=1000)
W = model.fit_transform(weighted)
H = model.components_
err = model.reconstruction_err_
print(f'Shape of W: {W.shape}')
print(f'Shape of H: {H.shape}')
print(f'Reconstruction error: {err}')

Shape of W: (9043, 50)
Shape of H: (50, 352)
Reconstruction error: 760.1021982710823


The resulting model gives us two matrices W and H. W contains one row for each author in the original data, with values for each of the K hidden components that we uncovered. H contains one row for each hidden component, with values for each of the 352 journals that went into generating those components.

This lets us do a couple of useful things.

1) we can sort each hidden component in H by its feature values in order to see which journals contribute most towards that component
2) we can sort each author by their uncovered component values in order to see which authors are most prominent within the discovered components

In [28]:
num_journals = 10
num_authors = 10

for ix in range(k):
    print(f'COMPONENT {ix}:\n')
    print('Top journals:')
    for journal_ind in H[ix].argsort()[::-1][:num_journals]:
        print(weighted.columns[journal_ind])
    print()
    print('Prominent authors:')
    for author_ind in W[:, ix].argsort()[::-1][:num_authors]:
        print(weighted.iloc[author_ind].name)
    print()

COMPONENT 0:

Top journals:
Time
National Observer
Newsweek
New York Times (Daily)
Life
Saturday Review
Christian Science Monitor
Book World
Esquire
Saturday Review/World

Prominent authors:
STEINER, Jean-Francois
DECTER, Midge
MILFORD, Nancy
CRICHTON, Robert
WEESNER, Theodore
LUKAS, J Anthony
BUNTING, Josiah III
SHETZUNE, David
TEICHMANN, Howard
STEAD, Christine.

COMPONENT 1:

Top journals:
Childhood Education
Center for Children's Books, Bulletin
Language Arts
Reading Teacher
Horn Book Magazine
Catholic Library World
Teachers College Record
Instructor
Children's Book Review Service
School Library Journal

Prominent authors:
AARDEMA, Verna
CREWS, Donald
ANNO, Mitsumasa
LOBEL, Arnold
TURKLE, Brinton
REISS, Johanna
KUSKIN, Karia
ISADORA, Rachel
KEATS, Ezra Jack
BABBITT, Natalie

COMPONENT 2:

Top journals:
American Political Science Review
Journal of Politics
Political Science Quarterly
American Academy of Political and Social Science, Annals
Perspective
Current History
Public Administ

This method reproduced many of the journal clusters that we originally uncovered in Notebook #2. We have clusters for children's literature, science fiction, mainstream publishing, poetry, science education, political science, British journals, Christian publishing, Canadian journals, and more. 

Let's see if this reduced-dimensionality space can give us any insight to the author-similarity comparisons we made last time.

In [32]:
W_df = pd.DataFrame(W, index=df.index)
def author_query(author: str, num_journals: int = 5, num_authors: int = 5):

    print(author)
    print('Top Component Scores:')
    print(W_df.loc[author].sort_values(ascending=False)[:num_journals])
    print()

    author_vector = W_df.loc[author]
    similarities = W_df.drop(author).apply(lambda x: distance.cosine(x, author_vector), axis=1)

    print('Most Similar Authors:')
    print(similarities.sort_values()[:num_authors])
    print()

In [31]:
query_authors = [

    'LE GUIN, Ursula K',
    'MORRISON, Toni',
    'MERTON, Thomas', # Christian monk
    'KENNEDY, Eugene', # Catholic priest
    'UPDIKE, John',
    'ZINN, Howard', 
    'BENNETT, Lerone, Jr.', # social historian of race
    'CAUSLEY, Charles', # British children's poet, known for blurring lines between lit for kids/adults
    'SENDAK, Maurice',
    'RICE, Anne',
    'TYLER, Anne'
    
]

for author in query_authors:
    author_query(author)

LE GUIN, Ursula K
Top Journal Scores:
17    0.320008
28    0.224257
35    0.183802
14    0.180390
30    0.144695
Name: LE GUIN, Ursula K, dtype: float64

Most Similar Authors:
author_name
TEVIS, Walter         0.175220
MC CAFFREY, Anne      0.185508
WILDER, Cherry        0.228243
MC INTYRE, Vonda N    0.240695
VINGE, Joan D         0.241176
dtype: float64

MORRISON, Toni
Top Journal Scores:
35    0.482209
38    0.364533
36    0.332221
45    0.328923
32    0.327046
Name: MORRISON, Toni, dtype: float64

Most Similar Authors:
author_name
LURE, Alison          0.137649
BEATTIE, Ann          0.147989
DRABBLE, Margaret     0.160569
SPARK, Muriel         0.162081
JANEWAY, Elizabeth    0.175294
dtype: float64

MERTON, Thomas
Top Journal Scores:
7     0.323174
23    0.195930
42    0.177712
13    0.169534
33    0.073354
Name: MERTON, Thomas, dtype: float64

Most Similar Authors:
author_name
TAVARD, George H    0.126822
VAWTER, Bruce       0.130290
BEA, Augustin       0.132659
COOKE, Bernard J   

Each author's resulting "bag" of components can be quite informative. Isaac Asimov, for instance, is mostly "science fiction," but he is also made up of "children's literature" and "science writing." This decomposition gives us a succinct summary of the different journal clusters that reviewed his work. 

Note however that in some cases the author-similarity comparisons got worse, especially if the author in question was reviewed in journals that don't contribute very much to the primary components. This dimensionality-reduction method inevitably involves a loss of information.

With this information, we can find authors located at the "edges" of different components: those whose component mixture is notably split between two components in particular. 

I'll define an "edgy" author as someone with a small distance between their highest and second-highest valued components. 

In [64]:
num_authors = 100
def top_diff(x):
    top2 = x.sort_values(ascending=False)[:2]
    return top2.iloc[0] - top2.iloc[1]

edgy_authors = W_df.apply(top_diff, axis=1).sort_values()[:num_authors]
for author in edgy_authors.index:

    print(author)
    print('Top Components:')
    top_comp = W_df.loc[author].sort_values(ascending=False)[:2]
    print(top_comp)
    print()
    
    print('Component Journals:')
    for ix in top_comp.index:
        print(f'Component {ix}')
        for journal_ind in H[ix].argsort()[::-1][:7]:
            print(weighted.columns[journal_ind])
        print()
    print()

BUKOVSKY, Vladimir
Top Components:
18    0.305850
10    0.305847
Name: BUKOVSKY, Vladimir, dtype: float64

Component Journals:
Component 18
National Review
Wall Street Journal
Modern Age
Christian Science Monitor
Esquire
Economist
America

Component 10
Business Week
Washington Monthly
Human Events
American Spectator
Christian Science Monitor
Policy Review
Guardian Weekly


WEIL, Andrew
Top Components:
0    0.071990
8    0.071976
Name: WEIL, Andrew, dtype: float64

Component Journals:
Component 0
Time
National Observer
Newsweek
New York Times (Daily)
Life
Saturday Review
Christian Science Monitor

Component 8
American Reference Books Annual
Reference Services Review
Wilson Library Bulletin
RQ
College and Research Libraries
Booklist
Journal of Academic Librarianship


CADWALLADER, Sharon
Top Components:
27    0.087239
22    0.087224
Name: CADWALLADER, Sharon, dtype: float64

Component Journals:
Component 27
Virginia Quarterly Review
Sewanee Review
Southern Living
South Atlantic Quarterly

One shortcoming of this method is that it is naturally biased towards authors that have low component values generally, since subtracting two small values is more likely to turn up a small value than subtracting two large values. Nevertheless, it turns up some interesting cases:

--> Stephen Spender, a radical poet, at the boundary of literature and politics (Dissent/Commentary)
--> Gene DeWeese, SF author: SF mags and film/pop culture
--> Freeman Dyson, scientist also reviewed by political mags for a book he wrote about nuclear weapons
--> Basil Bunting, poetry and English lit
--> Spider Robinson, a Canadian-Am SF author is at the boundary of Candian mags and SF mags
--> Barbara Ward Jackson, economist and Christian thinker
--> Edward Lear, illustrator and poet, at the border of children's lit and art criticism
--> Bill C. Malone, a historian of music at the boundary of (you guessed it) music and history

