In this notebook, we are going to consider the clustering of *authors* rather than journals (which is simply the inverse of the previous question). With over a hundred thousand authors (to say nothing of the full 35 years of data), hierarchical clustering is no longer an option. Instead, we're going to consider author similarity with two methods: 

1) a "cluster" analysis via NMF (nonnegative matrix factorization)
2) a 2-D projection t-SNE projection of the feature space

In [1]:
import pandas as pd
import numpy as np
import math
import time

# for t-SNE
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE

# for clustering
from sklearn.decomposition import NMF

# for plotting, we will use Bokeh for the excellent interactive options
from bokeh.plotting import figure, output_file, show
from bokeh.models import ColumnDataSource, HoverTool, CategoricalColorMapper, Legend
from bokeh.transform import factor_cmap

As in our EDA section, we need to group individual titles by author. On my machine, this cell takes 20-30 seconds.

In [2]:
df = pd.read_csv('../data/processed/book_reviews.tsv', sep='\t', index_col=0)
df['author_name'] = df.index.to_series().str.split('\\|\\|').str[1].str.strip()
author_total_books = df['author_name'].value_counts()
df = df.groupby('author_name').sum()
df = df[df.index.notnull()]
df = df.drop('#NAME?')
df.head()

Unnamed: 0_level_0,AB Bookman's Weekly,Publishers Weekly,Esquire,Booklist,Journal of Aesthetics and Art Criticism,International Philosophical Quarterly,Journal of Marketing,Harvard Law Review,Journal of Business Education,Journal of Home Economics,...,Black Warrior Review,Computers and the Humanities,American Arts,Essays on Canadian Writing`,Performing Arts Review,"Journal of Arts Management, Law, and Society","Studio International, Review",Journal of Black Studies,Lone Star Review,Aspen Journal of the Arts
author_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"AABERG, Jean",2,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"AADLAND, Florence",0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"AAFJES, Bertus",0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"AAGAARD, Orlena",0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Note that we are representing our author-review relationships in much the same way as one would represent customer-item interactions in a (very simple) recommender system. As in that application, we're going to restrict the data to only authors who have received a minimum number of reviews. I've selected 20, because this yields a manageable number for visualization. But that window could be tinkered with.

Additionally, I've dropped all journals with fewer than 25 reviews.

In [3]:
auth_min = 20
journal_min = 25
df = df[df.sum(axis=1) >= auth_min]
df = df[df.columns[df.sum() >= journal_min]]
df.shape

(9043, 352)

When we clustered journals, we needed to account for the disproportionate number of reviews published by the biggest journals. We face a similar problem here. Ideally, we would like to find authors with a high degree of review overlap: authors that tend to be reviewed in the same journals. The problem is that some journals review so many authors that many different authors will seem "similar" almost by coincidence. 

As such, we'll weight counts more heavily if they come from journals that review relatively few books. Since this is conceptually similar to using NMF for topic modeling, I've experimented with two weighting schemes sometimes used for that purpose: term frequency - inverse document frequency (TFIDF) and pointwise mutual information (PMI).

In [4]:
weighting_scheme = 'TFIDF'

if weighting_scheme == 'TFIDF':
    docs = df.shape[0]
    idfs = [math.log(docs / np.where(df[col] == 0, 0, 1).sum()) for col in df.columns]
    weighted = df * idfs
elif weighting_scheme == 'PMI': # TO DO
    pass
weighted.head()

Unnamed: 0_level_0,AB Bookman's Weekly,Publishers Weekly,Esquire,Booklist,Journal of Aesthetics and Art Criticism,International Philosophical Quarterly,Harvard Law Review,Journal of Home Economics,Social Education,Library Journal,...,Journal of Negro Education,Foreign Affairs,Thought,Political Science Reviewer,Mankind,Black Scholar,Social Research,Religious Studies,Daedalus,Threepenny Review
author_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"AARDEMA, Verna",0.0,1.588567,0.0,1.766097,0.0,0.0,0.0,0.0,5.524714,0.136522,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AARON, Chester",0.0,0.907753,0.0,1.059658,0.0,0.0,0.0,0.0,2.762357,0.409565,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AARON, Daniel",0.0,0.453876,0.0,0.17661,0.0,0.0,0.0,0.0,0.0,0.136522,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AARON, Henry J",0.0,0.0,0.0,0.353219,0.0,0.0,0.0,0.0,0.0,0.546086,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AASENG, Nathan",0.0,0.0,0.0,7.064389,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


With these values, we can see an author's review profile when weighted by the total number of reviews published by a magazine. This mitigates the problem of everybody's "top" journal being Booklist, Kirkus, or Publisher's Weekly.

Additionally, we can see which authors are most similar to a chosen query author.

A few authors are listed below along with what might be called their "review profile," or the journals that characterize their reception, as well as the other authors who are most similar to them.

In [24]:
from scipy.spatial import distance

def author_query(author: str, num_journals: int = 5, num_authors: int = 5):

    print(author)
    print('Top Journal Scores:')
    print(weighted.loc[author].sort_values(ascending=False)[:num_journals])
    print()

    author_vector = weighted.loc[author]
    similarities = weighted.drop(author).apply(lambda x: distance.cosine(x, author_vector), axis=1)

    print('Most Similar Authors:')
    print(similarities.sort_values()[:num_authors])
    print()

In [22]:
query_authors = [

    'LE GUIN, Ursula K',
    'MORRISON, Toni',
    'MERTON, Thomas', # Christian monk
    'KENNEDY, Eugene', # Catholic priest
    'UPDIKE, John',
    'ZINN, Howard', 
    'BENNETT, Lerone, Jr.', # social historian of race
    'CAUSLEY, Charles', # British children's poet, known for blurring lines between lit for kids/adults
    'SENDAK, Maurice',
    'RICE, Anne',
    'TYLER, Anne'
    
]

for author in query_authors:
    author_query(author)

LE GUIN, Ursula K
Top Journal Scores:
Magazine of Fantasy and Science Fiction    23.271826
English Journal                            17.578661
Analog Science Fiction and Fact            16.369866
Voice of Youth Advocates                   13.572423
Emergency Librarian                        11.886755
Name: LE GUIN, Ursula K, dtype: float64

Most Similar Authors:
author_name
ANDERSON, Poul       0.326544
NIVEN, Larry         0.334455
DICKSON, Gordon R    0.335798
ELLISON, Harlan      0.348994
ZELAZNY, Roger       0.349243
dtype: float64

MORRISON, Toni
Top Journal Scores:
Black Scholar          33.500314
Critique               19.560954
Ms.                    13.111455
American Literature     9.842402
Black World             9.611362
Name: MORRISON, Toni, dtype: float64

Most Similar Authors:
author_name
DUMAS, Henry              0.318922
BAMBARA, Toni Cade        0.389751
WALKER, Alice             0.443341
KELLEY, William Melvin    0.460104
REED, Ishmael             0.480544
dtype: fl

Feel free to try your own author queries here. Anything other than an exact match of an author's name will throw an error. Note that author names are formatted "LAST, first", with the surname capitalized.

You can adjust the number of results to display with the parameters "num_authors" and "num_journals."

In [25]:
query = 'PYNCHON, Thomas'
author_query(query, num_journals=5, num_authors=5)

PYNCHON, Thomas
Top Journal Scores:
Harper's Magazine         7.366781
Critique                  4.890239
America                   4.324958
Modern Fiction Studies    3.702574
Nation                    3.664997
Name: PYNCHON, Thomas, dtype: float64

Most Similar Authors:
author_name
BARTH, John      0.260174
MURDOCH, Iris    0.310726
VIDAL, Gore      0.351542
PALEY, Grace     0.370701
ADLER, Renata    0.370921
dtype: float64



The natural next step is to want to see a visualization of the entire space in which authors that have high TF-IDF
scores in the same journals are grouped together.

Before doing that, I'm going to create a simple dictionary that associates each author with their top 5 journals.
This will be included as a tooltip for that author visible when mousing over their point in the visualization.
This is useful just because I have no idea who most of these people are; having their top journals makes them easier to Google, if I encounter them while browsing the visualization.

In [27]:
author_dict = {}
for author in weighted.index:
    top_10 = weighted.loc[author].sort_values(ascending=False)[:5]
    author_dict[author] = top_10.index

t-SNE requires reducing the dimensionality of the data first. We'll use SVD rather than PCA. Since the data is sparse (lots of 0 values), it wouldn't make sense to normalize it, which is a required first step for a principal component analysis. 

In this cell, we'll run an SVD on the data with different numbers of components to see if there are any "break points" beyond which we get diminishing returns in terms of variance explained.

In [28]:
for n in range(5, 100, 5):
    svd = TruncatedSVD(n_components=n)
    comps = svd.fit_transform(weighted)
    exp_var = sum(svd.explained_variance_ratio_)
    print(f'Components: {n}. Explained variance: {exp_var}')

Components: 5. Explained variance: 0.28461617027673697
Components: 10. Explained variance: 0.3816958210961464
Components: 15. Explained variance: 0.4454190593961824
Components: 20. Explained variance: 0.48946401906026604
Components: 25. Explained variance: 0.5251304957607378
Components: 30. Explained variance: 0.5544630838358661
Components: 35. Explained variance: 0.5807842918094387
Components: 40. Explained variance: 0.6041880145303471
Components: 45. Explained variance: 0.6257229156027264
Components: 50. Explained variance: 0.6447639161961387
Components: 55. Explained variance: 0.6625591970484203
Components: 60. Explained variance: 0.6789946960469747
Components: 65. Explained variance: 0.6944947669502707
Components: 70. Explained variance: 0.7089195688987124
Components: 75. Explained variance: 0.7224609522948877
Components: 80. Explained variance: 0.7354063447560552
Components: 85. Explained variance: 0.7473828211123055
Components: 90. Explained variance: 0.7586741946395863
Component

I'll use 60, since that already accounts for about 2/3 of the variance.

In [30]:
n=60
svd = TruncatedSVD(n_components=n)
comps = svd.fit_transform(weighted)
exp_var = sum(svd.explained_variance_ratio_)
print(f'Components: {n}. Explained variance: {exp_var}')

Components: 60. Explained variance: 0.6791182535862029


Now we fit our t-SNE projection on this tranformed version of the data.

In [33]:
time_start = time.time()
tsne = TSNE(n_components=2, verbose=0, perplexity=40, n_iter=300, random_state=11, init='pca')
tsne_svd_results = tsne.fit_transform(weighted)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))



t-SNE done! Time elapsed: 9.077425956726074 seconds


Finally, we'll make a scatterplot with a mouseover that gives us the name of the author and their top-scoring journals. Authors will tend to be grouped near other authors reviewed by the same venues.

For reasons I don't fully understand, the default scale of the resulting plot is a little wonky, skewed by a handful of outliers. Luckily, Bokeh has great functionality for zooming in and out that gets around that problem somewhat. But I'll have to revisit that issue when compiling a publishable visualization.


In [34]:
source = ColumnDataSource(data=dict(
    x=tsne_svd_results[:,0],
    y=tsne_svd_results[:,1],
    author=weighted.index,
    top_scores = [author_dict[author] for author in weighted.index],
    #label=author_cluster_list,
    #colors=[color_map[c] for c in author_cluster_list]
    
))
TOOLTIPS = [
    ("(x,y)", "($x, $y)"),
    ("author", "@author"),
    ("top scores", "@top_scores"),
]

p = figure(plot_width=1000, plot_height=800, tooltips=TOOLTIPS, toolbar_location='above',
           title="t-SNE Projection of 7000 Authors in Book Review Space")
p.scatter('x', 
          'y',
          size=7,
          source=source,
          fill_alpha=1,
          #fill_color='colors'
)

output_file("../images/tsne_interactive.html", title=f"t-SNE Projection of {len(weighted.index)} Authors in Book Review Space")

show(p)

Start : This command cannot be run due to the error: The system cannot find the file specified.
At line:1 char:1
+ Start "file:///mnt/e/dissertation/ch3/images/tsne_interactive.html"
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [Start-Process], InvalidOperationException
    + FullyQualifiedErrorId : InvalidOperationException,Microsoft.PowerShell.Commands.StartProcessCommand
 


Take some time to look it over. You will note that the clusters have a high degree of intuitive structure. Just browsing, I even found a cluster of 19th century American authors: Twain, Fennimore Cooper, Melville, etc.

In the next notebook, we will extract the latent components underlying the clustering of these authors.