In this notebook, we are going to consider the clustering of *authors* rather than journals (which is simply the inverse of the previous question). With over a hundred thousand authors (to say nothing of the full 35 years of data), hierarchical clustering is no longer an option. Instead, we're going to consider author similarity with two methods: 

1) a 2-D projection t-SNE projection of the feature space
2) cluster analysis with K-Means and DBSCAN

In [1]:
import pandas as pd
import numpy as np
import math
import time

# for t-SNE
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE

# for clustering
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN

# for plotting, we will use Bokeh for the excellent interactive options
from bokeh.plotting import figure, output_file, show
from bokeh.models import ColumnDataSource, HoverTool, CategoricalColorMapper, Legend
from bokeh.transform import factor_cmap

As in our EDA section, we need to group individual titles by author. On my machine, this cell takes 20-30 seconds.

In [5]:
df = pd.read_csv('../data/processed/book_reviews.tsv', sep='\t', index_col=0)
df['author_name'] = df.index.to_series().str.split('\\|\\|').str[1].str.strip()
author_total_books = df['author_name'].value_counts()
df = df.groupby('author_name').sum()
df = df[df.index.notnull()]
df = df.drop('#NAME?')
df.head()
df.head()

Unnamed: 0_level_0,AB Bookman's Weekly,Publishers Weekly,Esquire,Booklist,Journal of Aesthetics and Art Criticism,International Philosophical Quarterly,Journal of Marketing,Harvard Law Review,Journal of Business Education,Journal of Home Economics,...,Black Warrior Review,Computers and the Humanities,American Arts,Essays on Canadian Writing`,Performing Arts Review,"Journal of Arts Management, Law, and Society","Studio International, Review",Journal of Black Studies,Lone Star Review,Aspen Journal of the Arts
author_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"AABERG, Jean",2,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"AADLAND, Florence",0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"AAFJES, Bertus",0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"AAGAARD, Orlena",0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Note that we are representing our author-review relationships in much the same way as one would represent customer-item interactions in a (very simple) recommender system. As in that application, we're going to restrict the data to only authors who have received a minimum number of reviews. I've selected 20, because this yields a manageable number for visualization. But that window could be tinkered with.

Additionally, I've dropped all journals with fewer than 25 reviews.

In [6]:
auth_min = 20
journal_min = 25
df = df[df.sum(axis=1) >= auth_min]
df = df[df.columns[df.sum() >= journal_min]]
df.shape

(9043, 352)

When we clustered journals, we needed to account for the disproportionate number of reviews published by the biggest journals. We face a similar problem here. Ideally, we would like to find authors with a high degree of review overlap: authors that tend to be reviewed in the same journals. The problem is that some journals review so many authors that many different authors will seem "similar" almost by coincidence. 

As such, 

In [14]:
# it is useful to introduce a weighting scheme, since some magazines just publish so many more reviews than authors

# I'll use TF-IDF, where IDF is instead: log(total authors / authors reviewed by this magazine)
docs = df.shape[0]
idfs = [math.log(docs / np.where(df[col] == 0, 0, 1).sum()) for col in df.columns]
tfidf = df * idfs
tfidf.head()

Unnamed: 0_level_0,AB Bookman's Weekly,Publishers Weekly,Esquire,Booklist,Journal of Aesthetics and Art Criticism,International Philosophical Quarterly,Harvard Law Review,Journal of Home Economics,Social Education,Library Journal,...,Journal of Negro Education,Foreign Affairs,Thought,Political Science Reviewer,Mankind,Black Scholar,Social Research,Religious Studies,Daedalus,Threepenny Review
author_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"AARDEMA, Verna",0.0,1.588567,0.0,1.766097,0.0,0.0,0.0,0.0,5.524714,0.136522,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AARON, Chester",0.0,0.907753,0.0,1.059658,0.0,0.0,0.0,0.0,2.762357,0.409565,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AARON, Daniel",0.0,0.453876,0.0,0.17661,0.0,0.0,0.0,0.0,0.0,0.136522,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AARON, Henry J",0.0,0.0,0.0,0.353219,0.0,0.0,0.0,0.0,0.0,0.546086,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AASENG, Nathan",0.0,0.0,0.0,7.064389,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [97]:
# With these values, we can see an author's review profile when weighted by the 
# total number of reviews published by a magazine

# This mitigates the problem of everybody's "top" journal being Booklist, Kirkus, or Publisher's Weekly

# here, for example, is Ursula Le Guin
tfidf.loc['LE GUIN, Ursula K'].sort_values(ascending=False)[:10]

Magazine of Fantasy and Science Fiction    22.779794
English Journal                            16.851439
Analog Science Fiction and Fact            16.250383
Voice of Youth Advocates                   13.294380
Emergency Librarian                        11.407915
Horn Book Magazine                          9.493542
Observer (London)                           9.383036
Book World                                  9.363350
New Age Journal                             8.679830
Book Report                                 7.499680
Name: LE GUIN, Ursula K, dtype: float64

In [98]:
# or Toni Morrison
tfidf.loc['MORRISON, Toni'].sort_values(ascending=False)[:10]

Black Scholar          32.883834
Critique               18.899886
Ms.                    12.951489
American Literature     9.467173
Black World             9.332262
Newsweek                8.309319
Hudson Review           7.039645
Progressive             5.291060
Nation                  5.189991
Yale Review             4.966590
Name: MORRISON, Toni, dtype: float64

In [101]:
# or Thomas Merton, a famous American Christian monk
# side note: this is how I discovered that "America" is a Jesuit journal
tfidf.loc['MERTON, Thomas'].sort_values(ascending=False)[:10]

Christian Century           31.385226
America                     26.435117
Review for Religious        22.994971
Critic                      22.335174
Religious Studies Review    18.840447
Sewanee Review              12.647159
Commonweal                  11.568903
Catholic Library World       8.416696
American Book Review         7.388541
Best Sellers                 7.002259
Name: MERTON, Thomas, dtype: float64

In [103]:
# Eugene Kennedy, a psychologist and Catholic priest
tfidf.loc['KENNEDY, Eugene'].sort_values(ascending=False)[:10]

America                    37.009164
Review for Religious       31.618086
Christian Century          26.677442
Commonweal                 11.568903
Critic                      9.926744
Best Sellers                8.169302
Catholic Library World      6.733356
Educational Leadership      4.078344
New Catholic World          3.994645
Contemporary Psychology     3.631422
Name: KENNEDY, Eugene, dtype: float64

In [106]:
# we can see the more "elite mainstream" profile of somebody like John Updike
tfidf.loc['UPDIKE, John'].sort_values(ascending=False)[:10]

America                     29.078628
Newsweek                    21.604228
Hudson Review               21.118934
New York Review of Books    20.681935
National Review             20.633266
Time                        20.401461
Atlantic Monthly            19.861204
Commonweal                  18.799467
New Republic                18.113006
Saturday Review             17.873968
Name: UPDIKE, John, dtype: float64

In [108]:
# you can even see what might be called the leftist magazine sphere
tfidf.loc['ZINN, Howard'].sort_values(ascending=False)[:10]

Science and Society            11.050553
Negro Digest                    8.521332
Social Education                7.902691
Dissent                         7.593265
Nation                          6.919988
Partisan Review                 6.448533
Progressive                     5.291060
Journal of American History     5.145009
Saturday Review                 4.468492
Commonweal                      4.338339
Name: ZINN, Howard, dtype: float64

In [110]:
# social historian Lerone Bennett has a prominent place in Black journals
tfidf.loc['BENNETT, Lerone, Jr.'].sort_values(ascending=False)[:10]

Negro Digest                   17.042663
Black World                     9.332262
Black Scholar                   5.480639
Christian Century               4.707784
Saturday Review                 3.574794
Quarterly Journal of Speech     3.283414
Social Studies                  3.106484
Top of the News                 2.853169
Journal of American History     2.572505
Critic                          2.481686
Name: BENNETT, Lerone, Jr., dtype: float64

In [112]:
# British children's poet Charles Causley has an interesting place; note that he was known for blurring line between
# children's and adult poetry
tfidf.loc['CAUSLEY, Charles'].sort_values(ascending=False)[:10]

Junior Bookshelf                         36.521990
Growing Point                            14.189852
Times Educational Supplement              7.782162
New Statesman                             6.527859
School Librarian                          5.743987
Books & Bookmen                           5.287753
Horn Book Magazine                        4.746771
Observer (London)                         4.691518
Center for Children's Books, Bulletin     4.543850
Listener                                  3.721420
Name: CAUSLEY, Charles, dtype: float64

In [124]:
# The natural next step is to want to see a visualization of the entire space in which authors that have high TF-IDF
# scores in the same journals are grouped together

# Before doing that, I'm going to create a simple dictionary that associates each author with their top 5 journals.
# This will be included as a tooltip for that author visible when mousing over their point in the visualization.
# This is useful just because I have no idea who most of these people are. 

author_dict = {}
for author in tfidf.index:
    top_10 = tfidf.loc[author].sort_values(ascending=False)[:5]
    author_dict[author] = top_10.index
author_dict

{'AARDEMA, Verna': Index(['Center for Children's Books, Bulletin', 'School Library Journal',
        'Childhood Education', 'Language Arts', 'Catholic Library World'],
       dtype='object'),
 'AARON, Daniel': Index(['Journal of American History',
        'Journal of Library History, Philosophy, and Comparative Libarianship',
        'Science and Society', 'Journal of Southern History',
        'American Historical Review'],
       dtype='object'),
 'AARON, Henry J': Index(['Journal of Economic Literature', 'Perspective',
        'Political Science Quarterly', 'Monthly Labor Review',
        'Wall Street Review of Books'],
       dtype='object'),
 'ABBEY, Edward': Index(['Living Wilderness', 'National Parks', 'Southwest Review',
        'English Journal', 'Audubon'],
       dtype='object'),
 'ABBOTT, Carl': Index(['Western Historical Quarterly', 'Journal of American History',
        'Business History Review', 'Pacific Historical Review',
        'Wall Street Review of Books'],
       

In [153]:
# to add a little color to our plot, I'm just going to arbitrarily assign colors to journal clusters discovered from 
# the hierarchical cluster analysis


def single_true(iterable):
    i = iter(iterable)
    return any(i) and not any(i)

clusters = {
    'sf' : ['Science Fiction Review', 
            'Analog Science Fiction and Fact', 
            'Magazine of Fantasy and Science Fiction',
            'Fantasy Review'],
    'children' : ['Reading Teacher',
                 'Language Arts',
                 'School Library Journal',
                 'Horn Book Magazine'],
    'science' : ['Science', 
                 'Sky and Telescope', 
                 'Nature', 
                 'Scientific American'],
    'christian' : ['America', 
                   'Christian Century',
                   'Review for Religious',
                   'Critic'],
    'british' : ['Observer (London)',
                'New Statesman',
                'Listener',
                'Spectator',
                'Times Literary Supplement',
                'Guardian Weekly'],
    'wide coverage' : ['Kirkus Reviews',
                      'Publishers Weekly',
                      'Library Journal',
                      'Booklist',
                      'New York Times Book Review'],
    'history' : ['Reviews in American History',
                'Journal of American History',
                'American Historical Review',
                'Historian'],
    'poetry' : ['Parnassus: Poetry in Review',
               'Poetry',
               'North American Review',
               'American Poetry Review']
    
}
author_cluster_list = []
for author in tfidf.index:
    top_score = author_dict[author][0]
    found_cluster = 'Other'
    for cluster in clusters:
        if top_score in clusters[cluster]:
            found_cluster = cluster
    author_cluster_list.append(found_cluster)

In [155]:
color_map = {
    'Other' : 'grey',
    'sf' : 'darkmagenta',
    'children' : 'blue',
    'science': 'limegreen',
    'wide coverage': 'olive',
    'history': 'gold',
    'british': 'crimson',
    'christian': 'saddlebrown',
    'poetry' : 'orange'
}

t-SNE requires reducing the dimensionality of the data first. We'll use SVD rather than PCA. Since the data is sparse (lots of 0 values), it wouldn't make sense to normalize it, which is a required first step for a principal component analysis. 

In this cell, we'll run an SVD on the data with different numbers of components to see if there are any "break points" beyond which we get diminishing returns in terms of variance explained.

In [None]:
for n in range(5, 50, 5):
    svd = TruncatedSVD(n_components=n)
    comps = svd.fit_transform(df)
    exp_var = sum(svd.explained_variance_ratio_)
    print(f'Components: {n}. Explained variance: {exp_var}')

Components: 5. Explained variance: 0.6707966425249912
Components: 10. Explained variance: 0.7641581320966095
Components: 15. Explained variance: 0.8146105023788425
Components: 20. Explained variance: 0.8442384585117223
Components: 25. Explained variance: 0.8644455058513664
Components: 30. Explained variance: 0.8805949470164651
Components: 35. Explained variance: 0.8932371060155487
Components: 40. Explained variance: 0.904084626420168
Components: 45. Explained variance: 0.9133160986258515


I'll use 40, since that already accounts for ~90% of the variance.

In [8]:
n=40
svd = TruncatedSVD(n_components=n)
comps = svd.fit_transform(df)
exp_var = sum(svd.explained_variance_ratio_)
print(f'Components: {n}. Explained variance: {exp_var}')

Components: 40. Explained variance: 0.9040863177930241


In [121]:
# t-SNE on the SVD transformed data
time_start = time.time()
tsne = TSNE(n_components=2, verbose=0, perplexity=40, n_iter=300, random_state=11)
tsne_svd_results = tsne.fit_transform(transformed)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))

t-SNE done! Time elapsed: 6.537069082260132 seconds


In [157]:
# Now, we'll make a scatterplot with a mouseover that gives us the name of the author and their top-scoring journals
#output_file("tsne_interactive2.html")

source = ColumnDataSource(data=dict(
    x=tsne_svd_results[:,0],
    y=tsne_svd_results[:,1],
    author=tfidf.index,
    top_scores = [author_dict[author] for author in tfidf.index],
    label=author_cluster_list,
    colors=[color_map[c] for c in author_cluster_list]
    
))
TOOLTIPS = [
    ("(x,y)", "($x, $y)"),
    ("author", "@author"),
    ("top scores", "@top_scores"),
]

p = figure(plot_width=1000, plot_height=800, tooltips=TOOLTIPS, toolbar_location='above',
           title="t-SNE Projection of 7000 Authors in Book Review Space")
p.scatter('x', 
          'y',
          size=7,
          source=source,
          fill_alpha=1,
          fill_color='colors'
)

output_file("tsne_interactive.html", title="t-SNE Projection of 7000 Authors in Book Review Space")

show(p)

INFO:bokeh.io.state:Session output file 'tsne_interactive.html' already exists, will be overwritten.


Take some time to look it over. You will note that the clusters have a high degree of intuitive structure. Just browsing, I even found a cluster of 19th century American authors: Twain, Fennimore Cooper, Melville, etc.

In [10]:
from sklearn.decomposition import NMF

In [51]:
# https://towardsdatascience.com/nmf-a-visual-explainer-and-python-implementation-7ecdd73491f8
c = 30
model = NMF(n_components=c, init='random', random_state=99, max_iter=500)
W = model.fit_transform(tfidf)
H = model.components_
err = model.reconstruction_err_

In [52]:
def display_topics(model, feature_names, num_top_words):
    '''Given an NMF model, feature_names, and number of top features, print hidden component 
    and its top feature names, up to specified number of top features.'''
    
    for ix, topic in enumerate(model.components_):
        print("\nTopic ", ix)
        for i in topic.argsort()[:-num_top_words - 1:-1]:
            print(feature_names[i])

In [53]:
display_topics(model=model, feature_names=tfidf.columns, num_top_words=10)


Topic  0
Best Sellers
New York Times Book Review
Publishers Weekly
Kirkus Reviews
Book World
West Coast Review of Books
Saturday Review
Observer (London)
New Yorker
Books & Bookmen

Topic  1
Sky and Telescope
Science Books and Films
Science
Nature
Geographical Journal
Times Literary Supplement
Natural History
Voice of Youth Advocates
Horn Book Magazine
Science Fiction and Fantasy Book Review

Topic  2
Times Educational Supplement
School Librarian
British Book News, Children's Supplement
Children's Book Review Service
Economist
New Statesman
Book World
History Today
Emergency Librarian
Los Angeles Times Book Review

Topic  3
Newsweek
Time
New York Times (Daily)
Saturday Review
National Observer
Hudson Review
New York Review of Books
New Republic
Book World
Nation

Topic  4
Horn Book Magazine
Childhood Education
Catholic Library World
Language Arts
Teachers College Record
Christian Science Monitor
School Library Journal
Top of the News
Social Education
New York Times Book Review

Topic 

In [54]:
cluster = 1
top_author_inds = W[:,cluster].argsort()[::-1][:10]
for ind in top_author_inds:
    name = tfidf.iloc[ind].name
    print(name)

MOORE, Patrick
KOPAL, Zdenek
RONAN, Colin A
ASIMOV, Isaac
MUIRDEN, James
JASTROW, Robert
ALTER, Dinsmore
SAGAN, Carl
BRANLEY, Franklyn M
DRAKE, Stillman


In [46]:
tfidf.iloc[3596].name

'KENNEDY, Eugene'