In this notebook, we are going to consider the clustering of *authors* rather than journals (which is simply the inverse of the previous question). With over a hundred thousand authors (to say nothing of the full 35 years of data), hierarchical clustering is no longer an option. Instead, we're going to consider author similarity with two methods: 

1) a 2-D projection t-SNE projection of the feature space
2) cluster analysis with K-Means and DBSCAN

In [1]:
import pandas as pd
import numpy as np
import math
import time

# for t-SNE
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE

# for clustering
from sklearn.decomposition import NMF

# for plotting, we will use Bokeh for the excellent interactive options
from bokeh.plotting import figure, output_file, show
from bokeh.models import ColumnDataSource, HoverTool, CategoricalColorMapper, Legend
from bokeh.transform import factor_cmap

As in our EDA section, we need to group individual titles by author. On my machine, this cell takes 20-30 seconds.

In [2]:
df = pd.read_csv('../data/processed/book_reviews.tsv', sep='\t', index_col=0)
df['author_name'] = df.index.to_series().str.split('\\|\\|').str[1].str.strip()
author_total_books = df['author_name'].value_counts()
df = df.groupby('author_name').sum()
df = df[df.index.notnull()]
df = df.drop('#NAME?')
df.head()

Unnamed: 0_level_0,AB Bookman's Weekly,Publishers Weekly,Esquire,Booklist,Journal of Aesthetics and Art Criticism,International Philosophical Quarterly,Journal of Marketing,Harvard Law Review,Journal of Business Education,Journal of Home Economics,...,Black Warrior Review,Computers and the Humanities,American Arts,Essays on Canadian Writing`,Performing Arts Review,"Journal of Arts Management, Law, and Society","Studio International, Review",Journal of Black Studies,Lone Star Review,Aspen Journal of the Arts
author_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"AABERG, Jean",2,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"AADLAND, Florence",0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"AAFJES, Bertus",0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"AAGAARD, Orlena",0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Note that we are representing our author-review relationships in much the same way as one would represent customer-item interactions in a (very simple) recommender system. As in that application, we're going to restrict the data to only authors who have received a minimum number of reviews. I've selected 20, because this yields a manageable number for visualization. But that window could be tinkered with.

Additionally, I've dropped all journals with fewer than 25 reviews.

In [3]:
auth_min = 20
journal_min = 25
df = df[df.sum(axis=1) >= auth_min]
df = df[df.columns[df.sum() >= journal_min]]
df.shape

(9043, 352)

When we clustered journals, we needed to account for the disproportionate number of reviews published by the biggest journals. We face a similar problem here. Ideally, we would like to find authors with a high degree of review overlap: authors that tend to be reviewed in the same journals. The problem is that some journals review so many authors that many different authors will seem "similar" almost by coincidence. 

As such, 

In [4]:
# it is useful to introduce a weighting scheme, since some magazines just publish so many more reviews than authors

# I'll use TF-IDF, where IDF is instead: log(total authors / authors reviewed by this magazine)
docs = df.shape[0]
idfs = [math.log(docs / np.where(df[col] == 0, 0, 1).sum()) for col in df.columns]
tfidf = df * idfs
tfidf.head()

Unnamed: 0_level_0,AB Bookman's Weekly,Publishers Weekly,Esquire,Booklist,Journal of Aesthetics and Art Criticism,International Philosophical Quarterly,Harvard Law Review,Journal of Home Economics,Social Education,Library Journal,...,Journal of Negro Education,Foreign Affairs,Thought,Political Science Reviewer,Mankind,Black Scholar,Social Research,Religious Studies,Daedalus,Threepenny Review
author_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"AARDEMA, Verna",0.0,1.588567,0.0,1.766097,0.0,0.0,0.0,0.0,5.524714,0.136522,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AARON, Chester",0.0,0.907753,0.0,1.059658,0.0,0.0,0.0,0.0,2.762357,0.409565,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AARON, Daniel",0.0,0.453876,0.0,0.17661,0.0,0.0,0.0,0.0,0.0,0.136522,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AARON, Henry J",0.0,0.0,0.0,0.353219,0.0,0.0,0.0,0.0,0.0,0.546086,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"AASENG, Nathan",0.0,0.0,0.0,7.064389,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


With these values, we can see an author's review profile when weighted by the total number of reviews published by a magazine. This mitigates the problem of everybody's "top" journal being Booklist, Kirkus, or Publisher's Weekly.

A few authors are listed below along with what might be called their "review profile," or the journals that characterize their reception.

In [5]:
# here, for example, is Ursula Le Guin
tfidf.loc['LE GUIN, Ursula K'].sort_values(ascending=False)[:10]

Magazine of Fantasy and Science Fiction    23.271826
English Journal                            17.578661
Analog Science Fiction and Fact            16.369866
Voice of Youth Advocates                   13.572423
Emergency Librarian                        11.886755
Book World                                 10.542505
Observer (London)                          10.234553
Horn Book Magazine                          9.954513
New Age Journal                             8.969547
Book Report                                 7.683776
Name: LE GUIN, Ursula K, dtype: float64

In [6]:
# or Toni Morrison
tfidf.loc['MORRISON, Toni'].sort_values(ascending=False)[:10]

Black Scholar          33.500314
Critique               19.560954
Ms.                    13.111455
American Literature     9.842402
Black World             9.611362
Newsweek                8.890156
Hudson Review           7.417465
Progressive             5.538774
Nation                  5.497496
Listener                5.349659
Name: MORRISON, Toni, dtype: float64

In [7]:
# or Thomas Merton, a famous American Christian monk
# side note: this is how I discovered that "America" is a Jesuit journal
tfidf.loc['MERTON, Thomas'].sort_values(ascending=False)[:10]

Christian Century           33.450798
America                     28.833051
Review for Religious        23.898028
Critic                      23.355302
Religious Studies Review    19.392063
Sewanee Review              13.296379
Catholic Library World       8.873822
American Book Review         7.593081
Best Sellers                 7.533190
Classical World              6.928599
Name: MERTON, Thomas, dtype: float64

In [9]:
# Eugene Kennedy, a Catholic priest
tfidf.loc['KENNEDY, Eugene'].sort_values(ascending=False)[:10]

America                    40.366271
Review for Religious       32.859788
Christian Century          28.433178
Critic                     10.380134
Best Sellers                8.788721
Catholic Library World      7.099058
New Catholic World          4.160986
Educational Leadership      4.153919
Contemporary Psychology     3.702574
Classical World             3.464299
Name: KENNEDY, Eugene, dtype: float64

In [10]:
# we can see the more "elite mainstream" profile of somebody like John Updike
tfidf.loc['UPDIKE, John'].sort_values(ascending=False)[:10]

America                     31.716356
Newsweek                    23.114407
Hudson Review               22.252394
New York Review of Books    22.181995
National Review             21.989982
Time                        21.727498
Saturday Review             19.570194
New Republic                19.415015
Observer (London)           18.763347
Guardian Weekly             18.364124
Name: UPDIKE, John, dtype: float64

In [11]:
# you can even see what might be called the leftist magazine sphere
tfidf.loc['ZINN, Howard'].sort_values(ascending=False)[:10]

Science and Society            11.302236
Negro Digest                    8.764717
Social Education                8.287071
Dissent                         7.867193
Nation                          7.329994
Partisan Review                 6.765797
Progressive                     5.538774
Journal of American History     5.225943
Saturday Review                 4.892549
American Spectator              4.382358
Name: ZINN, Howard, dtype: float64

In [13]:
# social historian Lerone Bennett has a prominent place in journals dedicated to Black culture
tfidf.loc['BENNETT, Lerone, Jr.'].sort_values(ascending=False)[:10]

Negro Digest                   17.529434
Black World                     9.611362
Black Scholar                   5.583386
Christian Century               5.017620
Saturday Review                 3.914039
Quarterly Journal of Speech     3.399319
Social Studies                  3.201663
Top of the News                 2.976348
Journal of American History     2.612971
Critic                          2.595034
Name: BENNETT, Lerone, Jr., dtype: float64

In [14]:
# British children's poet Charles Causley has an interesting place; note that he was known for blurring line between
# children's and adult poetry
tfidf.loc['CAUSLEY, Charles'].sort_values(ascending=False)[:10]

Junior Bookshelf                         37.368358
Growing Point                            14.491105
Times Educational Supplement              8.172035
New Statesman                             7.084110
School Librarian                          5.914027
Books & Bookmen                           5.719416
Observer (London)                         5.117276
Horn Book Magazine                        4.977256
Center for Children's Books, Bulletin     4.630306
Listener                                  4.012244
Name: CAUSLEY, Charles, dtype: float64

The natural next step is to want to see a visualization of the entire space in which authors that have high TF-IDF
scores in the same journals are grouped together.

Before doing that, I'm going to create a simple dictionary that associates each author with their top 5 journals.
This will be included as a tooltip for that author visible when mousing over their point in the visualization.
This is useful just because I have no idea who most of these people are; having their top journals makes them easier to Google, if I encounter them while browsing the visualization.

In [16]:
author_dict = {}
for author in tfidf.index:
    top_10 = tfidf.loc[author].sort_values(ascending=False)[:5]
    author_dict[author] = top_10.index

(TO DO) 
To add a little color to our plot, I'm just going to arbitrarily assign colors to authors based on which feature from the NMF stage they are most similar to.

In [17]:
color_map = {
    'Other' : 'grey',
    'sf' : 'darkmagenta',
    'children' : 'blue',
    'science': 'limegreen',
    'wide coverage': 'olive',
    'history': 'gold',
    'british': 'crimson',
    'christian': 'saddlebrown',
    'poetry' : 'orange'
}

t-SNE requires reducing the dimensionality of the data first. We'll use SVD rather than PCA. Since the data is sparse (lots of 0 values), it wouldn't make sense to normalize it, which is a required first step for a principal component analysis. 

In this cell, we'll run an SVD on the data with different numbers of components to see if there are any "break points" beyond which we get diminishing returns in terms of variance explained.

In [20]:
for n in range(5, 100, 5):
    svd = TruncatedSVD(n_components=n)
    comps = svd.fit_transform(tfidf)
    exp_var = sum(svd.explained_variance_ratio_)
    print(f'Components: {n}. Explained variance: {exp_var}')

Components: 5. Explained variance: 0.2846187817747947
Components: 10. Explained variance: 0.38169581910101613
Components: 15. Explained variance: 0.4454361434668806
Components: 20. Explained variance: 0.48949232657724645
Components: 25. Explained variance: 0.5252475824375943
Components: 30. Explained variance: 0.5546447196233324
Components: 35. Explained variance: 0.5806937172498169
Components: 40. Explained variance: 0.6043325322548287
Components: 45. Explained variance: 0.6256494261628878
Components: 50. Explained variance: 0.6450278854755689
Components: 55. Explained variance: 0.6625389852338813
Components: 60. Explained variance: 0.6790877537801898
Components: 65. Explained variance: 0.6944480911703156
Components: 70. Explained variance: 0.7088178708857833
Components: 75. Explained variance: 0.7225291676956674
Components: 80. Explained variance: 0.735431055326152
Components: 85. Explained variance: 0.7472648140126623
Components: 90. Explained variance: 0.7589477514285882
Components

I'll use 60, since that already accounts for about 2/3 of the variance.

In [22]:
n=60
svd = TruncatedSVD(n_components=n)
comps = svd.fit_transform(tfidf)
exp_var = sum(svd.explained_variance_ratio_)
print(f'Components: {n}. Explained variance: {exp_var}')

Components: 60. Explained variance: 0.6791823929931932


Now we fit our t-SNE projection on this tranformed version of the data.

In [24]:
time_start = time.time()
tsne = TSNE(n_components=2, verbose=0, perplexity=40, n_iter=300, random_state=11)
tsne_svd_results = tsne.fit_transform(comps)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))



t-SNE done! Time elapsed: 7.7194976806640625 seconds


Finally, we'll make a scatterplot with a mouseover that gives us the name of the author and their top-scoring journals. Authors will tend to be grouped near other authors reviewed by the same venues.


In [29]:
source = ColumnDataSource(data=dict(
    x=tsne_svd_results[:,0],
    y=tsne_svd_results[:,1],
    author=tfidf.index,
    top_scores = [author_dict[author] for author in tfidf.index],
    #label=author_cluster_list,
    #colors=[color_map[c] for c in author_cluster_list]
    
))
TOOLTIPS = [
    ("(x,y)", "($x, $y)"),
    ("author", "@author"),
    ("top scores", "@top_scores"),
]

p = figure(plot_width=1000, plot_height=800, tooltips=TOOLTIPS, toolbar_location='above',
           title="t-SNE Projection of 7000 Authors in Book Review Space")
p.scatter('x', 
          'y',
          size=7,
          source=source,
          fill_alpha=1,
          #fill_color='colors'
)

output_file("../images/tsne_interactive.html", title=f"t-SNE Projection of {len(tfidf.index)} Authors in Book Review Space")

show(p)

Start : This command cannot be run due to the error: The system cannot find the file specified.
At line:1 char:1
+ Start "file:///mnt/e/dissertation/ch3/images/tsne_interactive.html"
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [Start-Process], InvalidOperationException
    + FullyQualifiedErrorId : InvalidOperationException,Microsoft.PowerShell.Commands.StartProcessCommand
 


Take some time to look it over. You will note that the clusters have a high degree of intuitive structure. Just browsing, I even found a cluster of 19th century American authors: Twain, Fennimore Cooper, Melville, etc.

In [10]:
from sklearn.decomposition import NMF

In [51]:
# https://towardsdatascience.com/nmf-a-visual-explainer-and-python-implementation-7ecdd73491f8
c = 30
model = NMF(n_components=c, init='nndsvd', random_state=99, max_iter=500)
W = model.fit_transform(tfidf)
H = model.components_
err = model.reconstruction_err_

In [52]:
def display_topics(model, feature_names, num_top_words):
    '''Given an NMF model, feature_names, and number of top features, print hidden component 
    and its top feature names, up to specified number of top features.'''
    
    for ix, topic in enumerate(model.components_):
        print("\nTopic ", ix)
        for i in topic.argsort()[:-num_top_words - 1:-1]:
            print(feature_names[i])

In [53]:
display_topics(model=model, feature_names=tfidf.columns, num_top_words=10)


Topic  0
Best Sellers
New York Times Book Review
Publishers Weekly
Kirkus Reviews
Book World
West Coast Review of Books
Saturday Review
Observer (London)
New Yorker
Books & Bookmen

Topic  1
Sky and Telescope
Science Books and Films
Science
Nature
Geographical Journal
Times Literary Supplement
Natural History
Voice of Youth Advocates
Horn Book Magazine
Science Fiction and Fantasy Book Review

Topic  2
Times Educational Supplement
School Librarian
British Book News, Children's Supplement
Children's Book Review Service
Economist
New Statesman
Book World
History Today
Emergency Librarian
Los Angeles Times Book Review

Topic  3
Newsweek
Time
New York Times (Daily)
Saturday Review
National Observer
Hudson Review
New York Review of Books
New Republic
Book World
Nation

Topic  4
Horn Book Magazine
Childhood Education
Catholic Library World
Language Arts
Teachers College Record
Christian Science Monitor
School Library Journal
Top of the News
Social Education
New York Times Book Review

Topic 

In [54]:
cluster = 1
top_author_inds = W[:,cluster].argsort()[::-1][:10]
for ind in top_author_inds:
    name = tfidf.iloc[ind].name
    print(name)

MOORE, Patrick
KOPAL, Zdenek
RONAN, Colin A
ASIMOV, Isaac
MUIRDEN, James
JASTROW, Robert
ALTER, Dinsmore
SAGAN, Carl
BRANLEY, Franklyn M
DRAKE, Stillman


In [46]:
tfidf.iloc[3596].name

'KENNEDY, Eugene'