# Visualizing Clusters of Clickbait Headlines Using R and Plotly

by Max Woolf (@minimaxir)

*This notebook is licensed under the MIT License. If you use the code or data visualization designs contained within this notebook, it would be greatly appreciated if proper attribution is given back to this notebook and/or myself. Thanks! :)*

In [1]:
options(warn=1)

source("Rstart.R")


library(htmlwidgets)
library(tidyr)
library(tsne)
#library(crosstalk)   # Does not work unfortunately
library(plotly)

sessionInfo()


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Registering fonts with R

Attaching package: ‘scales’

The following objects are masked from ‘package:readr’:

    col_factor, col_numeric


Attaching package: ‘plotly’

The following object is masked _by_ ‘.GlobalEnv’:

    subplot

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:graphics’:

    layout



R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] plotly_3.6.0       tsne_0.1-3         tidyr_0.5.1        htmlwidgets_0.7   
 [5] stringr_1.0.0      digest_0.6.10      RColorBrewer_1.1-2 scales_0.4.0      
 [9] extrafont_0.17     ggplot2_2.1.0      dplyr_0.5.0        readr_0.2.2       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.6      Rttf2pt1_1.3.3   magrittr_1.5     munsell_0.4.3   
 [5] uuid_0.1-2       colorspace_1.2-6 R6_2.1.2         httr_1.2.1      
 [9] plyr_1.8.4       tools_3.3.0      gtable_0.2.0     DBI_0.4-1       
[13] extrafontdb_1.0  htmltools_0.3.5  assertthat_0.1   tibble_1.1      
[17] gridExtra_2.2.1  IRdisplay_0.3    repr_0.4         viridis_0.3.4   
[21] base64enc_0.1-

In [12]:
df <- read_csv('fb_headlines_53D.csv')

df %>% head() %>% print()

# A tibble: 6 x 6
  page_id                    status_id
    <chr>                        <chr>
1     CNN 5550296508_10155163822816509
2     CNN 5550296508_10155163797056509
3     CNN 5550296508_10155163796576509
4     CNN 5550296508_10155163760831509
5     CNN 5550296508_10155163747646509
6     CNN 5550296508_10155163713601509
# ... with 4 more variables: link_name <chr>, status_published <time>,
#   num_reactions <int>, merged_vectors <chr>


In [13]:
vector_names = paste0('w2v_', 1:53)

vector_trim <- function(vector)
    substr(vector, 2, nchar(vector)-1)

vector_names %>% head() %>% print()
vector_trim(df$merged_vectors[1])

[1] "w2v_1" "w2v_2" "w2v_3" "w2v_4" "w2v_5" "w2v_6"


In [14]:
df$merged_vectors = lapply(df$merged_vectors, vector_trim)

df <- separate(data = df, col = merged_vectors, into = vector_names, convert=T, sep = ",")

df %>% select(w2v_1:w2v_4) %>% head() %>% print()

# A tibble: 6 x 4
  w2v_1 w2v_2 w2v_3        w2v_4
  <dbl> <dbl> <dbl>        <dbl>
1     0     1     0 -0.029536006
2     0     1     0  0.008397121
3     0     1     0 -0.045302893
4     0     1     0 -0.039918742
5     0     1     0 -0.041991464
6     0     1     0 -0.046025169


This will take a very, very long time.

In [15]:
matrix <- df %>% select(w2v_1:w2v_53) %>% as.matrix()

system.time( cluster_coords <- tsne(matrix, initial_dims=53, perplexity=50, epoch=50) )

sigma summary: Min. : 2.98e-08 |1st Qu. : 0.4972 |Median : 0.5505 |Mean : 0.6062 |3rd Qu. : 0.6534 |Max. : 1.602 |
Epoch: Iteration #50 error is: 22.6931190265602
Epoch: Iteration #100 error is: 22.5340438633914
Epoch: Iteration #150 error is: 2.85954740967973
Epoch: Iteration #200 error is: 2.54754332470592
Epoch: Iteration #250 error is: 2.4017312366903
Epoch: Iteration #300 error is: 2.22656455410785
Epoch: Iteration #350 error is: 2.13140040113128
Epoch: Iteration #400 error is: 2.07159763903534
Epoch: Iteration #450 error is: 2.02963431041514
Epoch: Iteration #500 error is: 1.99813060715731
Epoch: Iteration #550 error is: 1.97330883995303
Epoch: Iteration #600 error is: 1.95320547606506
Epoch: Iteration #650 error is: 1.93655699300747
Epoch: Iteration #700 error is: 1.92246268080483
Epoch: Iteration #750 error is: 1.91031431549613
Epoch: Iteration #800 error is: 1.89978083538389
Epoch: Iteration #850 error is: 1.8905105349365
Epoch: Iteration #900 error is: 1.88227733304293
Epoch:

    user   system  elapsed 
14417.74 10045.16 29550.69 

8.2 Hours!

In [16]:
cluster_coords %>% head() %>% print()

          [,1]       [,2]
[1,] -26.29485  -6.132532
[2,] -52.90633 -55.677870
[3,] -33.27932 -12.286431
[4,] -28.99210  23.270245
[5,] -17.15817  16.288898
[6,]  40.21769 -21.949683


In [17]:
df_transform = df %>% select(page_id, status_id, link_name, status_published, num_reactions) %>%
                mutate(x = cluster_coords[,1], y= cluster_coords[,2])

df_transform %>% select(status_id, x, y) %>% head() %>% print()

# A tibble: 6 x 3
                     status_id         x          y
                         <chr>     <dbl>      <dbl>
1 5550296508_10155163822816509 -26.29485  -6.132532
2 5550296508_10155163797056509 -52.90633 -55.677870
3 5550296508_10155163796576509 -33.27932 -12.286431
4 5550296508_10155163760831509 -28.99210  23.270245
5 5550296508_10155163747646509 -17.15817  16.288898
6 5550296508_10155163713601509  40.21769 -21.949683


In [18]:
write.csv(df_transform, "df_transform_53D.csv", row.names=F)


## Make the plot

Prototype using ggplot2

In [90]:
df_plot <- read_csv("df_transform_53D.csv")

df_plot %>% select(link_name, x, y) %>% mutate(link_name = substr(link_name,1,20)) %>% head() %>% print()

# A tibble: 6 x 3
             link_name         x          y
                 <chr>     <dbl>      <dbl>
1 Joseph Schooling bea -26.29485  -6.132532
2 Bill Clinton: Email  -52.90633 -55.677870
3 Hacker releases cell -33.27932 -12.286431
4 Lionel Messi announc -28.99210  23.270245
5 Fighting the male bi -17.15817  16.288898
6 The face of the Olym  40.21769 -21.949683


In [93]:
plot <- ggplot(df_plot, aes(x=x, y=y, color=page_id)) +
            geom_point(alpha=0.75, stroke=0) + 
            theme_bw()

ggsave("fb-headlines-cluster-test-53D.png", plot, width=4, height=3, dpi=300)

![](fb-headlines-cluster-test-53D.png)

Prototype using plotly's scattergl

In [13]:
p <- plot_ly(df_plot,
             x = x,
             y = y,
             color=page_id,
             type = "scattergl",
             mode = "markers",
             marker = list(line = list(width = 0), opacity=0.75, size=6),
             text=link_name)

createWidget(name="plotly",x=plotly_build(p), sizingPolicy=sizingPolicy(browser.padding = 0, 
            browser.fill = F, defaultWidth = "100%", defaultHeight = 400)) %>%
saveWidget("fb-headlines-cluster-test-53D.html", selfcontained=T, libdir="plotly")

Generate custom text for tooltips (note: this was not used since it made charts harder to read)

In [76]:
processText <- function(row) {
    sprintf("%s<br>%s Reactions<br>%s",
            row[3],
            format(as.numeric(row[5]), big.mark=","),
            format(as.Date(substr(row[4], 1, 10) ), format = "%B %d, %Y" )) 
                    }

apply(df_plot[1,], 1, processText)

In [77]:
df_plot$text = apply(df_plot, 1, processText)

Plot the real plotly chart, with layout options to remove the axes.

In [86]:
# https://plot.ly/r/axes/

ax <- list(
  title = "",
  zeroline = FALSE,
  showline = FALSE,
  showticklabels = FALSE,
  showgrid = FALSE
)

m = list(
  l = 0,
  r = 0,
  b = 0,
  t = 25,
  pad = 0
)


p <- plot_ly(df_plot,
             x = x,
             y = y,
             color=page_id,
             type = "scattergl",
             mode = "markers",
             marker = list(line = list(width = 0), opacity=0.75, size=6),
             text=link_name,
            hoverinfo="text+name") %>% layout(xaxis = ax, yaxis = ax, margin=m)

createWidget(name="plotly",x=plotly_build(p), sizingPolicy=sizingPolicy(browser.padding = 0, 
            browser.fill = F, defaultWidth = "100%", defaultHeight = 400)) %>%
saveWidget("fb-headlines-cluster-standalone.html", selfcontained=T, libdir="plotly")

Tweak plot slightly for blog post proper.

In [107]:
font <- list(
        family='Source Sans Pro, Arial, sans-serif'
    )

p <- plot_ly(df_plot,
             x = x,
             y = y,
             color=page_id,
             type = "scattergl",
             mode = "markers",
             marker = list(line = list(width = 0), opacity=0.75, size=6),
             text=link_name,
            hoverinfo="text+name") %>% layout(xaxis = ax,
                                              yaxis = ax,
                                              margin=m,
                                              font=font,
                                              plot_bgcolor ='#f7f8fa',
                                              paper_bgcolor='#f7f8fa')

createWidget(name="plotly",x=plotly_build(p), sizingPolicy=sizingPolicy(browser.padding = 0, 
            browser.fill = F, defaultWidth = "100%", defaultHeight = 400)) %>%
saveWidget("fb-headlines-cluster-web.html", selfcontained=T, libdir="plotly")

# The MIT License (MIT)

Copyright (c) 2016 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.