# How to Create a Network Graph Visualization of Reddit Subreddits

by Max Woolf (@minimaxir)

*This notebook is licensed under the MIT License. If you use the code or data visualization designs contained within this notebook, it would be greatly appreciated if proper attribution is given back to this notebook and/or myself. Thanks! :)*

In [1]:
source("Rstart.R")

library(sna)
library(ggnetwork)
library(svglite)
library(igraph)
library(intergraph)   # convert igraph to network
library(rsvg)   # convert svg to pdf

sessionInfo()


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Registering fonts with R

Attaching package: ‘scales’

The following objects are masked from ‘package:readr’:

    col_factor, col_numeric

sna: Tools for Social Network Analysis
Version 2.3-2 created on 2014-01-13.
copyright (c) 2005, Carter T. Butts, University of California-Irvine
 For citation information, type citation("sna").
 Type help(package="sna") to get started.


Attaching package: ‘igraph’

The following objects are masked from ‘package:sna’:

    %c%, betweenness, bonpow, closeness, components, degree,
    dyad.census, evcent, hierarchy, is.connected, neighborhood,
    triad.census

The following object is masked from ‘package:stringr’:

    %>%

The following objects are masked from ‘package:dplyr’:

    %>%, as_data_frame, groups, union

The following objects are masked from 

R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] rsvg_0.5           intergraph_2.0-2   igraph_1.0.1       svglite_1.1.0     
 [5] ggnetwork_0.5.1    sna_2.3-2          stringr_1.0.0      digest_0.6.9      
 [9] RColorBrewer_1.1-2 scales_0.4.0       extrafont_0.17     ggplot2_2.1.0     
[13] dplyr_0.4.3        readr_0.2.2       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4      plyr_1.8.3       base64enc_0.1-3  tools_3.3.0     
 [5] uuid_0.1-2       jsonlite_0.9.19  evaluate_0.9     gtable_0.2.0    
 [9] IRdisplay_0.3    DBI_0.4          IRkernel_0.5     ggrepel_0.5     
[13] parallel_3.3.0   rzmq_0.7.7       Rttf2pt1_1.3.3   repr_0.4        
[17] gdtools_0.0.7    R6_2.1.2         extrafontd

Load edgelist into R and preprocess.

In [2]:
file_name <- "subreddit_edges.csv"

df <- read_csv("subreddit_edges.csv") %>% arrange(Source, Target)
print(head(df))

Source: local data frame [6 x 3]

             Source        Target Weight
              (chr)         (chr)  (int)
1 10cloverfieldlane     askreddit    228
2      1200isplenty     askreddit    244
3      1200isplenty        loseit    228
4         2007scape adviceanimals    248
5         2007scape     askreddit   1514
6         2007scape         funny    372


In [3]:
defaults <- c("announcements","art","askreddit","askscience","aww","blog",
             "books","creepy","dataisbeautiful","diy","documentaries","earthporn",
             "explainlikeimfive","fitness","food","funny","futurology","gadgets",
             "gaming","getmotivated","gifs","history","iama","internetisbeautiful",
             "jokes","lifeprotips","listentothis","mildlyinteresting","movies","music",
             "news","nosleep","nottheonion","oldschoolcool","personalfinance",
             "philosophy","photoshopbattles","pics","science","showerthoughts",
             "space","sports","television","tifu","todayilearned","twoxchromosomes","upliftingnews",
             "videos","worldnews","writingprompts")

df <- df %>% mutate(connectDefault = ifelse(Source %in% defaults | Target %in% defaults, T, F))
print(tail(df))

Source: local data frame [6 x 4]

     Source       Target Weight connectDefault
      (chr)        (chr)  (int)          (lgl)
1 worldnews youtubehaiku    298           TRUE
2       wow   woweconomy    308          FALSE
3       wow   wowservers    344          FALSE
4       wow          wtf    654          FALSE
5       wtf      xboxone    539          FALSE
6       wtf youtubehaiku    349          FALSE


In [4]:
net <- graph.data.frame(df, directed=F)

print(net)

IGRAPH UN-- 1131 7498 -- 
+ attr: name (v/c), Weight (e/n), connectDefault (e/l)
+ edges (vertex names):
 [1] 10cloverfieldlane--askreddit       1200isplenty     --askreddit      
 [3] 1200isplenty     --loseit          2007scape        --adviceanimals  
 [5] 2007scape        --askreddit       2007scape        --funny          
 [7] 2007scape        --gaming          2007scape        --gifs           
 [9] 2007scape        --globaloffensive 2007scape        --ice_poseidon   
[11] 2007scape        --leagueoflegends 2007scape        --pcmasterrace   
[13] 2007scape        --pics            2007scape        --politics       
[15] 2007scape        --runescape       2007scape        --the_donald     
+ ... omitted several edges


Calculate degree, and remove nodes with only 1 or 2 neighbors for graphing simplicity.

In [5]:
V(net)$degree <- centralization.degree(net)$res
net <- igraph::delete.vertices(net, V(net)[degree < 3])

print(net)

IGRAPH UN-- 517 6732 -- 
+ attr: name (v/c), degree (v/n), Weight (e/n), connectDefault (e/l)
+ edges (vertex names):
 [1] 2007scape--adviceanimals   2007scape--askreddit      
 [3] 2007scape--funny           2007scape--gaming         
 [5] 2007scape--gifs            2007scape--globaloffensive
 [7] 2007scape--leagueoflegends 2007scape--pcmasterrace   
 [9] 2007scape--pics            2007scape--politics       
[11] 2007scape--runescape       2007scape--the_donald     
[13] 2007scape--todayilearned   2007scape--videos         
[15] 2007scape--worldnews       2007scape--wtf            
+ ... omitted several edges


Add more summary statistics to the nodes.

In [6]:
V(net)$group <- membership(cluster_walktrap(net, weights=E(net)$Weight))
V(net)$centrality <- eigen_centrality(net, weights=E(net)$Weight)$vector
V(net)$defaultnode <- V(net)$name %in% defaults

print(head(data.frame(V(net)$name, V(net)$degree, V(net)$centrality, V(net)$group, V(net)$defaultnode)))

    V.net..name V.net..degree V.net..centrality V.net..group V.net..defaultnode
1     2007scape            17       0.013282501            6              FALSE
2           3ds            18       0.007731622           10              FALSE
3         49ers             3       0.001340047           33              FALSE
4         4chan            62       0.055411772            5              FALSE
5        advice             4       0.005108509            7              FALSE
6 adviceanimals           226       0.449733804            5              FALSE


## Adding colors

Long string of code to generate color palette and assign to nodes and edges. Generate a color for a group from solid ColorBrewer colors.

In [7]:
color_pool <- c(brewer.pal(9, "Blues")[6:9],
                brewer.pal(9, "Reds")[6:9],
                brewer.pal(9, "Greens")[6:9],
                brewer.pal(9, "Purples")[6:9])

n_colors <- max(V(net)$group)
set.seed(42)
palette <- data.frame(group=1:n_colors, colors=sample(color_pool, n_colors, replace=T), stringsAsFactors=FALSE)

V(net)$colornode <- palette[V(net)$group, 2]
                   
print(head(palette))

  group  colors
1     1 #54278F
2     2 #54278F
3     3 #EF3B2C
4     4 #6A51A3
5     5 #006D2C
6     6 #41AB5D


Prepare data frame for merging. (to find edges with are in the same group)

In [8]:
# http://stackoverflow.com/questions/21243965/igraph-get-edge-from-to-value

df_edges <- tbl_df(data.frame(get.edgelist(net), stringsAsFactors=FALSE))
df_vertices <- tbl_df(data.frame(name=V(net)$name, color=V(net)$colornode, group=V(net)$group, stringsAsFactors=FALSE))

print(head(df_edges))
print(head(df_vertices))

Source: local data frame [6 x 2]

         X1              X2
      (chr)           (chr)
1 2007scape   adviceanimals
2 2007scape       askreddit
3 2007scape           funny
4 2007scape          gaming
5 2007scape            gifs
6 2007scape globaloffensive
Source: local data frame [6 x 3]

           name   color group
          (chr)   (chr) (dbl)
1     2007scape #41AB5D     6
2           3ds #00441B    10
3         49ers #A50F15    33
4         4chan #006D2C     5
5        advice #00441B     7
6 adviceanimals #006D2C     5


In [9]:
default_edge_color <- "#cccccc"

df_edges <- df_edges %>% left_join(df_vertices, by=c("X1"="name")) %>% left_join(df_vertices, by=c("X2"="name"))
E(net)$coloredge <- ifelse(df_edges$group.x==df_edges$group.y, df_edges$color.x, default_edge_color)

print(head(df_edges))

Source: local data frame [6 x 6]

         X1              X2 color.x group.x color.y group.y
      (chr)           (chr)   (chr)   (dbl)   (chr)   (dbl)
1 2007scape   adviceanimals #41AB5D       6 #006D2C       5
2 2007scape       askreddit #41AB5D       6 #006D2C       5
3 2007scape           funny #41AB5D       6 #006D2C       5
4 2007scape          gaming #41AB5D       6 #08519C       8
5 2007scape            gifs #41AB5D       6 #006D2C       5
6 2007scape globaloffensive #41AB5D       6 #41AB5D       6


In [10]:
df_net <- ggnetwork(net, layout = "fruchtermanreingold", weights="Weight", niter=50000)

write.csv(df_net, "df_net.csv", row.names=F)
print(head(df_net))

Loading required package: network
network: Classes for Relational Data
Version 1.13.0 created on 2015-08-31.
copyright (c) 2005, Carter T. Butts, University of California-Irvine
                    Mark S. Handcock, University of California -- Los Angeles
                    David R. Hunter, Penn State University
                    Martina Morris, University of Washington
                    Skye Bender-deMoll, University of Washington
 For citation information, type citation("network").
 Type help("network-package") to get started.


Attaching package: ‘network’

The following objects are masked from ‘package:igraph’:

    %c%, %s%, add.edges, add.vertices, delete.edges, delete.vertices,
    get.edge.attribute, get.edges, get.vertex.attribute, is.bipartite,
    is.directed, list.edge.attributes, list.vertex.attributes,
    set.edge.attribute, set.vertex.attribute

The following object is masked from ‘package:sna’:

    %c%



          x         y  centrality colornode defaultnode degree group  na.x
1 0.2893745 0.4687382 0.013282501   #41AB5D       FALSE     17     6 FALSE
2 0.3214649 0.3795588 0.007731622   #00441B       FALSE     18    10 FALSE
3 0.7652908 0.2848083 0.001340047   #A50F15       FALSE      3    33 FALSE
4 0.4980805 0.4638789 0.055411772   #006D2C       FALSE     62     5 FALSE
5 0.7857507 0.6293891 0.005108509   #00441B       FALSE      4     7 FALSE
6 0.4729610 0.5305755 0.449733804   #006D2C       FALSE    226     5 FALSE
   vertex.names      xend      yend coloredge connectDefault na.y Weight
1     2007scape 0.2893745 0.4687382      <NA>             NA   NA     NA
2           3ds 0.3214649 0.3795588      <NA>             NA   NA     NA
3         49ers 0.7652908 0.2848083      <NA>             NA   NA     NA
4         4chan 0.4980805 0.4638789      <NA>             NA   NA     NA
5        advice 0.7857507 0.6293891      <NA>             NA   NA     NA
6 adviceanimals 0.4729610 0.5305755  

In [11]:
df_net_defaults = df_net[which(df_net$default),]
print(head(df_net_defaults))

            x         y centrality colornode defaultnode degree group  na.x
20  0.2610215 0.6775423 0.00457658   #006D2C        TRUE      6     5 FALSE
25  0.4901913 0.4970370 1.00000000   #006D2C        TRUE    887     5 FALSE
26  0.4174736 0.6233355 0.01409392   #006D2C        TRUE     18     5 FALSE
39  0.5055989 0.5204532 0.21326200   #006D2C        TRUE    116     5 FALSE
63  0.6083284 0.5272011 0.03790249   #006D2C        TRUE     37     5 FALSE
108 0.5682124 0.5922601 0.03120117   #006D2C        TRUE     29     5 FALSE
    vertex.names      xend      yend coloredge connectDefault na.y Weight
20           art 0.2610215 0.6775423      <NA>             NA   NA     NA
25     askreddit 0.4901913 0.4970370      <NA>             NA   NA     NA
26    askscience 0.4174736 0.6233355      <NA>             NA   NA     NA
39           aww 0.5055989 0.5204532      <NA>             NA   NA     NA
63         books 0.6083284 0.5272011      <NA>             NA   NA     NA
108       creepy 0.56821

In [21]:
default_colors=c("#3498db", "#e67e22")
default_labels=c("Not Default", "Default")

svglite("subreddit-1.svg", width=10, height=8)  
  ggplot(df_net, aes(x = x, y = y, xend = xend, yend = yend, size = centrality)) +
    geom_edges(aes(color = connectDefault), size=0.05) +
    geom_nodes(aes(fill = defaultnode), shape = 21, stroke=0.2, color="black") +
    geom_nodelabel_repel(data=df_net, aes(color = defaultnode, label = vertex.names),
                          fontface = "bold", size=0.5, box.padding = unit(0.05, "lines"),
                          label.padding= unit(0.1, "lines"), segment.size=0.1, label.size=0.2) +
    scale_color_manual(values=default_colors, labels=default_labels, guide=F) +
    scale_fill_manual(values=default_colors, labels=default_labels) +
    ggtitle("Network Graph of Reddit Subreddits (by @minimaxir)") +
    scale_size(range=c(0.1, 4)) + 
    theme_blank()
dev.off()

rsvg_pdf("subreddit-1.svg", "subreddit-1.pdf")

In [22]:
svglite("subreddit-2.svg", width=10, height=8)  
  ggplot(df_net, aes(x = x, y = y, xend = xend, yend = yend, size = centrality)) +
  geom_edges(aes(color = coloredge), size=0.05) +
  geom_nodes(aes(fill = colornode), shape = 21, stroke=0.2, color="black") +
     geom_nodelabel_repel(data=df_net, aes(color = colornode, label = vertex.names),
                       fontface = "bold", size=0.5,
                    box.padding = unit(0.05, "lines"), label.padding= unit(0.1, "lines"), segment.size=0.1, label.size=0.2) +
    scale_color_identity("colornode", guide=F) +
    scale_fill_identity("colornode", guide=F) +
    scale_size(range=c(0.2, 3), guide=F) +
    ggtitle("Network Graph of Reddit Subreddits (by @minimaxir)") +
  theme_blank()
dev.off()

rsvg_pdf("subreddit-2.svg", "subreddit-2.pdf")

In [23]:
subreddit_graph_subset <- function(group_number) {

df_network <- df_net[which(df_net$group==group_number),]

plot <- 
  ggplot(df_network, aes(x = x, y = y, xend = xend, yend = yend, size = centrality)) +
  geom_edges(data=df_network[which(df_network$coloredge!=default_edge_color),], aes(color = coloredge), size=0.05) +
  geom_nodes(aes(fill = colornode), shape = 21, stroke=0.5, color="black") +
    geom_nodelabel_repel(data=df_network, aes(color = colornode, label = vertex.names),
                       fontface = "bold", family="Open Sans Condensed", size=1.5,
                    box.padding = unit(0.10, "lines"), label.padding= unit(0.1, "lines"), segment.size=0.1, label.size=0.5, label.r=unit(0.15, "lines")) +
    scale_color_identity("colornode", guide=F) +
    scale_fill_identity("colornode", guide=F) +
    scale_size(range=c(0.2, 6), guide=F) +
    ggtitle(sprintf("Network Subgraph of Group %s Subreddits",group_number)) +
  theme_blank(base_size=7, base_family="Source Sans Pro")
    
ggsave(sprintf("subreddit-groups/group-%03d.png", group_number), plot, width=4, height=3, dpi=300)

}

In [24]:
x <- lapply(1:max(V(net)$group), subreddit_graph_subset)

# The MIT License (MIT)

Copyright (c) 2016 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.