# Introduction

The [PERCEIVE](https://sailuh.github.io/perceive/) project seeks to proactively identify upcoming cybersecurity threats through textual similarity. Social network analysis can be used to partition a network and evaluate its textual content, and hence provide word-selection criteria for defining corpus documents. 

The subject of this notebook, the [Full Disclosure (FD) mailing list](http://seclists.org/fulldisclosure/) is a "public, vendor-neutral forum for detailed discussion of vulnerabilities and exploitation techniques, as well as tools, papers, news, and events of interest to the community."

## Problem Statement

Although cybersecurity email mailing lists provide a rich source to identify emerging concepts, they contain a large amount of content that is irrelevant to the project's purpose, including but not limited to:

 * conference invitations[[1](http://seclists.org/fulldisclosure/2017/Feb/6)]
 * vendor announcements[[2](http://seclists.org/fulldisclosure/2016/Dec/48)]
 * extensive conversations on security topics[[3](http://seclists.org/fulldisclosure/2004/Jul/1026)]
 * a significant amount of trolling[[4](http://seclists.org/fulldisclosure/2008/Apr/96)] 
 * and nonsense[[5](http://seclists.org/fulldisclosure/2009/Jul/289)] [[6](http://seclists.org/fulldisclosure/2004/Jul/796)].

As some of this irrelevant content can be strictly tied to its producer, i.e., the **authors** of the **e-mail replies**, social network analysis provides an interesting opportunity for the isolation of relevant discussion topics in order to **filter** non-related vulnerability content. 

## Network Definition

Earlier in this project, the FD email lists were developed into networks of edges and nodes, divided by year. These original csv files are available [online](https://mega.nz/#F!CUEByR5I!GY56GzTpYz68IlTqj4aQNQ!fR8jFLxL). 

In the FD graphs, nodes represent either documents (i.e., emails) or authors, as identified by a nodeType attribute. Edges are directed, representing authorship and replies; edge weight indicates an increasing number of replies. 


## Tools used

### Gephi 

While [Gephi](http://gephi.org/) proved useful for data exploration, some difficulties arose. In particular, many of the core functions[[7](https://github.com/jaroslav-kuchar/Social-Network-Analysis/issues/2)] and plugins[[8](https://github.com/gephi/gephi/issues/1481)] once used for social network analysis are not compatible with current (0.9) software versions. Gephi specifications note that the 0.9.0 version (December 2015) "Removed [the] Clustering module"[[9](https://github.com/gephi/gephi/releases/tag/v0.9.0)] that these plugins relied upon.

### Python-igraph 

[Python-igraph](http://igraph.org/python/) (igraph hereafter) allows for programatic manipulation of network data. Igraph can generate images via a `plot` function, but this is a slow process for large graphs. Igraph has a wide variety of import and export options for network data, including straightforward export of subgraphs.

### Interoperability of Gephi and igraph

While Gephi can export graphs, its export implementations include temporary node properties like position and color attributes. Igraph can import Gephi-exported [GraphML](http://graphml.graphdrawing.org/) files (and those are available [here](https://mega.nz/#F!Dpdh2TjD!4Rd462mFXbdFn5Scs1WwUA). However, it was more useful in the long run to generate graphs in igraph directly from the provided edge and node lists. Igraph's exported GraphML files functioned as expected in Gephi.

## Generating graphs from edge and node lists

In **Gephi**, graphs were generated via the `"import spreadsheet"` function, and [separate Gephi projects](https://github.com/jeffgerhard/perceive_personal/tree/master/gephi_projects) saved for each year. 

In **igraph**, graphs are generated via importing csv data into lists, then using igraph's `DictReader` function.


In [2]:
from igraph import Graph, summary
from csv import DictReader

# 2008 will be the sample year for this introductory analysis, but let's define it as a variable
# for future development of functions that will work for all years
year = '2008'  

# import edge list and vertex list as lists of dictionaries:
e = []
with open('data/author_threadID_edgelist_' + year + '.csv') as csvfile:
    reader = DictReader(csvfile, dialect='excel')
    reader.fieldnames = [name.lower() for name in reader.fieldnames]  # igraph expects lowercase attribute names
    for row in reader:
        row['weight'] = int(row['weight'])  # convert weight from string to int
        e.append(row)
v = []
with open('data/nodelist_aut_doc_' + year +'.csv') as csvfile:
    reader = DictReader(csvfile, dialect='excel')
    reader.fieldnames = [name.lower() for name in reader.fieldnames]  # igraph expects lowercase attribute names
    for row in reader:
        v.append(row)

# build a graph from the lists; see http://igraph.org/python/doc/igraph.Graph-class.html#DictList
ml = Graph.DictList(vertices=v, edges=e, directed=True, vertex_name_attr='id')  # specify the 'id' attribute here
                                                                                # because igraph anticipates a 'name' instead

ml['name'] = year + ' Full Disclosure network'  # provide a name for this graph

summary(ml)  # list properties and attributes

IGRAPH D-W- 9793 12073 -- 2008 Full Disclosure network
+ attr: name (g), color (v), id (v), label (v), label1 (v), label2 (v), nodetype (v), source (e), target (e), weight (e)


The summary command providing the output seen above is explained in the igraph documentation[[10](http://igraph.org/python/doc/igraph.summary'.GraphSummary-class.html)]. In this example, the four-character code "D-W-" indicates that the graph is directed and weighted. The graph for 2008 has 9793 vertices (nodes) and 12073 edges. 

The list of attributes in the summary ("name", "color", etc.) are those for the graph (g), vertices (v), or edges (e). 

In the remainder of this notebook, **2008** will be used for single-year analysis.

# Preliminary overview
## Visual assessment via Gephi-generated images

Importing the entire graphs for each year into Gephi allowed for early visual inspection and exploration of the FD lists.  

The following table includes full-network Gephi graphics for each year of the email list's existence; blue nodes represent authors and red nodes represent documents. Click through for full-size renderings.

 |||
:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|
[![](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2002.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2002.png?raw=true)  |  [![](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2003.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2003.png?raw=true) |  [![](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2004.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2004.png?raw=true) |  [![](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2005.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2005.png?raw=true)
[![](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2006.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2006.png?raw=true)  |  [![](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2007.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2007.png?raw=true) |  [![](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2008.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2008.png?raw=true) |  [![](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2009.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2009.png?raw=true)
[![](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2010.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2010.png?raw=true)  |  [![](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2011.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2011.png?raw=true) |  [![](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2012.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2012.png?raw=true) |  [![](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2013.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2013.png?raw=true) 
[![](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2014.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2014.png?raw=true)  |  [![](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2015.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2015.png?raw=true) |  [![](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2016.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/adb51a586a8d77c7b0da1f381ba530547a860846/images/fd2016.png?raw=true)  

## The expansion and contraction of Full Disclosure

It is clear from a quick look at the graphs that the overall pattern of the Full Disclosure network has shifted over time. In the early years of the list, there were tightly clustered conversations, particularly from about 2003-2006. As the years went by, the level of discussion seems to have dropped. Some of this pattern can be inferred simply from the filesizes of the edge and node lists, or by counting the nodes.

The visual impression is further confirmed by a documented change in the FD network in March, 2014, when one of the original managers of the list decided to halt the list entirely[[11](http://seclists.org/fulldisclosure/2014/Mar/332)] but another team soon revived it[[12](http://seclists.org/fulldisclosure/2014/Mar/333)]. 

Following the 2014 format change, the number of edges drops significantly, and the list displays a different network structure from the earlier years. Currently, the list remains active but few members respond to published posts.

## "Firework" patterns

Visual analysis of the graphs for each year showed distinct fireworks patterns throughout the lifecycle of the list. These are cases where one author is posting to the list many times, with little or no response from other people on the list, resulting in a hub-and-spoke graph form. The working hypothesis is that these firework patterns generally represent security advisories. If so, these are discussions known issues, and not useful for the project's overall predictive goals.

These fireworks were analyzed both visually, in Gephi, and programatically, in igraph, to test the hypothesis. 

In **Gephi**, manual examination included use of the [Linkfluence plugin](https://marketplace.gephi.org/plugin/linkfluence-plugin/) to launch URLs in a web browser directly from nodes studied. This node-by-node analysis demonstrated that firework nodes were security advisories[[13](http://seclists.org/fulldisclosure/2007/Sep/310)][[14](http://seclists.org/fulldisclosure/2007/Apr/291)][[15](http://seclists.org/fulldisclosure/2007/Aug/8)].

In **igraph**, it was informative to assess each author node and examine the in-degree of its connected (document) nodes. In cases where all (or most) of an author's neighbors have an in-degree of one, the author is the hub of a firework. 

In [3]:
print('List of authors and their connected (document) vertices, among first 45 in total list,\nlimited to those with neighbors with max in-degree < 2')
for node in ml.vs(nodetype_eq='author')[0:45]:
    indegrees = [ml.degree(_, mode='in') for _ in ml.neighbors(node)]
    if indegrees:  # a few authors have 0 neighbors... 
        if max(indegrees) < 2:
            print()
            print('-' * 55)
            print(node['label'], '\n')
            print('  In-degree\t Label')
            print('  ---------\t -----')
            for _ in ml.neighbors(node):
                print(' ', ml.degree(_, mode='in'), '\t\t', ml.vs[_]['label'][0:60] )
            print('  MAX:', max(indegrees))
            print('  AVERAGE:', sum(indegrees) / max(len(indegrees), 1))

List of authors and their connected (document) vertices, among first 45 in total list,
limited to those with neighbors with max in-degree < 2

-------------------------------------------------------
andur matrix <andurmatrix () gmail com> 

  In-degree	 Label
  ---------	 -----
  1 		 Re: [OOT] Thesis for master degree
  MAX: 1
  AVERAGE: 1.0

-------------------------------------------------------
Matousec - Transparent security Research <research () matousec com> 

  In-degree	 Label
  ---------	 -----
  1 		 Kerio Fake 'iphlpapi' DLL injection Vulnerability
  1 		 Outpost Bypassing Self-Protection using file	links Vulnerabi
  1 		 Comodo Multiple insufficient argument validation of hooked S
  1 		 Comodo DLL injection via weak hash function	exploitation Vul
  1 		 Comodo Bypassing settings protection using magic	pipe Vulner
  1 		 Norton Insufficient validation of 'SymTDI' driver	input buff
  1 		 Norton Multiple insufficient argument validation of hooked S
  1 		 ZoneAlarm Multiple

In the above code example, it is revealed that authors "Matousec" and "Vic Vandal" are centers of fireworks. Email subjects suggest that these are security advisories and conference announcements, respectively. 

A few authors in the 2008 network have zero neighboring nodes. These are responses to email threads from earlier years. In general, we are not interested in these delayed responses. Security literature[[16](https://pdfs.semanticscholar.org/6431/5b4290353cf7e46d9cfa6cba566479ff275e.pdf)] suggests that vulnerabilities are only of interest in the short term, and **discussions within a single year** are a reasonable target for our analysis.

# Filtering the network

Igraph allows for removal of vertices from the graph based on useful criteria. Since the "fireworks" are not desired in future analysis, they can be removed.

### Saving some attributes before filtering

Before filtering, it may be useful to store some graph statistics in the node attributes.



In [4]:
pr = ml.pagerank(directed=True, weights='weight')
for idx, _ in enumerate(ml.vs):
    _['original_pagerank'] = pr[idx]
    _['original_enumeration'] = idx - 1  # this can link us back to the node list if needed


A simple function to list total numbers of nodes and authors will be useful as we attempt to shrink the analyzed graph:

In [5]:
def nodeCount(g):
    print('Total vertices for', g['name'], ':\n    ', len(g.vs), 'including', len(g.vs(nodetype_eq='author')), 'authors')

nodeCount(ml)

Total vertices for 2008 Full Disclosure network :
     9793 including 2438 authors


## Removing fireworks
The earlier analysis of the in-degree of authors' neighbors suggested that those authors whose neighbors' in-degree was mostly 1 are centers of fireworks. A fuller examination of the data revealed that these types of list postings occasionally gain responses[[16](http://seclists.org/fulldisclosure/2008/Feb/440)]. Therefore, it was determined that a firework including documents with _average_ (or mean) in-degrees of < 1.2 can be removed from the graph.

In [6]:
# now let's start filtering...
# first remove authors with neighbors mostly of in-degree 1
print('Filtering out authors with neighbors mostly of in-degree 1...')

for node in ml.vs:
    if node['nodetype'] == 'author':
        indegrees = [ml.degree(_, mode='in') for _ in ml.neighbors(node)]
        if indegrees:  # a few authors have 0 neighbors... this is something to investigate later
            if sum(indegrees) / max(len(indegrees), 1) < 1.2: #  there are examples where most of the indegrees are 1 but
                                                     #  we have a random response, so let's use the mean
                ml.delete_vertices(node) #  deleting vertices also deletes all of its edges
ml['name'] = ml['name'] + ' (fireworks removed)'

nodeCount(ml)

Filtering out authors with neighbors mostly of in-degree 1...
Total vertices for 2008 Full Disclosure network (fireworks removed) :
     9253 including 1898 authors


The above reduction in network size is helpful, but the removal of firework hubs is likely to leave a number of isolated documents in the graph. Removing all nodes with degree 0 can complete the _firework cleanup_.

In [7]:
print('Filtering out documents or authors with degree 0...')
for node in ml.vs:
    if ml.degree(node) == 0: 
        ml.delete_vertices(node)
ml['name'] = ml['name'] + ' (isolated nodes removed)'
nodeCount(ml)

Filtering out documents or authors with degree 0...
Total vertices for 2008 Full Disclosure network (fireworks removed) (isolated nodes removed) :
     7179 including 1898 authors


The numbers after firework-filtering are still large. In order to focus more narrowly on relevant portions of this network, some deeper social network analysis techniques are required.

# Social network analysis

The network structure varies considerably across the years. This provides an opportunity to partition a network for a given year into several clusters and investigate if the visualized structure has any association to the textual content of a given **e-mail thread** discussion.

If such association exists, then we can leverage the **social network structure** in order to simplify the identification of emerging concepts through text alone. For instance, identifying a group of individuals who prefer certain subjects, or are spammers or trolls, may become a trivial pre-processing stage before textual content is analyzed for _emerging_ concepts.

We begin by considering 2 methods to more precisely define how to partition a network at any given year: 

 * Community Detection 
 * Betweenness Centrality

## Community Detection

In real-world social networks, there is a tendency towards clustering of nodes with strong ties: if I have a close friend who has another close friend, I am likely to make a connection with that person. There are mathematical methods for identifying clusters based on the existing edges of their neighbors in comparison to the possible connections among them.

Both Gephi and igraph have clustering functions, but igraphâ€™s are more robust and allow us to easily filter out subgraphs.

Igraph's community measures do not support directed graphs, as seen in the warning message below, but the directionality can be removed. It is worth noting that directionality is implied in the structure of the graph (authors write or respond to documents).

In [8]:
com = ml.community_leading_eigenvector(weights='weight', clusters=3)

  membership, _, q = GraphBase.community_leading_eigenvector(self, clusters, **kwds)


In [9]:
ml.to_undirected(combine_edges='max') # eliminate the directionality

# Attempt to identify communities

com = ml.community_leading_eigenvector(weights='weight', clusters=3)
print('clustering attempt, leading eigenvector:')
summary(com)

clustering attempt, leading eigenvector:
Clustering with 7179 elements and 1092 clusters


Igraphs allows a suggested number of communities generated, but on the entire network, the applied settings fail (the program cannot divide the network into 3 clusters, in this case).

It will be useful to define a method of saving cluster information in vertex attributes.

In [10]:
def saveClusterInfo(g, com, attr):
    '''add to graph 'g' a cluster number from community 'com' as 'attr' attribute'''
    for idx, c in enumerate(com):
        for _ in c:
            g.vs[_][attr] = idx

# Apply above function to our overall graph
saveClusterInfo(ml, com, 'filtered_clustering')


## Centrality measures

Igraph allows us to calculate _betweenness centrality_ and _pagerank_. These are related measures for identifying the most central vertices in a graph.

We can add a function to calculate and store these measures as attributes.

In [11]:
# add betweenness centrality, pagerank, and clustering info to original graph
def saveCentrality(g, name):
    bc = g.betweenness()
    for idx, node in enumerate(g.vs):
        g.vs[idx][name + '_betweenness'] = bc[idx]

    pr = g.pagerank(directed=False)
    for idx, node in enumerate(g.vs):
        g.vs[idx][name + '_pagerank'] = pr[idx]
    return bc, pr

bc, pr = saveCentrality(ml, 'filtered')

### Community and centrality analysis

Examining the results in igraph can reveal some useful information. Beginning with the clusters:

In [12]:
# First, we will define a function to summarize the clustering attempt.
def summarizeClusters(com, n=5):
    print(len(com), 'clusters.')
    print('maximum size:', len(max(com, key=len)))
    print('minimum size:', len(min(com, key=len)))

    print('\nSummary of first', n, 'clusters:')
    for i in range(n):
        etc = ''
        if len(com[i]) > 5:
            etc += '(and ' + str(len(com[i]) - 5) + ' more)'
        print('[{}] has size of {}. Vertices: {} {}'.format(i, len(com[i]), com[i][0:5], etc))

summarizeClusters(com, n=10)

1092 clusters.
maximum size: 5594
minimum size: 1

Summary of first 10 clusters:
[0] has size of 5594. Vertices: [0, 1, 2, 3, 4] (and 5589 more)
[1] has size of 8. Vertices: [37, 1922, 1970, 2242, 2730] (and 3 more)
[2] has size of 4. Vertices: [44, 1931, 2847, 6026] 
[3] has size of 2. Vertices: [80, 1973] 
[4] has size of 4. Vertices: [81, 83, 1974, 4715] 
[5] has size of 2. Vertices: [101, 1997] 
[6] has size of 3. Vertices: [107, 2009, 2686] 
[7] has size of 5. Vertices: [118, 136, 2017, 2020, 2021] 
[8] has size of 5. Vertices: [130, 2033, 2186, 2649, 3428] 
[9] has size of 4. Vertices: [133, 2036, 2037, 2094] 


The above output immediately demonstrates that one cluster is by far the largest in the network -- consisting here of 5594 nodes (out of 7179 in the total network). Most of the other clusters are tiny in comparison.

The initial impressions from betweenness measures are also interesting.


In [13]:
# sort the betweenness and then list nodes in order of centrality
bc.sort(reverse=True)
max_bc = max(bc)
bc_normalized = [x / max_bc for x in bc]

for idx,val in enumerate(bc_normalized[0:15]):
    print('Node', idx, ':', '{:.0%}'.format(bc_normalized[idx]))

Node 0 : 100%
Node 1 : 62%
Node 2 : 50%
Node 3 : 48%
Node 4 : 48%
Node 5 : 48%
Node 6 : 29%
Node 7 : 27%
Node 8 : 19%
Node 9 : 18%
Node 10 : 17%
Node 11 : 16%
Node 12 : 15%
Node 13 : 15%
Node 14 : 14%


We can observe that the betweenness centrality quickly drops, which may indicate nodes that often discuss different subjects for contributing e-mail discussions for different "community of threats".

At this point it is helpful to look at some visualizations to assess the 2008 graph in its current state.

### Visualizing the current graph

In [14]:
# export the ml graph
filename = 'data/out_ml' + year + '_filtered.graphml'
ml.save(filename, format='graphml') # export graph

When the resulting graph is opened in Gephi, and colored by cluster, it is apparent that one large "blob" still comprises the bulk of the network.

![Filtered graph](https://github.com/jeffgerhard/perceive_personal/blob/master/images/2008_clusters_betweenness.png?raw=true "Filtered graph")

There are still too many communities for easy analysis. The next objective is to isolate the blob.

## Breaking down the network into subgraphs

### The blob
In the case of year 2008, the blob is easily identifiable, taking up majority of the graph. In other years -- particularly later years -- this will not be the case. More sophisticated selection criteria will have to be developed based on the initial clustering results.

In [15]:
# The blob has been manually identified as cluster 0

blob = ml.induced_subgraph(com[0])
blob['name'] = year + ' subgraph for cluster 0 (blob)'
summary(blob)

IGRAPH U-W- 5594 8576 -- 2008 subgraph for cluster 0 (blob)
+ attr: name (g), color (v), filtered_betweenness (v), filtered_clustering (v), filtered_pagerank (v), id (v), label (v), label1 (v), label2 (v), nodetype (v), original_enumeration (v), original_pagerank (v), source (e), target (e), weight (e)


In [16]:
# Now working with the blob, we can re-attempt clustering and bc!

com = blob.community_leading_eigenvector(weights='weight', clusters=8)
print('clustering attempt, leading eigenvector:')
summary(com)
saveClusterInfo(blob, com, 'blob_clustering')

bc, pr = saveCentrality(blob, 'blob')

clustering attempt, leading eigenvector:
Clustering with 5594 elements and 8 clusters


Igraph's clustering algorithms are far more effective on the blob than they were on the whole graph. Here, we can specify the number of clusters we are seeking. Experimenting with results, though, demonstrated that there was little significant difference among different numbers of communities; in each case the next steps are the same.

### The "blob" colored by subcommunity
[8-cluster visualization](https://github.com/jeffgerhard/perceive_personal/blob/master/images/blob8clusters.png?raw=true) | [15-cluster visualization](https://github.com/jeffgerhard/perceive_personal/blob/master/images/blob15clusters.png?raw=true) |
:-------------------------:|:-------------------------:|
[![](https://github.com/jeffgerhard/perceive_personal/blob/master/images/blob8clusters.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/master/images/blob8clusters.png?raw=true) | [![](https://github.com/jeffgerhard/perceive_personal/blob/master/images/blob15clusters.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/master/images/blob15clusters.png?raw=true)

The blob displays distinctive subcommunities. A few of these new clusters _still_ include fireworks (green in these visualizations), but others (colored purple here) are strongly suggestive of discussion communities in which multiple authors interact with each others' messages. 

Examining these subcommunities in turn provides an opportunity to discover real communities and begin to develop a **corpus** from the Full Disclosure discussion.

## Saving the blob subcommunities

Using 8 clusters as a baseline for analysis, the relevant subgraphs can include all useful attributes and be exported in turn.


In [17]:
# now let's try each cluster separately

subs = {}  # build all the clusters into a dictionary
for idx, c in enumerate(com):
    subs[idx] = blob.induced_subgraph(com[idx])
    subs[idx]['name'] = year + ' blob subgraph ' + str(idx) + ' of ' + str(len(com))
    # rerun the centrality, clustering, pagerank analyses
    subcom = subs[idx].community_leading_eigenvector(weights='weight', clusters=5)
    saveClusterInfo(subs[idx], subcom, 'local_clustering')
    bc, pr = saveCentrality(subs[idx], 'local')    
    filename = 'data/' + subs[idx]['name'].replace(' ', '_') + '_out.graphml'
    subs[idx].save(filename, format='graphml')

Loading each subcommunity into Gephi, and retaining the original coloring (red nodes for authors, black for documents), the different graph structures for each subcommunity become clear.

### Each subcommunity of the 2008 blob
[Sub-cluster 0](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph0.png?raw=true) | [Sub-cluster 1](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph1.png?raw=true) | [Sub-cluster 2](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph2.png?raw=true) | [Sub-cluster 3](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph3.png?raw=true) |
:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|
[![](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph0.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph0.png?raw=true) | [![](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph1.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph1.png?raw=true) | [![](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph2.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph2.png?raw=true) | [![](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph3.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph3.png?raw=true)
[**Sub-cluster 4**](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph4.png?raw=true) | [**Sub-cluster 5**](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph5.png?raw=true) | [**Sub-cluster 6**](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph6.png?raw=true) | [**Sub-cluster 7**](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph7.png?raw=true)
[![](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph4.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph4.png?raw=true) | [![](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph5.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph5.png?raw=true) | [![](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph6.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph6.png?raw=true) | [![](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph7.png?raw=true)](https://github.com/jeffgerhard/perceive_personal/blob/master/images/subgraph7.png?raw=true)

A few of these subgraphs (i.e, sub-groups 2 and 6) are too small to bother with analyzing, but the remaining groups represent real communities in dialogue. Thus, the sub-clusters represent opportunities to identify individual authors and attempt to identify conversational topics. Two problems remain to discuss:

- Do these groupings have distinctive characteristics? 
- How can we assess the remaining firework features (e.g. in subcluster 7 above)? Is it reasonable to generalize the original filtering process at the subgraph level?

### Cluster features and textual analysis

In [21]:
summary(subs[3])

IGRAPH U-W- 4151 4039 -- 2008 blob subgraph 3 of 8
+ attr: name (g), blob_betweenness (v), blob_clustering (v), blob_pagerank (v), color (v), filtered_betweenness (v), filtered_clustering (v), filtered_pagerank (v), id (v), label (v), label1 (v), label2 (v), local_betweenness (v), local_clustering (v), local_pagerank (v), nodetype (v), original_enumeration (v), original_pagerank (v), source (e), target (e), weight (e)


- **Cluster 0** (161 nodes, 183 edges) features a number of discussions of _exploit auctions or sales_[[17](http://seclists.org/fulldisclosure/2007/Jul/93)], [[18](http://seclists.org/fulldisclosure/2007/Jul/116)], [[19](http://seclists.org/fulldisclosure/2007/Nov/226) and includesa  _potential_ firework clustering around user `Nick FitzGerald <nick () virus-l demon co uk>`. A researcher more familiar with security matters should assess whether or not the Nick FitzGerald messages (for example [[20](http://seclists.org/fulldisclosure/2007/Jan/443)], [[21](http://seclists.org/fulldisclosure/2008/Jan/422)], [[22](http://seclists.org/fulldisclosure/2008/Dec/368)]) are valuable for further study; they are primarily general discussions of security topics, in conversation with a wider group of list members.

- **Cluster 1** (488 nodes, 983 edges) is heavily dominated by three authors, `Ureleet <ureleet () gmail com>`, `Valdis.Kletnieks () vt edu`, and above all `n3td3v <xploitable () gmail com>`, an infamous list member discussed in several forums even outside of Full Disclosure[23](http://it.toolbox.com/blogs/managing-infosec/security-trolls-n3td3v-12460), [[24](http://hackerfactor.com/papers/who_is_n3td3v.pdf)]. n3td3v responds to a wide range of topics including over 300 separate emails in the overall 2008 graph. The fact that these users are not typical makes analysis difficult. The best practice is to treat them as trolls (n3td3v is referred to as a troll in the cited links, with some caveats.) Conveniently, the blob clustering has linked together all of these troll-flavored postings into one giant subgraph of argumentation and flame wars. (Examples of such posts include [[25](http://seclists.org/fulldisclosure/2008/Dec/200)], [[26](http://seclists.org/fulldisclosure/2008/Jun/104)], [[27](http://seclists.org/fulldisclosure/2008/Oct/65)].) This cluster can be dropped from further discussion, along with **Cluster 2** as mentioned earlier, for its small size or 26 nodes.

- **Cluster 3** is the largest in the blob and corresponds to the large purple communities in the blob overviews above. This cluster 

Remaining fireworks

Conclusions

Suggestions for further research