Deduplication with digital material is usually considered at a file level - ie. this file is identical to that file, these files have identical titles, thses files are similar.

The Greer Archive contains 421 disk images of various forms of media, all with varying levels of duplication. Duplicates within a digital collection are not necessarily a problem, as duplicates can mean different things in different contexts. Consider duplicate powerpoint presentations on media labelled with the event at which the talk was delivered. 

However analysis of duplication at a file level indicates under a third of the files recovered are unique. This notebook is an attempt to map similarities of groupings of files. It does this by comparing the checksums for the contents of each package (in a bagit structure) and then mapping the similarity.

First we'll need a list of the bags. 

In [25]:
import os
import bagit
directory = r"C:\Users\lglanville\tester\corpus"
baglist = []
for root, _, files in os.walk(directory):
    for file in files:
        if file == 'bagit.txt':
            baglist.append(bagit.Bag(root))

In [26]:
def filter_hashes(p_entries):
    """
    Helper function to exclude hashes for submission data. 
    Extracted files are stored in the 'objects' directory of each bag
    """
    h = []
    for file, hashes in p_entries.items():
        if file.startswith('data\\objects'):
            h.append(hashes.get('sha256'))
    return(h)

In [27]:
def compare(a, b):
    """
    compares the contents of two iterables. 
    Score of 0 means b contains no items in a. 
    Score of 1.0 means a is completely duplicated within b.
    """
    portion = 1 / len(a)
    score = 0
    for h in a:
        if h in b:
            score += portion
    return(score)

Now we're going to construct a network graph using the networkx library. Each bag is a node, with edges between when similarity is over 0.5. We're also going to filter in some descriptive metadata from the bag-info.txt files, so we can interpret the visualised graph better.

In [None]:
import networkx as nx
g = nx.MultiDiGraph()
h_list = [(bag, filter_hashes(bag.payload_entries())) for bag in baglist if filter_hashes(bag.payload_entries())!= []]
for bag, hashes in h_list:
    kw = {'carrier': bag.info.get('carrier'), 'title': bag.info.get('title')}
    g.add_node(bag.info.get('identifier'), **kw)
    for bagtwo, hashtwo in h_list:
           if bagtwo.info.get('identifier') != bag.info.get('identifier'):
               s = compare(hashes, hashtwo)
               if s > 0.5:
                   g.add_edge(bag.info.get('identifier'), bagtwo.info.get('identifier'), weight=s)
          

Here we're going to construct a searchable index, where you can pick a package and it will tell you about similar packages.

In [None]:
import pandas as pd
from ipywidgets import interact
f = nx.to_pandas_edgelist(g)
sc = [g.nodes.data()[key]['carrier'] for key in f['source']]
st = [g.nodes.data()[key]['title'] for key in f['source']]
tc = [g.nodes.data()[key]['carrier'] for key in f['target']]
tt = [g.nodes.data()[key]['title'] for key in f['target']]
data = {'source_title':st, 'source_carrier': sc, 'target_title': tt, 'target_carrier': tc}
f = f.join(pd.DataFrame(data=data))
def filter(ind):
    return f[f['source'] == ind]
ind = f['source'].drop_duplicates().sort_values()
interact(filter, ind=ind)

Finally, we're going to visualise the graph using Bokeh(I tried this with Plot.ly, but it doesn't visualise network graphs as nicely). Darker edges are where packages are more similar. Slect a node to bold connected edges.

In [None]:
from bokeh.io import show, output_notebook
from bokeh.models import Plot, Range1d, MultiLine, Circle, HoverTool, BoxZoomTool, ResetTool, TapTool, DataTable
from bokeh.models.graphs import from_networkx, NodesAndLinkedEdges, EdgesAndLinkedNodes
from bokeh.palettes import magma, viridis

edge_attrs = {}
cp = magma(10)
cp.reverse()
for start_node, end_node, key, data in g.edges(data=True, keys=True):
    w = int(data['weight']*10)
    edge_attrs[(start_node, end_node, key)] = cp[w-1]
edge_attrs
nx.set_edge_attributes(g, edge_attrs, name="edge_color")

node_attrs = {}
c = []
for node, data in g.nodes.data():
    c.append(data['carrier'])
c = set(c)
m = viridis(len(c))
colours = dict(zip(c, m))
colourmap = {}
for node, data in g.nodes.data():
    node_attrs.update({node: colours[data['carrier']]})
nx.set_node_attributes(g, node_attrs, name="node_color")
    
output_notebook(notebook_type='jupyter')
plot = Plot(plot_width=800, plot_height=800,
            x_range=Range1d(-1.1, 1.1), y_range=Range1d(-1.1, 1.1))
plot.title.text = "Bag similarity"

node_hover_tool = HoverTool(tooltips=[("index", "@index"), ("title", "@title"), ("carrier", "@carrier")])
plot.add_tools(node_hover_tool, BoxZoomTool(), ResetTool(), TapTool())

graph_renderer = from_networkx(g, nx.spring_layout, scale=1, center=(0, 0), weight="weight")
graph_renderer.node_renderer.glyph = Circle(size=15, fill_color="node_color")
graph_renderer.edge_renderer.glyph = MultiLine(line_alpha=0.8, line_width="weight", line_color="edge_color")
graph_renderer.edge_renderer.selection_glyph = MultiLine(line_width=5, line_color="edge_color")
graph_renderer.selection_policy = NodesAndLinkedEdges()
plot.renderers.append(graph_renderer)

show(plot)