## A network analysis of the texts in books of hours

The research project aims to answer the following questions: 
    
* Which text co-occur in books? In other words, which texts were often combined in books of hours? 
* Which texts were unique?



Which texts co-occur in books?

Collect all texts they share (through an analysis of the paths; betweenness centrality)

Which texts are unique to specific books?

Are there differences per century?



Handschriften die samen voorkomen:
kortste pad van 3
Tel het aantal paden
Maak gephi-bestand
Voor totaal en per eeuw

Visualisatie in heatmap

Identificeer alle handschriften die maar een keer voorkomen



In [None]:
import os
import re
import json
import requests
import bnm
from tdmh import *
from operator import itemgetter
from pyvis.network import Network
import pandas as pd
from tdmh import *

## Creating the data set

To address this question, we firstly select the data we work with. The data have all been exported from the BNM-i, a database contructed by the Huyghens ING. The texts have all been saved as separate JSON files. 41902 files have been dowbloaded in total. The code below selects the texts which have been assigned a category containing the words 'getijden' or 'gebeden'. This is the case for 5437 texts. 

In [None]:
files = os.listdir('BNM_texts')
print(f'{len(files)} texts in total.')

selected_texts = []

for file in files:
    if re.search( r'.json$' , file ):
        path = os.path.join('BNM_texts' , file)
        json_str = open( path , encoding = 'utf-8')
        json_data = json.load(json_str)
        
        categories = bnm.get_categories(json_data)
        for c in categories:
            
            if re.search( r'\bgetijden' , c ) or re.search( r'\bgebeden' , c ):
                selected_texts.append( path )
            
selected_texts = list(set(selected_texts))
print( f'{len(selected_texts)} texts selected.')

A further selection takes place. We focus exclusively on the texts that have been assigned a standardised title.

The code below navigates across all the texts, and establishes the carriers (or the books) these texts are in. 

In [None]:
out = open('books_of_hours.tsv' , 'w' , encoding = 'utf-8')
out.write('text_id\ttext_title\tbook_id\tbook_title\tyear\n')

count = 0 
norm_titles = []


for text in selected_texts:
    json_str = open( text, encoding = 'utf-8')
    json_data = json.load(json_str)
    #print(text)
    norm_title = bnm.get_norm_title(json_data)

    if len(norm_title) > 0:
        #print(norm_title)
        
        count += 1

        title = norm_title[0][1]
        title = re.sub('Ongespecificeerde Mnl. teksten op naam van\s+' , '' , str(title))
        title = re.sub('Mnl. vertaling(en)? van\s+' , '' , str(title))
        title = re.sub('MNoordnederlandse vertaling van\s+' , '' , str(title))     
        
        norm_titles.append(title)
        # Find information about the book this text is in. 
        # the book is added to the modes dictionary
        book = bnm.get_text_carrier(json_data)
        
        ## get the date of the book
        path = os.path.join( 'BNM_carriers' , book[0][0] ) + '.json'
        json_book = open( path , encoding = 'utf-8')
        json_book = json.load(json_book)
        book_date = json_book['datering']
        
        
        # normalise book_date
        date_norm = 0
        if re.search( r'/' , str(book_date)):
            parts = re.split( r'/' , book_date )
            date_norm = ( int(parts[0]) + int(parts[1]) )/2
        elif re.search( r'[?]{2}' , str(book_date)):
            date_norm = re.sub( r'[?]{2}' , '50' , book_date )
        elif book_date is not None:
            date_norm = book_date
        
        if re.search( r'\d' , str(date_norm) ):
            date_norm = int(date_norm)
        else:
            date_norm = None
        
        
        out.write(f'{norm_title[0][0]}\t{title}\t{book[0][0]}\t{book[0][1]}\t{date_norm}\n')
        

out.close()

print( f'{count} texts have been assigned normalised titles.' )
print( f'There are {len(selected_texts)-count} without a normalised title.' )


To be able to perform network analysis, we create an edges file and a nodes file. The nodes in this multimodal network are the texts and the books these texts are in. Note that the edges are directed: they represent the notion that a text occurs in a book. 



In [None]:
nodes_file = open('nodes.tsv' , 'w' , encoding = 'utf-8')
edges_file = open('edges.csv' , 'w' , encoding = 'utf-8')

nodes_file.write('Id\tLabel\tType\n')
edges_file.write('Source,Target\n')

titles_dict = dict()
types_dict = dict()


nodes = []

df = pd.read_csv( 'books_of_hours.tsv' , sep = '\t' )

for i,row in df.iterrows():
    nodes.append(row['text_id'])
    titles_dict[ row['text_id'] ] = row['text_title'].strip()
    types_dict[ row['text_id'] ] = 'Text'
    
    nodes.append(row['book_id'])
    titles_dict[ row['book_id'] ] = row['book_title'].strip()
    types_dict[ row['book_id'] ] = 'Book'
    edges_file.write( f"{row['text_id']},{row['book_id']}\n" )

nodes = list(set(nodes))

for n in nodes:
    nodes_file.write( f"{n}\t{titles_dict[n]}\t{types_dict[n]}\n" )

edges_file.close()
nodes_file.close()

The collections of nodes are represented as Pandas data frames. 

In [None]:
nodes_df = pd.read_csv(f'nodes.tsv' , sep = '\t' )
edges_df = pd.read_csv(f'edges.csv' )

## Network analysis

Now that we have all the nodes and the edges, we are ready to perform the network analysis. We firstly create a network of all the nodes. Texts are shown in orange, and the books are shown in blue. The visualisation reveals that there are a number of texts that appear in many different books. This is the case for 'Teksten op naam van Bernard Clairvaux' and 'Teksten op naam van Augustinus'.

In [None]:

net = Network(notebook=True , height="750px", width="100%" , bgcolor="#dce5f2" )

net.force_atlas_2based(
        gravity=-60,
        central_gravity=0.01,
        spring_length=100,
        spring_strength=0.08,
        damping=0.4,
        overlap= 0 )
               
for i,row in nodes_df.iterrows():
    node = row['Id']
    label= row['Label']
    if row['Type'] == 'Text':
        c ='#EE7733'
    else:
        c = '#007788'  
    net.add_node( node , title=label,  color= c , value = 15 )
                

for i,row in edges_df.iterrows():
    net.add_edge( row['Source'] , row['Target'] )
                              


net.show( f'network1.html')

We can analyse the networks in Python using the `networkx` package. 

In [None]:

import networkx as nx
from networkx.algorithms import community 

G = nx.Graph()

for i,row in nodes_df.iterrows():
    G.add_node( row['Id'] , type = row['Type'])
                
for i,row in edges_df.iterrows():
    G.add_edge( row['Source'] , row['Target'] )



In [None]:
print(nx.info(G))

We want to establish the texts that co-occur in a book. 

In [None]:
all_nodes = G.nodes()



cooccurring_texts = dict()
unique = []


for node1 in all_nodes:
    count = 0 
    if types_dict[node1] == 'Text': 
        for node2 in all_nodes:
            if types_dict[node2] == 'Text' and node1 != node2:
                nr_paths = nx.all_simple_paths( G,node1,node2 , 3 )
                for path in nr_paths:
                    count += 1
                    cooccurring_texts[(path[0],path[2]) ] = cooccurring_texts.get( (path[0],path[2]) , 0 )+1
        if count == 0:
            unique.append(node1)


In [None]:
print('The following titles occur only once in the network:\n')

for t in unique:
    print(f'{titles_dict[t]} ({t})' )

This network can be plotted. The visualisation displays all the texts that cooccur in one or more books. It looks as if there are a number of 'cliques' consisting of texts that appear together. 

In [None]:

net = Network(notebook=True , height="750px", width="100%" , bgcolor="#dce5f2" )

net.force_atlas_2based(
        gravity=-60,
        central_gravity=0.01,
        spring_length=100,
        spring_strength=0.08,
        damping=0.4,
        overlap= 0 )
               
for node in cooccurring_texts:
    net.add_node( node[0]  )
    net.add_node( node[1]  )
                

for node in cooccurring_texts:
    net.add_edge( node[0] , node[1] , value = cooccurring_texts[node] )
                              

net.show( f'network2.html')

The information about the intensity of the cooccurrences (i.e. how often often do two diffent texts cooccur?) can be visualised by varying the thickness of the edges. Such a visualisation can also created in Gephi. The network should be imported as a non-directed graph. 

The cell below generates the CSV files that can be used for this purpose. 



In [None]:
n = open('cooccurrences_nodes.csv' , 'w')
e = open('cooccurrences_edges.csv' , 'w')

n.write('Id')
e.write('Source,Target,Weight')

all_nodes = []
for node in cooccurring_texts:
    all_nodes.append(node[0])
    all_nodes.append(node[1])
    
all_nodes = list(set(all_nodes))

for node in all_nodes:
    n.write(f'{n}\n')
    
for c in cooccurring_texts:
    e.write(f'{c[0]},{c[1]},{cooccurring_texts[c]}\n') 
    

In [None]:
from tdmh import sortedByValue

cooccurring_text_deduplicated = dict()
for c in cooccurring_texts:
    if (c[1],c[0]) not in cooccurring_text_deduplicated:
        cooccurring_text_deduplicated[c] = cooccurring_texts[c]


for c in sortedByValue(cooccurring_text_deduplicated , ascending = False ):
    if cooccurring_text_deduplicated[c] > 1:
        print( f'{titles_dict[c[0]]}({c[0]}) and {titles_dict[c[1]]}({c[1]}) occur together {cooccurring_text_deduplicated[c]} times \n' )


The cell below identifies the texts that occur most frequently with other texts.

In [None]:
degrees = dict(G.degree(G.nodes()))

for d in sortedByValue( degrees , ascending = False ):
    if degrees[d] > 1:
        print( f'{titles_dict[d]} ({d}) => {degrees[d] }' )
    


The following texts are unique in the network.

In [None]:
for d in sortedByValue( degrees , ascending = False ):
    if degrees[d] == 1 and types_dict[d] == 'Text':
        print( f'{titles_dict[d]} ({d})' )


Which books do these texts appear in?

In [None]:
for c in sortedByValue(cooccurring_text_deduplicated , ascending = False ):
    if cooccurring_text_deduplicated[c] > 1:
        nr_paths = nx.all_simple_paths( G,c[0],c[1] , 3 )
        for path in nr_paths:
            print(f'{path[0]} and {path[2]} both occur in ')
            print( f'{titles_dict[path[1]]} ({path[1]})\n')

In [None]:
print( f"Network density: {nx.density(G) }" )

In [None]:
all_nodes = G.nodes()


all_books = []
all_texts = []
for node in all_nodes:
    if types_dict[node] == 'Book': 
        all_books.append(node)
    else:
        all_texts.append(node)
        
books_dict = dict()


## Create a list of all the texts in each book
for book in all_books:
    texts_list = []
    for t in all_texts:
        nr_paths = nx.all_simple_paths( G,book,t , 2 )
        for path in nr_paths:
            texts_list.append( path[1] )
    books_dict[book] = texts_list
    
## next, create an overview of the number of texts the books have in common

books_edges = dict()

for book1 in books_dict:
    for book2 in books_dict:
        if book1 != book2:
            
            intersection = list(set(books_dict[book1]) & set(books_dict[book2]))
            if len(intersection) > 2:
                books_edges[(book1,book2)] = len(intersection)
                


In [None]:
sbn = open( 'similar_books_nodes.csv' , 'w' ) 
sbe = open( 'similar_books_edges.csv' , 'w' )

sbn.write('Id\n')
sbe.write('Source,Target\n')


nodes = []

for be in books_edges:
    sbe.write(f'{be[0]},{be[1]}\n')
    if be[0] not in nodes:
        nodes.append(be[0])
    if be[1] not in nodes:
        nodes.append(be[0])

for n in nodes:
    sbn.write(f'{n}\n')
    
sbn.close()
sbe.close()

In [None]:

net = Network(notebook=True , height="750px", width="100%" , bgcolor="#dce5f2" )

net.force_atlas_2based(
        gravity=-60,
        central_gravity=0.01,
        spring_length=100,
        spring_strength=0.08,
        damping=0.4,
        overlap= 0 )
               
for node in books_edges:
    net.add_node( node[0]  )
    net.add_node( node[1]  )
                

for node in books_edges:
    net.add_edge( node[0] , node[1] , value = books_edges[node] )
                              

net.show( f'network3.html')

## Some other analyses

In [None]:
betweenness_dict = nx.betweenness_centrality(G) 
eigenvector_dict = nx.eigenvector_centrality(G) 

nx.set_node_attributes(G, betweenness_dict, 'betweenness')
nx.set_node_attributes(G, eigenvector_dict, 'eigenvector')

sorted_betweenness = sorted(betweenness_dict.items(), key=itemgetter(1), reverse=True)

print("5 nodes with the higest betweenness centrality:")
for b in sorted_betweenness[:5]:
    print( f'{titles_dict[b[0]]} ({b[0]}), {b[1]} ' ) 
    
print('\n')    
    
sorted_eigen = sorted(eigenvector_dict.items(), key=itemgetter(1), reverse=True)

print("5 nodes with the higest eigenvector centrality:")
for e in sorted_eigen[:5]:
    print( f'{titles_dict[e[0]]} ({e[0]}), {e[1]} ' )   
