# AUTHOR COLLABORATION NETWORK

The objective of this notebook is to build an author collaboration graph and extract some features from it.

## 1. Processing of the authors

We will encode the authors names into identificators so that we can easily manipulate them.

In [1]:
all_authors = set() # set of all the authors in the dataset
node_authors = dict() # dictionnary  {node/paper : list of authors names}
with open("../data/initial_data/authors.txt", 'r') as f:
    for line in f:
        node, authors = line.rstrip('\n').split('|--|')
        authors = authors.split(',')
        node_authors[int(node)] = authors
        all_authors |= set(authors) # '|' is the union operator

In [2]:
id2author = dict() # dictionnary {author id : author name}
author2id = dict() # dictionnary {author name : author id}
for i, author in enumerate(all_authors):
    id2author[i] = author
    author2id[author] = i

In [3]:
# Save the (id,name) pairs as a text file
with open("../data/authors_processed/id2author.txt", 'w+') as f: 
    for id_, author in id2author.items(): 
        f.write(f"{id_},{author}\n")

In [4]:
# Create a new file similar to authors.txt but with authors ids instead of their names
with open("../data/authors_processed/authors_ids.txt", 'w+') as f: 
    for node, authors in node_authors.items():
        authors_id = list(map(author2id.get, authors))
        authors_id = list(map(str, authors_id))
        f.write(f"{node}|--|{','.join(authors_id)}\n")

## 2. Building the author collaboration network

We want to create an undirected graph of authors, where two authors are connected by an edge with weight $k$ if there are $k$ papers that they co-authored.

In [None]:
import itertools
import networkx as nx
import numpy as np
import pickle 
import gc

In [None]:
# paper - authors ids dict
paper_authors = dict()
with open('../data/authors_processed/authors_ids.txt', 'r') as f:
    for line in f:
        node, node_authors = line.rstrip('\n').split('|--|')
        paper_authors[int(node)] = list(map(int,node_authors.split(',')))

# Number of authors
with open("../data/authors_processed/id2author.txt", 'r') as f:
    n_authors = len(f.readlines())

In [None]:
print("Number of authors:", n_authors)

Number of authors: 149682


We will first build a weighted collaboration matrix $W \in \mathbb{R}^{n\times n}$ (n := number of authors) such that for two authors $i$ and $j$:

$$
W_{ij} = \sum_{p \in papers} \frac{\delta^p_i \delta^p_j}{n_p - 1} 
\quad \text{if} \quad  i \neq j \quad\quad \text{and} \quad\quad
W_{ii} = 0 
$$

where $n_p$ is the number of authors of paper $p$ and $\delta^p_i$ = $\mathbf{1}$($i \in$ {authors of $p$}) 

In [None]:
author_collab_weights = np.zeros((n_authors, n_authors)) 

In [None]:
all_author_collabs = set()
for paper in paper_authors:
    # Create tuples of author collaborations for one paper
    # itertools.combinations(p, r) creates r-length tuples, in sorted order, no repeated elements
    # e.g. : list(itertools.combinations('ABC', 2)) >>> [('A', 'B'), ('A', 'B'), ('B', 'C')] 
    authors = paper_authors[paper]
    author_collabs = list(itertools.combinations(authors, r=2))
    all_author_collabs |= set(author_collabs)
    for author_1, author_2 in author_collabs:
        author_collab_weights[author_1, author_2] += 1/(len(authors)-1)
        author_collab_weights[author_2, author_1] += 1/(len(authors)-1)

In [None]:
print(len(all_author_collabs))
# We sort each pair of citation because we consider  that 
# a citation (author_1, author_2) is the same as a citation (author_2, author_1)
all_author_collabs = list(map(sorted, all_author_collabs))
# The result of sorted is a list so we put it back as a tuple
all_author_collabs = set(map(tuple, all_author_collabs))
print(len(all_author_collabs))

556344
529595


In [None]:
# Write the collaborations in a file where each line 'author_1,author_2,n_collabs'
# means that author_1 and author_2 co-authored n_collabs papers
with open("../data/authors_processed/author_collab_edgelist.txt", 'w+') as f:
    for (author_1, author_2) in all_author_collabs:
        weight = author_collab_weights[author_1, author_2]
        f.write(f"{author_1},{author_2},{round(weight,2)}\n")

## 3. Features of the author collaboration network

In [None]:
with open("../data/processed_data/id2author.txt", 'r') as f:
    n_authors = len(f.readlines())

# author collaboration graph
G_author_collab = nx.read_weighted_edgelist(
    '../data/processed_data/author_collab_edgelist.txt',
    delimiter=',', 
    nodetype=int,
)

# There are authors who never co-authored a paper 
# these authors don't have edges (no collaboration) in the graph
# so we have to add them 
# (we give all authors as paramater, the ones that don't exist yet will be added as single nodes)
G_author_collab.add_nodes_from(range(n_authors))

In [3]:
print("Number of nodes:", len(G_author_collab.nodes()))
print("Number of edges:", len(G_author_collab.edges()))

NameError: name 'G_author_collab' is not defined