# AUTHOR NETWORKS

The objectives of this notebook are: 
- assign an ID to each author for easier maniplation
- build a dictionnary with authors as keys and the list of their papers as values. 
- build an **author collaboration graph** where nodes represent authors, two authors have an edges **if they co-authored at least one paper.**
- build an **author citation graph** where nodes represent authors, two authors have an edges **if there is at least one time when one of them cited the other one in a paper.**

In [1]:
import os 
import gc
import itertools 
import numpy as np 
import pandas as pd

import paths # script with all data paths

## 1. Processing of the authors

We will encode the authors names into identificators so that we can easily manipulate them.

The following files will be added in the folder `data/authors_processed`:
- `id2author.txt` : contains a line *"id,name"* for each author
- `paper_2_authors_id.txt` : contains a line *"paper|--|auth_id_1,auth_id_2,..."*

In [2]:
if not os.path.isfile(paths.ID_2_AUTHOR_PATH) or \
    not os.path.isfile(paths.PAPER_2_AUTHORS_ID_PATH):
    
    all_authors = set() # set of all the authors in the dataset
    paper_authors = dict() # dictionnary  {paper : list of authors names}
    with open(paths.AUTHORS_PATH, 'r') as f:
        for line in f:
            paper, authors = line.rstrip('\n').split('|--|')
            authors = authors.split(',')
            paper_authors[int(paper)] = authors
            all_authors |= set(authors) # '|' is the union operator


    id2author = dict() # dictionnary {author id : author name}
    author2id = dict() # dictionnary {author name : author id}
    for i, author in enumerate(all_authors):
        id2author[i] = author
        author2id[author] = i

    # Save the (id,name) pairs as a text file
    with open(paths.ID_2_AUTHOR_PATH, 'w+') as f: 
        for id_, author in id2author.items(): 
            f.write(f"{id_},{author}\n") 

    # Create a new file similar to authors.txt but with authors ids instead of their names
    with open(paths.PAPER_2_AUTHORS_ID_PATH, 'w+') as f: 
        for paper, authors in paper_authors.items():
            authors_id = list(map(author2id.get, authors))
            authors_id = list(map(str, authors_id))
            f.write(f"{paper}|--|{','.join(authors_id)}\n")
else:
    print("The files already exist !")

The files already exist !


## 2. Get all the papers of each author

The following file will be added in the folder `data/authors_processed`:
- `author_id_2_papers.txt` : contains a line *"author_id|--|paper_1,paper_2,..."*

In [3]:
if not os.path.isfile(paths.AUTHOR_ID_2_PAPERS_PATH):
    # {paper : authors ids} dict
    paper_authors = dict()
    with open(paths.PAPER_2_AUTHORS_ID_PATH, 'r') as f:
        for line in f:
            paper, co_authors = line.rstrip('\n').split('|--|')
            paper_authors[int(paper)] = list(map(int,co_authors.split(',')))

    # Build {author id : papers} 
    author_papers = dict()
    for paper, authors in paper_authors.items():
        for author in authors:
            if author in author_papers:
                author_papers[author] += [paper]
            else:
                author_papers[author] = [paper]

    # Create a new file with each line as "author|--|paper1,paper2,..."
    with open(paths.AUTHOR_ID_2_PAPERS_PATH, 'w+') as f: 
        for author, papers in author_papers.items():
            papers = list(map(str, papers))
            f.write(f"{author}|--|{','.join(papers)}\n")
else:
    print("The file already exists !")
    # We just read the file
    author_papers = dict()
    with open(paths.AUTHOR_ID_2_PAPERS_PATH, 'r') as f:
        for line in f:
            author, papers = line.rstrip('\n').split('|--|')
            author_papers[int(author)] = list(map(int,papers.split(',')))

The file already exists !


## 3. Author collaboration graph

The following file will be added in the folder `data/authors_processed`:
- `author_collab_edgelist.txt` : contain a line *"author_id_1,author_id_2,weigth"* for each edge of the author collaboration graph

#### How to build the graph ?

We want to create a weighted undirected graph of authors, where two authors are connected by an edge if they co-authored at least one paper. Note that This graph is independant from the paper citation graph.

We first build the adjacency matrix of this graph as a weighted collaboration matrix $W \in \mathbb{R}^{n\times n}$ (n := number of authors) such that for two authors $i$ and $j$:

$$
W_{ij} = \sum_{p\ \in\ papers} \frac{\delta^p_i \delta^p_j}{n_p - 1} 
\quad \text{if} \quad  i \neq j \quad\quad \text{and} \quad\quad
W_{ii} = 0 
$$

where $n_p$ is the number of authors of paper $p$ and $\delta^p_i$ = $\mathbf{1}$($i \in$ {authors of $p$}).

We use this formula so that the weight of a collaboration in a paper is correlated to the number of authors of the paper.

In [13]:
# NOTE: example of the use of itertools.combinations
# we will use that to create pairs of co-authors of a paper
a = [1, 2, 3, 4]

co_auths = list(itertools.combinations(a, r=2))
co_auths

[(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)]

In [4]:
if not os.path.isfile(paths.AUTHCOLL_EDGELIST_PATH):
    # {paper : authors ids} dict
    paper_authors = dict()
    with open(paths.PAPER_2_AUTHORS_ID_PATH, 'r') as f:
        for line in f:
            paper, co_authors = line.rstrip('\n').split('|--|')
            paper_authors[int(paper)] = list(map(int,co_authors.split(',')))

    # Total number of authors
    with open(paths.ID_2_AUTHOR_PATH, 'r') as f:
        n_authors = len(f.readlines())
    print("Number of authors:", n_authors)

    # Adjacency matrix of our future graph
    author_collab_weights = np.zeros((n_authors, n_authors))

    all_author_collabs = set()
    for paper in paper_authors:
        # Create tuples of author citations for one paper
        # NOTE: Look at the previous cell to understand the use of itertools.combinations
        authors = paper_authors[paper]
        author_collabs = list(itertools.combinations(authors, r=2))
        all_author_collabs |= set(author_collabs)
        for author_1, author_2 in author_collabs:
            author_collab_weights[author_1, author_2] += 1/(len(authors)-1)
            author_collab_weights[author_2, author_1] += 1/(len(authors)-1)


    print("# of collabs before the sort:", len(all_author_collabs))
    # We sort each pair of collab because we consider  that 
    # a collab (author_1, author_2) is the same as a collab (author_2, author_1)
    all_author_collabs = list(map(sorted, all_author_collabs))
    # The result of sorted is a list so we put it back as a tuple
    all_author_collabs = set(map(tuple, all_author_collabs))
    print("# of collabs after the sort:", len(all_author_collabs))


    # Write the collaborations in a file where each line 'author_1,author_2,n_collabs'
    # means that author_1 and author_2 co-authored n_collabs papers
    # NOTE: the graph will not contains authors that never collaborated with anyone
    print("Saving the edgelist ...")
    with open(paths.AUTHCOLL_EDGELIST_PATH, 'w+') as f:
        for (author_1, author_2) in all_author_collabs:
            weight = author_collab_weights[author_1, author_2]
            f.write(f"{author_1},{author_2},{round(weight,2)}\n")
    print("Done")
else:
    print("The author collaboration graph was already built !")

The file already exist !


## 4. Author citation graphs

The following file will be added in the folder `data/authors_citations`:
- `authcit_edgelist.txt` : contain a line *"author_id_1,author_id_2,weigth"* for each edge of the author citation graph corresponding to the full paper graph
- `train_authcit_edgelist.txt` : contain a line *"author_id_1,author_id_2,weigth"* for each edge of the author citation graph corresponding to the train paper graph
- `test_authcit_edgelist.txt` : contain a line *"author_id_1,author_id_2,weigth"* for each edge of the author citation graph corresponding to the test paper graph

#### How to build the graph ?

We want to create a weighted undirected graph of authors, where two authors are connected by an edge if there is at least one time when one of them cited the other in a paper. Thus we need the paper citation graph. Since we have three paper citation graph, we will build three author citation graph.

Consider the paper citation graph $G=(A,E)$. We first build the adjacency matrix of the corresponding author citation graph as a weighted citation matrix $W \in \mathbb{R}^{n\times n}$ (n := number of authors) such that for two authors $i$ and $j$:

$$
W_{ij} = \sum_{(p_1,p_2)\ \in\ E} 
\frac{\delta^{p_1}_i \delta^{p_2}_j}{n_{p_1} n_{p_2}} 
$$

where $n_p$ is the number of authors of paper $p$ and $\delta^p_i$ = $\mathbf{1}$($i \in$ {authors of $p$}).

We use this formula so that the weight of a citation is correlated to the number of authors of both papers.

Note that we did not put $W_{ii}=0$ because we consider self-citation.

In [5]:
# {paper : authors ids} dict
paper_authors = dict()
with open(paths.PAPER_2_AUTHORS_ID_PATH, 'r') as f:
    for line in f:
        paper, co_authors = line.rstrip('\n').split('|--|')
        paper_authors[int(paper)] = list(map(int,co_authors.split(',')))

print("Number of papers:", len(paper_authors))

# Total number of authors
with open(paths.ID_2_AUTHOR_PATH, 'r') as f:
    n_authors = len(f.readlines())
print("Number of authors:", n_authors)


Number of papers: 138499
Number of authors: 149682


In [6]:
# NOTE: example of the use of itertools.product
# we will use that to create pairs of authors of two papers with an edge
a = [1, 2, 3]
b = [5, 6]

prods = list(itertools.product(a, b))
prods

[(1, 5), (1, 6), (2, 5), (2, 6), (3, 5), (3, 6)]

In [7]:
def build_author_citation_graph(paper_edges_path, path_to_save):
    """
    Compute and save the author citation edgelist of the corresponding paper citation graph. 
    """
    paper_citation_edges = pd.read_csv(
        paper_edges_path,
        header=None
    ).to_numpy()

    # Adjacency matrix of our future graph
    author_citation_weights = np.zeros((n_authors, n_authors))

    all_author_citations = set()

    for paper_1, paper_2 in paper_citation_edges:
        # Get the authors of each paper
        authors_1 = paper_authors[paper_1]
        authors_2 = paper_authors[paper_2]

        # Create pairs of author citations
        # NOTE: Look at the previous cell to understand the use of itertools.product
        author_citations = list(itertools.product(authors_1, authors_2))
        all_author_citations |= set(author_citations)

        for author_1, author_2 in author_citations:
            author_citation_weights[author_1, author_2] += 1 / (len(authors_1) * len(authors_2))
            author_citation_weights[author_2, author_1] += 1 / (len(authors_1) * len(authors_2))


    print("# of authors citation before the sort:", len(all_author_citations))
    # We sort each pair of citation because we consider  that 
    # a citation (author_1, author_2) is the same as a citation (author_2, author_1)
    all_author_citations = list(map(sorted, all_author_citations))
    # The result of sorted is a list so we put it back as a tuple
    all_author_citations = set(map(tuple, all_author_citations))
    print("# of authors citation after the sort:", len(all_author_citations))

    # Save the edgelist to path
    print("Saving the edgelist ...")
    with open(path_to_save, 'w+') as f:
        for (author_1, author_2) in all_author_citations:
            weight = author_citation_weights[author_1, author_2]
            f.write(f"{author_1},{author_2},{round(weight,2)}\n")
    print("Done")

    del all_author_citations, author_citation_weights
    gc.collect()

In [8]:
# Full paper citation graph
if not os.path.isfile(paths.FULL_AUTHCIT_EDGELIST_PATH):
    build_author_citation_graph(
        paths.FULL_GRAPH_EDGELIST_PATH, 
        paths.FULL_AUTHCIT_EDGELIST_PATH
    )
else:
    print("The author citation graph corresponding to the full paper graph was already built !")

The author citation graph corresponding to the full paper graph was already built !


In [9]:
# Train paper citation graph
if not os.path.isfile(paths.TRAIN_AUTHCIT_EDGELIST_PATH):
    build_author_citation_graph(
        paths.TRAIN_EDGELIST_PATH,
        paths.TRAIN_AUTHCIT_EDGELIST_PATH
    )
else:
    print("The author citation graph corresponding to the train paper graph was already built !")

The author citation graph corresponding to the train paper graph was already built !


In [10]:
# Test paper citation graph
if not os.path.isfile(paths.TEST_AUTHCIT_EDGELIST_PATH):
    build_author_citation_graph(
        paths.TEST_EDGELIST_PATH,
        paths.TEST_AUTHCIT_EDGELIST_PATH
    )
else:
    print("The author citation graph corresponding to the test paper graph was already built !")

The author citation graph corresponding to the test paper graph was already built !
