# Data Analytics 
## Course Assignment N. 12: Mathematicians Network

Zhe Huang, 2020.3

---
### The goal:
> - Explore and describe the data (preprocess the data, visualize the variables with different graphs, distribution of the variables).
> - While exploring the data, define research questions and answer them such as which are the top authors according to number of co-authors? Which are highly connected or isolated from others? Etc.
> - Plot the graph that shows the links between the different authors, i.e., how the authors are connected.
> - Use graphics to enlarge the authors that have most centrality,etc.
---

This notebook explores and analyzes the Erdös collaboration graph.


In order to illustrate the interactive graph visualization, Jupyter Notebook provides a tool to load and run the JavaScript. It will fetch the ipynb file from Github.

For this assignment, the link is [here](https://nbviewer.jupyter.org/github/onlyacat/Mathematicians_Network/blob/master/main.ipynb).

In this project I used `Python 3.7` as programming language. I also used `pyecharts` to draw the interactive pictures and `networkx` to analyze the graph.

In [1]:
import random

import networkx as nx
from networkx.algorithms import approximation

from pyecharts import options as opts
from pyecharts.charts import Graph, Bar, Line, Radar

import prettytable

#### 1. Simple preparations.

- `randomcolor()`: generates a random type color.

- `preprocessing()`: reads the given file and generates the graph based on the input. It will treat every author as a node(`id_name_pairs`) and build the edges based on the co-author relationships(`id_id_list`). Author ID can be represented by index. **-1** because python list starts from **0**.

In [2]:
def randomcolor():
    colorArr = ['1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F']
    color = ""
    for i in range(6):
        color += colorArr[random.randint(0, 14)]
    return "#" + color


def preprocessing(path):
    with open(path, 'r') as book:
        lines = book.readlines()

    lines = [x.strip() for x in lines if x[0] not in ['%', '*']]
    id_name_pairs = [x.replace('"', '').split(' ', 1) for x in lines[:6927]]
    id_name_pairs = [x[1] for x in id_name_pairs]

    id_id_pairs = [x.split() for x in lines[6927:]]
    id_id_pairs = [[int(y) - 1 for y in records] for records in id_id_pairs]
    id_id_list = [None] * (id_id_pairs[-1][0] + 1)
    for x in id_id_pairs:
        id_id_list[x[0]] = x[1:] if id_id_list[x[0]] is None else id_id_list[x[0]] + x[1:]

    return id_name_pairs, id_id_list

#### 2. Create the network.
- `create—network()`: Each node has three attributes:**ID**,**name** and **E_numbers**. Erdös's E_number is 0.

In [3]:
def create_network():
    G = nx.Graph()

    for index, name in enumerate(id_name):
        G.add_node(index, name=name, E_numbers=1 if index < 507 else 2)

    G.nodes[6926]['E_numbers'] = 0

    for x, y in enumerate(id_id):
        for z in y:
            G.add_edge(x, z)

    return G

Here we generate the basic components.

In [4]:
id_name, id_id = preprocessing('12_Mathematicians_Network.txt')
G = create_network()

#### 3. Basic data analysis.

- `analyze_basics()`: Use the functions `networkx` provided. 

The Erdös collaboration graph contains **6927** nodes and **11850** edges. The density is **0.0005** and the average degree is **3.42**. 

Assortativity measures the similarity of connections in the graph with respect to the node degree. The value of degree assortativity coefficient is **-0.116**, showing that the network is disassortative.

In [27]:
def analyze_basics():
    density = nx.density(G)
    average_degree = sum(dict(nx.degree(G)).values()) / len(dict(nx.degree(G)).values())  # print
    number_of_nodes = len(G.nodes)
    number_of_edges = len(G.edges)
    degree_assortativity = nx.degree_assortativity_coefficient(G)
    table = prettytable.PrettyTable(
        ['Number of Nodes', 'Number Of Edges', 'Density', 'Average Degree', 'Degree Assortativity'])
    table.add_row([number_of_nodes, number_of_edges, density, average_degree, degree_assortativity])
    print(table)

analyze_basics()

+-----------------+-----------------+-----------------------+-------------------+----------------------+
| Number of Nodes | Number Of Edges |        Density        |   Average Degree  | Degree Assortativity |
+-----------------+-----------------+-----------------------+-------------------+----------------------+
|       6927      |      11850      | 0.0004939928592394236 | 3.421394543092248 | -0.11557739696891721 |
+-----------------+-----------------+-----------------------+-------------------+----------------------+


- `analyze_clustering()`: Use the functions `networkx` provided. 

The `Average clustering` and `Average clustering coefficient` denote the willing that the nodes tend to cluster together or not. 

`Average shortest path length` means the average number of steps along the shortest paths for all possible pairs of network nodes. It can measure the efficiency of information on a network. The value is 3.776, showing that for every two nodes on this graph, it takes about 3.7 edges to reach.

There is a negative correlation between the `Efficiency ` and the `Shortest path length`. The average local efficiency is the average of the local efficiencies of each node and the average global efficiency of a graph is the average efficiency of all pairs of nodes.

In [29]:
def analyze_clustering():
    average_clustering_coefficient = approximation.average_clustering(G)
    average_clustering = nx.average_clustering(G)
    average_shortest_path_length = nx.average_shortest_path_length(G)
    local_efficiency = nx.local_efficiency(G)
    global_efficiency = nx.global_efficiency(G)
    table = prettytable.PrettyTable(
        ['Average clustering', 'Average clustering coefficient', 'Average shortest path length'])
    table.add_row([average_clustering, average_clustering_coefficient, average_shortest_path_length])
    print(table)
    table = prettytable.PrettyTable(['Local efficiency','Global efficiency'])
    table.add_row([local_efficiency,global_efficiency])
    print(table)
analyze_clustering()

+---------------------+--------------------------------+------------------------------+
|  Average clustering | Average clustering coefficient | Average shortest path length |
+---------------------+--------------------------------+------------------------------+
| 0.12390011501874702 |             0.134              |       3.77644059260634       |
+---------------------+--------------------------------+------------------------------+
+---------------------+--------------------+
|   Local efficiency  | Global efficiency  |
+---------------------+--------------------+
| 0.13696963461442008 | 0.2704027423224077 |
+---------------------+--------------------+


The `center` is the node with eccentricity equal to radius. Obviously, 6926, **ERDOS PAUL** is the center and barycenter on this graph. The `diameter` is the maximum eccentricity, showing that two farthest nodes have the distance of 4 on this graph.


In [31]:
def analyze_distance():
    center = nx.center(G)
    barycenter = nx.barycenter(G)
    diameter = nx.diameter(G)
    table = prettytable.PrettyTable(['center', 'barycenter', 'diameter'])
    table.add_row([center, barycenter, diameter])
    print(table)

analyze_distance()

+--------+------------+----------+
| center | barycenter | diameter |
+--------+------------+----------+
| [6926] |   [6926]   |    4     |
+--------+------------+----------+


From the document of `networkx` these APIs can be used to calculate **sigma** and **omega**. Small-worldness is commonly measured with these two parameters. If sigma > 1 and omega is near to 0, this graph can be classified as small-world. However the calculation is costly and I cannot get the final result.

In [8]:
def analyze_small_world():
    # Small-world
    sigma = nx.sigma(G)
    omega = nx.omega(G)
    table = prettytable.PrettyTable(['sigma', 'omega'])
    table.add_row([sigma, omega])
    print(table)
    
# analyze_small_world()

Here I ranked the influence of each node in the graph, using the [`Voterank` algorithm](https://www.nature.com/articles/srep27823). 

During the process, each node will calculate a tuple ($s_u$, $va_u$), representing the voting score and voting ability. Voting score means the number of votes obtained from its neighbors and voting ability is the number of votes that it can give its neighbors.  The final voting score is 0 because the node has been elected in previous turn. The $va_u$ for ERDOS PAUL is much bigger than the second node, so that ERDOS PAUL has more influence than the rest of authors.

Finally, it will sort the ranking result and shows the top 15 author with the highest influence. 


In [7]:
def ranking():
    voterank = nx.voterank(G)
    table = prettytable.PrettyTable(
        ['Rank', 'Author ID', 'Name', 'Final voting score', 'Final voting ability'])
    for index, aid in enumerate(voterank[:15]):
        table.add_row([index, aid, G.nodes[aid]['name'], G.nodes[aid]['voterank'][0], G.nodes[aid]['voterank'][1]])
    print(table)

ranking()

+------+-----------+-----------------------+--------------------+----------------------+
| Rank | Author ID |          Name         | Final voting score | Final voting ability |
+------+-----------+-----------------------+--------------------+----------------------+
|  0   |    6926   |       ERDOS PAUL      |         0          |   -23.966835443038   |
|  1   |    185    |     HARARY, FRANK     |         0          |  -2.922784810126582  |
|  2   |     9     |       ALON, NOGA      |         0          |  -4.384177215189872  |
|  3   |    416    |    SHELAH, SAHARON    |         0          | -0.2922784810126582  |
|  4   |     85    |  COLBOURN, CHARLES J. |         0          | -1.1691139240506327  |
|  5   |    332    |   ODLYZKO, ANDREW M.  |         0          | -2.0459493670886073  |
|  6   |    248    |  KLEITMAN, DANIEL J.  |         0          | -2.3382278481012655  |
|  7   |    163    |   GRAHAM, RONALD L.   |         0          |  -2.922784810126582  |
|  8   |    474    | 

#### 4. Draw the graphs

Notification: Since some notebook viewer does not support the Javascript, graphs below may be unvisible. the link [here](https://nbviewer.jupyter.org/github/onlyacat/Mathematicians_Network/blob/master/main.ipynb) can provide a 

- `draw_the_whole_graph()`

Here the graph shows the relation of the nodes on this network in **circular** layout. Nodes with E_number 1 contributes most of the edges. 

In [5]:
def draw_the_whole_graph():
    nodes = [opts.GraphNode(
        name=G.nodes[x]['name'],
        value=G.degree[x],
        symbol_size=G.degree[x] / 10,
        category=G.nodes[x]['E_numbers']
    )
        for x in G.nodes]

    links = [opts.GraphLink(source=G.nodes[x]['name'], target=G.nodes[y]['name']) for x, y in G.edges]

    categories = [{'name': 'Erdos_number:' + str(x)} for x in range(3)]
    c = (
        Graph()
            .add(
            series_name="",
            nodes=nodes,
            links=links,
            layout='circular',
            is_roam=True,
            is_focusnode=True,
            label_opts=opts.LabelOpts(is_show=False),
            is_draggable=True,
            categories=categories,
            # repulsion=100
            # linestyle_opts=opts.LineStyleOpts(width=0.5, curve=0.3, opacity=0.7),
        )
            .set_global_opts(title_opts=opts.TitleOpts(title="Graph with \n authors degrees"))
    )
    return c

c = draw_the_whole_graph()
c.render_notebook()

- `draw_degree()`

Here the bar graph shows the top 50 authors with the high degree grouping by E_number. 

Obviously ERDOS PAUL,HARARY  FRANK and Lesaink Linda M has the biggest number 507, 297 and 18 in three groups. 

The average for the green color(E_number is 1) and red color(E_number is 2) is 89 and 10.

In [6]:
def draw_degree():
    degree = nx.degree(G)
    degree_sort = sorted(degree, key=lambda x: x[1], reverse=True)
    e0 = [opts.BarItem(name=G.nodes[degree_sort[0][0]], value=degree_sort[0][1])]
    e1 = [opts.BarItem(name=G.nodes[x[0]]['name'], value=x[1]) for x in degree_sort if G.nodes[x[0]]['E_numbers'] == 1][
         :50]
    e2 = [opts.BarItem(name=G.nodes[x[0]]['name'], value=x[1]) for x in degree_sort if G.nodes[x[0]]['E_numbers'] == 2][
         :50]

    xaxis = [x + 1 for x in range(50)]
    c = (
        Bar()
            .add_xaxis(xaxis)
            .add_yaxis("Erdos_number is 0", e0, category_gap=0, itemstyle_opts=opts.ItemStyleOpts(color='#d48265'),
                       gap="0%")
            .add_yaxis("Erdos_number is 1", e1, category_gap=0, itemstyle_opts=opts.ItemStyleOpts(color='#749f83'),
                       gap="0%")
            .add_yaxis("Erdos_number is 2", e2, category_gap=0, gap="0%")
            .set_series_opts(label_opts=opts.LabelOpts(is_show=False), axisline_opts=opts.AxisOpts(interval=1),
                             markline_opts=opts.MarkLineOpts(
                                 data=[
                                     opts.MarkLineItem(type_="average", name="Average"),
                                 ]
                             ), )
            .set_global_opts(
            title_opts=opts.TitleOpts(title="Top 50 degree authors"),
            datazoom_opts=opts.DataZoomOpts(),
            yaxis_opts=opts.AxisOpts(
                axistick_opts=opts.AxisTickOpts(is_show=True),
                splitline_opts=opts.SplitLineOpts(is_show=True),
                ),
            )
        )
    return c

c = draw_degree()
c.render_notebook()

- `draw_degree_histogram()`

Here the graph shows the degree histogram distribution.

We can find that 4772 nodes are 1-degree node and the number decreased sharply. It means that most of the nodes only communicate with one node. 

The range from 112 to 507 only contains 9 authors, showing that the minority owns the huge influence.

The number of zero-degree node is zero, so that the graph is connected.

In [7]:
def draw_degree_histogram():
    degree_histogram = nx.degree_histogram(G)
    c = (
        Line()
            .add_xaxis([x for x in range(len(degree_histogram))])
            .add_yaxis("Number of authors", degree_histogram, is_smooth=True)
            .set_series_opts(
            areastyle_opts=opts.AreaStyleOpts(opacity=0.5),
            label_opts=opts.LabelOpts(is_show=False),
            markpoint_opts=opts.MarkPointOpts(
                data=[
                    opts.MarkPointItem(type_="max", name="Maximum"),
                ]
            ),
        )
            .set_global_opts(
            title_opts=opts.TitleOpts(title="Degree Histogram"),
            datazoom_opts=opts.DataZoomOpts(),
            xaxis_opts=opts.AxisOpts(
                name="Degree",
                axistick_opts=opts.AxisTickOpts(is_align_with_label=True),
                is_scale=False,
                boundary_gap=False,
                ),
            )
        )
    return c
    
c = draw_degree_histogram()
c.render_notebook()

- `draw_centrality()`
Here the radar graph illustrates the distribution of centralities. Degree centrality，betweenness centrality and closeness centrality are three concepts to measure the node centrality.

Degree centrality is measured by the number of edges connected to this node.

Betweenness centrality is measured by the mentioned number in the shortest paths. As we all know the nodes rely on the shortest path to communicate. If a node always appear in the shortest path of other nodes, it has a high betweenness centrality.

Closeness centrality is measured by the shortest distance from this node to other nodes. If a node can easily reach other nodes without a long path, it can be considered as the centre of the graph and will have a high closeness centrality.

In the function first it calculates the three centralities for each node and then only keep the top 15 nodes for each centrality. Then it intersects three sets and chooses the common nodes.

The graph shows the top 7 authors are ERDOS PAUL,GRAHAM RONALD L, ALON NOGA, BOLLOBAS BELA, ERDOS PAUL, KLEITMAN DANIEL J., HARARY FRANK and TUZA ZSOLT. Also, ERDOS PAUL has the highest values of three centralities. 

In [10]:
def draw_centrality():
    degree_centrality = nx.degree_centrality(G)
    closeness_centrality = nx.closeness_centrality(G)
    betweenness_centrality = nx.betweenness_centrality(G)
    degree_centrality_sort = sorted(degree_centrality, key=lambda x: degree_centrality[x], reverse=True)
    closeness_centrality_sort = sorted(closeness_centrality, key=lambda x: closeness_centrality[x], reverse=True)
    betweenness_centrality_sort = sorted(betweenness_centrality, key=lambda x: betweenness_centrality[x], reverse=True)
    h_authors = set.intersection(
        set(betweenness_centrality_sort[:15]),
        set(closeness_centrality_sort[:15]),
        set(degree_centrality_sort[:15]))

    data = [[G.nodes[x]["name"], [betweenness_centrality[x], closeness_centrality[x], degree_centrality[x]]] for x in
            h_authors]

    c = (
        Radar()
            .add_schema(
            schema=[
                opts.RadarIndicatorItem(name="Betweenness centrality [0,1]", max_=1, min_=0),
                opts.RadarIndicatorItem(name="Closeness centrality [0,0.6]", max_=0.6, min_=0),
                opts.RadarIndicatorItem(name="Degree centrality [0,0.1]", max_=0.1, min_=0),
            ],
            shape="circle",
            center=["50%", "50%"],
            radius="80%",
            splitarea_opt=opts.SplitAreaOpts(
                is_show=True, areastyle_opts=opts.AreaStyleOpts(opacity=1)
            ),
            textstyle_opts=opts.TextStyleOpts(color="#000"),
        ).set_series_opts(label_opts=opts.LabelOpts(is_show=False))
    )

    for x in data:
        color = randomcolor()
        c.add(
            series_name=x[0],
            data=[x[1]],
            areastyle_opts=opts.AreaStyleOpts(opacity=0.1, color=color),
            linestyle_opts=opts.LineStyleOpts(width=1, color=color),
            label_opts=opts.LabelOpts(is_show=False)
        )

    return c

c = draw_centrality()
c.render_notebook()

- `draw_chain_decomposition()`

Here the relation graph shows the longest chain from the graph.

First it uses the function `chain_decomposition` and obtains the chains. It chooses the longest one and visualizes the result.

It contains 144 nodes and 143 edges, from author `ABBOTT HARVEY L.` to author `SUBBARAO M. V.` and vice versa.

In [8]:
def draw_chain_decomposition():
    chain_decomposition = list(nx.chain_decomposition(G))
    longest_chain = sorted(chain_decomposition, key=lambda x: (len(x)), reverse=True)[0]
    nodes = [opts.GraphNode(name=G.nodes[x[0]]['name']) for x in longest_chain]
    nodes.append(opts.GraphNode(name=G.nodes[longest_chain[-1][1]]['name'], label_opts=opts.LabelOpts(color='#d48265')))
    nodes[0] = opts.GraphNode(name=G.nodes[longest_chain[0][0]]['name'], label_opts=opts.LabelOpts(color='#749f83'))
    links = [opts.GraphLink(source=G.nodes[x]['name'], target=G.nodes[y]['name']) for x, y in longest_chain]
    c = (
        Graph()
            .add(
            series_name="",
            nodes=nodes,
            links=links,
            layout='force',
            is_roam=True,
            is_focusnode=True,
            label_opts=opts.LabelOpts(is_show=False),
            is_draggable=True,
            repulsion=100,
            linestyle_opts=opts.LineStyleOpts(width=0.5, curve=0.3, opacity=0.7),
        )
    )
    return c

c = draw_chain_decomposition()
c.render_notebook()

- `draw_k_cores()`

It uses the function `k_core` from networkx. It will remove the nodes that the degree is smaller than k repeatly. Finally, it generates a core subgraph that has a high correlation. 

It contains 38 nodes and 281 edges.

From the graph, we can find that almost every node has a edge with the centre node `ERDOS PAUL`, except the node `Lesniak Linda M`, which has 10 neighbours. So that, it is the only node that E_number is 2 on this graph. We can guess that he will have a high possibility of collaboration with `ERDOS PAUL`.

In [9]:
def draw_k_cores():
    k_cores = nx.k_core(G)
    nodes = [opts.GraphNode(name=k_cores.nodes[x]['name'], value=k_cores.degree[x], symbol_size=k_cores.degree[x]) for x
             in k_cores.nodes]
    links = [opts.GraphLink(source=k_cores.nodes[x]['name'], target=k_cores.nodes[y]['name']) for x, y in k_cores.edges]
    c = (
        Graph()
            .add(
            series_name="",
            nodes=nodes,
            links=links,
            layout='force',
            is_roam=True,
            is_focusnode=True,
            label_opts=opts.LabelOpts(is_show=False),
            is_draggable=True,
            repulsion=10000,
            # linestyle_opts=opts.LineStyleOpts(width=0.5, curve=0.3, opacity=0.7),
        )
    )
    return c

c = draw_k_cores()
c.render_notebook()

### Reference

1. [The Erdös Number Project](https://oakland.edu/enp/thedata/)
2. [Networks / Pajek](http://vlado.fmf.uni-lj.si/pub/networks/doc/erdos/)
3. [NetworkX](https://networkx.github.io/documentation/stable/index.html)
4. [Pyecharts](https://github.com/pyecharts/pyecharts/blob/master/README.en.md)