# Social networks, social signatures, and weight-topology correlations

In this exercise, we study a social network data set on private messaging on a Facebook-like web page (Data originally from http://toreopsahl.com/datasets/). In this weighted network, each node corresponds to a user, and the weight of a link quantifies the total number of messages exchanged between the two connected users. 

The edge list file `OClinks_w_undir.edg` has three entries per each row:
`(node_i node_j w_ij)`, 
where `w_ij` is the weight of the link between nodes `node_i` and `node_j`.

Below, you'll find some predefined functions that are useful in this exercise. In most cases, you do not need to modify these functions (but do inspect them to get an idea of what they do so that you can use them in your code). 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
from scipy.stats import binned_statistic

In [None]:
def create_linbins(start, end, n_bins): # NO NEED TO MODIFY THIS
    """
    Creates a set of linear bins.

    Parameters
    -----------
    start: minimum value of data, int
    end: maximum value of data, int
    n_bins: number of bins, int

    Returns
    --------
    bins: a list of linear bin edges
    """
    bins = np.linspace(start, end, n_bins)
    return bins

In [None]:
def create_logbins(start, end, n_log, n_lin=0): # NO NEED TO MODIFY THIS
    """
    Creates a combination of linear and logarithmic bins: n_lin linear bins 
    of width 1 starting from start and n_log logarithmic bins further to
    max.

    Parameters
    -----------
    start: starting point for linear bins, float
    end: maximum value of data, int
    n_log: number of logarithmic bins, int
    n_lin: number of linear bins, int

    Returns
    -------
    bins: a list of bin edges
    """
    if n_lin == 0:
        bins = np.geomspace(start, end, num=n_log+1)
    elif n_lin > 0:
        bins = np.array([start + i for i in range(n_lin)] 
                        + list(np.geomspace(start + n_lin, end, n_log+1)))
    return bins

In [None]:
# Select data directory
import os
if os.path.isdir('/coursedata'):
    course_data_dir = '/coursedata'
elif os.path.isdir('../data'):
    course_data_dir = '../data'
else:
    # Specify course_data_dir on your machine
    course_data_dir = '.'

print('The data directory is %s' % course_data_dir)

network_path = os.path.join(course_data_dir, 'OClinks_w_undir.edg')

### a) Degrees, weights, and node strengths (2 pts)

Let us first read the weighted edgelist into a NetworkX graph and extract the node degrees, strengths, and edge weights because we'll need these below. Modify the code below so that it extracts the three lists. The code then computes their minimal and maximal values. Use these to answer the MyCourses quiz. 

**Coding hints**:

- if `network`is a `nx.Graph`object, you can iterate over its nodes through the iterable object `network.nodes()` and its edges through `network.edges()`. 

- `network.degree(node)` yields the degree of `node` (which can come from the above iterable), and `network.degree(node,weight='weight')` yields the node's strength (=sum of the weights of its links).  
- If you iterate over the edges with the following parameter: `network.edges(data='weight')`, each iteration yields three values, `node_i`,`node_j`,`weight_ij`. 



In [None]:
# First read in the data (modify the path in the code snippet above)
network = nx.read_weighted_edgelist(network_path)

# TODO: Get the node degrees and strengths
degrees = []
strengths = []
weights = []

# YOUR CODE HERE
for node in network.nodes():
    degrees.append(network.degree(node))
    strengths.append(network.degree(node, weight='weight'))

for edge in network.edges(data='weight'):
    weights.append(edge[2])

print('Minimum degree: {}'.format(min(degrees)))
print('Maximum degree: {}'.format(max(degrees)))
print('Minimum strength: {}'.format(min(strengths)))
print('Maximum strength: {}'.format(max(strengths)))
print('Minimum weight: {}'.format(min(weights)))
print('Maximum weight: {}'.format(max(weights)))

### b) Complementary cumulative distribution functions (2 pts)

Next, we'll look at how the degrees, strengths, and edge weights are distributed.  To this end, we will plot their complementary cumulative distributions (CCDFs) — recall that a CCDF $P(X > x)$ is defined as the fraction of data points whose values are larger $x$, as a function of this $x$. 

Your next task is to write the function, `plot_ccdf`, that lets you generate plots of CCDFs. Its key input is `datavec`, which is simply a list of the data points whose CCDF you want to plot. For the rest, see the documentation, and also the code block below.

When your plotting function is complete, complete the block of code below it. Your goal is to plot the CCDFs of all three variables — degrees, strengths, and edge weights — in a single plot.For comparison, also plot exponential distributions with the same means. Recall that the CCDF of an exponential distribution with mean $\mu$ is given by $\exp(-x / \mu)$.

Inspect your plot and answer the MyCourses quiz questions.

**Coding hints**:

For the CCDF, first sort the list containing the data points from the smallest to the largest. Then, for each data point $x$, compute the fraction of data points larger than it (this is quite simple, just consider how long your list is and how long the remaining list with values higher than $x$ is). Your CCDF is now just two lists: the sorted values and the fractions of values above each sorted value. 

As a sanity check, please note that $P(X > x)$ is always a non-increasing function of $x$ because if you compare two values $x_1$ and $x_2$ where $x_1 > x_2$, you never find more data points above $x_1$ than above $x_2$.

In [None]:
def plot_ccdf(datavec, label, linestyle, color, ax):
    """
    Plots the complementary cumulative distribution function (CCDF) of the input data.
    
    Parameters
    -----------
    datavec: a list of data points
    label: str, label for the legend
    linestyle: str, style of the line
    color: str, color of the line
    ax: axis object where to plot using ax.plot(xdata,ydata,linestyle=linestyle,color=color,label=label)
    """
    # TODO: Calculate the CCDF of the datavec and plot it
    # YOUR CODE HERE
    datavec = np.sort(datavec)
    ccdf = 1 - np.linspace(0, 1, len(datavec))

    ax.plot(datavec, ccdf, linestyle=linestyle, color=color, label=label)
    return ax


In [None]:
# Let us plot the empirical CCDFs
datavecs = [degrees, strengths, weights]
styles = ['-'] * 3
colors = ['tab:blue', 'tab:orange', 'tab:green']
labels = ['degree', 'strength', 'weight']
xlabel = 'degree/strength/weight'
ylabel = 'CCDF'

fig_ccdf, ax_ccdf = plt.subplots(figsize=(8, 6))
for datavec, label, style, color in zip(datavecs, labels, styles, colors):
    ax_ccdf = plot_ccdf(datavec, label, style, color, ax_ccdf)

max_degree = 0
max_strength = 0
max_weight = 0

# TODO: Get the maximum degree, maximum strength, and maximum weight
# YOUR CODE HERE
max_degree = max(degrees)
max_strength = max(strengths)
max_weight = max(weights)

mean_degree = 0
mean_strength = 0
mean_weight = 0

# TODO: Get the mean degree, mean strength, and mean weight
# YOUR CODE HERE
mean_degree = np.mean(degrees)
mean_strength = np.mean(strengths)
mean_weight = np.mean(weights)

# Now let us plot the exponential counterparts
labels_for_exp = ['exponential degree', 'exponential strength', 'exponential weight']
maxima = [max_degree, max_strength, max_weight]
means = [mean_degree, mean_strength, mean_weight]

# TODO: Define the CCDF of an exponential distribution with mean mu
def ccdf_exp(x, mu):
    # YOUR CODE HERE
    return np.exp(-x / mu)

for mean, maximum, label, color in zip(means, maxima, labels_for_exp, colors):
    x = np.geomspace(1, maximum, num=100)
    ax_ccdf.plot(x, ccdf_exp(x, mean), linestyle='--', color=color, label=label)

# Use the loglog scale
ax_ccdf.set_xscale('log')
ax_ccdf.set_yscale('log')

ax_ccdf.set_xlabel(xlabel)
ax_ccdf.set_ylabel(ylabel)
ax_ccdf.legend()
ax_ccdf.set_ylim(bottom=2E-5, top=2)
ax_ccdf.grid()

In [None]:
# Save the figure in the current directory
fig_ccdf.savefig('ccdfs.pdf', bbox_inches='tight')

#### c) Average link weight per node (4 pts)
Next, we will study how the average link weight per node $\langle w \rangle =\frac{s}{k}$ behaves as a function of the node degree $k$. 

i) **Compute** $s$, $k$, and $\langle w \rangle = \frac{s}{k}$ for each node.

ii) Generate a **scatter plot** of all the data points of $\langle w \rangle$ as a function of $k$. Create two versions of the plots: one with linear and one with logarithmic $x$-axis.

iii) The large variance of the data can make the scatter plots a bit messy. To make the relationship between $\langle w \rangle$ and $k$ more visible, **generate** bin-averaged versions of the plots, i.e., divide nodes into bins based on their degree and calculate the average $\langle w \rangle$ in each bin. Plot the bin-averaged versions on top of the scatter plots.

Use the results of i)-iii) to answer the questions in the MyCourses quiz.

**Hints**:
- A good choice for the number of bins in this case would be around 20.

- In the bin-averaged plot, the size of the points represents the number of nodes in each bin. An unbalanced distribution of observations, where some bins have far more points than others, may obscure the results.

- `scipy.stats.binned_statistic` function is especially useful for answering this and the next questions.

In [None]:
def bin_averaged_plot(x, y, bins, ax):
    """
    Plots the bin-averaged values of y as a function of x.

    Parameters
    -----------
    x: x values, list or numpy array
    y: y values, list or numpy array
    bins: bin edges of x, list or numpy array
    xlabel: x label for the figure, string
    ylabel: y label for the figure, string
    ax: axis object where to plot, None or axis object
    """
    # bin-averaged values of x
    x_bin_averages, _, _ = binned_statistic(x, x, statistic='mean', bins=bins)

    # TODO: Use binned_statistic to calculate the bin-averaged values of y. 
    # see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic.html
    
    # YOUR CODE HERE

    # Count the number of values in each bin.
    # This is used to calculate the size of the scatter points
    counts, _, _ = binned_statistic(x, x, statistic='count', bins=bins)

    y_bin_averages, _, _ = binned_statistic(x, y, statistic='mean', bins=bins)
    
    # Plot the bin-averaged values
    ax.scatter(x_bin_averages, y_bin_averages, marker='o', color='r', 
               s=2*np.sqrt(counts), label='bin averages')
    return ax


In [None]:
# average link weight per node (list or numpy array)
av_weight = []
#TODO: generate a list whose entries are average link weights (=strength/degree) of each node
# YOUR CODE HERE
av_weight = [strengths[i] / degrees[i] for i in range(len(degrees))]

print("The largest average weight is "+str(max(av_weight)))

n_bins = 20
n_lin = 5
min_deg = min(degrees)
max_deg = max(degrees)

# Create linear and log bins 
linbins = create_linbins(min_deg, max_deg, n_bins)
logbins = create_logbins(min_deg, max_deg, n_log=n_bins-n_lin, n_lin=n_lin)

# Uncomment if you wish to take a look at the logarithmic bins
print(logbins)

In [None]:
alpha = 0.1 # transparency of data points in the scatter

# For each of the linear and logarithmic versions, the full scatter plot and 
# the bin-averaged plot should be shown in one figure.

for bins, scale in zip([linbins, logbins], ['linear', 'log']):
    fig, ax = plt.subplots(figsize=(6, 4))

    # Scatter plot all data points on ax
    ax.scatter(degrees, av_weight, marker='o', color='k', s=1.5, alpha=alpha)

    # Plot the bin-averaged values on ax
    ax = bin_averaged_plot(degrees, av_weight, bins, ax)
    
    ax.set_xscale(scale)
    # TODO: Set appropriate labels using ax.set_xlabel('your_label_here'), ax.set_ylabel('your label here')

    # YOUR CODE HERE
    ax.set_xlabel('degree')
    ax.set_ylabel('average link weight')
    
    ax.set_ylim(bottom=0.0)
    ax.grid()

    ax.legend(loc='best')
    ax.set_title('avg. link weight vs. degree')


In [None]:
# Save the figure in the current directory
fig.savefig('avg_weight_vs_degree.pdf', bbox_inches='tight')

### d) Social signatures of egocentric networks

Next, we will study the social signatures of egocentric networks. The social signature of a node is a vector that describes the distribution of link weights in the egocentric network of the node. We define the social signature of node $i$ with degree $k_i$ as
$$
\mathbf{p}_i = \left( p_{i,1}, p_{i,2}, \ldots, p_{i,k_i} \right), \quad p_{i, 1} \geq p_{i, 2} \geq \ldots \geq p_{i, k_i},
$$
where $p_{i,j}$ is the weight of the $j$-th link (sorted in descending order of weight) incident to node $i$, normalized by the sum of the weights of all links incident to $i$ (i.e., the strength of node $i$) :
$$
p_{i,j} = \frac{w_{i,j}}{\sum_{j=1}^{k_i} w_{i,j}}.
$$

Here, your task is to **compute** the social signatures of the nodes in the network that have degree $k_i = 25$ and **plot** the signatures $p_{i, j}$ as a function of rank $j$.

In [None]:
def social_signature(G, node):
    """Calculate the social signature of a node in a network.

    Parameters
    ----------
    G : NetworkX graph
        The network of interest
    node : int
        The node of interest

    Returns
    -------
    signature : np.ndarray
        The social signature of the node, i.e., the tie strengths between 
        the node and its neighbors, normalized by the sum and sorted in 
        descending order, from the largest tie strength to the smallest tie strength.
    """
    signature = np.array([]) # REPLACE!

    # hint: to get the link weights of "node" in graph G, use G[node].values() and the attribute 'weight' (x['weight'] for x in G[node].values()])
    # for sorting, see, e.g., https://docs.python.org/3/howto/sorting.html
    
    # YOUR CODE HERE
    neighbors = G.neighbors(node)
    signature = np.array([G[node][neighbor]['weight'] for neighbor in neighbors])
    signature = signature / np.sum(signature)
    signature = np.sort(signature)[::-1]
    return signature

In [None]:
nodes_k = []
# TODO: Get all the nodes with degree 25 and store them in nodes_k
# YOUR CODE HERE
nodes_k = [node for node in network.nodes() if network.degree(node) == 25]

print("Found "+str(len(nodes_k))+" nodes of degree 25")

# Plot social signatures of nodes with degree 25

largest_top_fraction=-1

fig_signature, ax_signature = plt.subplots(figsize=(6, 6))
for node in nodes_k:
    signature = social_signature(network, node)
    rank = np.arange(1, len(signature) + 1)
    ax_signature.plot(rank, signature, 'o-', alpha=0.6)

    if signature[0]>largest_top_fraction:
            largest_top_fraction=signature[0]
        
ax_signature.set_yscale('log')
ax_signature.set_ylim(top=1)
ax_signature.set_xlabel('Neighbor rank')
ax_signature.set_ylabel('Fraction of tie strength')

print('Largest fraction of weight at rank 1: '+str(largest_top_fraction))

In [None]:
# Save the figure in the current directory
fig_signature.savefig('social_signatures_rnd.png', dpi=300, bbox_inches='tight')

### e) Link neighborhood overlap (4 pts)
Let's consider a link between nodes $i$ and $j$. For this link, the *link neighborhood overlap* $O_{ij}$ is defined as the fraction of common neighbors of $i$ and $j$ out of all their neighbors: $O_{ij}=\frac{n_{ij}}{\left(k_i-1\right)+\left(k_j-1\right)-n_{ij}}.$

Your task is to calculate the link overlap for all links in the data and investigate how it behaves as a function of link weight. 

- **Calculate** the link neighborhood overlap for each link. To do this, you must modify the `get_link_overlap` function.

- Create a **scatter plot** showing the overlaps as a function of link weight.

- As in c), **produce** a bin-averaged version of the plot on top of the scatter plot. Use the binning strategy (linear or logarithmic) most suitable for this case. A reasonable number of bins would be between 10 and 30. 

**Coding hints**:

- To extract the common neighbors of two nodes, you can use the `set` datatype and the `intersection()` method.

- A naive implementation of the above definition of $O_{ij}$ will result in ZeroDivisionError for some links. Examine for what kind of links this happens. What would be the appropriate way to define $O_{ij}$ for those links?
 

In [None]:
def get_link_overlap(net):
    """
    Calculates link overlap: 
    O_ij = n_ij / [(k_i - 1) + (k_j - 1) - n_ij]

    Parameters
    -----------
    net: a networkx.Graph() object

    Returns
    --------
    overlaps: list of link overlaps in net
    """

    # TODO: write a function to calculate link neighborhood overlap
    overlaps = []
    # YOUR CODE HERE
    for edge in net.edges():
        node_i = edge[0]
        node_j = edge[1]
        neighbors_i = set(net.neighbors(node_i))
        neighbors_j = set(net.neighbors(node_j))
        n_ij = len(neighbors_i.intersection(neighbors_j))
        k_i = len(neighbors_i)
        k_j = len(neighbors_j)
        if ((k_i - 1) + (k_j - 1) - n_ij) == 0:
            overlaps.append(0)
        else:
            overlaps.append(n_ij / ((k_i - 1) + (k_j - 1) - n_ij))
    return overlaps

In [None]:
overlaps = get_link_overlap(network)

print('The largest link overlap is '+str(max(overlaps)))

alpha = 0.1 # transparency of data points in the scatter

n_bins = 0 # REPLACE!
# TODO: Change the number of bins to a number you find reasonable
# YOUR CODE HERE
n_bins = 10
n_lin = 5
min_w = np.min(weights)
max_w = np.max(weights)

linbins = create_linbins(min_w, max_w, n_bins)
logbins = create_logbins(min_w, max_w, n_log=n_bins-n_lin, n_lin=n_lin)

# TODO: try both linear and logarithmic bins, select the better one (bins=logbins and scale='log' or bins=linbins and scale='lin')
bins = []
scale = ''
# YOUR CODE HERE

bins = linbins
scale = 'log'

# creating link neighborhood overlap scatter
fig_lno, ax_lno = plt.subplots()

ax_lno.scatter(weights, overlaps, marker='o', color='k', s=1.5, alpha=alpha)
ax_lno = bin_averaged_plot(weights, overlaps, bins, ax_lno)
ax_lno.set_xscale(scale)
ax_lno.set_yscale('log')
ax_lno.set_ylim(bottom=0.01)
ax_lno.grid()

#TODO: Set appropriate labels
ax_lno.set_xlabel(' ')
ax_lno.set_ylabel(' ')

# YOUR CODE HERE
ax_lno.set_xlabel('weight')
ax_lno.set_ylabel('overlap')

ax_lno.text(0.95, 0.95, "number of bins: {}".format(n_bins),
        horizontalalignment='right', verticalalignment='top', 
        backgroundcolor='w', transform=ax_lno.transAxes)

ax_lno.set_title('Overlap vs. weight')


In [None]:
fname_lno = 'o_vs_w.png'
fig_lno.savefig(fname_lno, dpi=300, bbox_inches='tight')
print('Link neighborhood overlap scatter saved as ' + fname_lno)