# Exercise Set 13: Network formation


In this Exercise Set 13 we will investigate network formation among high school pupils. 

## Part 1: Network formation


Load the data using the script below. Read a bit about the dataset [here](http://www.sociopatterns.org/datasets/high-school-contact-and-friendship-networks/) to get an understanding of what is in each variable. 

The script gives you two dataframes to work with: 
 > `el`, which is an edge-list 
 >
 > `ind` which contains individual characteristics

In [2]:
import networkx as nx
import numpy as np
import pandas as pd

url_base = 'http://www.sociopatterns.org/wp-content/uploads/2015/'

# edgelist
url_el = url_base + '07/High-School_data_2013.csv.gz'
col_names_el = ['timestamp', 'u1', 'u2', 'class1', 'class2']
el = pd.read_csv(url_el, header=None, names=col_names_el, delimiter=' ')

# individual characteristics
url_ind = url_base + '09/metadata_2013.txt'
col_names_ind = ['u', 'class', 'gender']
ind = pd.read_csv(url_ind, header=None, names=col_names_ind, delimiter='\t')\
            .set_index('u')

# remove observation with missing gender
has_gender = ind[ind.gender!='Unknown'].index

# DataFrames
ind = ind.loc[has_gender].copy()
el = el[el.u1.isin(has_gender) &  el.u2.isin(has_gender)].copy()

In [6]:
# el is the first of 5 datasets, cleaned of any i's (now u1) with missing gender data
# ind is the fifth of 5 datasets, cleaned of any i's (now u1) with missing gender data
el

Unnamed: 0,timestamp,u1,u2,class1,class2
0,1385982020,454,640,MP,MP
1,1385982020,1,939,2BIO3,2BIO3
2,1385982020,185,258,PC*,PC*
3,1385982020,55,170,2BIO3,2BIO3
4,1385982020,9,453,PC,PC
...,...,...,...,...,...
188503,1386345560,120,285,PC,PC
188504,1386345580,61,160,2BIO2,2BIO2
188505,1386345580,272,939,2BIO3,2BIO3
188506,1386345580,311,496,PC,PC


> **Ex. 13.1.1**: Describe the edgelist columns content. Parse the timestamp. What is the resolution of meetings? Use the parsed timestamp to count the meetings by hour in local time.

In [96]:
# [Answer to ex. 13.1.1 here]

# Column 1: timestamp --the interval during which this contact was active is [ t – 20s, t ]. 
# If multiple contacts are active in a given interval, you will see multiple lines starting with the same value of t. 
# Time is measured in seconds and expressed in UNIX ctime.

# Column 2 and 3: u1 and u2 -- are the persons that are in contact during a given 20seconds window (cf. column 1). 
# persons are identified in the column by their id number. 

# Column 4 and 5: class 1 and class 2 -- class 1 tells us the class u1 belongs to. class 2 tells us the class u2 is in.

# Parse the timestamp:
el['timestamp']=pd.to_datetime(el['timestamp'], unit='s') # Unit is seconds, pandas dataframe can be converted from unix ctime into date format

# What is the resolution of meetings? - what is meant by this question? meetings are resolved after 20 sedonds in these observations. 
# The timestamp indicates the ending time of the contact. 

# Use the parsed timestamp to count the meetings by hour in local time 
#times = pd.to_datetime(el['timestamp'])
#el.groupby([el['timestamp'].hour]).value_col.sum()
#el.groupby(el.timestamp.dt.time).u1.count()

#times = pd.to_datetime(el.timestamp)
#contacts=el.groupby([times.dt.hour]).u1.count() -> does not work since it does not take into account the 4 days and local time
#contacts

contacts = el.groupby(pd.Grouper(key='timestamp',freq="h")).size().reset_index(name="contacts")
contacts

Unnamed: 0,timestamp,contacts
0,2013-12-02 11:00:00,5556
1,2013-12-02 12:00:00,4259
2,2013-12-02 13:00:00,6617
3,2013-12-02 14:00:00,5715
4,2013-12-02 15:00:00,5972
...,...,...
96,2013-12-06 11:00:00,4106
97,2013-12-06 12:00:00,3247
98,2013-12-06 13:00:00,1785
99,2013-12-06 14:00:00,2026


> **Ex. 13.1.2**: Count the number of meetings for each edge and save this as a DataFrame called `el_agg`. Filter out edges with less than 5 minutes of meetings. Attach the gender and class of both nodes.

In [102]:
# [Answer to ex. 13.1.2 here]
# count number of meetings for each edge 
# count how many times edges meet
count_edge_occurces = el.groupby(['u1', 'u2']).size()
count_edge_occurces


u1    u2  
1     55       8
      63       2
      101      1
      106      4
      117     18
              ..
1805  1870     1
      1894     5
1819  1894     1
1828  1894     2
1870  1894    29
Length: 5583, dtype: int64

In [110]:
# create new df with u1 and u2 as nodes and the number of edges between them
el_agg = count_edge_occurces.to_frame(name = 'meet_count').reset_index()
el_agg

Unnamed: 0,u1,u2,meet_count
0,1,55,8
1,1,63,2
2,1,101,1
3,1,106,4
4,1,117,18
...,...,...,...
5578,1805,1870,1
5579,1805,1894,5
5580,1819,1894,1
5581,1828,1894,2


In [113]:
# filter out edges with less than 5 minute meeting time, i.e. 5 minutes corresponds to 15 meetings
el_agg = el_agg[el_agg['meet_count'] > 14]  
el_agg

Unnamed: 0,u1,u2,meet_count
4,1,117,18
7,1,196,38
10,1,205,47
13,1,494,123
21,1,939,85
...,...,...,...
5560,1518,1784,165
5568,1543,1784,29
5572,1594,1819,129
5573,1594,1828,1285


In [114]:
# merge with ind df to get node characteristics such as gender and class for each node
el_agg = el_agg.merge(ind, right_index=True,left_on=["u1"]).rename(columns={"class":"u1_class","gender":"u1_gender"}).reset_index().drop(["index"],axis=1)
el_agg = el_agg.merge(ind, right_index=True,left_on=["u2"]).rename(columns={"class":"u2_class","gender":"u2_gender"}).reset_index().drop(["index"],axis=1)


In [106]:
el_agg

Unnamed: 0,u1,u2,edge_meetings,u1_class,u1_gender,u2_class,u2_gender
0,1,117,18,2BIO3,M,2BIO3,M
1,39,117,27,2BIO3,F,2BIO3,M
2,55,117,26,2BIO3,F,2BIO3,M
3,101,117,44,2BIO3,F,2BIO3,M
4,106,117,104,2BIO3,F,2BIO3,M
...,...,...,...,...,...,...,...
1370,1332,1819,120,MP*2,F,MP*2,M
1371,1423,1819,62,MP*2,M,MP*2,M
1372,1594,1819,129,MP*2,F,MP*2,M
1373,1332,1870,21,MP*2,F,MP*2,M


> **Ex. 13.1.3**: Answer question in the function `fraction_triangles` below. Explain how `fraction_triangles` is related to  computing the clustering coefficient (using `nx.average_clustering`).
>
>> *Hint:* The following code does the same thing as `fraction_triangles`, but at a scale where you can understand what's going on. If you have a hard time understanding the code in the function you can try to play around with this simpler example
>>
>> ```python
>> import networkx as nx 
>>
>> A  = np.array(
>>     [[0, 1, 1, 0],
>>      [1, 0, 1, 0],
>>      [1, 1, 0, 1],
>>      [0, 0, 1, 0]]
>> )
>>
>> G = nx.from_numpy_array(A)
>> nx.draw(G,with_labels=True)
>>
>> def nth(A, n):
>>     A_ = A.copy()    
>>     for _ in range(1,n):
>>         A = A.dot(A_)
>>     return A
>>
>> a_t = nth(A,3).diagonal().sum()/6
>> n = len(A[:,0])
>> p_t = binom(n, 3)
>> ```


In [118]:
def make_net(el_, nodes):
    '''
    Convert edgelist to networkx graph which is 
    binary and undirected.
    
    Parameters
    ----------
    el_ : DataFrame
        Table containing an edgelist with columns 
        `u1` and `u2` which are the nodes in the edge.
        
    nodes : array-like
        1d array containing the node identities.
    '''    
    
    nx_input = el_, 'u1', 'u2', 'meet_count', nx.Graph()
    g = nx.from_pandas_edgelist(*nx_input)
    g.add_nodes_from(nodes)
    return g

In [119]:
from scipy.special import binom

def fraction_triangles(el_, nodes):
    '''
    Compute fraction of actual triangles out 
    of the potential triangles.
    
    Parameters
    ----------
    el_ : DataFrame
        Table containing an edgelist with columns 
        `u1` and `u2` which are the nodes in the edge.
        
    nodes : array-like
        1d array containing the node identities.
    '''
    
    g = make_net(el_, nodes)
    
    # Answers already given ;-) 
    #Q.1: what is `A`?: the adjacency matrix which is symmetric and binary
    #Q.2: what does `A**3` do? compute the number of paths between two nodes
    #Q.3: what is diagonal of A_t? the number of actual paths of length 3, 
    # i.e. triangles, which include the person. these are called cycles
    # because they start and end at the same person
    
    # count actual triangles    
    A = nx.to_scipy_sparse_matrix(g)
    A_t = A**3
    a_t = A_t.diagonal().sum()/6
    
    #Q.4: what does `binom(n,3)` compute? the number of triangles including the person
    
    # count potential triangles
    n = len(g.nodes())
    p_t = binom(n, 3)
        
    return a_t/p_t

> **Ex. 13.1.4**: Apply the function `fraction_triangles` to `el_agg` and print the triangle fraction in the network. Next remove all edges that go between classes. Compute triangle fraction within each class and store it. Compute the mean within class triangles and bootstrap the standard error of the mean. Comment on the output.
>
>> *Hint:* To bootstrap an estimate draw $k>>0$ samples with replacement from the data. Compute the estimate on each of these samples and average them in the end to get the bootstrapped estimate. 

In [134]:
# [Answer to ex. 13.1.4 here]
u = el_agg.u1.unique() #get nodes array to feed into fraction_triangles function, undirected graph, sufficient to pull from u1
u = np.sort(u)
print(f"The triangle fraction in the network is: {fraction_triangles(el_agg, u)}")

The triangle fraction in the network is: 0.0003160278606511087


In [225]:
# Remove all edges that go between classes 

#remove all inter-class communications
intraclass_all = el_agg[el_agg.u1_class == el_agg.u2_class] # -> = only intra-class edges

# how many classes are there --> 9
classes = intraclass_all.u1_class.unique()

# dataframe per class
class1=intraclass_all.loc[lambda intraclass_all: intraclass_all['u1_class'] == classes[0]]
class2=intraclass_all.loc[lambda intraclass_all: intraclass_all['u1_class'] == classes[1]]
class3=intraclass_all.loc[lambda intraclass_all: intraclass_all['u1_class'] == classes[2]]
class4=intraclass_all.loc[lambda intraclass_all: intraclass_all['u1_class'] == classes[3]]
class5=intraclass_all.loc[lambda intraclass_all: intraclass_all['u1_class'] == classes[4]]
class6=intraclass_all.loc[lambda intraclass_all: intraclass_all['u1_class'] == classes[5]]
class7=intraclass_all.loc[lambda intraclass_all: intraclass_all['u1_class'] == classes[6]]
class8=intraclass_all.loc[lambda intraclass_all: intraclass_all['u1_class'] == classes[7]]
class9=intraclass_all.loc[lambda intraclass_all: intraclass_all['u1_class'] == classes[8]]

# get nodes for each class
node1 = class1.u1.unique()
node2 = class2.u1.unique()
node3 = class3.u1.unique()
node4 = class4.u1.unique()
node5 = class5.u1.unique()
node6 = class6.u1.unique()
node7 = class7.u1.unique()
node8 = class8.u1.unique()
node9 = class9.u1.unique()

# store triangle fraction per class
class1_tf = fraction_triangles(class1, node1)
class2_tf = fraction_triangles(class2, node2)
class3_tf = fraction_triangles(class3, node3)
class4_tf = fraction_triangles(class4, node4)
class5_tf = fraction_triangles(class5, node5)
class6_tf = fraction_triangles(class6, node6)
class7_tf = fraction_triangles(class7, node7)
class8_tf = fraction_triangles(class8, node8)
class9_tf = fraction_triangles(class9, node9)

#mean of class triangle fractions
import statistics
tfs = [class1_tf,class2_tf,class3_tf,class4_tf,class5_tf,class6_tf,class7_tf,class8_tf,class9_tf]
mean_classes = np.mean(tfs)

#bootstrap the std deviation of class triangle fractions - tbd, sorry - Ill revise and add the exercises below during exam prep


Recall from class that we can define the following measures of homophily. We define **homophily index** inspired by [Currarini et al. (2009)](https://doi.org/10.2139/ssrn.1021650):
- share of edges that are same type: $H = \frac{s}{s+d}$
- possible range [0,1]


We define **baseline homophily** as: 
- We count fraction of potential edges in population of nodes which are same type:

\begin{equation}B=\frac{\sum_t\#potential(n_t)}{\#potential(n)}, \qquad \#potential(k)=\frac{k\cdot(k-1)}{2}\end{equation}

- Interpretation: Expected homophily from random link formation.     

We define **inbreeding homophily** as:      

\begin{equation}IH=\frac{H-B}{1-B}\end{equation}


> **Ex. 13.1.5**: Compute the inbreeding homophily for each class. Use the class measures to compute the mean. Use a bootstrap to compute whether there is inbreeding homophily.

In [15]:
# [Answer to ex. 13.1.5 here]

> **Ex. 13.1.6** (BONUS): Describe what an unsupported edge is. Construct a test of whether there is a preference for forming  triangles within same gender than across.
>
>> *Hint:*  You can find inspiration in the approach of [Chandrasekhar, Jackson (2018)](https://web.stanford.edu/~arungc/CJ_sugm.pdf) pp. 31-35. They construct an almost identical test for triangle formation across castes in Indian villages.

In [None]:
# [Answer to ex. 13.1.6 here]