# Exercise Set 13: Network formation


In this Exercise Set 13 we will investigate network formation among high school pupils. 

## Part 1: Network formation


Load the data using the script below. Read a bit about the dataset [here](http://www.sociopatterns.org/datasets/high-school-contact-and-friendship-networks/) to get an understanding of what is in each variable. 

The script gives you two dataframes to work with: 
 > `el`, which is an edge-list 
 >
 > `ind` which contains individual characteristics

In [1]:
import networkx as nx
import numpy as np
import pandas as pd

url_base = 'http://www.sociopatterns.org/wp-content/uploads/2015/'

# edgelist
url_el = url_base + '07/High-School_data_2013.csv.gz'
col_names_el = ['timestamp', 'u1', 'u2', 'class1', 'class2']
el = pd.read_csv(url_el, header=None, names=col_names_el, delimiter=' ')

# individual characteristics
url_ind = url_base + '09/metadata_2013.txt'
col_names_ind = ['u', 'class', 'gender']
ind = pd.read_csv(url_ind, header=None, names=col_names_ind, delimiter='\t')\
            .set_index('u')

# remove observation with missing gender
has_gender = ind[ind.gender!='Unknown'].index

# DataFrames
ind = ind.loc[has_gender].copy()
el = el[el.u1.isin(has_gender) &  el.u2.isin(has_gender)].copy()

> **Ex. 13.1.1**: Describe the edgelist columns content. Parse the timestamp. What is the resolution of meetings? Use the parsed timestamp to count the meetings by hour in local time.

In [2]:
#timestamp refers to the interval during which contact is [t – 20s, t], u1,u2 are the students, C1,C2 are their classes
#contacts between students in a high school
el.head() 

Unnamed: 0,timestamp,u1,u2,class1,class2
0,1385982020,454,640,MP,MP
1,1385982020,1,939,2BIO3,2BIO3
2,1385982020,185,258,PC*,PC*
3,1385982020,55,170,2BIO3,2BIO3
4,1385982020,9,453,PC,PC


In [3]:
el['parse_time']=pd.to_datetime(el.timestamp,unit='s')

In [4]:
el['date_time']=el.parse_time.apply(lambda x: str(x.year)+str(x.month)+str(x.day)+str(x.hour))

In [5]:
el.groupby('date_time').size()

date_time
201312211    5556
201312212    4259
201312213    6617
201312214    5715
201312215    5972
201312310    5096
201312311    4675
201312312    4193
201312313    5172
201312314    3772
201312315    4316
20131237     6048
20131238     5286
20131239     7104
201312410    4013
201312411    3998
201312412    4555
201312413    3109
201312414    2567
201312415    2117
20131247     5100
20131248     6218
20131249     7309
201312510    4230
201312511    3063
201312512    3039
201312513    3680
201312514    3461
201312515    2595
20131257     4603
20131258     4851
20131259     6146
201312610    5051
201312611    4106
201312612    3247
201312613    1785
201312614    2026
201312615    1352
20131267     3877
20131268     4872
20131269     6898
dtype: int64

> **Ex. 13.1.2**: Count the number of meetings for each edge and save this as a DataFrame called `el_agg`. Filter out edges with less than 5 minutes of meetings. Attach the gender and class of both nodes.

In [6]:
# [Answer to ex. 13.1.2 hesre]
el_agg=el.groupby(['u1','u2']).size().to_frame()

In [12]:
ind=ind.reset_index()

In [10]:
el_agg=el_agg.reset_index()

In [8]:
el_agg.columns=['u1','u2','count']
el_agg=el_agg[el_agg['count']>15]

In [13]:
for i in range(len(el_agg['u1'])):
    el_agg.loc[i,'u1_class'],el_agg.loc[i,'u2_class']=ind[ind['u']==el_agg['u1'][i]]['class'].to_list(),ind[ind['u']==el_agg['u2'][i]]['class'].to_list()
    el_agg.loc[i,'u1_gender']=ind[ind['u']==el_agg['u1'][i]]['gender'].to_list()
    el_agg.loc[i,'u2_gender']=ind[ind['u']==el_agg['u2'][i]]['gender'].to_list()

In [14]:
el_agg=el_agg.drop(columns=['index'])

In [15]:
el_agg

Unnamed: 0,u1,u2,count,u1_class,u2_class,u1_gender,u2_gender
0,1,117,18,2BIO3,2BIO3,M,M
1,1,196,38,2BIO3,2BIO3,M,M
2,1,205,47,2BIO3,2BIO3,M,M
3,1,494,123,2BIO3,2BIO3,M,M
4,1,939,85,2BIO3,2BIO3,M,M
...,...,...,...,...,...,...,...
1315,1518,1784,165,MP*2,MP*2,M,M
1316,1543,1784,29,MP*2,MP*2,M,M
1317,1594,1819,129,MP*2,MP*2,F,M
1318,1594,1828,1285,MP*2,MP*2,F,M


> **Ex. 13.1.3**: Answer question in the function `fraction_triangles` below. Explain how `fraction_triangles` is related to  computing the clustering coefficient (using `nx.average_clustering`).
>
>> *Hint:* The following code does the same thing as `fraction_triangles`, but at a scale where you can understand what's going on. If you have a hard time understanding the code in the function you can try to play around with this simpler example
>>
>> ```python
>> import networkx as nx 
>>
>> A  = np.array(
>>     [[0, 1, 1, 0],
>>      [1, 0, 1, 0],
>>      [1, 1, 0, 1],
>>      [0, 0, 1, 0]]
>> )
>>
>> G = nx.from_numpy_array(A)
>> nx.draw(G,with_labels=True)
>>
>> def nth(A, n):
>>     A_ = A.copy()    
>>     for _ in range(1,n):
>>         A = A.dot(A_)
>>     return A
>>
>> a_t = nth(A,3).diagonal().sum()/6
>> n = len(A[:,0])
>> p_t = binom(n, 3)
>> ```


The clustering coefficient and fraction of triangle both measure the degree to which nodes in a network tend to form triangles.The larger they are,the more triangles will be in the network. The fraction of triangles put larger weight on high degree nodes, like node2(there is only one out of three possible triangles around node2,so that the value is lower), whereas the clustering coefficient takes each node as the same weight.


In [16]:
def make_net(el_, nodes):
    '''
    Convert edgelist to networkx graph which is 
    binary and undirected.
    
    Parameters
    ----------
    el_ : DataFrame
        Table containing an edgelist with columns 
        `u1` and `u2` which are the nodes in the edge.
        
    nodes : array-like
        1d array containing the node identities.
    '''    
    
    nx_input = el_, 'u1', 'u2', 'count', nx.Graph()
    g = nx.from_pandas_edgelist(*nx_input) #unpack argument
    g.add_nodes_from(nodes)
    return g

In [17]:
from scipy.special import binom

def fraction_triangles(el_, nodes):
    '''
    Compute fraction of actual triangles out 
    of the potential triangles.
    
    Parameters
    ----------
    el_ : DataFrame
        Table containing an edgelist with columns 
        `u1` and `u2` which are the nodes in the edge.
        
    nodes : array-like
        1d array containing the node identities.
    '''
    
    g = make_net(el_, nodes)
    
    #Q.1: what is `A`?: the adjacency matrix which is symmetric and binary
    #Q.2: what does `A**3` do? compute the number of paths between two nodes
    #Q.3: what is diagonal of A_t? the number of actual paths of length 3, 
    # i.e. triangles, which include the person. these are called cycles
    # because they start and end at the same person
    
    # count actual triangles    
    A = nx.to_scipy_sparse_matrix(g)
    A_t = A**3
    a_t = A_t.diagonal().sum()/6
    
    #Q.4: what does `binom(n,3)` compute? the number of triangles including the person
    
    # count potential triangles
    n = len(g.nodes())
    p_t = binom(n, 3)
        
    return a_t/p_t

> **Ex. 13.1.4**: Apply the function `fraction_triangles` to `el_agg` and print the triangle fraction in the network. Next remove all edges that go between classes. Compute triangle fraction within each class and store it. Compute the mean within class triangles and bootstrap the standard error of the mean. Comment on the output.
>
>> *Hint:* To bootstrap an estimate draw $k>>0$ samples with replacement from the data. Compute the estimate on each of these samples and average them in the end to get the bootstrapped estimate. 

In [18]:
# [Answer to ex. 13.1.4 here]
fraction_triangles(el_agg, set(el_agg['u1'].to_list()+el_agg['u2'].to_list()))

0.00028807593795814057

In [19]:
el_agg=el_agg[el_agg['u1_class']==el_agg['u2_class']] #only within classes

In [21]:
def classes(el):
    return el.apply(lambda x:fraction_triangles(x, set(x['u1'].to_list()+x['u2'].to_list())))

In [22]:
class_triangle=classes(el_agg.groupby(['u1_class']))

In [23]:
class_triangle #the fraction of triangles is larger within class

u1_class
2BIO1    0.012529
2BIO2    0.033065
2BIO3    0.032591
MP       0.028188
MP*1     0.014347
MP*2     0.026079
PC       0.020311
PC*      0.016988
PSI*     0.009897
dtype: float64

In [24]:
np.mean(class_triangle) #mean 

0.021554986718998446

In [25]:
def bootstrap(el_):
    s=[]
    sample=el_.apply(lambda x: np.random.choice(x.index,size=len(x)))
    for subsample in sample:
        s.append(el_agg.loc[subsample,:])
    return s

In [85]:
def nodes(el): #nodes within each class
    return len(np.unique(el['u1'].to_list()+el['u2'].to_list()))

In [41]:
def std(el_,k):
    fts=[]
    for i in range(k):
        ft=[]
        sample=bootstrap(el_)
        for s in sample:
            ft.append(fraction_triangles(s, nodes(s)))
        fts.append(np.mean(ft))
    return np.std(fts)

In [42]:
std(el_agg.groupby(['u1_class']),100) #bootstrap std

0.0004598694355824293

Recall from class that we can define the following measures of homophily. We define **homophily index** inspired by [Currarini et al. (2009)](https://doi.org/10.2139/ssrn.1021650):
- share of edges that are same type: $H = \frac{s}{s+d}$
- possible range [0,1]


We define **baseline homophily** as: 
- We count fraction of potential edges in population of nodes which are same type:

\begin{equation}B=\frac{\sum_t\#potential(n_t)}{\#potential(n)}, \qquad \#potential(k)=\frac{k\cdot(k-1)}{2}\end{equation}

- Interpretation: Expected homophily from random link formation.     

We define **inbreeding homophily** as:      

\begin{equation}IH=\frac{H-B}{1-B}\end{equation}


> **Ex. 13.1.5**: Compute the inbreeding homophily for each class. Use the class measures to compute the mean. Use a bootstrap to compute whether there is inbreeding homophily.

In [44]:
def inbreeding_homophily(h,b):
    return (h-b)/(1-b)
def baseline_homophily(k,ks):
    return sum([i*(i-1) for i in ks])/(k*(k-1))
def class_gender(el): #number of nodes of different gender within class
    el_i=el_agg.loc[el,:]
    people=np.unique(np.concatenate([el_i.u1.unique(),el_i.u2.unique()]))
    gender=[ind[ind['u']==n]['gender'].to_list() for n in people]
    m,f=gender.count(['M']),gender.count(['F'])
    return list([m,f])
def homophily(el):
    el_i=el_agg.loc[el,:]
    return len(el_i[el_i['u1_gender']==el_i['u2_gender']])/(len(el_i))

In [45]:
el_k=el_agg.groupby(['u1_class']).apply(lambda s:len(nodes(s))) #numder of people in each class
el_k_index=el_agg.groupby(['u1_class']).apply(lambda s:s.index) #their index

In [47]:
#averege inbreeding homophily 
base=[]
homo=[]
for i in range(len(el_k)):
    ks=class_gender(el_k_index[i]) #number of male and female
    base.append(baseline_homophily(el_k[i],ks))
    homo.append(homophily(el_k_index[i]))
i_h=[]
for i in range(len(base)):
    i_h.append(inbreeding_homophily(homo[i],base[i])) #within class
np.mean(i_h)

0.11533737795720495

In [90]:
el_agg.groupby(['u1_class']).apply(lambda s:nodes(s))

u1_class
2BIO1    35
2BIO2    32
2BIO3    40
MP       29
MP*1     28
MP*2     38
PC       44
PC*      37
PSI*     33
dtype: int64

In [112]:
mean=[]
for i in range(10):
    sample=bootstrap(el_agg.groupby(['u1_class']))
    base=[]
    homo=[]
    i_h=[]
    for s in sample:
        el_k=nodes(s)
        el_k_index=s.index
        ks=class_gender(list(el_k_index))
        baseline=baseline_homophily(el_k,ks)
        base.append(baseline)
        homo.append(homophily(el_k_index))
    for i in range(len(base)):
        i_h.append(inbreeding_homophily(homo[i],base[i]))
    mean.append(np.mean(i_h))
        

In [113]:
np.std(mean) #std

0.04903725838399845

> **Ex. 13.1.6** (BONUS): Describe what an unsupported edge is. Construct a test of whether there is a preference for forming  triangles within same gender than across.
>
>> *Hint:*  You can find inspiration in the approach of [Chandrasekhar, Jackson (2018)](https://web.stanford.edu/~arungc/CJ_sugm.pdf) pp. 31-35. They construct an almost identical test for triangle formation across castes in Indian villages.

In [None]:
# [Answer to ex. 13.1.6 here]
