# Exercise Set 13: Network formation


In this Exercise Set 13 we will investigate network formation among high school pupils. 

## Part 1: Network formation


Load the data using the script below. Read a bit about the dataset [here](http://www.sociopatterns.org/datasets/high-school-contact-and-friendship-networks/) to get an understanding of what is in each variable. 

The script gives you two dataframes to work with: 
 > `el`, which is an edge-list 
 >
 > `ind` which contains individual characteristics

In [253]:
import warnings
warnings.filterwarnings("ignore")

In [103]:
import networkx as nx
import numpy as np
import pandas as pd

url_base = 'http://www.sociopatterns.org/wp-content/uploads/2015/'

# edgelist
url_el = url_base + '07/High-School_data_2013.csv.gz'
col_names_el = ['timestamp', 'u1', 'u2', 'class1', 'class2']
el = pd.read_csv(url_el, header=None, names=col_names_el, delimiter=' ')

# individual characteristics
url_ind = url_base + '09/metadata_2013.txt'
col_names_ind = ['u', 'class', 'gender']
ind = pd.read_csv(url_ind, header=None, names=col_names_ind, delimiter='\t')\
            .set_index('u')

# remove observation with missing gender
has_gender = ind[ind.gender!='Unknown'].index

# DataFrames
ind = ind.loc[has_gender].copy()
el = el[el.u1.isin(has_gender) &  el.u2.isin(has_gender)].copy()

> **Ex. 13.1.1**: Describe the edgelist columns content. Parse the timestamp. What is the resolution of meetings? Use the parsed timestamp to count the meetings by hour in local time.

The columns of the edgelist dataframe: 

`timestamp` is the time of the meeting. It is given in epoch time, which is seconds since a certain date. We can easily convert to datetime with pandas. 

`u1` and `u2` are the two id's of the persons meeting. 

`class1` and `class2` are simply their classes. 

In [104]:
el['timestamp'] = pd.to_datetime(el['timestamp'],unit='s')
el = el.set_index(el['timestamp'])
el['timestamp'].resample('H', kind='period').count()

timestamp
2013-12-02 11:00    5556
2013-12-02 12:00    4259
2013-12-02 13:00    6617
2013-12-02 14:00    5715
2013-12-02 15:00    5972
                    ... 
2013-12-06 11:00    4106
2013-12-06 12:00    3247
2013-12-06 13:00    1785
2013-12-06 14:00    2026
2013-12-06 15:00    1352
Freq: H, Name: timestamp, Length: 101, dtype: int64

> **Ex. 13.1.2**: Count the number of meetings for each edge and save this as a DataFrame called `el_agg`. Filter out edges with less than 5 minutes of meetings. Attach the gender and class of both nodes.

In [117]:
el_agg = pd.DataFrame(el.groupby(['u1','u2']).size()).reset_index()
el_agg = el_agg.merge(ind, left_on='u1', right_on='u').rename(columns={'class':'class1','gender':'gender1'})
el_agg = el_agg.merge(ind, left_on='u2', right_on='u').rename(columns={'class':'class2','gender':'gender2'})
el_agg = el_agg.rename(columns={0:'meet_count'})
print('n =',el_agg.shape[0])
el_agg.head()

n = 5583


Unnamed: 0,u1,u2,meet_count,class1,gender1,class2,gender2
0,1,55,8,2BIO3,M,2BIO3,F
1,1,63,2,2BIO3,M,2BIO3,F
2,3,63,2,2BIO2,M,2BIO3,F
3,27,63,19,2BIO2,M,2BIO3,F
4,39,63,1,2BIO3,F,2BIO3,F


> **Ex. 13.1.3**: Answer question in the function `fraction_triangles` below. Explain how `fraction_triangles` is related to  computing the clustering coefficient (using `nx.average_clustering`).
>
>> *Hint:* The following code does the same thing as `fraction_triangles`, but at a scale where you can understand what's going on. If you have a hard time understanding the code in the function you can try to play around with this simpler example
>>
>> ```python
>> import networkx as nx 
>>
>> A  = np.array(
>>     [[0, 1, 1, 0],
>>      [1, 0, 1, 0],
>>      [1, 1, 0, 1],
>>      [0, 0, 1, 0]]
>> )
>>
>> G = nx.from_numpy_array(A)
>> nx.draw(G,with_labels=True)
>>
>> def nth(A, n):
>>     A_ = A.copy()    
>>     for _ in range(1,n):
>>         A = A.dot(A_)
>>     return A
>>
>> a_t = nth(A,3).diagonal().sum()/6
>> n = len(A[:,0])
>> p_t = binom(n, 3)
>> ```


In [115]:
def make_net(el_, nodes):
    '''
    Convert edgelist to networkx graph which is 
    binary and undirected.
    
    Parameters
    ----------
    el_ : DataFrame
        Table containing an edgelist with columns 
        `u1` and `u2` which are the nodes in the edge.
        
    nodes : array-like
        1d array containing the node identities.
    '''    
    
    nx_input = el_, 'u1', 'u2', 'meet_count', nx.Graph()
    g = nx.from_pandas_edgelist(*nx_input)
    g.add_nodes_from(nodes)
    return g

In [116]:
from scipy.special import binom

def fraction_triangles(el_, nodes):
    '''
    Compute fraction of actual triangles out 
    of the potential triangles.
    
    Parameters
    ----------
    el_ : DataFrame
        Table containing an edgelist with columns 
        `u1` and `u2` which are the nodes in the edge.
        
    nodes : array-like
        1d array containing the node identities.
    '''
    
    g = make_net(el_, nodes)
    
    #Q.1: what is `A`?: the adjacency matrix which is symmetric and binary
    #Q.2: what does `A**3` do? compute the number of paths between two nodes
    #Q.3: what is diagonal of A_t? the number of actual paths of length 3, 
    # i.e. triangles, which include the person. these are called cycles
    # because they start and end at the same person
    
    # count actual triangles    
    A = nx.to_scipy_sparse_matrix(g)
    A_t = A**3
    a_t = A_t.diagonal().sum()/6
    
    #Q.4: what does `binom(n,3)` compute? the number of triangles including the person
    
    # count potential triangles
    n = len(g.nodes())
    p_t = binom(n, 3)
        
    return a_t/p_t

***Answers:***

The answers were already written... Here is what the function computes in my own words: 

*The function creates a network from the data. Then it computes the adjacency matrix `A`. Here the network is a simple and undirected networks, and thus the adjancency matrix consists of numbers `0` and `1`, and it is symmetric. The function then compute `A_t` which is the number of paths between nodes, and the diagonal in this new matrix is the number of paths with length 3. Then `p_t` is calculated by creating a binomally distributed probality of there being a triangle with the same number of nodes as the network. Thus we can finally return the fraction of actual triangels divided by the expected number of triangles.*
 

> **Ex. 13.1.4**: Apply the function `fraction_triangles` to `el_agg` and print the triangle fraction in the network. Next remove all edges that go between classes. Compute triangle fraction within each class and store it. Compute the mean within class triangles and bootstrap the standard error of the mean. Comment on the output.
>
>> *Hint:* To bootstrap an estimate draw $k>>0$ samples with replacement from the data. Compute the estimate on each of these samples and average them in the end to get the bootstrapped estimate. 

In [131]:
fraction_triangles(el_agg,el_agg['u1'])

0.005962397231915774

In [136]:
el_agg['same_class'] = el_agg['class1'].eq(el_agg['class2'])
el_agg = el_agg.drop(el_agg[el_agg.same_class == False].index)
print('n without edges between classes =',el_agg.shape[0])

n without edges between classes = 3873


In [197]:
class_values = {}
for class_name in el_agg['class1'].unique():
    df = el_agg.loc[el_agg['class1'] == class_name]
    nodes = df['u1']
    class_values.update( {'{}'.format(class_name) : fraction_triangles(df,nodes)} )   
class_t = pd.DataFrame(pd.Series(class_values)).rename(columns={0:'{}'.format(fraction_triangles.__name__)})
class_t

Unnamed: 0,fraction_triangles
2BIO3,0.465385
MP,0.507115
MP*2,0.379445
2BIO2,0.401943
2BIO1,0.317647
PC*,0.422909
PSI*,0.274733
PC,0.449713
MP*1,0.179256


In [194]:
def bootstrap(df,pd_function):
    l = []
    n = int(df.shape[0]/3)
    for i in range(100):
        if pd_function == 'mean':
            l.append(df.sample(n=n, replace=True).mean())
        if pd_function == 'sem':
            l.append(df.sample(n=n, replace=True).sem())
    return sum(l)/len(l)

In [195]:
print('Bootstrapped mean: %.3f' %bootstrap(class_t, 'mean'))
print('Bootstrapped st.error of mean: %.3f' %bootstrap(class_t,'sem'))

Bootstrapped mean: 0.366
Bootstrapped st.error of mean: 0.050


Recall from class that we can define the following measures of homophily. We define **homophily index** inspired by [Currarini et al. (2009)](https://doi.org/10.2139/ssrn.1021650):
- share of edges that are same type: $H = \frac{s}{s+d}$
- possible range [0,1]


We define **baseline homophily** as: 
- We count fraction of potential edges in population of nodes which are same type:

\begin{equation}B=\frac{\sum_t\#potential(n_t)}{\#potential(n)}, \qquad \#potential(k)=\frac{k\cdot(k-1)}{2}\end{equation}

- Interpretation: Expected homophily from random link formation.     

We define **inbreeding homophily** as:      

\begin{equation}IH=\frac{H-B}{1-B}\end{equation}


> **Ex. 13.1.5**: Compute the inbreeding homophily for each class. Use the class measures to compute the mean. Use a bootstrap to compute whether there is inbreeding homophily.

In [228]:
def inbreed_homophily(df, type1, type2):
    '''
    df : the dataframe
    type1 : type of node 1 in edge
    type2 : type of node 2 in edge
    '''
    df['same_type'] = type1.eq(type2)
    s = df['same_type'][df['same_type'] == True].count()
    d = df['same_type'][df['same_type'] == False].count()
    
    H = s / (s + d)
    
    def potential(k): return (k*(k-1)) / 2
    
    B = potential(s) / potential(df.shape[0])
    
    return (H - B) / (1 - B)

In [254]:
class_values = {}
for class_name in el_agg['class1'].unique():
    df = el_agg.loc[el_agg['class1'] == class_name]
    class_values.update( {'{}'.format(class_name) : inbreed_homophily(df,df['gender1'], df['gender2'])})   
class_t = pd.DataFrame(pd.Series(class_values)).rename(columns={0:'{}'.format(inbreed_homophily.__name__)})

print(class_t)
print('\n')
print('Boostrapped mean: %.3f' %bootstrap(class_t,'mean'))
print('Boostrapped st.error of mean: %.3f' %bootstrap(class_t,'mean'))

       inbreed_homophily
2BIO3           0.406814
MP              0.354772
MP*2            0.418380
2BIO2           0.349291
2BIO1           0.395173
PC*             0.360265
PSI*            0.377049
PC              0.348412
MP*1            0.403409


Boostrapped mean: 0.377
Boostrapped st.error of mean: 0.382


> **Ex. 13.1.6** (BONUS): Describe what an unsupported edge is. Construct a test of whether there is a preference for forming  triangles within same gender than across.
>
>> *Hint:*  You can find inspiration in the approach of [Chandrasekhar, Jackson (2018)](https://web.stanford.edu/~arungc/CJ_sugm.pdf) pp. 31-35. They construct an almost identical test for triangle formation across castes in Indian villages.

In [None]:
# [Answer to ex. 13.1.6 here]