# Exercise Set 13: Network formation


In this Exercise Set 13 we will investigate network formation among high school pupils. 

## Part 1: Network formation


Load the data using the script below. Read a bit about the dataset [here](http://www.sociopatterns.org/datasets/high-school-contact-and-friendship-networks/) to get an understanding of what is in each variable. 

The script gives you two dataframes to work with: 
 > `el`, which is an edge-list 
 >
 > `ind` which contains individual characteristics

In [686]:
import networkx as nx
import numpy as np
import pandas as pd

url_base = 'http://www.sociopatterns.org/wp-content/uploads/2015/'

# edgelist
url_el = url_base + '07/High-School_data_2013.csv.gz'
col_names_el = ['timestamp', 'u1', 'u2', 'class1', 'class2']
el = pd.read_csv(url_el, header=None, names=col_names_el, delimiter=' ')

# individual characteristics
url_ind = url_base + '09/metadata_2013.txt'
col_names_ind = ['u', 'class', 'gender']
ind = pd.read_csv(url_ind, header=None, names=col_names_ind, delimiter='\t')\
            .set_index('u')

# remove observation with missing gender
has_gender = ind[ind.gender!='Unknown'].index

# DataFrames
ind = ind.loc[has_gender].copy()
el = el[el.u1.isin(has_gender) &  el.u2.isin(has_gender)].copy()

> **Ex. 13.1.1**: Describe the edgelist columns content. Parse the timestamp. What is the resolution of meetings? Use the parsed timestamp to count the meetings by hour in local time.

the edgelist, el, columns content:
* timestamp: Time of link formation
* u1: ID of the one person
* u2: ID of the other person
* class1: class of the one person
* class2: class of the other person
By printing el["timestamp"].unique() we see that the resolution of the meetings is 20 minutes which aligns with the description of the dataset given on the website.

In [687]:
el["timestamp"] = pd.to_datetime(el["timestamp"],unit="s") #Default is unix time.
#Cannot choose "Julian" because this isn't supported on resolution in seconds.

el["hours"] = el["timestamp"].dt.hour
el["days"]  = el["timestamp"].dt.day

In [688]:
el[["u1","u2","hours","days"]].groupby(["days","hours"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,u1,u2
days,hours,Unnamed: 2_level_1,Unnamed: 3_level_1
2,11,5556,5556
2,12,4259,4259
2,13,6617,6617
2,14,5715,5715
2,15,5972,5972
3,7,6048,6048
3,8,5286,5286
3,9,7104,7104
3,10,5096,5096
3,11,4675,4675


> **Ex. 13.1.2**: Count the number of meetings for each edge and save this as a DataFrame called `el_agg`. Filter out edges with less than 5 minutes of meetings. Attach the gender and class of both nodes.

In [689]:
el.head()

Unnamed: 0,timestamp,u1,u2,class1,class2,hours,days
0,2013-12-02 11:00:20,454,640,MP,MP,11,2
1,2013-12-02 11:00:20,1,939,2BIO3,2BIO3,11,2
2,2013-12-02 11:00:20,185,258,PC*,PC*,11,2
3,2013-12-02 11:00:20,55,170,2BIO3,2BIO3,11,2
4,2013-12-02 11:00:20,9,453,PC,PC,11,2


In [690]:
el_agg = el.groupby(["u1","u2"])["hours"].count().reset_index()
#Above you can count on any column, not just hours. reset_index() to get code below to work.
el_agg = el_agg[el_agg["hours"]>=5*(60/20)].rename(columns = {"hours":"meet_count"})
#Above we exclude pairs with less than 5 mins encounter and we rename the arbitrary
#column that I counted on ("hours") to what it is: encounters (which I found out I have to 
# call meet_conut to make triangle function work..)

el_agg.head()

Unnamed: 0,u1,u2,meet_count
4,1,117,18
7,1,196,38
10,1,205,47
13,1,494,123
21,1,939,85


In [691]:
ind.head()

Unnamed: 0_level_0,class,gender
u,Unnamed: 1_level_1,Unnamed: 2_level_1
650,2BIO1,F
498,2BIO1,F
627,2BIO1,F
857,2BIO1,F
487,2BIO1,F


Fortunately, the index in the "ind" frame are the student numbers, so we can easily apply loc

In [692]:
el_agg["class1"]  = el_agg["u1"].apply(lambda x: ind.loc[x]["class"])
el_agg["class2"]  = el_agg["u2"].apply(lambda x: ind.loc[x]["class"])
el_agg["gender1"] = el_agg["u1"].apply(lambda x: ind.loc[x]["gender"])
el_agg["gender2"] = el_agg["u1"].apply(lambda x: ind.loc[x]["gender"])

In [693]:
el_agg.shape

(1375, 7)

In [694]:
# [Answer to ex. 13.1.2 here]

> **Ex. 13.1.3**: Answer question in the function `fraction_triangles` below. Explain how `fraction_triangles` is related to  computing the clustering coefficient (using `nx.average_clustering`).
>
>> *Hint:* The following code does the same thing as `fraction_triangles`, but at a scale where you can understand what's going on. If you have a hard time understanding the code in the function you can try to play around with this simpler example
>>
>> ```python
>> import networkx as nx 
>>
>> A  = np.array(
>>     [[0, 1, 1, 0],
>>      [1, 0, 1, 0],
>>      [1, 1, 0, 1],
>>      [0, 0, 1, 0]]
>> )
>>
>> G = nx.from_numpy_array(A)
>> nx.draw(G,with_labels=True)
>>
>> def nth(A, n):
>>     A_ = A.copy()    
>>     for _ in range(1,n):
>>         A = A.dot(A_)
>>     return A
>>
>> a_t = nth(A,3).diagonal().sum()/6
>> n = len(A[:,0])
>> p_t = binom(n, 3)
>> ```


In [857]:
def make_net(el_, nodes):
    '''
    Convert edgelist to networkx graph which is 
    binary and undirected.
    
    Parameters
    ----------
    el_ : DataFrame
        Table containing an edgelist with columns 
        `u1` and `u2` which are the nodes in the edge.
        
    nodes : array-like
        1d array containing the node identities.
    '''    
    
    nx_input = el_, 'u1', 'u2', 'meet_count', nx.Graph()
    g = nx.from_pandas_edgelist(*nx_input)
    #print("make_net ",g.nodes())
    g.add_nodes_from(nodes)
    return g

In [856]:
from scipy.special import binom

def fraction_triangles(el_, nodes):
    '''
    Compute fraction of actual triangles out 
    of the potential triangles.
    
    Parameters
    ----------
    el_ : DataFrame
        Table containing an edgelist with columns 
        `u1` and `u2` which are the nodes in the edge.
        
    nodes : array-like
        1d array containing the node identities.
    '''
    
    g = make_net(el_, nodes)
    #print("fraction_triangles : ",nodes)
    #Q.1: what is `A`?: the adjacency matrix which is symmetric and binary
    #Q.2: what does `A**3` do? compute the number of paths between two nodes
    #Q.3: what is diagonal of A_t? the number of actual paths of length 3, 
    # i.e. triangles, which include the person. these are called cycles
    # because they start and end at the same person
    
    # count actual triangles    
    A = nx.to_scipy_sparse_matrix(g)
    A_t = A**3
    a_t = A_t.diagonal().sum()/6
    
    #Q.4: what does `binom(n,3)` compute? the number of triangles including the person
    
    # count potential triangles
    n = len(g.nodes())
    p_t = binom(n, 3)
        
    return a_t/p_t

**The questions in the code are already answered. As to what fraction_triangles is computing: the fraction of actualy traingles of the potential number of triangles. This is identical to nx.average_clustering (assuming that the clustering coefficient computetd in this library is the global clustering coefficient which is based on triplets by definition)**

> **Ex. 13.1.4**: Apply the function `fraction_triangles` to `el_agg` and print the triangle fraction in the network. Next remove all edges that go between classes. Compute triangle fraction within each class and store it. Compute the mean within class triangles and bootstrap the standard error of the mean. Comment on the output.
>
>> *Hint:* To bootstrap an estimate draw $k>>0$ samples with replacement from the data. Compute the estimate on each of these samples and average them in the end to get the bootstrapped estimate. 

**Part 1**

First, we need to construct something that function can read

In [697]:
array1 = el_agg.loc[:,"u1"]
array2 = el_agg.loc[:,"u2"]
nodes  = np.unique(np.concatenate((array1,array2),axis=0)).tolist()

In [698]:
print(f"The triangle fraction in the network = {fraction_triangles(el_agg,nodes)}")

The triangle fraction in the network = 0.0003160278606511087


**Part 2**

In [753]:
el_agg_same = el_agg[el_agg["class1"] == el_agg["class2"]]
class1      = list(el_agg_same.loc[:,"class1"].unique())
class2      = list(el_agg_same.loc[:,"class2"].unique())

class_list = class1

In [892]:
def triangle_of_klasse(el_agg_):
    a1     = el_agg_.loc[:,"u1"]
    a2     = el_agg_.loc[:,"u2"]

    nodes_ = np.unique(np.concatenate((a1,a2),axis=0)).tolist()
    return fraction_triangles(el_agg_,nodes_)

In [893]:
values = []

for i in range(len(class_list)):
    result = triangle_of_klasse(el_agg_same[el_agg_same["class1"] == class_list[i]])
    print(f"fraction_triangles of {class_list[i]} = {result}")
    values.append(result)

fraction_triangles of 2BIO3 = 0.0347165991902834
fraction_triangles of 2BIO2 = 0.034274193548387094
fraction_triangles of PSI* = 0.010813782991202347
fraction_triangles of PC = 0.023406825732407127
fraction_triangles of PC* = 0.01981981981981982
fraction_triangles of MP*1 = 0.015873015873015872
fraction_triangles of MP = 0.02983032293377121
fraction_triangles of 2BIO1 = 0.013292589763177999
fraction_triangles of MP*2 = 0.02809388335704125


Above is what I was asked to store. Now, we can turn to the bootstrap part of the question

In [824]:
def bootstrap_sample(el_agg_same_):
    '''
    Note that it has to be the el_agg where we have excluded across-class nodes
    '''
    df_boot = el_agg_same_.sample(n           = int(el_agg_same_.shape[0]/2),
                                 replace      = True)
    return df_boot

In [901]:
def bootstrap_mean_std(toboot,iterations):
    ones = np.ones(iterations)
    
    for i in range(iterations):
        boot    = bootstrap_sample(toboot)
        ones[i] = triangle_of_klasse(boot)
    
    return ones.mean(),ones.std()

In [925]:
for i in range(len(class_list)):
    df = el_agg_same[el_agg_same["class1"] == class_list[i]]
    result = bootstrap_mean_std(df,100)[1]
    print(f"std of {class_list[i]} = {result.round(4)}")

std of 2BIO3 = 0.0006
std of 2BIO2 = 0.0009
std of PSI* = 0.0008
std of PC = 0.0005
std of PC* = 0.0008
std of MP*1 = 0.0011
std of MP = 0.0012
std of 2BIO1 = 0.0007
std of MP*2 = 0.0008


**We can't trust these standard errors because the mean of the bootstrap is not consistent with the actual value of the class. Why this is the case is intuitive as to some nodes will not exist within each bootstrap (with an almost 100% probability) and so the triangle thing will always be too low. The lower the sample we take, the higher will be the bias.**

Maybe there is a proof somewhere that consistency in means is not required for consistency in standard errors implying the bootstrap is valid from a standard deviation point of view. I don't know. However, since the mean is biased, and since this mean enters into the formula for the variance, it seems the standard errors will be biased, rendering the bootstrap worthless

In [927]:
for i in range(len(class_list)):
    df = el_agg_same[el_agg_same["class1"] == class_list[i]]
    result = bootstrap_mean_std(df,100)[0]
    print(f"mean of of {class_list[i]} = {result.round(4)}")

mean of of 2BIO3 = 0.0024
mean of of 2BIO2 = 0.0026
mean of of PSI* = 0.001
mean of of PC = 0.0017
mean of of PC* = 0.0016
mean of of MP*1 = 0.0016
mean of of MP = 0.0024
mean of of 2BIO1 = 0.0011
mean of of MP*2 = 0.0021


A bootstrap scheme that might work would be to create links that don't exist in our data (possibly according to the connectedness of a given node) to compensate for the 0 probability that all links remain unbroken

**Part 3: Comment**

**Regardless of the bootstrap detour, we see from the class calculations that there are clear differences between classes**

In [13]:
# [Answer to ex. 13.1.4 here]

Recall from class that we can define the following measures of homophily. We define **homophily index** inspired by [Currarini et al. (2009)](https://doi.org/10.2139/ssrn.1021650):
- share of edges that are same type: $H = \frac{s}{s+d}$
- possible range [0,1]


We define **baseline homophily** as: 
- We count fraction of potential edges in population of nodes which are same type:

\begin{equation}B=\frac{\sum_t\#potential(n_t)}{\#potential(n)}, \qquad \#potential(k)=\frac{k\cdot(k-1)}{2}\end{equation}

- Interpretation: Expected homophily from random link formation.     

We define **inbreeding homophily** as:      

\begin{equation}IH=\frac{H-B}{1-B}\end{equation}


> **Ex. 13.1.5**: Compute the inbreeding homophily for each class. Use the class measures to compute the mean. Use a bootstrap to compute whether there is inbreeding homophily.

In [15]:
# [Answer to ex. 13.1.5 here]

> **Ex. 13.1.6** (BONUS): Describe what an unsupported edge is. Construct a test of whether there is a preference for forming  triangles within same gender than across.
>
>> *Hint:*  You can find inspiration in the approach of [Chandrasekhar, Jackson (2018)](https://web.stanford.edu/~arungc/CJ_sugm.pdf) pp. 31-35. They construct an almost identical test for triangle formation across castes in Indian villages.

In [None]:
# [Answer to ex. 13.1.6 here]