*Note: In this workbook, we try to replicate the results from the classic paper "Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth", Goldenberg, Libai and Muller (2001). This is a self-didactic attempt.*

In [1]:
using Distributions
using MetaGraphs
using DataFrames
using GLM

In [2]:
srand(20130810)

MersenneTwister(UInt32[0x01332bfa], Base.dSFMT.DSFMT_state(Int32[-1772545288, 1073534108, 1077066014, 1072915095, -2146195133, 1072843413, 301764553, 1073404181, 750472136, 1073628106  …  -1491411563, 1073194977, 716119449, 1072893711, 1632331784, 758890923, 1433693833, -13012230, 382, 0]), [1.69558, 1.71546, 1.04527, 1.31036, 1.89552, 1.02364, 1.02657, 1.08113, 1.40426, 1.11619  …  1.01145, 1.26206, 1.83416, 1.11714, 1.57422, 1.77415, 1.06611, 1.98561, 1.84126, 1.25505], 382)

# 1. Introduction 

In [Talk of the Network](https://www0.gsb.columbia.edu/mygsb/faculty/research/pubfiles/3391/TalkofNetworks.pdf), the authors  explore the pattern of personal communication betwee an individual's core friends group (strong ties) and a wider set of acquaintances (weak ties). This remarkable study is one of the first ones in marketing that explored the influence of social networks on the diffusion of marketing messages. The key questions investigated in the context of information dissemination are:

- What matters more - strong ties or weak ties?
- What effect does the size of an average individuals network have?
- How does advertising interact with the diffusion through weak ties and that through strong ties

In this workbook, we focus on the first question: do strong ties or weak ties influence the speed of information dissemination in a network?

# 2. Initializing the network

Since this study employs a set of synthetic networks, where each of the nodes have a fixed number of strong ties ($s$) and weak ties ($w$), we need to use the `MetaGraph` types to build these networks. We initialize the network as an empty graph and then build the neighborhoods of individual nodes by adhering to the number of strong and weak ties for each node.

In [5]:
function initialize_network(n_nodes::Int, n_strong_ties::Int, n_weak_ties::Int)
    
    # Initialize an empty network
    
    mg = MetaGraph(n_nodes)
    nodes = 1:n_nodes
    
    for node in nodes
        set_props!(mg, node, Dict(:weak_ties => Int[],
                                  :strong_ties => Int[],
                                  :status => false,
                                  :activation_prob => 0.0))
    end
    
    # Wire the network according to the number of strong and weak ties
    # When wiring with random nodes, take care that the subject node and
    # already existing neighbors are not sampled again
    
    for node in nodes
        while length(get_prop(mg, node, :weak_ties)) < n_weak_ties
            rand_nbr = sample(nodes[1:end .!= node])
            if !(rand_nbr in get_prop(mg, node, :weak_ties) || rand_nbr in get_prop(mg, node, :strong_ties))
                append!(get_prop(mg, node, :weak_ties), rand_nbr)
            end
        end
        while length(get_prop(mg, node, :strong_ties)) < n_strong_ties
            rand_nbr = sample(nodes[1:end .!= node])
            if !(rand_nbr in get_prop(mg, node, :weak_ties) || rand_nbr in get_prop(mg, node, :strong_ties))
                append!(get_prop(mg, node, :strong_ties), rand_nbr)
            end
        end
    end
    
    return mg
end

initialize_network (generic function with 1 method)

In [4]:
g = initialize_network(3000, 5, 5)

empty undirected Int64 metagraph with Float64 weights defined by :weight (default weight 1.0)

# 3. Model

## 3.1 Assumptions

Each individual in the substrate network (referred to as nodes) are connected to the same number of strong ties (varied from 5 - 29) and weak ties (varied from 5 - 29). The probability of activation of a node, i.e., an uninformed individual turning to informed can happen in three ways: through a strong tie with probability $\beta_s$, through a weak tie with probability $\beta_w$ or through external marketing efforts with probability $\alpha$. In line with conventional wisdom, we assume $\alpha < \beta_w < \beta_s$. 

At timestep $t$, if an individual is connected to $m$ strong ties and $j$ weak ties, the probability of the individual being informed in this time step is:

$$
p(t) = 1 - (1- \alpha)(1 - \beta_w)^j(1 - \beta_s)^m
$$

We are interested in two outcome variables:
1. The number of time steps elapsed till 15% of the network engages 
2. The number of time steps elapsed till 95% of the network engages

## 3.2 Execution

*Step 1:* At $t = 0$, the status of all nodes is set to `false`

*Step 2:* For each node, the probability of being informed is calculated as per the above equation. A random draw $U$ is made from a standard uniform distribution and compared with the probability. If $U < p(t)$ the status of the node is changed to `true`

*Step 3:* In each successive time step, Step 2 is repeated till 95% of the total network (of size 3000) engages

We now look at several helper functions that execute the above logic

### 3.2.1 Reset node status

At the beginning of each simulation, we call the following function to set the status of all the nodes to `false`. 

In [None]:
function reset_node_status!(G::MetaGraphs.MetaGraph, n_nodes::Int)
    for node in 1:n_nodes
        set_prop!(G, node, :status, false)
    end
    return nothing
end

### 3.2.2 Activation probability

At each time step, the probabilty of activation for each node is calculated using the following function. We count the number of activated strong and weak ties for each node and use the above formula to compute the activation probability.

In [None]:
function update_activation_prob!(G::MetaGraphs.MetaGraph, node::Int, alpha::Float64, beta_w::Float64, beta_s::Float64)
    n_active_weak_ties, n_active_strong_ties = 0, 0

    for weak_tie in get_prop(G, node, :weak_ties)
        if get_prop(G, weak_tie, :status) == true
            n_active_weak_ties += 1
        end
    end

    for strong_tie in get_prop(G, node, :strong_ties)
        if get_prop(G, strong_tie, :status) == true
            n_active_strong_ties += 1
        end
    end

    set_prop!(G, node, :activation_prob, 
              1 - (1 - alpha) * (1 - beta_w)^n_active_weak_ties * (1 - beta_s)^n_active_strong_ties)
    
    return nothing
end

### 3.2.3 Update node status

At each time step the status of all the nodes is updated according to the calculated probability of activation. 

In [None]:
function update_status!(G::MetaGraphs.MetaGraph, n_nodes::Int, alpha::Float64, beta_w::Float64, beta_s::Float64)
    nodes = 1:n_nodes
    
    # assuming that the nodes update in random order
    
    for node in shuffle(nodes)
        update_activation_prob!(G, node, alpha, beta_w, beta_s)
        
        if rand(Uniform()) < get_prop(G, node, :activation_prob)
            set_prop!(G, node, :status, true)
        end
    end
    
    return nothing
end

### 3.2.4 Simulation on the parameter space

The function `execute_simulation` puts together the scaffolding to set up the parameter space $(s, w, \alpha, \beta_w, \beta_s)$ and execute diffusion along the network. From what I can gather from the paper, one simulation was carried out at each point on the parameter space. No further details regarding the execution are mentioned except that since each parameter has 7 levels, a total of $7^5 = 16,808$ simulations were executed in a factorial design. In this workbook, we work on a smaller sized parameter space with 3 levels for illustration.

Also, I am assuming that the network is drawn at random for each run of the simulation.

One more interesting information: The authors mention that their simulations were written in C, it would be interesting to compare the execution times with Julia. This is a non-standard problem that tests both the robustness of Julia types and its execution speed (maybe this will prompt someone to make a pull request!).

In [None]:
println("Number of strong ties per node (s): ", floor.(Int, linspace(5, 29, 3)))
println("Number of weak ties per node(w): ", floor.(Int, linspace(5, 29, 3)))
println("Effect of advertising (α): ", collect(linspace(0.0005, 0.01, 3)))
println("Effect of weak ties (β_w): ", collect(linspace(0.005, 0.015, 3)))
println("Effect of strong ties (β_s): ", collect(linspace(0.01, 0.07, 3)))

In [None]:
parameter_space = [(s, w, alpha, beta_w, beta_s) for s in floor.(Int, linspace(5, 29, 3)), 
                                                     w in floor.(Int, linspace(5, 29, 3)),
                                                     alpha in linspace(0.0005, 0.01, 3),
                                                     beta_w in linspace(0.005, 0.015, 3),
                                                     beta_s in linspace(0.01, 0.07, 3)]

size(parameter_space), length(parameter_space)

In [None]:
function execute_simulation(parameter_space, n_nodes::Int)
    
    # n_nodes dictates how big the network will be
    # We cannot pre-allocate the output since we do not know for how many time steps the simulation will
    # run at each setting
    
    output = DataFrame(s = Int[], w = Int[], alpha = Float64[], 
                       beta_w = Float64[], beta_s = Float64[], 
                       t = Int[], num_engaged = Int[])


    # Rewiring the network each time is expensive. We can cut down repeats of the same rewiring process
    # by building the network only when the parameters used to build the network have changed.
    
    old_s, old_w = parameter_space[1][1:2]
    G = initialize_network(n_nodes, old_s, old_w)
    
    for (s, w, alpha, beta_w, beta_s) in parameter_space[1:end]
        
        # Rewire the network only if the network creation parameters have changed
  
        if !(old_s == s && old_w == w)
            G = initialize_network(n_nodes, s, w)
        end
        reset_node_status!(G, n_nodes)
        
        println("Beginning simulation on setting $((s, w, alpha, beta_w, beta_s)) at : ", Dates.format(now(), "HH:MM"))
        
        num_engaged = sum([get_prop(G, node, :status) for node in 1:n_nodes])
        t = 1
        
        # Continue updates at each setting till 95% of the network engages
        
        while num_engaged < Int(0.95 * n_nodes)
            update_status!(G, n_nodes, alpha, beta_w, beta_s)
            num_engaged = sum([get_prop(G, node, :status) for node in 1:n_nodes])
            push!(output, [s, w, alpha, beta_w, beta_s, t, num_engaged])
            t += 1
        end
    
        old_s, old_w = s, w
    end
    
    return output
end

In [None]:
@time results = execute_simulation(parameter_space, 3000)

# 4. Discussion

To answer the research questions, the authors resort to simple linear regression. 

Since our focus in this workbook is on highlighting the strengths of the JuliaGraphs ecosystem, we keep the regression modeling at the most basic level. So, we replicate the linear model in the paper only for for the time till 95% of the network engages.

The features used to predict these outcomes are $s$, $w$, $\alpha$, $\beta_w$ and $\beta_S$. 

In [None]:
head(results)

To build the data required for the linear modeling, we group the data by each parameter setting and calculate the time the network takes to reach 95% activation.

In [None]:
all_engaged = by(results, [:s, :w, :alpha, :beta_w, :beta_s], df -> DataFrame(T95 = maximum(df[:t])));
head(all_engaged)

We then run a simple linear model on the data

In [None]:
ols = lm(@formula(T95 ~ s + w + alpha + beta_s + beta_w), all_engaged)

These results indicate that both strong ties and weak ties have an equally important role in the speedy diffusion of information. As the authors note, this happens despite the inferiority of the weak ties parameter in the model assumptions.