<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1-Network-Metrics" data-toc-modified-id="1-Network-Metrics-1">1 Network Metrics</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Hint:-How-to-extract-node-characteristics-from-a-sample-graph" data-toc-modified-id="Hint:-How-to-extract-node-characteristics-from-a-sample-graph-1.0.1">Hint: How to extract node characteristics from a sample graph</a></span></li></ul></li></ul></li><li><span><a href="#2-Power-Law-Distributions" data-toc-modified-id="2-Power-Law-Distributions-2">2 Power Law Distributions</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Hint:" data-toc-modified-id="Hint:-2.0.1">Hint:</a></span></li></ul></li></ul></li><li><span><a href="#3-Influential-Nodes" data-toc-modified-id="3-Influential-Nodes-3">3 Influential Nodes</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Hint:-How-to-aggregate-over-a-dataframe" data-toc-modified-id="Hint:-How-to-aggregate-over-a-dataframe-3.0.1">Hint: How to aggregate over a dataframe</a></span></li></ul></li></ul></li><li><span><a href="#4-Distinguishing-Homophily-from-Influence" data-toc-modified-id="4-Distinguishing-Homophily-from-Influence-4">4 Distinguishing Homophily from Influence</a></span><ul class="toc-item"><li><span><a href="#a)-Conceptual-overview" data-toc-modified-id="a)-Conceptual-overview-4.1">a) Conceptual overview</a></span><ul class="toc-item"><li><span><a href="#The-adopter-files-(worldcup.csv,-love.csv,-selfie.csv,tbt.csv)" data-toc-modified-id="The-adopter-files-(worldcup.csv,-love.csv,-selfie.csv,tbt.csv)-4.1.1">The adopter files (<code>worldcup.csv</code>, <code>love.csv</code>, <code>selfie.csv</code>,<code>tbt.csv</code>)</a></span></li><li><span><a href="#The-file-of-all-users-(all_users.csv)" data-toc-modified-id="The-file-of-all-users-(all_users.csv)-4.1.2">The file of all users (<code>all_users.csv</code>)</a></span></li><li><span><a href="#The-propensity-score-workflow" data-toc-modified-id="The-propensity-score-workflow-4.1.3">The propensity score workflow</a></span></li></ul></li><li><span><a href="#b)-Programming-overview" data-toc-modified-id="b)-Programming-overview-4.2">b) Programming overview</a></span><ul class="toc-item"><li><span><a href="#Hint:-How-to-select-a-subset-of-a-dataframe" data-toc-modified-id="Hint:-How-to-select-a-subset-of-a-dataframe-4.2.1">Hint: How to select a subset of a dataframe</a></span></li><li><span><a href="#Hint:-How-to-loop-over-a-dataframe-by-row-index" data-toc-modified-id="Hint:-How-to-loop-over-a-dataframe-by-row-index-4.2.2">Hint: How to loop over a dataframe by row index</a></span></li><li><span><a href="#Hint:-Initialize-a-dataframe-and-add-rows" data-toc-modified-id="Hint:-Initialize-a-dataframe-and-add-rows-4.2.3">Hint: Initialize a dataframe and add rows</a></span></li><li><span><a href="#Hint:-How-to-fit-a-logistic-regression-in-R,-and-use-it-for-prediction" data-toc-modified-id="Hint:-How-to-fit-a-logistic-regression-in-R,-and-use-it-for-prediction-4.2.4">Hint: How to fit a logistic regression in R, and use it for prediction</a></span></li></ul></li></ul></li></ul></div>

In [1]:
library(igraph)

"package 'igraph' was built under R version 3.3.3"
Attaching package: 'igraph'

The following objects are masked from 'package:stats':

    decompose, spectrum

The following object is masked from 'package:base':

    union



# 1 Network Metrics

### Hint: How to extract node characteristics from a sample graph

For a list of different characteristics we can extract, see [the documentation](http://igraph.org/r/doc/).

<img src="fig/graph.png" style="width: 150px;"/>

In [2]:
# Step 1: Let's load a sample graph (pictured above) ->  for your assignment, you'll want to load from a file instead.
uv <- matrix( c('a','b','b','c','c','a','c','d','d','a'), nc = 2, byrow = TRUE)
g <- graph_from_edgelist(uv, directed=TRUE) 
g

IGRAPH 5ddf537 DN-- 4 5 -- 
+ attr: name (v/c)
+ edges from 5ddf537 (vertex names):
[1] a->b b->c c->a c->d d->a

In [3]:
# Step 2: We can calculate a metric, e.g. closeness, over the whole graph 
closeness(g)

In [4]:
# Step 3: We can use indexing to retrieve the value for a set of nodes node, e.g. a and c
closeness(g)[c('a','c')]

# 2 Power Law Distributions

### Hint:
You may want to use the `factor` function to count the number of times that each possible value of `ntweets` appears.

# 3 Influential Nodes

### Hint: How to aggregate over a dataframe
You may find it useful to use the `aggregate` function to get the average number of retweets per user.

In [5]:
# Step 1: For a simple example, let's use a subset of the mtcars sample data
data(mtcars)
test_df <- mtcars[0:5, c("cyl", 'wt')]
test_df

Unnamed: 0,cyl,wt
Mazda RX4,6,2.62
Mazda RX4 Wag,6,2.875
Datsun 710,4,2.32
Hornet 4 Drive,6,3.215
Hornet Sportabout,8,3.44


In [6]:
# Step 2: Let's get the maximum weight by cylinder count
maxwt<-aggregate(wt~cyl, test_df, max)

# Step 3: Sort the dataframe
maxwt[order(-maxwt$wt), ]

Unnamed: 0,cyl,wt
3,8,3.44
2,6,3.215
1,4,2.32


# 4 Distinguishing Homophily from Influence

** Are users more likely to tweet with a hashtag, if their friends have already tweeted with that hashtag?** This problem attempts to answer this question by studying the spread of four hashtags in a network of Twitter users.
In other words, we want to estimate the effect of the treatment on the outcome of interest, where:
- **Treatment** = friends tweeted with a given hashtag before time $t^*_h$ and user has tweeted since $t^*_h$
- **Outcome of interest** = adoption, defined as whether or not someone tweeted with the hashtag. 

## a) Conceptual overview
### The adopter files (`worldcup.csv`, `love.csv`, `selfie.csv`,`tbt.csv`)

Essentially, we can consider every adopter who tweeted before time $t^*_{h}$ to have "treated" all of his/her followers. 

> <span style="color:gray">
    **Example: **   
We see that $309$ users adopted the hashtag `#worldcup`. We can get the median `timeStamp` from the dataframe of `#worldcup` adopters, and see that $t^*_{worldcup}=154$. The timestamp is just a sequential ordering of users starting at 0, so we can confirm that 154 users adopted before $t^*_{worldcup}$ and had the potential to treat others.
    </span>

### The file of all users (`all_users.csv`)

The people in `all_users.csv` fall  into three categories:
1. Those who were treated
2. Those who were untreated
3. Those who don't follow anyone and can't possibly have been treated ($\rightarrow$ discard)

<img src="fig/allusers.png" style="width: 700px;"/>

> <span style="color:gray"> 
    **Check your work:** 
    After dropping users with no friends and calculating adoption and treatment status for everyone in `all_users.csv`, we wind up with the following distributions for `#worldcup`:   
    
|didn't adopt |adopted   
---|---|---
untreated | 200 | 30 |  = 13%
treated | 214|61 |  = 22%
</span>

### The propensity score workflow

Ideally, treated and untreated users would be similar. Then, we could just compare the adoption rate across all groups. However, this is unlikely...

Due to homophily, the treatment group might suffer from selection bias. Specifically, this group of people might be more likely to adopt already, since they follow (and are likely similar to) the early adopters. In this case, they would be *both* more likely to be treated, *and* more likely to adopt.

We need to take extra steps so that we compare the *treated* users to the "right" *untreated* users who are just like them. We will do so with propensity score matching. A sample solution could adopt the following workflow:


<img src="fig/workflow.png" style="width: 800px;"/>

## b) Programming overview

### Hint: How to select a subset of a dataframe

In [7]:
#Let's find all cars with 6 cylinders in mtcars
cars6cyl <- subset(mtcars, cyl==6)

### Hint: How to loop over a dataframe by row index

In [8]:
# Let's find the qsec variable for every even-numbered row

for (row in 1:nrow(mtcars)) {           # Loop over dataframe    
    if (row%%2==0) {                    # Select only even-numbered rows       
        print(mtcars[row, c('qsec')])   # Print the "qsec" variable for that row
        }
}

[1] 17.02
[1] 19.44
[1] 20.22
[1] 20
[1] 18.3
[1] 17.4
[1] 18
[1] 17.82
[1] 19.47
[1] 19.9
[1] 16.87
[1] 15.41
[1] 18.9
[1] 16.9
[1] 15.5
[1] 18.6


### Hint: Initialize a dataframe and add rows

In [9]:
# Initialize empty dataframe
df <- data.frame(matrix(ncol = 2, nrow = 0))

# Name the dataframe columns
names(df) <- c("a", "b")

# Add rows
df[nrow(df)+1,] = list('test1', 'test2')
df[nrow(df)+1,] = list('test3', 'test4')

df

a,b
test1,test2
test3,test4


### Hint: How to fit a logistic regression in R, and use it for prediction

In [10]:
# Step 1: Let's use our mtcars data again
data(mtcars)

# Step 2: Fit a logit model to predict 'am' using 'cyl' and 'wt' variables
fittedlogit <- glm(am~cyl+wt, family=binomial(link='logit'), data=mtcars)

# Step 3: Get predicted values and assign them to a new column
mtcars$predicted <- predict(fittedlogit, newdata=mtcars, type='response')
head(mtcars)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,predicted
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4,0.95589722
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4,0.74475046
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1,0.94223476
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1,0.16756939
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2,0.3254379
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1,0.02848076
