# Twitter Retweet Percentage
The twitter API provides the number of retweets a tweet got but not the percent of users who would have saw the tweet but not tweeted it. If we assume for a given user, that followers for that user who check twitter arrive at rate: lambda 
$$ rate = \lambda $$
$$ s.t. E[N(t)] = \lambda*t $$

and we assume that due to the researched power law of twitter followers, the average follower has 90.9 followers.

###Questions###

* What is the probability that anyone who saw the tweet, retweeted it for a given the total number of retweets, the total number of followers for the original poster, the time since it was tweeted, and the follower arrival rate lambda?
* Given Tweet A, tweeted 30 minutes ago with 1000 first order followers and 25 retweets, did it have a higher retweet rate than Tweet B, tweeted 30 minutes ago with 2000 first order followers and 25 retweets?

In [2]:
import random
import math
def poisson_arrival(alpha):
    x = 0
    p = random.random()
    while p >= math.exp(-1*alpha):
        p = random.random() *p
        x+=1
    return x
print poisson(4)

## Discrete Event Simulation
I've setup a quick python function that models a tweet getting posted and receiving retweets from the 1..d order followers. According to [Twitter's own research](http://www.ambuehler.ethz.ch/CDstore/www2010/www/p591.pdf), d is almost never greator than 6. Additionally, [other research](http://www2014.kr/wp-content/uploads/2014/05/companion_p493.pdf) they've conducted indicates that there is a power law distribution of followers and the average account should have about `90.9` followers.

In [4]:
from collections import defaultdict
def simulate(alpha, max_seconds, n, n_d, probability, max_order=6):
    upcoming = [(poisson_arrival(alpha), 1)] # tuple of next event and order
    total_retweets = 0
    totals = defaultdict(int) 
    while upcoming[0][0] <= max_seconds:
        t, d = upcoming.pop(0) #current time and order of follower
        if random.random() < probability:
            max_n = n if d == 1 else n_d # use approx if not first order follower
            if totals[str(d)] < max_n and len(str(d).split(".")) < max_order:
                totals[str(d)] += 1
                child = str(totals[str(d)]) + ".1"
                total_retweets += 1
                upcoming.append((t + poisson_arrival(alpha), child)) # set the next event for followers of this user
        upcoming.append((t + poisson(alpha), d)) # set the next event on order d
        upcoming.sort(key=lambda tup: tup[0])
    return total_retweets
simulate(1, 10, 5, 3, .4)

I'll also create a helper function `find_closest_prob`. This will essentially do a greedy search on probabilities, looking for the probability p most likely to give us the retweet count we saw. I'm taking advantage of the fact that retweets = f(p) should be a monotonically increasing function given a sufficiently large n.

In [6]:
Nd = 90.9
def find_closest_prob(rate, t, first_order_followers, second_order_followers, retweets, seed=0, step=.1, margin=.3, n=1000):
    results = defaultdict(float)
    expected = 0
    idx = 0
    while expected < retweets and idx < 11:
        p = seed + (step * (idx))
        result = [simulate(rate, t, first_order_followers, second_order_followers, p) for i in range(n)]
        expected = sum(result)/ float(n)
        idx += 1
    if abs(expected - retweets) <= margin or idx == 11:
        std = np.std(result)
        std_error = std / math.sqrt(n)
        return (expected, p, 1.96*std_error) 
    else:
        return find_closest_prob(rate, t, first_order_followers, second_order_followers, retweets, seed + (step * (idx - 2)), step/10.0, margin, n)
find_closest_prob(86.4, 60*60, 1000, Nd, 25, n=2000)   

Let's Start with Tweet A which ran for an hour with 1000 followers and got 25 retweets. Let's assume that a user's followers will arrive with uniform probility over the day. Meaning we'll generalize lambda to be 86.4 

$$ 86.4  = \frac{60 \times 60 \times 24}{1000} $$

In [8]:
find_closest_prob(86.4, 60*60, 1000, Nd, 25, n=2000)   

Alright time to do B which ran for 30 minutes with 1000 followers and got 25 retweets. Making the same assumption as with Tweet A, we should get half that rate since we have double the followers so a follower appearing every 43.2.  

$$ 43.2 = \frac{60 \times 60 \times 24}{2000} $$

In [10]:
find_closest_prob(43.2, 30*60, 2000, Nd, 25, n=2000)   

So A and B essentially have the same retweet rate. I'll have to edit this later when I think of a signfigance test that can capture this scenario. Essentially we have

$$ f(p) \rightharpoonup X \sim n(\mu, \sigma) $$
$$ g(p) \rightharpoonup Y \sim n(\mu, \sigma) $$

and we want to know if we should reject or accept the hypothesis that f and g are the same function on a fixed p(somwhere around .8) given what we know about X and Y. When I figure that out I'll update the this notebook. Until then I'm going to assume signifigance because :shrugs:

In [12]:
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.mlab as mlab
import math

mu_a, mu_b = 25.024, 25.1065
var_a, var_b = 1.0303547039826624, 1.0397981568584356
sigma_a, sigma_b = math.sqrt(var_a), math.sqrt(var_b)
fig, ax = plt.subplots()
x = np.linspace(mu_a-3*var_a,mu_a+3*var_a, 100)
y = np.linspace(mu_b-3*var_b,mu_b+3*var_b, 100)
ax.plot(x,mlab.normpdf(x, mu_a, sigma_a))
ax.plot(x,mlab.normpdf(y, mu_b, sigma_b))
display(fig) #different if not using databricks