# Poisson processes for fun and profit

Nelson Uhan<br>
October 2016

## A really important question

Do Taylor Swift's tweets follow a Poisson process? 

## How do we determine if an arrival process is Poisson?

One approach: __look at the interarrival times__.

* Are the interarrival times exponentially distributed?
* Are the interarrival times independent?

## Preliminaries

This is a [Jupyter Notebook](http://www.jupyter.org), which lets you mix live code, equations, text, and images into one interactive document. The code in this notebook is written in [Python](http://www.python.org).

To execute a code cell:

1. Click inside a code cell
2. Either
    * press <key><i class="fa fa-step-forward" aria-hidden="true"></i></key> in the toolbar, or
    * press Shift + Enter

First, we need to import a whole bunch of libraries, including [Tweepy](http://www.tweepy.org), which allows us to interface with Twitter programmatically with the Python programming language.

In [None]:
# Setup - import libraries, initialize plotly for Jupyter
import tweepy
import numpy as np
import scipy.stats as stats
import IPython.display
import plotly.graph_objs as go
import plotly.offline as pl
pl.init_notebook_mode(connected=True)

Next, we need to authenticate into Twitter.

In [None]:
# Authenticate into Twitter
consumer_key = 'CONSUMER_KEY'
consumer_secret = 'CONSUMER_SECRET'
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth)

## Getting someone's tweets

Let's grab some information about the Twitter user we want to study.

In [None]:
# Enter Twitter user name
username = 'taylorswift13'

In [None]:
# Get information about this Twitter user
user = api.get_user(username)
print('Name: {0}'.format(user.name))
print('Location: {0}'.format(user.location))
IPython.display.Image(user.profile_image_url)

Let's grab this user's last 500 tweets.

In [None]:
# Get user's last 500 tweets
public_tweets = []
for tweet in tweepy.Cursor(api.user_timeline, screen_name=username).items(500):
    public_tweets.append(tweet)

Just to make sure we're doing this right &mdash; let's examine this user's last 10 tweets.

In [None]:
# Print user's last 10 tweets: date/time, text
for tweet in public_tweets[:10]:
    print("{0} {1}".format(tweet.created_at, tweet.text))

OK, looks good! Now let's create a list of just the tweet arrival times.

In [None]:
# Grab just the arrival times
arrival_times = []
for tweet in public_tweets:
    arrival_times.append(tweet.created_at)

Next, we can compute the interarrival times by

* sorting the arrival times, and then 
* computing the difference in consecutive arrival times.

The times are in seconds, so we divide the interarrival times by $60 \times 60$ to obtain times in hours.

In [None]:
# Sort arrival times
arrival_times.sort()

# Compute interarrival times in hours
interarrival_times = []
for a, b in zip(arrival_times, arrival_times[1:]):
    interarrival_times.append((b - a).seconds / (60 * 60))

Another sanity check: do the interarrival times look reasonable? Let's print out the first 10:

In [None]:
print(interarrival_times[:10])

The exponential distribution has a parameter, $\lambda$. The maximum likelihood estimator of $\lambda$ is the mean interarrival rate. Let's compute this next.

In [None]:
# Mean interarrival time
mean_interarrival_time = np.mean(interarrival_times)
print("Mean interarrival time: {0} hours per tweet".format(mean_interarrival_time))

# Mean interarrival rate
mean_interarrival_rate = 1 / mean_interarrival_time
print("Mean interarrival rate: {0} tweets per hour".format(mean_interarrival_rate))

Let's see if the histogram of interarrival times looks like the pdf of the exponential distribution, using the mean interarrival rate as an estimate for $\lambda$:

In [None]:
# Create histogram trace
hist_trace = go.Histogram(x=interarrival_times, histnorm='probability density')

# Create pdf trace
x_max = max(interarrival_times)
x_range = np.arange(0, x_max, x_max / 1000)
pdf = [stats.expon.pdf(x, scale=mean_interarrival_time) for x in x_range]
pdf_trace = go.Scatter(x=x_range, y=pdf)

# Plot the histogram and pdf
data = [hist_trace, pdf_trace]
layout = go.Layout(
    title='Histogram and theoretical pdf',
    xaxis=dict(title='interarrival time (hours)'),
    yaxis=dict(title='frequency/density'),
    showlegend=False
)
fig = go.Figure(data=data, layout=layout)
pl.iplot(fig)

__What do you think? Do you think the interarrival times are from an exponential distribution?__

Ideally, we would perform some goodness-of-fit tests to statistically determine whether the exponential distribution is a good fit for the interarrival times.

We also need to check independence of the interarrival times. One easy visual test is to plot the interarrival times as a time series:

In [None]:
# Plot interarrival times as time series
trace = go.Bar(y=interarrival_times)
data = [trace]
layout = go.Layout(
    title='Interarrival times',
    xaxis=dict(title='arrival (tweet)'),
    yaxis=dict(title='interarrival time (hours)'),
    showlegend=False
)
fig = go.Figure(data=data, layout=layout)
pl.iplot(fig)

__What do you think? Are the interarrival times independent?__

# #poisson?

We can also do the same thing with hashtags. Let's grab the last 500 tweets with a certain hashtag.

In [None]:
# Enter hashtag to search
search_text = "#apple"

In [None]:
# Get last 500 tweets with this hashtag
cursor = tweepy.Cursor(api.search, q=search_text)
hashtag_tweets = []
for tweet in cursor.items(500):
    hashtag_tweets.append(tweet)

In [None]:
# Print last 10 tweets with this hashtag: date/time, text
for tweet in hashtag_tweets[:10]:
    print("{0} {1}".format(tweet.created_at, tweet.text))

Now, we can go through the same process as we did above.

Since hashtags appear more frequently than one user's tweets, let's change the time scale to minutes instead of hours.

In [None]:
# Grab just the arrival times
ht_arrival_times = []
for tweet in hashtag_tweets:
    ht_arrival_times.append(tweet.created_at)

# Sort arrival times
ht_arrival_times.sort()

# Compute interarrival times in minutes
ht_interarrival_times = []
for a, b in zip(ht_arrival_times, ht_arrival_times[1:]):
    ht_interarrival_times.append((b - a).seconds / 60)
    
# Mean interarrival time
ht_mean_interarrival_time = np.mean(ht_interarrival_times)
print("Mean interarrival time: {0} minutes per tweet".format(ht_mean_interarrival_time))

# Mean interarrival rate
ht_mean_interarrival_rate = 1 / ht_mean_interarrival_time
print("Mean interarrival rate: {0} tweets per minute".format(ht_mean_interarrival_rate))

# Create histogram trace
ht_hist_trace = go.Histogram(x=ht_interarrival_times, histnorm='probability density')

# Create pdf trace
ht_x_max = max(ht_interarrival_times)
ht_x_range = np.arange(0, ht_x_max, ht_x_max / 1000)
ht_pdf = [stats.expon.pdf(x, scale=ht_mean_interarrival_time) for x in ht_x_range]
ht_pdf_trace = go.Scatter(x=ht_x_range, y=ht_pdf)

# Plot the histogram and pdf
data = [ht_hist_trace, ht_pdf_trace]
layout = go.Layout(
    title='Histogram and theoretical pdf',
    xaxis=dict(title='interarrival time (minutes)'),
    yaxis=dict(title='frequency/density'),
    showlegend=False
)
fig = go.Figure(data=data, layout=layout)
pl.iplot(fig)

# Plot interarrival times as time series
trace = go.Bar(y=ht_interarrival_times)
data = [trace]
layout = go.Layout(
    title='Interarrival times',
    xaxis=dict(title='arrival (tweet)'),
    yaxis=dict(title='interarrival time (minutes)'),
    showlegend=False
)
fig = go.Figure(data=data, layout=layout)
pl.iplot(fig)

__What do you think? Are the interarrival times from an exponential distribution? Are the interarrival times independent?__