In [None]:
from bokeh.plotting import figure
from bokeh.io import output_notebook,show
import numpy as np
output_notebook()

# Bayes Theorem Mini-Lab

This lab is a chance to work with Bayes Theorem.  The underlying dataset is a collection of SMS (text) messages
that were labelled as either 'junk' or 'real' as part of an attempt to build a classifier that could filter out
junk text messages.

The full dataset is in the 'complete.tsv' file -- this is a "tab-separated" file, rather than a "comma-separated"
file.  But we won't be using this file directly.  Instead, we will just work with the simpler file 'data.csv'
which is a comma separated file with two columns.  The first column is 0 or 1 depending on whether the corresponding
text is real (0) or junk (1).  The second column is the length of the associated text message.

In [None]:
data = np.genfromtxt('data.csv',delimiter=',',skip_header=1)

There are 5572 messages in the data.  

In [None]:
data.shape

Let's separate out the junk and real messages to compare them.  One way to do that is to create
an index array.  This command creates an array of True/False values based on whether that condition
is true row-by-row.

In [None]:
data[:,0]==0

Notice that the first two entries in data are real and the third is junk:

In [None]:
data[:3,:]

Now we use our index array to extract the real rows.

In [None]:
real = data[data[:,0]==0,:]

In [None]:
real

There are 4825 real messages in the dataset.

In [None]:
real.shape

The average length of the real messages is computed like this:

In [None]:
real[:,1].mean(axis=0)

Now use a similar strategy to extract the junk rows and compute the mean length of the junk emails.
What do you notice?

In [None]:
junk = data[data[:,0]==1,:]
print(junk.shape)
print('Mean Length of Junk Messages=',junk[:,1].mean(axis=0)

One way to use this information about the lengths is to set a threshold value, of say 100 characters, and
divide the messages into "long" and "short" messages using this threshold.  It seems that long messages
are more likely to be junk.  We can use Bayes theorem to try to quantify this.

Think of checking the length of a message like administering a test.  Getting a positive result -- finding a long
message -- should increase the odds that our message is junk.  

From the point of view of Bayes Theorem, we are interested in

$$P(junk|long)$$

which we can compute as

$$
P(junk|long) = \frac{P(long|junk)P(junk)}{P(long)}
$$

And while we don't really know these probabilities, we can estimate them by looking at the frequency counts in our data.  (This approach is called "Naive" Bayes because we are naively assuming that the frequencies of data in our experiment are the real frequencies).

To get started, we need a $2x2$ table of counts like this:

|  | long  | short  | total |
|---|---|---|---|
| junk  |   |   |   |
| real  |    |   |   |
| total |     |   |  |

from which we can compute the conditional probabilities.

These equations compute the number of elements in the (junk, long) and (junk, short) cells, with a threshold of 100 characters defining "Long".

In [None]:
junk_long = junk[junk[:,1]>=100].shape[0]
junk_short=junk[junk[:,1]<100].shape[0]

In [None]:
#real_long =
#real_short =

The conditional probability P(junk|long) is the percentage of long texts that are junk.

In [None]:
# P(junk|long) = 

The probability of being junk unconditionally is about 13%.

In [None]:
(junk_long+junk_short)/5572

This function computes the conditional probability as a function of a threshold, which can vary.

In [None]:
def cp(threshold):
    junk_long = junk[junk[:,1]>=threshold].shape[0]
    real_long = real[real[:,1]>=threshold].shape[0]
    return junk_long/(junk_long+real_long)

Setting the threshold to 130 makes P(junk|long) maximal. 

In [None]:
x=np.arange(200)
y=np.array([cp(i) for i in x])
f=figure()
f.line(x=x,y=y)
show(f)

Of course we are actually interested in detecting *real* messages.  About 85% of our messages our real, so if we just say everything is real, we are right 85% of the time.  Suppose we get a short message.  

**What is the probability that it is real?**