# Naïve Bayes and Confusion Matrices

In [3]:
'''
Name:
ISTA 331
Date:
Collaborator(s):
'''
import pandas as pd

Let's do a simple naive Bayes calculation. Consider the following data matrix. The target variable $y$ means that approving a loan was a good decision if $y = 1$.

| Index | age | income | gender  |**y** |
|---|---|---|---|---|
0 | >= 40 | > 75 | M | 1 |
1 | < 40 | 50-75 | F | 1 |
2 | < 40 | < 50 | M | 0 |
3 | >= 40 | 50-75 | F | 1 |
4 | >= 40 | 50-75 | M | 1 |
5 | < 40 | < 50 | F | 0 |
6 | >= 40 | 50-75 | M | 0 |
7 | < 40 | 50-75 | F | 1 |
8 | >= 40 | 50-75 | M | 0 |
9 | < 40 | > 75 | F | 1 |

We have three new applications, described in the following table:

| age | income | gender |
|---|---|---|
| >= 40 | 50-75 | M |
| >= 40 | 50-75 | F |
| < 40 | 50-75 | M |

Based on Naive Bayes, should we approve their loan applications? Let's break this down into steps.

In [4]:
ages = ['>= 40', '< 40', '< 40', '>= 40', '>= 40', '< 40', '>= 40', '< 40', '>= 40', '< 40']
incomes = ['> 75', '50-75', '< 50','50-75','50-75', '< 50', '50-75', '50-75', '50-75', '> 75']
genders = ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F']
outcomes = [1,1,0,1,1,0,0,1,0,1]
df = pd.DataFrame({'age': ages, 'income':incomes, 'gender':genders, 'y':outcomes})
df

Unnamed: 0,age,income,gender,y
0,>= 40,> 75,M,1
1,< 40,50-75,F,1
2,< 40,< 50,M,0
3,>= 40,50-75,F,1
4,>= 40,50-75,M,1
5,< 40,< 50,F,0
6,>= 40,50-75,M,0
7,< 40,50-75,F,1
8,>= 40,50-75,M,0
9,< 40,> 75,F,1


1. Write a function called `get_single_likelihood` which implements the calculation of $P(X_i = x | Y = y)$. Given a feature name, a value of that feature, an outcome value (either 0 or 1), and a data frame, calculate the probability that, given that outcome, the named feature will equal the given value.

    Example: the call `get_single_likelihood('gender', 'F', 1, df)` should return the probability that, among applicants for whom `y == 1`, their `gender` attribute is `F`.

In [7]:
def get_single_likelihood(feature_name, value, outcome, df):
    freq_x_and_y = len(df[(df[feature_name] == value) & (df['y'] == outcome)])
    freq_y = len(df[df['y'] == outcome])
    return freq_x_and_y / freq_y

2. Write a function called `get_joint_likelihood` which implements the calculation of $P(X_0 = x_0, X_1 = x_1, \ldots | Y = y)$. Given a `Series` representing a feature vector, an outcome value (0 or 1), and a data frame, calculate the probability that, given that outcome, all features will equal the values in the feature vector. (Hint: loop over `features.index`, which contains the feature names, and call `get_single_likelihood`. Multiply together the values you get and return the product.)

In [8]:
def get_joint_likelihood(features, outcome, df):
    joint_likelihood = 1
    for feature in features.index:
        joint_likelihood *= get_single_likelihood(feature, features.loc[feature], outcome, df)
    return joint_likelihood

3. Write a function called `get_prior` which calculates $P(Y = y)$. That is, given an outcome value (0 or 1) and a data frame, `get_prior(outcome, df)` return the estimated probability in that data frame that `y == outcome`.

In [13]:
def get_prior(outcome, df):
    return len(df[df['y'] == outcome]) / len(df)

4. Finally, put all the pieces of Bayes' theorem together. Write a function called `get_prob` that takes a new feature vector (in the form of a `Series`) and returns the estimated probability

$$ P(y = 1 | \pmb X = \pmb x) $$

where the vector equation $\pmb X = \pmb x$ stands for $X_0 = x_0, X_1 = x_1, \ldots$. Remember this is calculated by Bayes' theorem as

$$ P(y = 1 | \pmb X = \pmb x) = \frac{P(\pmb X = \pmb x | y = 1) P(y = 1)}{ P(\pmb X = \pmb x | y = 1) P(y = 1) + P(\pmb X = \pmb x | y = 0) P(y = 0) } $$

and everything on the right-hand side can be calculated by either `get_prior` or `get_joint_likelihood`.

In [14]:
def get_prob(features, df):
    likelihood_y1 = get_joint_likelihood(features, 1, df)
    prior_y1 = get_prior(1, df)
    likelihood_y0 = get_joint_likelihood(features, 0, df)
    prior_y0 = get_prior(0, df)
    return likelihood_y1 * prior_y1 / (likelihood_y1 * prior_y1 + likelihood_y0 * prior_y0)

In [15]:
applications = pd.DataFrame({'age':['>= 40', '>= 40', '< 40'], 'income': ['50-75', '50-75', '50-75'], 'gender': ['M', 'F', 'M']})
get_prob(applications.iloc[1], df)

0.8421052631578948

6. In the U.S., prevalence of HIV (according to the web) is **0.0033 cases per capita**. Statistics about accuracy of HIV tests for cases outside the 3-month window after initial infection seem to all be from companies selling the tests (and are therefore immediately suspect, in my mind). Here is a typical number: assume an HIV test is **99.68% accurate (and assume the same accuracy for both positive and negative cases)**. Construct a confusion matrix with the numbers you would expect for **100000** people who received this test given the above statistics. Call cases of infection positive, healthy negative.

Use this matrix to estimate the probability that a person who receives a positive test actually has HIV.  

Estimate the probability that a person who has HIV receives a negative test.