In [None]:
from google.colab import drive
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.special import xlogy
import plotly.express as px

### Exercise 1
We flip a coin which comes up heads with probability $p$ and tails with probability $1−p$.

1. What is the information entropy of the coin?
2. What value of $p$ gives the largest entropy?

In [None]:
def coin_entropy(p: np.ndarray):
    return -xlogy(p, p) - xlogy(1-p, 1-p)

In [None]:
x = np.arange(0, 1, 0.001)

In [None]:
px.line(x=x, y=coin_entropy(x), labels={'x': 'probability', 'y': 'entropy'}).show()

### Exercise 2
Luna and I happen upon a coin.  Upon inspecting the coin, I believe that the coin is fair while Luna thinks heads is twice as likely to come up as tails.

We flip the coin 1000 times, and it comes up heads 600 times.
Who is more surprised by this result?

In [None]:
def coin_relative_entropy(p, q):
    return -xlogy(p, q/p) - xlogy(1-p, (1-q)/(1-p))

In [None]:
q_me = 1/2
q_luna = 2/3
p = 600/1000

In [None]:
coin_relative_entropy(p, q_me)

In [None]:
coin_relative_entropy(p, q_luna)

### Exercise 3
Given the data:

In [None]:
df = pd.DataFrame(
    [
        (.01, .03, .06),
        (.09, .77, .04)
    ],
    columns=['positive', 'negative', 'neutral'],
    index=['covid', 'no_covid']
)

In [None]:
df

1. Are the covid and sentiment variables independent?
2. What are the marginal density functions of covid and sentiment?
3. What are the various possible conditional density functions?

### Exercise 4

Let's play around with mutual information a bit.
First, we need an implementation that works on a dataframe:

In [None]:
def mutual_information(joint_density: pd.DataFrame) -> float:
    '''
    Input: DataFrame representing the joint density of a pair of random variables
    Output: Mutual information of the two random variables
    '''
    independent_density = pd.DataFrame(
        expected_freq(joint_density),
        columns=joint_density.columns,
        index=joint_density.index
    )
    return -(joint_density * np.log2(independent_density / joint_density)).sum().sum()

Now let's use this to compute the mutual information of some random variables that come from text.
Let's load in some reddit data (about 10000 posts):

In [None]:
drive.mount('/content/gdrive')

In [None]:
df = pd.read_parquet('gdrive/My Drive/Colab Notebooks/reddit_calculus.parquet')
# df = pd.read_parquet('/Users/paul/data/reddit/reddit_calculus.parquet')

We'll play around with the words and phrases appearing in the text of the posts, so we should start by cleaning it up and tokenizing it.
Here is a brutal approach:

In [None]:
df['subreddit'] = df['url'].apply(lambda url: url.split('/')[4].lower())

In [None]:
cv = CountVectorizer(stop_words='english', max_features=100)
count_matrix = cv.fit_transform(df['text'])
count_df = pd.DataFrame(count_matrix.todense(), columns=cv.get_feature_names())

In [None]:
count_df

Compute the mutual information for one of the following pairs of variables (or another pair of your choosing!):
1. Occurrence of two terms of your choosing
2. Occurrence of a term of your choosing in a randomly chosen document
3. Occurrence of terms by subreddit

What is the intuitive meaning of mutual information in each of these examples?

For whichever example you chose, compute all pointwise mutual information scores and all TF-IDF scores.  Which gives a better notion of relevance?