In [None]:
# from google.colab import drive
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.special import xlogy
import plotly.express as px

In [None]:
# drive.mount('/content/gdrive')

In [None]:
# df = pd.read_parquet('gdrive/My Drive/Colab Notebooks/reddit_calculus.parquet')
train = pd.read_parquet('/Users/paul/data/reddit/reddit_math_ds_train.parquet').reset_index(drop=True)
test = pd.read_parquet('/Users/paul/data/reddit/reddit_math_ds_test.parquet').reset_index(drop=True)

In [None]:
train

In [None]:
cv = CountVectorizer(stop_words='english', max_features=250)
count_model = cv.fit(train['text'])
train_counts = pd.DataFrame(cv.transform(train['text']).todense(), columns=cv.get_feature_names())
test_counts = pd.DataFrame(cv.transform(test['text']).todense(), columns=cv.get_feature_names())

## Decision tree classifiers

So far we have used information theory to measure the relevance of an input feature to a target variable.
With a little nudge we can use this to create a simple but powerful machine learning classifier called a _decision tree_.

### Warmup
The starting point is to view a single feature as a simple classifier.
In the dataset above, we can use the presence or absence of a given term to try to predict the subreddit label
To train the classifier, we just have to decide which label we should apply if the term is present.

In [None]:
pd.crosstab(train_counts['engineering'] != 0, train['subreddit'])

In [None]:
pd.crosstab(train_counts['number'] != 0, train['subreddit'])

#### Exercises
1. Pick a term, and decide whether you would predict the subreddit "math" or "datascience" if that term is present.
2. Use the test set to measure the accuracy of this classifier.
3. Compute the mutual information of the term you chose and the subreddit label.
4. Repeat the above for lots of terms.  What is the relationship between mutual information and classifier accuracy?

### Ensembling

The classifiers above have no hope of being very accurate because no one term appears in most of the posts.
To build a better classifier we'll need to ensemble them, so that the classifier considers more evidence.

A decision tree does this by assembling features into a chain of if/then statements; for instance:

- If "number" is present, output "math"
- If "number" is not present:
    - If "engineering" is present, output "datascience"
    - If "engineering is not present, output "math"

Here we split the "number is not present" condition on the "engineering" feature; this allows our model to tell a story like "If the post doesn't contain the term 'number' and it does contain 'engineering' then it's probably in the datascience subreddit".

We could of course keep splitting, adding in more and more features as we go.
We can also split on the "number is present" branch of the tree, though that branch doesn't have much data so we'll get diminishing returns.

In [None]:
not_number = train_counts[train_counts['number'] == 0]
pd.crosstab(not_number['engineering'] != 0, train['subreddit'])

#### Exercises
1. Compute the accuracy of the classifier described above on the test set.
2. Create your own decision tree, and calculate its accuracy.  How high can you get?

### Information gain

Choosing splits by hand is neither scalable nor particularly principled, but we can use information theory to do better.

Idea: given a tree $T$, choose the feature $F$ to split on so that it maximizes information gain:

$$I(T,F) = H(T) - H(T \vert F)$$

Next time: how and why do we maximize information gain?