# Structure of the data

In [None]:
!ls ../input

In [None]:
TRAIN_DATA = '../input/en_train.csv'
TEST_DATA = '../input/en_test.csv'

## Columns

In [None]:
import pandas as pd

In [None]:
train_data = pd.read_csv(TRAIN_DATA)
test_data = pd.read_csv(TEST_DATA)

### Columns in training data

1. `sentence_id` - identifies groups of tokens occuring together in sentences

2. `token_id` - marks the position of the token in the sentence corresponding to the given `sentence_id`

3. `class` - semantic type of the token

4. `before` - the token itself

5. `after` - the normalized form of the token

In [None]:
train_data.iloc[:5]

In [None]:
train_data.sample(5)

### Columns in test data

1. `sentence_id`

2. `token_id`

3. `before`

These columns form a subset of the ones present in the training data. We certainly do not get the labels, but we also don't get the semantic hints in the form of `class`.

In [None]:
test_data.iloc[:5]

In [None]:
test_data.sample(5)

## Questions

The samples above naturally suggest some questions:

1. What proportion of tokens in the training set are their own normalizations? How would the trivial normalization perform?

2. What are the different semantic classes represented in the training data?

3. How does the above statistic distribute over the different classes?

4. What is the distribution of classes over the training data?

5. If classes contain information about whether or not to normalize, can semantic class be inferred from things like character distributions, bigram distributions, etc.?

## Answers

### Lazy baseline

Let us start by identifying the rows of the training data in which some non-trivial normalization is required:

In [None]:
nontrivial_train_data = train_data[train_data.before != train_data.after]

The proportion of rows which require normalization is:

In [None]:
proportion_nontrivial = nontrivial_train_data.shape[0]/train_data.shape[0]
print(proportion_nontrivial)

Based on this training data, if we simply "normalized" every token to itself, we should expect an accuracy of:

In [None]:
1 - proportion_nontrivial

This statistic will serve as a sanity check going forward. If we're not doing at least this well, there is something seriously wrong with the approach!

### Semantic classes

A lot of our analysis in this notebook will focus on the semantic classes of tokens in the training data.

Semantic classes are very important because they provide us invaluable contextual information that we *have* to use in our normalization.

A very simple example is the way alphabetical characters are processed in class `PLAIN` and in class `LETTERS`.

Consider the following examples:

In [None]:
train_data[train_data['class'] == 'PLAIN'].sample(1)

In [None]:
train_data[(train_data['class'] == 'LETTERS')].sample(1)

Without the context provided by semantic class, it would be very hard to tell which type of normalization to apply.

#### Distinct classes

We can simply ask `pandas` to tell us the unique elements of the `classes` column of the `train_data` dataframe:

In [None]:
CLASSES = sorted(list(train_data['class'].unique()))
print(CLASSES)

In [None]:
len(CLASSES)

#### Nontriviality

In [None]:
grouped_by_class = train_data.groupby('class')

In [None]:
def proportion_nontrivial(df):
    """
    Args:
    1. df - Dataframe with 'before' and 'after' columns
    
    Returns:
    Proportion of rows in dataframe for which 'before' is not equal to 'after'
    """
    nontrivial = df[df.before != df.after]
    return nontrivial.shape[0]/df.shape[0]

In [None]:
class_nontriviality = [(key, proportion_nontrivial(group)) for key, group in grouped_by_class]

In [None]:
print('Proportion of rows for which nontrivial normalization is required (by class):\n')
for key, s in class_nontriviality:
    print('{} - {}'.format(key, s))

#### Distribution of classes over training data

The previous section suggests that there is a heavily skewed distribution of classes over the training data. Let us verify.

In [None]:
class_weights = [(key, group.shape[0]/train_data.shape[0]) for key, group in grouped_by_class]

In [None]:
sorted_class_weights = sorted(class_weights, key=lambda p: -p[1])

In [None]:
print('Proportion of the training data made up by each class (sorted in descending order of weight):\n')
for key, weight in sorted_class_weights:
    print('{} - {}'.format(key, weight))

Let us verify as a sanity check that this is consistent with the nontriviality of the entire training set:

In [None]:
nontriviality_dict = dict(class_nontriviality)
weight_dict = dict(class_weights)

In [None]:
total_nontriviality = sum([nontriviality_dict[k]*weight_dict[k] for k in nontriviality_dict])

In [None]:
total_nontriviality

In [None]:
total_nontriviality == proportion_nontrivial(train_data)

We are still sane!

### Characters

Before looking at how things work over the semantic classes, let us do something very simple.

Let us identify all characters appearing in the dataset.

In [None]:
from collections import Counter

In [None]:
def generate_recorder_fn(counter):
    def recorder_fn(iterable):
        for c in iterable:
            counter[c] += 1
    return recorder_fn

In [None]:
char_counter = Counter()

_ = train_data['before'].astype(str).apply(generate_recorder_fn(char_counter))

In [None]:
len(char_counter)

In [None]:
char_counter.most_common(20)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

To get a sense of this distribution, let us plot the frequency of occurency of the 50 most common characters.

(Note: Plotting code shamelessly stolen from [here](https://stackoverflow.com/a/19199002).)

In [None]:
labels, values = zip(*char_counter.most_common(50))

In [None]:
indexes = np.arange(len(labels))
width = 0.2

plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels)
plt.xlabel('characters')
plt.ylabel('counts')
plt.show()

Let's wrap up this counting functionality inside an abstraction:

In [None]:
def n_grams(string, n):
    return zip(*(string[k:] for k in range(n)))

In [None]:
def n_gram_frequency(strings, n=1):
    counter = Counter()
    record = generate_recorder_fn(counter)
    for string in strings:
        record(n_grams(string, n))
    return counter

In [None]:
class_char_counters = {k:n_gram_frequency(df['before'].astype(str), 1) for k,df in grouped_by_class}

##### Sanity check

The sum of the number of occurences of the character `e` in each class should be equal to the number of its occurences over the entire corpus

In [None]:
sum(class_char_counters[k][('e',)] for k in class_char_counters)

In [None]:
sum(class_char_counters[k][('e',)] for k in class_char_counters) == char_counter['e']

#### Bigrams

Let us also create bigram frequency counters:

In [None]:
bigram_counter = n_gram_frequency(train_data['before'].astype(str), 2)

In [None]:
len(bigram_counter)

In [None]:
bigram_counter.most_common(20)

In [None]:
def plot_frequencies(counter, n, width=0.5, font_size=5):
    labels, values = zip(*counter.most_common(n))
    indexes = np.arange(len(labels))
    width = 0.5
    
    plt.bar(indexes, values, width)
    plt.xticks(indexes + width * 0.5, labels, rotation='vertical', fontsize=font_size)
    plt.xlabel('grams')
    plt.ylabel('counts')
    plt.show()

In [None]:
plot_frequencies(bigram_counter, 80)

In [None]:
class_bigram_counters = {k:n_gram_frequency(df['before'].astype(str), 2) for k,df in grouped_by_class}

##### Sanity check

Sum of occurences of the bigram `('a','l')` within each of the classes should be equal to the number of occurences over the entire corpus.

In [None]:
sum(class_bigram_counters[k][('a','l')] for k in class_bigram_counters)

In [None]:
sum(class_bigram_counters[k][('a','l')] for k in class_bigram_counters) == bigram_counter[('a','l')]

#### How different are these frequencies over each class?

We will answer this question by deriving within-class character and bigram probability distributions from these frequency counts and comparing those distributions to each other.

##### Character distributions

In [None]:
def freq_to_dist(counter):
    total = sum(counter[k] for k in counter)
    return {k:counter[k]/total for k in counter}

In [None]:
class_char_dists = {c:freq_to_dist(class_char_counters[c]) for c in CLASSES}

##### Sanity check

The values in each item of `class_char_dists` should add up to (roughly) 1.

In [None]:
class_char_dists_sums = {c:sum(class_char_dists[c][k] for k in class_char_dists[c]) for c in CLASSES}

In [None]:
sum(abs(class_char_dists_sums[c] - 1) < 0.000001 for c in CLASSES) == len(CLASSES)

##### Bigram distributions

In [None]:
class_bigram_dists = {c:freq_to_dist(class_bigram_counters[c]) for c in CLASSES}

#### Pairwise $L^1$ distances

To begin with, let us calculate the pairwise $L^1$ distances between the character and bigram distributions for each of the classes.

In [None]:
def l1_distance(dist1, dist2):
    """
    Args:
    1. dist1 - a probability distribution represented as a Python dictionary
    2. dist2 - a probability distribution represented as a Python dictionary
    (Note: Dictionary representation of probability distributions is as {value:probability for value in universe})
    
    Returns:
    L^1 distance between dist1 and dist2
    """
    keys = set(dist1) | set(dist2)
    return sum(abs(dist1.get(k,0) - dist2.get(k, 0)) for k in keys)

##### Sanity check

The $L^1$ distance between any distribution and itself should be 0.

In [None]:
distribution = class_char_dists['PLAIN']

In [None]:
l1_distance(distribution, distribution) == 0

The $L^1$ distance should be (roughly) symmetric.

In [None]:
dist1 = class_char_dists['PLAIN']

In [None]:
dist2 = class_char_dists['VERBATIM']

In [None]:
abs(l1_distance(dist1, dist2) - l1_distance(dist2, dist1)) < 0.000001

##### Tables of pairwise distances

In [None]:
char_dist_distances_dict = {c1:{c2:l1_distance(class_char_dists[c1], class_char_dists[c2]) for
                           c2 in CLASSES} for
                       c1 in CLASSES}

In [None]:
char_dist_distances = pd.DataFrame.from_dict(char_dist_distances_dict)

In [None]:
bigram_dist_distances_dict = {c1:{c2:l1_distance(class_bigram_dists[c1], class_bigram_dists[c2]) for
                             c2 in CLASSES} for
                         c1 in CLASSES}

In [None]:
bigram_dist_distances = pd.DataFrame.from_dict(bigram_dist_distances_dict)

In [None]:
import seaborn as sns

In [None]:
sns.heatmap(char_dist_distances)
plt.show()

In [None]:
sns.heatmap(bigram_dist_distances)
plt.show()

These heatmaps are quite instructive, as they suggest classes that can be treated similarly to each other for normalization purposes.

For example:

+ The `CARDINAL`, `DIGIT`, and `TELEPHONE` classes are very similar in both character and bigram distribution, and so there will be a substantial degree of overlap in how we normalize those.

+ `PUNCT` (punctuation) is very, very different from every other class.

+ There are similarities in character distribution between `FRACTION` and `CARDINAL`/`DIGIT`/`TELEPHONE`, but these similarities get de-emphasized by considering the bigram distribution.

+ There are similarities in both distribution types between `PLAIN` and `ELECTRONIC`.

We can use this information is to define compound semantic classes.

##### ELECTRONIC

It's not clear to me what the `ELECTRONIC` class contains, so let's have a look:

In [None]:
electronic = [df for k,df in grouped_by_class if k == 'ELECTRONIC'][0]

In [None]:
electronic.sample(5)

This class seems to denote eletronic identifiers like URLs.

Note that, although the character and bigram distributions between `ELECTRONIC` and `PLAIN` are similar for the actual raw content, they are unlikely to be so for the normalized content (given how spaces are inserted to normalize `ELECTRONIC`).

Our analysis in this section concerns the semantics of the raw tokens, but not of the normalization (by analyzing the normalized tokens).

##### Inferring semantic class

Let us explore the possibility of using a nearest-neighbor scheme to identify the semantic class of a raw token.

The idea is that, for a given token, we can associate it with the semantic class whose character and/or bigram probability distribution is closest (in terms of the $L^1$ distance) to that of the token itself.

Note that this is unlikely to work well. Not because the idea is bad, but because the length of each token is very small relative to the number of characters (and so each token also contains very few bigrams compared to the total number of possible bigrams represented in the training data). These size considerations may take us down the road of applying a syntactic classification to the characters comprising each token before doing the analysis that we have done thus far in this notebook.

Regardless, let us see where the idea takes us with the unmodified tokens. We can investigate the approach mentioned above in a later notebook if it seems promising based on the experiment that follows.

We will begin by sampling a number of tokens from the corpus. For each token:

1. We will calculate the character/bigram distribution.

2. We will identify the within-class character/bigram distribution that is closest to the corresponding distribution for the given token.

3. We will record the class corresponding to the distribution identified in step 2.

We will then calculate the accuracy of the assignments over all sampled tokens.

In [None]:
SAMPLE_SIZE = 1000

In [None]:
sample_df = train_data.sample(SAMPLE_SIZE)[['before','class']]

In [None]:
sample = list(sample_df['before'])

In [None]:
labels = list(sample_df['class'])

In [None]:
def char_inferred_class(token):
    token_dist = freq_to_dist(n_gram_frequency([token], 1))
    distances = [(c, l1_distance(token_dist, class_char_dists[c])) for c in CLASSES]
    return min(distances, key=lambda p: p[1])[0]

In [None]:
inferences = [char_inferred_class(token) for token in sample]

In [None]:
accuracy = sum(inferred == actual for inferred, actual in zip(inferences, labels))/len(labels)

In [None]:
accuracy

Of course, in this case, the `PLAIN` class is over-represented given that it dwarfs all the other classes in terms of representation in the training dataset.

It is more meaningful for us to consider the accuracy of this inference within each class.

In [None]:
class_samples = {c:list(train_data[train_data['class'] == c].sample(SAMPLE_SIZE, replace=True)['before'].astype(str)) for
                 c in CLASSES}

In [None]:
class_inferences = {c:[char_inferred_class(token) for token in class_samples[c]] for c in CLASSES}

In [None]:
class_accuracies = {c:sum(i==c for i in class_inferences[c])/SAMPLE_SIZE for c in CLASSES}

In [None]:
class_accuracies

The above statistics estimate the accuracy conditioned on the true labels. Even *more* meaningful than that are the estimates for accuracy conditioned on the *inferred* labels:

In [None]:
inferred_class_confusion = {c:[] for c in CLASSES}

for c in CLASSES:
    for i in class_inferences[c]:
        inferred_class_confusion[i].append(c)

In [None]:
inferred_class_accuracies = {c:(sum(l==c for l in inferred_class_confusion[c])/len(inferred_class_confusion[c]),
                                len(inferred_class_confusion[c])) for
                             c in inferred_class_confusion if len(inferred_class_confusion[c]) > 0}

In [None]:
inferred_class_accuracies

Actually, a better way to visualize this information is as follows:

In [None]:
confusion = {c:Counter(class_inferences[c]) for c in CLASSES}

In [None]:
for c in confusion:
    for k in CLASSES:
        if not k in confusion[c]:
            confusion[c][k] = 0

In [None]:
sns.heatmap(pd.DataFrame.from_dict(confusion))
plt.xlabel('Inferred labels')
plt.ylabel('True labels')
plt.show()

Any way you slice it, there is promise in this approach to identifying classes.