# Build a Bag-of-Words (BoW) Sentiment Classifier

This notebook is using the same data and supporting code as the rule-based classifier.
Please see that notebook for details on the data.

Unlike the rule-based classifier, this BoW classifier will use automatically extracted features and _learn_ the weights on these features.
Our features will be specifically be count vectors where each index refers to the number of a given word found in the input string.
This is referred to of as a "bag of words" because it's like throwing all of the words into a bag and counting them -- while this method is simple, the main disadvantage is that we lose all structural information present in the sentence(s).
We can then use our training data to learn weights between the individual words and our positive, negative, and neutral labels.

## Data Reading

Read in the data from the training and dev (or finally test) sets

In [1]:
def read_XY_data(filename):
    X_data = []
    Y_data = []
    with open(filename, 'r') as f:
        for line in f:
            label, text = line.strip().split(' ||| ')
            X_data.append(text)
            Y_data.append(int(label))
    return X_data, Y_data

In [31]:
X_train, Y_train = read_XY_data('../data/sst-sentiment-text-threeclass/train.txt')
X_test, Y_test = read_XY_data('../data/sst-sentiment-text-threeclass/dev.txt')
len(X_train)

8544

In [None]:
def tokenize(datum):
    # Split string into words
    return datum.split(" ")

def build_feature_map(X):
    # We need to assign an index to each word in order to build the count vector.
    # We start by gathering a set of all word types in the training data.
    word_types = set()
    for datum in X:
        for word in tokenize(datum):
            word_types.add(word)
    # Create a dictionary keyed by word mapping it to an index
    return {word: idx for idx, word in enumerate(word_types)}
            

from scipy.sparse import dok_matrix

def extract_features(word_to_idx, X):
    # We are using a sparse matrix from scipy to avoid creating an 8000 x 18000 matrix
    features = dok_matrix((len(X), len(word_to_idx)))
    for i in range(len(X)):
        for word in tokenize(X[i]):
            if word in word_to_idx:
                # Increment the word count if it is present in the map.
                # Unknown words are discarded because we would not have
                # a learned weight for them anyway.
                features[i, word_to_idx[word]] += 1
    return features


In [36]:
sample_data = [
    "When is the homework due ?",
    "When are the TAs' office hours ?",
    "How hard is the homework ?",
]

word_to_idx = build_feature_map(sample_data)
print(word_to_idx)
print()

features = extract_features(word_to_idx, sample_data)
print(features)


{"TAs'": 0, '<unk>': 1, 'When': 2, 'office': 3, 'are': 4, 'How': 5, 'hard': 6, 'the': 7, 'homework': 8, '?': 9, 'due': 10, 'hours': 11, 'is': 12}

  (0, 2)	1.0
  (0, 12)	1.0
  (0, 7)	1.0
  (0, 8)	1.0
  (0, 10)	1.0
  (0, 9)	1.0
  (1, 2)	1.0
  (1, 4)	1.0
  (1, 7)	1.0
  (1, 0)	1.0
  (1, 3)	1.0
  (1, 11)	1.0
  (1, 9)	1.0
  (2, 5)	1.0
  (2, 6)	1.0
  (2, 12)	1.0
  (2, 7)	1.0
  (2, 8)	1.0
  (2, 9)	1.0


Now let's run the feature extractor on the actual data.

In [34]:
# Build the map based on the training data
word_to_idx = build_feature_map(X_train)

print(f"Unique word types in X_train: {len(word_to_idx)}")
print("Sample words:")
print(list(word_to_idx.keys())[:20])

Unique word types in X_train: 18281
Sample words:
['tinseltown', 'naturalism', 'understated', 'sports', 'boundless', 'factory', 'oppositions', 'suffer', 'original', 'Alternates', 'self-glorified', 'moan', 'strays', 'Berkley', '8-10', 'oppressively', 'alien', 'Alone', 'reaches', 'richer']


In [30]:
# Convert our strings into count vectors
X_train_vec = extract_features(word_to_idx, X_train)
X_test_vec = extract_features(word_to_idx, X_test)

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(tol=1e1)
classifier.fit(X_train_vec, Y_train)
print(classifier.score(X_train_vec, Y_train))
print(classifier.score(X_test_vec, Y_test))

0.9701544943820225
0.5967302452316077
