In [1]:
import data_utils as du
import itertools

This is a basic implementation of a featurization technique. We first read in the .csv datfile and clean up the text. We then create dictionary of all the word types in our corpus, assigning an integer (an index) for each word. Our sentence vectors are generated by populating a list of indices corresponding to the words in the sentence.

In [2]:
# Get data function returns a 2D list of cleans text where first element of each sublist is label and second element is text
x = du.getdata('isear-train.csv')
# Count labels gives us a dictionary with the count of each label in the training data
labels = du.count_labels(x)
print(labels)

{'joy': 762, 'fear': 737, 'anger': 742, 'shame': 731, 'guilt': 737, 'sadness': 741, 'disgust': 736}


Now that we have our training data in x, we can generate the dictionary of indices with the build_vocab function.

In [4]:
words = []
for i in x:
    words.append(i[1])
vocab_idcs = du.build_vocab(words)
print(dict(itertools.islice(vocab_idcs.items(), 25)))

{'field': 0, 'wiped': 1, 'maps': 2, 'sorrow': 3, 'coins': 4, 'deeper': 5, 'bribed': 6, 'spontaneous': 7, 'friendship': 8, 'cloth': 9, 'chips': 10, 'guanggho': 11, 'your': 12, 'inflammation': 13, 'entire': 14, 'unit': 15, 'custody': 16, 'coast': 17, 'septic': 18, 'thoughtless': 19, 'bloopers': 20, 'clock': 21, 'obstacles': 22, 'signed': 23, 'commit': 24}


The next step will be to generate a sentence vector from the indices we have made.

In [6]:
s = 'i am going to cry'
s = s.split()

s_vec = []
for i in s:
    try:
        s_vec.append(vocab_idcs[i])
    except KeyError:
        pass
    
print(s_vec)

[7441, 272, 1242, 7678, 2205]


Now, let's check the results:

In [7]:
print("'i' is at idx: " + str(vocab_idcs['i']))
print("'am' is at idx: " + str(vocab_idcs['am']))
print("'going' is at idx: " + str(vocab_idcs['going']))
print("'to' is at idx: " + str(vocab_idcs['to']))
print("'cry' is at idx: " + str(vocab_idcs['cry']))

'i' is at idx: 7441
'am' is at idx: 272
'going' is at idx: 1242
'to' is at idx: 7678
'cry' is at idx: 2205
