# Representations

The focus of this notebook is to explore different representations of the input data. The data is a pre-processed text, and the goal is to find the best representation for the data. The representations that will be explored are:

- Bag of Words (BoW)
- One-Hot Encoding
- TF-IDF (Term Frequency-Inverse Document Frequency)
- Word Embeddings (Word2Vec, custom trained) of different sizes

We also explored n-gram representation, but due to the memory it needs, it was not possible to apply it in our project.

## Importing Libraries and Data

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
from collections import defaultdict
from nltk import ngrams
import random
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import spacy
import gensim
import logging
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

In [3]:
df = pd.read_pickle('data/data_processed.pkl')
df = df.reset_index(drop=True)

df.head()

Unnamed: 0,text,emotions
0,feel aw job get posit succeed not_happen,sadness
1,im alon feel aw,sadness
2,ive probabl mention realli feel proud actual k...,joy
3,feel littl low day back,sadness
4,beleiv much sensit peopl feel tend compassion,love


In [4]:
# print the first text row

print(df['text'][0])
df.describe()

feel aw job get posit succeed not_happen


Unnamed: 0,text,emotions
count,416809,416809
unique,379880,6
top,feel accept,joy
freq,65,141067


## 1 - Bag of Words (BoW)

In [8]:
def model_bow(corpus, max_features = 1500):
    vectorizer = CountVectorizer(max_features = max_features)
    x = vectorizer.fit_transform(corpus).toarray()
    return x

In [9]:
# Apply the BOW model to the text column

x = model_bow(df['text'])

# Replace the text column with the new BOW representation
df_bow = df.copy()
df_bow['text'] = x.tolist()

print(f"Original text: {df['text'][0]}")
print(f"BOW representation: {df_bow['text'][0]}")
print(f"Length of the BOW representation: {len(df_bow['text'][0])}")
print (f"Number of none zeros: {np.count_nonzero(df_bow['text'][0])}")

df_bow.head()

Original text: feel irrit kinda hate feel
BOW representation: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Unnamed: 0,text,emotions
0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",anger
1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",anger
2,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",anger
3,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",anger
4,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",anger


In [10]:
df_bow.to_pickle('data/reps/1_bow.pkl') # Save the dataframe to file

## 2 - One-Hot Encoding

In [4]:
def model_one_hot(corpus, max_features = 1500):
    vectorizer_binary = CountVectorizer(binary=True, max_features = max_features)
    x = vectorizer_binary.fit_transform(corpus).toarray()    
    return x

In [5]:
# Apply the one-hot model to the text column

x = model_one_hot(df['text'])

# Replace the text column with the new one-hot representation
df_one_hot = df.copy()
df_one_hot['text'] = x.tolist()

print(f"Original text: {df['text'][0]}")
print(f"One-hot representation: {df_one_hot['text'][0]}")
print(f"Length of the one-hot representation: {len(df_one_hot['text'][0])}")
print(f"Number of ones: {np.count_nonzero(df_one_hot['text'][0])}")

df_one_hot.head()

Original text: feel aw job get posit succeed not_happen
One-hot representation: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Unnamed: 0,text,emotions
0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",sadness
1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",sadness
2,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",joy
3,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",sadness
4,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",love


In [None]:
# Save the dataframe to file
df_one_hot.to_pickle('data/reps/2_one_hot.pkl')

## 3 - TF-IDF

In [4]:
def model_tf_idf(corpus):
    vectorizer_tfidf = TfidfVectorizer(max_features=1500)
    x = vectorizer_tfidf.fit_transform(corpus).toarray()
    return x

The cell below requires a lot of RAM and if not enough available, may crash the kernel. The max_features parameter can be reduced to avoid this.

In [None]:
# Apply the TF-IDF model to the text column

# """ 

x = model_tf_idf(df['text'])

# Replace the text column with the new TF-IDF representation
df_tf_idf = df.copy()
df_tf_idf['text'] = x.tolist()

print(f"Original text: {df['text'][0]}")
print(f"TF-IDF representation: {df_tf_idf['text'][0]}")
print(f"Length of the TF-IDF representation: {len(df_tf_idf['text'][0])}")

print(df_tf_idf.head())

df_tf_idf.to_pickle('data/reps/3_tf_idf.pkl')

# """

## N-grams - attempt

In [4]:
def model_ngram(corpus, ngram_range = (1,2)):
    vectorizer_bigram = CountVectorizer(ngram_range = ngram_range)
    x = vectorizer_bigram.fit_transform(corpus).toarray()
    return x


In [None]:
# Apply the n-gram model to the text column

x = model_ngram(df['text'])

# Replace the text column with the new n-gram representation
df_ngram = df.copy()
df_ngram['text'] = x.tolist()

print(f"Original text: {df['text'][0]}")
print(f"n-gram representation: {df_ngram['text'][0]}")
print(f"Length of the n-gram representation: {len(df_ngram['text'][0])}")

df_ngram.head()

df_ngram.to_pickle('data/reps/5_ngram.pkl')

## 4 - Word Embeddings (Custom)

In [4]:
df_og = pd.read_pickle("data/data_processed.pkl")

df_og.head()

Unnamed: 0,text,emotions
19132,feel irrit kinda hate feel,anger
51533,id rather home feel violent lone im not_tri so...,anger
44351,suggest wait discuss feel less resent,anger
51299,wrong feel royal piss,anger
55778,im tierd talk like there hope hell care unders...,anger


In [5]:
import gensim

# Apply gensim.utils.simple_preprocess(line) to each line in the text column
df_og['text'] = df_og['text'].apply(lambda x: gensim.utils.simple_preprocess(x))

df_og.head()

Unnamed: 0,text,emotions
19132,"[feel, irrit, kinda, hate, feel]",anger
51533,"[id, rather, home, feel, violent, lone, im, no...",anger
44351,"[suggest, wait, discuss, feel, less, resent]",anger
51299,"[wrong, feel, royal, piss]",anger
55778,"[im, tierd, talk, like, there, hope, hell, car...",anger


In [6]:
df_og.shape

(41681, 2)

In [7]:
documents = df_og['text'].tolist()

print(documents[0])

def train_embeddings(documents, vec_size=150):
    model = gensim.models.Word2Vec(documents, vector_size=vec_size, window=10, min_count=2, workers=10, sg=1)
    model.wv.save_word2vec_format('data/model_word_embedding_original.bin', binary=True)
    return model
    
def load_embedding():
    wv = gensim.models.KeyedVectors.load_word2vec_format("data/model_word_embedding_original.bin", binary=True)
    return wv

['feel', 'irrit', 'kinda', 'hate', 'feel']


Training a custom Word2Vec model using the data we have. For that, we will use the `gensim` library and we'll use the original data with fewer pre-processing steps.

In [8]:
model = train_embeddings(documents)
print(len(model.wv.key_to_index)) # Print the vocabulary size
print(model.wv.key_to_index) # Print the vocabulary

9442
{'feel': 0, 'like': 1, 'im': 2, 'get': 3, 'time': 4, 'know': 5, 'realli': 6, 'make': 7, 'go': 8, 'love': 9, 'want': 10, 'littl': 11, 'think': 12, 'peopl': 13, 'would': 14, 'thing': 15, 'day': 16, 'still': 17, 'not_feel': 18, 'one': 19, 'ive': 20, 'life': 21, 'way': 22, 'even': 23, 'much': 24, 'someth': 25, 'need': 26, 'bit': 27, 'could': 28, 'start': 29, 'work': 30, 'say': 31, 'see': 32, 'look': 33, 'back': 34, 'pretti': 35, 'tri': 36, 'alway': 37, 'friend': 38, 'also': 39, 'come': 40, 'good': 41, 'right': 42, 'use': 43, 'year': 44, 'today': 45, 'person': 46, 'never': 47, 'take': 48, 'around': 49, 'hate': 50, 'though': 51, 'help': 52, 'thought': 53, 'someon': 54, 'made': 55, 'live': 56, 'well': 57, 'lot': 58, 'hope': 59, 'find': 60, 'happi': 61, 'mani': 62, 'quit': 63, 'got': 64, 'less': 65, 'week': 66, 'everi': 67, 'home': 68, 'write': 69, 'kind': 70, 'long': 71, 'felt': 72, 'read': 73, 'left': 74, 'enough': 75, 'actual': 76, 'anyth': 77, 'new': 78, 'last': 79, 'sometim': 80, 'gi

In [22]:
wv = load_embedding()

In [23]:
# print the vocabulary of the model
print(wv.key_to_index)

{'feel': 0, 'like': 1, 'im': 2, 'get': 3, 'time': 4, 'know': 5, 'realli': 6, 'make': 7, 'go': 8, 'love': 9, 'want': 10, 'littl': 11, 'think': 12, 'peopl': 13, 'would': 14, 'thing': 15, 'day': 16, 'still': 17, 'not_feel': 18, 'one': 19, 'ive': 20, 'life': 21, 'way': 22, 'even': 23, 'much': 24, 'someth': 25, 'need': 26, 'bit': 27, 'could': 28, 'start': 29, 'work': 30, 'say': 31, 'see': 32, 'look': 33, 'back': 34, 'pretti': 35, 'tri': 36, 'alway': 37, 'friend': 38, 'also': 39, 'come': 40, 'good': 41, 'right': 42, 'use': 43, 'year': 44, 'today': 45, 'person': 46, 'never': 47, 'take': 48, 'around': 49, 'hate': 50, 'though': 51, 'help': 52, 'thought': 53, 'someon': 54, 'made': 55, 'live': 56, 'well': 57, 'lot': 58, 'hope': 59, 'find': 60, 'happi': 61, 'mani': 62, 'quit': 63, 'got': 64, 'less': 65, 'week': 66, 'everi': 67, 'home': 68, 'write': 69, 'kind': 70, 'long': 71, 'felt': 72, 'read': 73, 'left': 74, 'enough': 75, 'actual': 76, 'anyth': 77, 'new': 78, 'last': 79, 'sometim': 80, 'give': 

In [24]:
print(len(wv['feel']))
print(wv.get_vector('feel'))

150
[-0.0486216  -0.3163228   0.3623145  -0.06800893 -0.3854535  -0.18433796
  0.28263554  0.08622984 -0.2759842  -0.254357    0.18156776 -0.28893664
 -0.40552175  0.28991073  0.1364522  -0.01297955 -0.29678842  0.04851127
 -0.1612786  -0.04161817 -0.17282018  0.15228981  0.2376068   0.29768112
 -0.02419732  0.07012758 -0.02635229  0.40613672 -0.0172801  -0.12285158
 -0.22543481 -0.09658933  0.34894082 -0.22661304  0.03727106  0.31226444
  0.4201239  -0.04531339 -0.01720235  0.19597638 -0.06165056  0.1678493
  0.03229427 -0.10176083 -0.08484645 -0.03094389  0.06789957  0.01867272
 -0.04809374  0.01364644 -0.1137508   0.12472186  0.1196994   0.07763399
  0.37735182 -0.05329221  0.26490948 -0.28472725 -0.19511798  0.2878918
  0.19934723 -0.08381692  0.11897624 -0.28800526 -0.05894611  0.30524054
  0.12072757 -0.2739678  -0.0583734  -0.15375237  0.14455894  0.19060007
 -0.04274751 -0.38681978  0.20569876 -0.2804796  -0.05063558  0.09774282
 -0.15822056 -0.02099248  0.15908436 -0.06382011 

Now, that the model has been trained, we can use it to generate embeddings for the text data.

In [8]:
# Generate embeddings for the dataset using mean of word embeddings
def generate_embeddings(documents, wv):
    embeddings = []
    for doc in documents:
        doc_embedding = []
        for word in doc:
            if word in wv:
                doc_embedding.append(wv.get_vector(word))
        if len(doc_embedding) > 0:
            embeddings.append(np.mean(doc_embedding, axis=0))
        else:
            embeddings.append(np.zeros(wv.vector_size))
    return embeddings

In [25]:
embeddings = generate_embeddings(documents, wv)

print(f"Embedding for the first document: {embeddings[0]}")
print(f"Length of the embedding: {len(embeddings[0])}")

# Save the embeddings to file
df_embeddings = df_og.copy()
df_embeddings['text'] = embeddings
df_embeddings.to_pickle('data/reps/4_embeddings_original.pkl')


Embedding for the first document: [-0.12554993 -0.13101527  0.26901686 -0.07821853 -0.15671797 -0.06083159
  0.1847798   0.20661625 -0.11461768 -0.28449285  0.13587268 -0.1536037
 -0.22497125  0.16792075 -0.04479312  0.06780216 -0.24790005  0.03467832
 -0.05297899  0.09582764 -0.17792876  0.0876202   0.25996074  0.27132052
  0.02152597 -0.05402856 -0.06537221  0.22617503 -0.02125627 -0.12765017
 -0.26115435  0.01211254  0.11859896 -0.17083617  0.06805599  0.1812742
  0.37529612  0.05075525 -0.05151397 -0.01014401  0.01968651  0.09169078
  0.04092222 -0.07113075  0.06592005  0.09252273  0.15531221  0.00505344
 -0.1474311  -0.0370223  -0.22325747  0.04498313  0.05987433  0.01468256
  0.37638146 -0.00330744  0.20345096 -0.06688371 -0.18058369  0.18248321
  0.15597697 -0.10377561  0.17278524 -0.24501336  0.05429054  0.22197144
 -0.06349276 -0.2151761  -0.18256348 -0.1386909   0.13258424  0.17475578
 -0.08174501 -0.32392806  0.19637159 -0.32053885  0.05327859  0.2031691
 -0.19568074  0.0374

## 7 - Word Embeddings (Word2Vec)



We can make use of pretrained word embeddings to represent our input text in a classification problem. Let's try it out with the embeddings we've trained in the word embeddings notebook, which have the advantage of having been trained on data that is similar to our classification task's data (reviews). You could try other embeddings (such as those available in [Gensim](https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html)).

In [None]:
import gensim.downloader as api

# Load pre-trained Word2Vec model
w2v_model = api.load('glove-twitter-100')

# Generate embeddings for the dataset using mean of word embeddings
embeddings = generate_embeddings(documents, w2v_model)

print(f"Embedding for the first document: {embeddings[0]}")
print(f"Length of the embedding: {len(embeddings[0])}")

# Save the embeddings to file
df_embeddings = df_og.copy()
df_embeddings['text'] = embeddings
df_embeddings.to_pickle('data/reps/5_embeddings_twitter.pkl')

Generating the embedding using FastText.

In [12]:
# Using the FastText model (by Facebook Research) trained on the Common Crawl corpus
import fasttext.util

documents = df_og['text'].tolist()
fasttext.util.download_model('en', if_exists='ignore')  # English
ft = fasttext.load_model('cc.en.300.bin')

# Generate embeddings for the dataset using mean of word embeddings
generate_embeddings(documents, ft)
embeddings_fasttext = generate_embeddings(documents, ft)

print(f"Embedding for the first document: {embeddings_fasttext[0]}")
print(f"Length of the embedding: {len(embeddings_fasttext[0])}")

# Save the embeddings to file
df_embeddings_fasttext = df_og.copy()
df_embeddings_fasttext['text'] = embeddings_fasttext
df_embeddings_fasttext.to_pickle('data/reps/6_embeddings_fasttext.pkl')



Embedding for the first document: [ 3.25609818e-02  1.19558135e-02  8.88022706e-02  1.30403519e-01
 -9.40019339e-02 -9.19098184e-02 -6.98399097e-02  1.30608631e-02
  3.05741169e-02  1.80588150e-03  3.13137397e-02 -2.05776617e-02
  1.86222941e-02  1.36141358e-02  5.95527701e-02  7.09418133e-02
  8.04259032e-02  7.34392256e-02 -4.83792275e-02  3.69153842e-02
  1.98652651e-02  4.01078239e-02  3.20375785e-02 -9.29942913e-03
 -1.01217583e-01 -2.62807291e-02 -2.89488714e-02 -4.81858589e-02
  4.08793762e-02  1.15669787e-01 -6.58827648e-02  4.63480465e-02
  2.77572200e-02  8.34309161e-02  4.65868190e-02  1.14921597e-04
  7.37055093e-02  5.03478087e-02 -3.97197753e-02 -4.29389067e-03
  3.30605060e-02 -6.19458780e-03  5.68229780e-02  9.40148830e-02
 -4.30895668e-03 -1.60394683e-02  8.10505450e-03  1.06105115e-02
 -3.41614746e-02  1.02989692e-02  5.38369119e-02 -2.64815092e-02
 -3.86923328e-02 -5.62955020e-03 -4.63240668e-02 -2.02059448e-02
  4.22170796e-02  2.12134775e-02 -1.24166295e-01 -6.4870

In [14]:
ft.get_dimension()

300