# Representations

The focus of this notebook is to explore different representations of the input data. The data is a pre-processed text, and the goal is to find the best representation for the data. The representations that will be explored are:

- Plain Text
- Bag of Words (BoW)
- One-Hot Encoding
- TF-IDF (Term Frequency-Inverse Document Frequency)
- N-grams (different n-gram sizes)
- Word Embeddings (Word2Vec, custom trained) of different sizes and aggregation methods (append, mean, max, min, etc.)
- Custom Representation (using word sentiment and word frequency)

## Importing Libraries and Data

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

df = pd.read_pickle('data/data_processed.pkl')

## Bag of Words (BoW)

In [7]:
def model_bow(corpus, max_features = 1500):
    vectorizer = CountVectorizer(max_features = max_features)
    x = vectorizer.fit_transform(corpus).toarray()
    return x

## One-Hot Encoding

In [8]:
def model_one_hot(corpus):
    vectorizer_binary = CountVectorizer(binary=True)
    x = vectorizer_binary.fit_transform(corpus).toarray()    
    return x

## TF-IDF

In [9]:
def model_tf_idf(corpus):
    vectorizer_tfidf = TfidfVectorizer()
    x = vectorizer_tfidf.fit_transform(corpus).toarray()
    return x

## N-grams

In [10]:
def model_ngram(corpus, ngram_range = (1,2)):
    vectorizer_bigram = CountVectorizer(ngram_range = ngram_range)
    x = vectorizer_bigram.fit_transform(corpus).toarray()
    return x

## Word Embeddings (Word2Vec)

In [None]:
# TODO

## Word Embeddings (Custom)

In [None]:
# TODO

## Custom Representation

In [12]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

def model_vader(corpus):
    # VADER is a robust rule-based lexicon tool tuned to assess social media sentiment 
    # Returns a binary result for each phrase in the corpus where 1 is positive
    analyzer = SentimentIntensityAnalyzer()
    x = []
    for rev in corpus:
        x.append(1 if analyzer.polarity_scores(rev)['compound'] > 0 else 0)
    return x

# TODO

In [31]:


def convert_text_to_embeddings(df, text_column, model_name='all-MiniLM-L6-v2'):

    # Load SentenceTransformer model
    model = SentenceTransformer(model_name)
    
    # Get the texts from the DataFrame
    texts = df[text_column].tolist()
    
    # Compute embeddings
    embeddings = model.encode(texts)

    # insert the embeddings into the DataFrame into a single new column
    df['embeddings'] = embeddings.tolist()
    
        
    
    return embeddings

# Example usage:
# Assuming you have a DataFrame df with a 'text' column
embeddings = convert_text_to_embeddings(df, 'text', 'all-MiniLM-L6-v2')





    

In [32]:
# print a line of the embeddings
print(embeddings[0])


[-5.81458732e-02 -5.60483709e-03  8.68722498e-02  4.17182781e-02
  1.92652512e-02  3.85539085e-02  6.65507019e-02 -6.02931455e-02
 -8.92053638e-03 -2.67916750e-02  4.97117154e-02 -4.83248979e-02
 -3.81650962e-02 -4.73941900e-02  6.84752455e-03 -5.11829481e-02
 -5.13229333e-03 -3.70376334e-02 -2.56631244e-02  1.07601456e-01
  6.60977606e-03  1.83822692e-03 -6.51664510e-02 -3.26131284e-02
 -2.13531661e-03 -5.26386537e-02 -3.27641070e-02 -3.78051512e-02
  6.38306439e-02 -9.13201123e-02  5.57275452e-02  1.32038862e-01
 -3.58072631e-02  3.95253068e-03 -4.15708199e-02  1.30974427e-01
 -3.78422923e-02 -5.21705188e-02 -1.13910604e-02  5.93322851e-02
 -1.29474699e-01  7.89017358e-04  1.51101360e-02  4.78607789e-02
 -4.01755702e-03 -4.56100218e-02  2.49709561e-02  1.00279246e-02
  4.62591685e-02 -7.68105462e-02  1.22880042e-02 -1.31569011e-02
 -6.09980784e-02 -2.93911854e-03  1.77230891e-02  2.48986315e-02
 -8.93952232e-03 -8.63799676e-02 -1.81567650e-02  3.69794816e-02
 -1.28638847e-02 -7.54874