# 3rd Iteration
In this iteration we embed text and emoji separately to preserve their unique signals, then combine them using a linear layer to match dimensionality.
The main purpose is to check if separate processing improves emotional understanding compared to demojizing.

Main steps:

1. Choose emoji-rich dataset

2. Create text and emoji embeddings independently

3. Project emoji embeddings to same dimension as text embeddings

4. Combine embeddings via a linear layer

5. Search most relevant quotes using FAISS index

6. Compare results from Emoji2Vec and Emojinal, and evaluate quality vs. previous iterations

In [26]:
!pip install emoji
!pip install faiss-cpu
!pip install numpy
!pip install scipy
!pip install gensim




In [27]:
import pandas as pd
import numpy as np
import emoji
import os
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import faiss
from tabulate import tabulate
import torch
import torch.nn as nn
from tqdm import tqdm
import re
import ast
from gensim.models import KeyedVectors

## 1. Prepare csv files

In [28]:
all_dfs = []
for file_name in os.listdir("archive"):
    if file_name.endswith('.csv'):
        file_path = os.path.join("archive", file_name)
        try:
            # Read CSV with appropriate encoding
            df = pd.read_csv(file_path)
            all_dfs.append(df)
            print(f"Loaded {file_name} with {len(df)} rows")
        except Exception as e:
            print(f"Error loading {file_name}: {e}")

combined_df = pd.concat(all_dfs, ignore_index=True)
print(f"Combined dataset size: {len(combined_df)} rows")
print()
combined_df.head()
combined_df.to_csv("combined_dataset.csv", index=False)

Loaded rabbit.csv with 20000 rows
Loaded thinking_face.csv with 20000 rows
Loaded hot_face.csv with 20000 rows
Loaded smiling_face_with_heart-eyes.csv with 20000 rows
Loaded fire.csv with 20000 rows
Loaded folded_hands.csv with 20000 rows
Loaded fearful_face.csv with 20000 rows
Loaded rolling_on_the_floor_laughing.csv with 20000 rows
Loaded saluting_face.csv with 20000 rows
Loaded white_heart.csv with 20000 rows
Loaded grinning_face_with_sweat.csv with 20000 rows
Loaded thumbs_up.csv with 20000 rows
Loaded cooking.csv with 20000 rows
Loaded face_with_steam_from_nose.csv with 20000 rows
Loaded rabbit_face.csv with 20000 rows
Loaded sparkles.csv with 20000 rows
Loaded smiling_face_with_halo.csv with 20000 rows
Loaded ghost.csv with 20000 rows
Loaded hatching_chick.csv with 20000 rows
Error loading backhand_index_pointing_right.csv: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Error loading red_heart.csv: Error tokenizing data. C error: Buffer o

## 2. Load dataset and filter rows with emojis

In [29]:
df = pd.read_csv('combined_dataset.csv')
df['has_emoji'] = df['Text'].apply(lambda x: bool(emoji.emoji_count(str(x))))
emoji_rich_df = df[df['has_emoji']].copy()

# Select 100 random records
df_subset = emoji_rich_df.sample(n=100, random_state=100).reset_index(drop=True)

In [30]:
tokenizer = AutoTokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")
text_model = AutoModelForSequenceClassification.from_pretrained("SamLowe/roberta-base-go_emotions")

def get_text_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = text_model(**inputs, output_hidden_states=True)
    return outputs.hidden_states[-1][0, 0, :].detach().numpy().astype(np.float32)


## 3. Create embeddings of the emojis and linearize it

In [38]:
# Download Emoji2Vec and Emojinal
emoji2vec = KeyedVectors.load_word2vec_format("emoji2vec.bin", binary=True)
emojinal = KeyedVectors.load_word2vec_format("emojional.bin", binary=True)


# Projection layers
emoji2vec_proj = nn.Linear(300, 768)
emojinal_proj = nn.Linear(300, 768)
combine_proj = nn.Linear(768 + 768, 768)

def extract_emojis(text):
    return [c for c in text if c in emoji.EMOJI_DATA]

def get_emoji_embedding(text, embedding_dict, projection_layer, input_dim):
    emojis = [ch for ch in text if ch in embedding_dict]
    # If there is no such emoji, we return the zero vector
    if not emojis:
        return np.zeros((768,), dtype=np.float32)
    # Averaging embeddings
    vectors = [embedding_dict[e] for e in emojis if e in embedding_dict]
    avg_emb = np.mean(vectors, axis=0)
    # Convert to torch and run through the dimension reduction layer
    avg_emb_tensor = torch.tensor(avg_emb, dtype=torch.float32).unsqueeze(0)  # (1, input_dim)
    projected = projection_layer(avg_emb_tensor).squeeze(0).detach().numpy()  # (768,)
    return projected


def combine_embeddings(text_emb, emoji_emb):
    text_tensor = torch.tensor(text_emb, dtype=torch.float32)
    emoji_tensor = torch.tensor(emoji_emb, dtype=torch.float32)
    combined = torch.cat([text_tensor, emoji_tensor], dim=0)
    return combine_proj(combined).detach().numpy()


In [32]:
emoji2vec["😢"].shape

(300,)

## 4. Loading citations and Faiss index

In [33]:
quotes_df = pd.read_csv('selected_quotes_embeddings.csv')

def parse_embedding(emb):
    try:
        if isinstance(emb, str):
            return np.array(ast.literal_eval(emb), dtype=np.float32)
        return np.array(emb, dtype=np.float32)
    except Exception as e:
        return None

index = faiss.IndexFlatIP(768)
for i in range(0, len(quotes_df), 10000):
    chunk = quotes_df.iloc[i:i + 10000]
    chunk_embeddings = [parse_embedding(emb) for emb in chunk['embeddings']]
    chunk_embeddings = [emb for emb in chunk_embeddings if emb is not None and emb.shape == (768,)]
    if chunk_embeddings:
        chunk_array = np.vstack(chunk_embeddings)
        faiss.normalize_L2(chunk_array)
        index.add(chunk_array)
    print(f"Processed chunk {i // 10000 + 1}/{len(quotes_df) // 10000 + 1}")



Processed chunk 1/11
Processed chunk 2/11
Processed chunk 3/11
Processed chunk 4/11
Processed chunk 5/11
Processed chunk 6/11
Processed chunk 7/11
Processed chunk 8/11
Processed chunk 9/11
Processed chunk 10/11
Processed chunk 11/11


In [34]:
def search_similar_quotes(query_embedding, k=5):
    faiss.normalize_L2(query_embedding.reshape(1, -1))
    distances, indices = index.search(query_embedding.reshape(1, -1), k)
    return distances[0], indices[0]

## 5. Evaluation

In [39]:
results = []

for idx, row in tqdm(df_subset.iterrows(), total=len(df_subset)):
    input_text = row['Text']

    text_emb = get_text_embedding(input_text)

    emoji_emb_2v = get_emoji_embedding(input_text, emoji2vec, emoji2vec_proj, 300)
    final_emb_2v = combine_embeddings(text_emb, emoji_emb_2v)
    dists_2v, idxs_2v = search_similar_quotes(final_emb_2v)
    quotes_2v = [quotes_df.iloc[i]['quote'] for i in idxs_2v]

    emoji_emb_ej = get_emoji_embedding(input_text, emojinal, emojinal_proj, 300)
    final_emb_ej = combine_embeddings(text_emb, emoji_emb_ej)
    dists_ej, idxs_ej = search_similar_quotes(final_emb_ej)
    quotes_ej = [quotes_df.iloc[i]['quote'] for i in idxs_ej]

    results.append({
        'text': input_text,
        'avg_similarity_emoji2vec': np.mean(dists_2v),
        'quotes_emoji2vec': quotes_2v,
        'avg_similarity_emojinal': np.mean(dists_ej),
        'quotes_emojinal': quotes_ej
    })


100%|██████████| 100/100 [00:31<00:00,  3.21it/s]


In [40]:
print("\n=== Detailed Results ===")
for idx, res in enumerate(results[:10], 1):
    print(f"\nQuery {idx}: {res['text']}")
    print(f"[Emoji2Vec] Avg Cosine Similarity: {res['avg_similarity_emoji2vec']:.4f}")
    for i, q in enumerate(res['quotes_emoji2vec']):
        print(f"  {i+1}. {q}")
    print(f"[Emojinal ] Avg Cosine Similarity: {res['avg_similarity_emojinal']:.4f}")
    for i, q in enumerate(res['quotes_emojinal']):
        print(f"  {i+1}. {q}")


=== Detailed Results ===

Query 1: @z388z @IFLTV Yh I had to search my house and take a picture of it 🤡
[Emoji2Vec] Avg Cosine Similarity: 0.1118
  1. I can't not put humor in a book.
  2. In an unforgiving world, chaos rules.
  3. I wasn't going to have fun doing a teen movie again.
  4. No one will laugh at how great things are for somebody.
  5. I do not have a sense of humor of any recognizable sort.
[Emojinal ] Avg Cosine Similarity: 0.1217
  1. I can't not put humor in a book.
  2. It's hard to force creativity and humor.
  3. No great thing is created suddenly.
  4. No one will laugh at how great things are for somebody.
  5. I wasn't going to have fun doing a teen movie again.

Query 2: just booked hotel for 5sos in prague even when I'm not sure if i get my ticket 
SO hoping for good luck while ticketing 🤞🤞🤞
[Emoji2Vec] Avg Cosine Similarity: 0.0573
  1. Even in the midst of the storm the sun is still shining.
  2. Leadership is a skill learned through many venues.
  3. The mo

## Results Analysis
In this iteration, text and emoji were embedded separately and then combined via linear layers (not trained).

Key Findings:

* Cosine similarities dropped significantly (often < 0.15), much lower than in Iterations 1 and 2.

* Some emotional context (e.g., sarcasm or joy) was captured better than in the demojized version.

* No consistent winner between Emoji2Vec and Emojinal.

* Untrained projection layers likely hurt performance — fusion was shallow and unoptimized.