# 1st Iteration

At first, let us try the simplest approach - our baseline - do not handle emojis at all (Treat them as UNK). The main purpose of this iteration is to check whether emoji add any emotional value to the text. 

Main steps:
1. Choose emoji-rich dataset
2. Delete emojis from it 
3. Create embeddings from input text
4. Search for the most relevant text from database
5. Drive conclusions about actual relevance using metrics

In [16]:
import pandas as pd
import numpy as np
import emoji
import os
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import faiss
from tabulate import tabulate
import torch

## 1. Prepare csv files

In [36]:
all_dfs = []
for file_name in os.listdir("archive"):
    if file_name.endswith('.csv'):
        file_path = os.path.join("archive", file_name)
        try:
            # Read CSV with appropriate encoding
            df = pd.read_csv(file_path)
            all_dfs.append(df)
            print(f"Loaded {file_name} with {len(df)} rows")
        except Exception as e:
            print(f"Error loading {file_name}: {e}")

combined_df = pd.concat(all_dfs, ignore_index=True)
print(f"Combined dataset size: {len(combined_df)} rows")

Loaded face_savoring_food.csv with 20000 rows
Loaded egg.csv with 20001 rows
Loaded fearful_face.csv with 20000 rows
Loaded sun.csv with 20000 rows
Loaded eyes.csv with 20000 rows
Error loading backhand_index_pointing_right.csv: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Loaded smiling_face_with_hearts.csv with 20000 rows
Loaded smiling_face_with_tear.csv with 20000 rows
Error loading red_heart.csv: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Loaded rolling_on_the_floor_laughing.csv with 20000 rows
Loaded check_mark.csv with 20000 rows
Loaded face_holding_back_tears.csv with 20000 rows
Loaded pile_of_poo.csv with 20000 rows
Loaded enraged_face.csv with 20000 rows
Loaded loudly_crying_face.csv with 20000 rows
Loaded partying_face.csv with 20001 rows
Loaded winking_face.csv with 20000 rows
Loaded face_with_tears_of_joy.csv with 20000 rows
Loaded grinning_face_with_sweat.csv with 20000 rows
Loaded t

In [38]:
combined_df.head()

Unnamed: 0,Text
0,"@PastorAlexLove Thank you, pastor. My mouth sh..."
1,"So horny right now, sending pics of my thick h..."
2,😋 I will be quiet cause she already know.
3,tonights supper is fake bake and chips. tomorr...
4,Bout to make my linguini 😋


In [39]:
combined_df.to_csv("combined_dataset.csv", index=False)

## 2. Load dataset and filter rows with emojis

In [2]:
# Function to remove emojis
def remove_emoji(text):
    return emoji.replace_emoji(text, replace='')  # Replace emojis with empty string

# Function to generate embeddings
def get_embedding(text, tokenizer, model, device='cpu'):
    # Tokenization
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    inputs = {key: val.to(device) for key, val in inputs.items()}
    
    # Get model outputs
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    
    # Extract the last hidden state (batch_size, seq_len, hidden_size)
    hidden_states = outputs.hidden_states[-1]  # Last layer
    # Use the [CLS] token embedding (first position)
    cls_embedding = hidden_states[:, 0, :].squeeze().cpu().numpy()
    return cls_embedding.astype(np.float32)

In [3]:
# Loading the model
tokenizer = AutoTokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")
model = AutoModelForSequenceClassification.from_pretrained("SamLowe/roberta-base-go_emotions")

In [4]:
# Load the dataset
df = pd.read_csv('combined_dataset.csv')

In [5]:
# Filter if text contains emoji
df['has_emoji'] = df['Text'].apply(lambda x: bool(emoji.emoji_count(str(x))))
emoji_rich_df = df[df['has_emoji']].copy()

In [6]:
emoji_rich_df['Text'].head()

0    @PastorAlexLove Thank you, pastor. My mouth sh...
1    So horny right now, sending pics of my thick h...
2           😋  I will be quiet cause she already know.
3    tonights supper is fake bake and chips. tomorr...
4                           Bout to make my linguini 😋
Name: Text, dtype: object

In [7]:
# Select a subset
df_subset = emoji_rich_df.sample(n=100, random_state=100).reset_index(drop=True)

In [8]:
df_subset.head()

Unnamed: 0,Text,has_emoji
0,@BEIDOUlSM i hope you have lovely dreams and n...,True
1,@ErfderEmpires I enjoy people who think not ha...,True
2,"@TheRealJosh05 @PFF All Pro LT, Top 3 Tackle i...",True
3,@girllikeglory Do shout-out for me na 🥲😭,True
4,i don't even know how to describe him anymore 🫠,True


## 3. Create Faiss index

In [9]:
# Load quotes embeddings
quotes_df = pd.read_csv('selected_quotes_embeddings.csv')

In [10]:
# Function to safely parse embeddings
def parse_embedding(emb):
    try:
        if isinstance(emb, str):
            return np.array(eval(emb), dtype=np.float32)
        return np.array(emb, dtype=np.float32)
    except Exception as e:
        print(f"Error parsing embedding: {e}")
        return None

In [11]:
# Create Faiss index for efficient similarity search
index = faiss.IndexFlatIP(768)  # Inner Product (cosine similarity)

In [12]:
for i in range(0, len(quotes_df), 10000):
    chunk = quotes_df.iloc[i:i + 10000]
    chunk_embeddings = [parse_embedding(emb) for emb in chunk['embeddings']]
    # Only keep valid embeddings
    chunk_embeddings = [emb for emb in chunk_embeddings if emb is not None and emb.shape == (768,)]
    if chunk_embeddings:
        chunk_array = np.vstack(chunk_embeddings)  # Combine only valid embeddings
        faiss.normalize_L2(chunk_array)  # Normalize for cosine similarity search
        index.add(chunk_array)  # Add to index
    print(f"Processed chunk {i // 10000 + 1}/{len(quotes_df) // 10000 + 1}")

Processed chunk 1/11
Processed chunk 2/11
Processed chunk 3/11
Processed chunk 4/11
Processed chunk 5/11
Processed chunk 6/11
Processed chunk 7/11
Processed chunk 8/11
Processed chunk 9/11
Processed chunk 10/11
Processed chunk 11/11


In [13]:
# Function to search top-k similar quotes
def search_similar_quotes(query_embedding, k=5):
    faiss.normalize_L2(query_embedding.reshape(1, -1))
    distances, indices = index.search(query_embedding.reshape(1, -1), k)
    return distances[0], indices[0]

## 4. Evaluation

In [17]:
# Evaluate impact of emojis
results = []
for idx, row in df_subset.iterrows():
    input_text = row['Text']
    
    # Emojis Removed
    clean_text = remove_emoji(input_text)
    clean_embedding = get_embedding(clean_text, tokenizer, model)
    clean_distances, clean_indices = search_similar_quotes(clean_embedding)
    clean_avg_similarity = np.mean(clean_distances)
    clean_quotes = [quotes_df.iloc[idx]['quote'] for idx in clean_indices]

    
    results.append({
        'text': input_text,
        'clean_avg_similarity': clean_avg_similarity,
        'clean_quotes': clean_quotes
    })

In [18]:
# Pretty-print results for a few queries
print("\n=== Detailed Results ===")
for idx, res in enumerate(results[:10], 1):  # Show first 10 queries
    print(f"\nQuery {idx}: {res['text']}")
    print(f"Average Cosine Similarity (Emojis Removed): {res['clean_avg_similarity']:.4f}")
    
    clean_table = pd.DataFrame(res)
    
    print("\nTop-K Quotes (Emojis Removed):")
    print(tabulate(clean_table, headers='keys', tablefmt='psql', showindex=True, floatfmt='.4f'))


=== Detailed Results ===

Query 1: @BEIDOUlSM i hope you have lovely dreams and not terrible angsty kaveh nighmares tonight!! 😔😔
Average Cosine Similarity (Emojis Removed): 0.9366

Top-K Quotes (Emojis Removed):
+----+-------------------------------------------------------------------------------------------------+------------------------+------------------------------------------------------+
|    | text                                                                                            |   clean_avg_similarity | clean_quotes                                         |
|----+-------------------------------------------------------------------------------------------------+------------------------+------------------------------------------------------|
|  0 | @BEIDOUlSM i hope you have lovely dreams and not terrible angsty kaveh nighmares tonight!! 😔😔 |                 0.9366 | A blessed hope, a blessed life.                      |
|  1 | @BEIDOUlSM i hope you have lovely dreams a

## Results analysis

The results show high cosine similarities `(0.8959–0.9761)` when removing emojis, indicating strong syntactic matching with Quotes-500K. However, retrieved quotes often lack semantic and emotional relevance. For example, "@BEIDOUlSM i hope you have lovely dreams..." (0.9366) returns "A blessed hope, a blessed life," missing the sadness conveyed by 😔 or sometimes sarcasm staying behind emojis. Emojis, critical for emotional context in social media, are lost when removed, so in case when input queue consists only from emojis we cannot gain any result. To improve, consider converting emojis to text (e.g., emoji.demojize), using emoji-aware models like emojinal.