# 1st Iteration

At first, let us try the simplest approach - our baseline - do not handle emojis at all (Treat them as UNK). The main purpose of this iteration is to check whether emoji add any emotional value to the text. 

Main steps:
1. Choose emoji-rich dataset
2. Delete emojis from it 
3. Create embeddings from input text
4. Search for the most relevant text from database
5. Drive conclusions about actual relevance using metrics

In [21]:
import pandas as pd
import numpy as np
import emoji
import os
from sentence_transformers import SentenceTransformer
import faiss
from tabulate import tabulate

## 1. Prepare csv files

In [36]:
all_dfs = []
for file_name in os.listdir("archive"):
    if file_name.endswith('.csv'):
        file_path = os.path.join("archive", file_name)
        try:
            # Read CSV with appropriate encoding
            df = pd.read_csv(file_path)
            all_dfs.append(df)
            print(f"Loaded {file_name} with {len(df)} rows")
        except Exception as e:
            print(f"Error loading {file_name}: {e}")

combined_df = pd.concat(all_dfs, ignore_index=True)
print(f"Combined dataset size: {len(combined_df)} rows")

Loaded face_savoring_food.csv with 20000 rows
Loaded egg.csv with 20001 rows
Loaded fearful_face.csv with 20000 rows
Loaded sun.csv with 20000 rows
Loaded eyes.csv with 20000 rows
Error loading backhand_index_pointing_right.csv: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Loaded smiling_face_with_hearts.csv with 20000 rows
Loaded smiling_face_with_tear.csv with 20000 rows
Error loading red_heart.csv: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Loaded rolling_on_the_floor_laughing.csv with 20000 rows
Loaded check_mark.csv with 20000 rows
Loaded face_holding_back_tears.csv with 20000 rows
Loaded pile_of_poo.csv with 20000 rows
Loaded enraged_face.csv with 20000 rows
Loaded loudly_crying_face.csv with 20000 rows
Loaded partying_face.csv with 20001 rows
Loaded winking_face.csv with 20000 rows
Loaded face_with_tears_of_joy.csv with 20000 rows
Loaded grinning_face_with_sweat.csv with 20000 rows
Loaded t

In [38]:
combined_df.head()

Unnamed: 0,Text
0,"@PastorAlexLove Thank you, pastor. My mouth sh..."
1,"So horny right now, sending pics of my thick h..."
2,😋 I will be quiet cause she already know.
3,tonights supper is fake bake and chips. tomorr...
4,Bout to make my linguini 😋


In [39]:
combined_df.to_csv("combined_dataset.csv", index=False)

## 2. Load dataset and filter rows with emojis

In [2]:
# Function to remove emojis
def remove_emoji(text):
    return emoji.replace_emoji(text, replace='')  # Replace emojis with empty string

# Function to generate embeddings
def get_embedding(text):
    return model.encode([text])[0]

In [3]:
# Loading the model
model = SentenceTransformer('SamLowe/roberta-base-go_emotions')

No sentence-transformers model found with name SamLowe/roberta-base-go_emotions. Creating a new one with mean pooling.
Some weights of RobertaModel were not initialized from the model checkpoint at SamLowe/roberta-base-go_emotions and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
# Load the dataset
df = pd.read_csv('combined_dataset.csv')

In [5]:
# Filter if text contains emoji
df['has_emoji'] = df['Text'].apply(lambda x: bool(emoji.emoji_count(str(x))))
emoji_rich_df = df[df['has_emoji']].copy()

In [6]:
emoji_rich_df['Text'].head()

0    @PastorAlexLove Thank you, pastor. My mouth sh...
1    So horny right now, sending pics of my thick h...
2           😋  I will be quiet cause she already know.
3    tonights supper is fake bake and chips. tomorr...
4                           Bout to make my linguini 😋
Name: Text, dtype: object

In [7]:
# Select a subset
df_subset = emoji_rich_df.sample(n=100, random_state=42).reset_index(drop=True)

In [8]:
df_subset.head()

Unnamed: 0,Text,has_emoji
0,@vini_ball Nah that was Cold ngl 😭💀,True
1,@PStyle0ne1 give out of yours 🖕🤦 peacemaker,True
2,Y’all ever went met some of y’all friends peop...,True
3,🐳 WHALE TRADE ALERT (CEX) \n\n4 $BTC has been ...,True
4,My thighs still hurt from doing 3 sets of 10 d...,True


## 3. Create Faiss index

In [9]:
# Load quotes embeddings
quotes_df = pd.read_csv('selected_quotes_embeddings.csv')

In [10]:
# Function to safely parse embeddings
def parse_embedding(emb):
    try:
        if isinstance(emb, str):
            return np.array(eval(emb), dtype=np.float32)
        return np.array(emb, dtype=np.float32)
    except Exception as e:
        print(f"Error parsing embedding: {e}")
        return None

In [11]:
# Create Faiss index for efficient similarity search
index = faiss.IndexFlatIP(768)  # Inner Product (cosine similarity)

In [12]:
for i in range(0, len(quotes_df), 10000):
    chunk = quotes_df.iloc[i:i + 10000]
    chunk_embeddings = [parse_embedding(emb) for emb in chunk['embeddings']]
    # Only keep valid embeddings
    chunk_embeddings = [emb for emb in chunk_embeddings if emb is not None and emb.shape == (768,)]
    if chunk_embeddings:
        chunk_array = np.vstack(chunk_embeddings)  # Combine only valid embeddings
        faiss.normalize_L2(chunk_array)  # Normalize for cosine similarity search
        index.add(chunk_array)  # Add to index
    print(f"Processed chunk {i // 10000 + 1}/{len(quotes_df) // 10000 + 1}")

Processed chunk 1/11
Processed chunk 2/11
Processed chunk 3/11
Processed chunk 4/11
Processed chunk 5/11
Processed chunk 6/11
Processed chunk 7/11
Processed chunk 8/11
Processed chunk 9/11
Processed chunk 10/11
Processed chunk 11/11


In [13]:
# Function to search top-k similar quotes
def search_similar_quotes(query_embedding, k=5):
    faiss.normalize_L2(query_embedding.reshape(1, -1))
    distances, indices = index.search(query_embedding.reshape(1, -1), k)
    return distances[0], indices[0]

## 4. Evaluation

In [16]:
# Evaluate impact of emojis
results = []
for idx, row in df_subset.iterrows():
    input_text = row['Text']
    
    # Emojis Removed
    clean_text = remove_emoji(input_text)
    clean_embedding = get_embedding(clean_text)
    clean_distances, clean_indices = search_similar_quotes(clean_embedding)
    clean_avg_similarity = np.mean(clean_distances)
    clean_quotes = [quotes_df.iloc[idx]['quote'] for idx in clean_indices]

    
    results.append({
        'text': input_text,
        'clean_avg_similarity': clean_avg_similarity,
        'clean_quotes': clean_quotes
    })

In [22]:
# Pretty-print results for a few queries
print("\n=== Detailed Results ===")
for idx, res in enumerate(results[:10], 1):  # Show first 10 queries
    print(f"\nQuery {idx}: {res['text']}")
    print(f"Average Cosine Similarity (Emojis Removed): {res['clean_avg_similarity']:.4f}")
    
    clean_table = pd.DataFrame(res)
    
    print("\nTop-K Quotes (Emojis Removed):")
    print(tabulate(clean_table, headers='keys', tablefmt='psql', showindex=True, floatfmt='.4f'))


=== Detailed Results ===

Query 1: @vini_ball Nah that was Cold ngl 😭💀
Average Cosine Similarity (Emojis Removed): 0.9799

Top-K Quotes (Emojis Removed):
+----+---------------------------------------+------------------------+-----------------------------------------------------------------+
|    | text                                  |   clean_avg_similarity | clean_quotes                                                    |
|----+---------------------------------------+------------------------+-----------------------------------------------------------------|
|  0 | @vini_ball Nah that was Cold ngl 😭💀 |                 0.9799 | [C]apitalism--democracy's sidekick                              |
|  1 | @vini_ball Nah that was Cold ngl 😭💀 |                 0.9799 | Rencana Allah selalu indah...                                   |
|  2 | @vini_ball Nah that was Cold ngl 😭💀 |                 0.9799 | Booty Butt, Booty Butt, Booty Butt Cheeks                       |
|  3 | @vini_ball Nah t

## Results analysis

The results show that removing emojis from input texts yields high cosine similarities (0.7995–0.9893) when matching quotes from the Quotes-500K database, indicating strong syntactic similarity. However, the retrieved quotes often lack relevance in meaning or emotion. For instance, "@vini_ball Nah that was Cold ngl 😭💀" (similarity 0.9799) returns "Capitalism--democracy's sidekick," missing the original’s informal, emotional tone. Removing emojis reduces noise but strips emotional context, as emojis like 😭 and 💀 convey key nuances. To improve, consider retaining emojis, converting them to text (e.g., via emoji.demojize), or using emoji-aware models like emojinal. Relevance could be better assessed with metrics like Precision@K or MRR on a labeled validation set.