# Iteration 2 vs Iteration 4
In this section, we will determine which of the iterations will cope with the search for suitable quotes best.
Considering that in the first Iteration we did not take emoji into account at all, we will not include this Iteration in the comparison
Regarding Iteration 3, its indicators are too low due to the impossibility of training. Therefore, although the third Iteration takes emoji into account when forming embeddings, we will consider it an automatic loser in this competition

In [None]:
!pip install emoji
!pip install faiss-cpu


In [3]:
import pandas as pd
import numpy as np
import emoji
import os
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import faiss
from tabulate import tabulate
import torch

## 1. Prepare csv files

In [4]:
all_dfs = []
for file_name in os.listdir("archive"):
    if file_name.endswith('.csv'):
        file_path = os.path.join("archive", file_name)
        try:
            # Read CSV with appropriate encoding
            df = pd.read_csv(file_path)
            all_dfs.append(df)
            print(f"Loaded {file_name} with {len(df)} rows")
        except Exception as e:
            print(f"Error loading {file_name}: {e}")

combined_df = pd.concat(all_dfs, ignore_index=True)
print(f"Combined dataset size: {len(combined_df)} rows")
print()
combined_df.head()
combined_df.to_csv("combined_dataset.csv", index=False)


Loaded rabbit.csv with 20000 rows
Loaded thinking_face.csv with 20000 rows
Loaded hot_face.csv with 20000 rows
Loaded smiling_face_with_heart-eyes.csv with 20000 rows
Loaded fire.csv with 20000 rows
Loaded folded_hands.csv with 20000 rows
Loaded fearful_face.csv with 20000 rows
Loaded rolling_on_the_floor_laughing.csv with 20000 rows
Loaded saluting_face.csv with 20000 rows
Loaded white_heart.csv with 20000 rows
Loaded grinning_face_with_sweat.csv with 20000 rows
Loaded thumbs_up.csv with 20000 rows
Loaded cooking.csv with 20000 rows
Loaded face_with_steam_from_nose.csv with 20000 rows
Loaded rabbit_face.csv with 20000 rows
Loaded sparkles.csv with 20000 rows
Loaded smiling_face_with_halo.csv with 20000 rows
Loaded ghost.csv with 20000 rows
Loaded hatching_chick.csv with 20000 rows
Error loading backhand_index_pointing_right.csv: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Error loading red_heart.csv: Error tokenizing data. C error: Buffer o

## 2. Load dataset and filter rows with emojis

In [5]:
df = pd.read_csv('combined_dataset.csv')
df['has_emoji'] = df['Text'].apply(lambda x: bool(emoji.emoji_count(str(x))))
emoji_rich_df = df[df['has_emoji']].copy()

# Select 100 random records
df_subset = emoji_rich_df.sample(n=100, random_state=100).reset_index(drop=True)


## 3. Create embeddings of the text with emoji description

In [22]:
# Convert Emoji to Text Description
def demojize_text(text):
    text = emoji.demojize(str(text), language='en')  # üò¢ ‚Üí ":crying_face:"
    text = text.replace(":", "").replace("_", " ")
    return text


In [None]:
tokenizer = AutoTokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")
model = AutoModelForSequenceClassification.from_pretrained("SamLowe/roberta-base-go_emotions")

def get_embedding(text, tokenizer, model, device='cpu'):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    inputs = {key: val.to(device) for key, val in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)

    hidden_states = outputs.hidden_states[-1]
    cls_embedding = hidden_states[:, 0, :].squeeze().cpu().numpy()
    return cls_embedding.astype(np.float32)


4. Create Faiss index

In [10]:
quotes_df = pd.read_csv('selected_quotes_embeddings.csv')

def parse_embedding(emb):
    try:
        if isinstance(emb, str):
            return np.array(eval(emb), dtype=np.float32)
        return np.array(emb, dtype=np.float32)
    except Exception as e:
        print(f"Error parsing embedding: {e}")
        return None

index = faiss.IndexFlatIP(768)

for i in range(0, len(quotes_df), 10000):
    chunk = quotes_df.iloc[i:i + 10000]
    chunk_embeddings = [parse_embedding(emb) for emb in chunk['embeddings']]
    chunk_embeddings = [emb for emb in chunk_embeddings if emb is not None and emb.shape == (768,)]
    if chunk_embeddings:
        chunk_array = np.vstack(chunk_embeddings)
        faiss.normalize_L2(chunk_array)
        index.add(chunk_array)
    print(f"Processed chunk {i // 10000 + 1}/{len(quotes_df) // 10000 + 1}")


Processed chunk 1/11
Processed chunk 2/11
Processed chunk 3/11
Processed chunk 4/11
Processed chunk 5/11
Processed chunk 6/11
Processed chunk 7/11
Processed chunk 8/11
Processed chunk 9/11
Processed chunk 10/11


In [11]:
def search_similar_quotes(query_embedding, k=5):
    faiss.normalize_L2(query_embedding.reshape(1, -1))
    distances, indices = index.search(query_embedding.reshape(1, -1), k)
    return distances[0], indices[0]


In [12]:
def extract_emojis(text):
    return ''.join(char for char in text if emoji.is_emoji(char))

## 5. Comparison

In [23]:
results = []

for idx, row in df_subset.iterrows():
    input_text = row['Text']
    embedding = get_embedding(input_text, tokenizer, model)
    distances, indices = search_similar_quotes(embedding)
    avg_similarity = np.mean(distances)
    quotes = [quotes_df.iloc[i]['quote'] for i in indices]

    demojized_text = demojize_text(input_text)
    demojized_embedding = get_embedding(demojized_text, tokenizer, model)
    demojized_distances, demojized_indices = search_similar_quotes(demojized_embedding)
    demojized_avg_similarity = np.mean(demojized_distances)
    demojized_quotes = [quotes_df.iloc[i]['quote'] for i in demojized_indices]


    results.append({
        'text': input_text,
        'avg_similarity': avg_similarity,
        'quotes': quotes,
        'demojized_avg_similarity': demojized_avg_similarity,
        'demojized_quotes': demojized_quotes
    })


results_only_emo = []

for idx, row in df_subset.iterrows():
    input_text = row['Text']
    input_text = extract_emojis(input_text)
    embedding = get_embedding(input_text, tokenizer, model)
    distances, indices = search_similar_quotes(embedding)
    avg_similarity = np.mean(distances)
    quotes = [quotes_df.iloc[i]['quote'] for i in indices]

    demojized_text = demojize_text(input_text)
    demojized_embedding = get_embedding(demojized_text, tokenizer, model)
    demojized_distances, demojized_indices = search_similar_quotes(demojized_embedding)
    demojized_avg_similarity = np.mean(demojized_distances)
    demojized_quotes = [quotes_df.iloc[i]['quote'] for i in demojized_indices]


    results_only_emo.append({
        'text': input_text,
        'avg_similarity': avg_similarity,
        'quotes': quotes,
        'demojized_avg_similarity': demojized_avg_similarity,
        'demojized_quotes': demojized_quotes
    })

In [25]:
print("\n=== Detailed Results ===")
for idx, res in enumerate(results[10:15], 1):
    print(f"\nQuery {idx}: {res['text']}")
    print(f"Average Cosine Similarity (Iteration 2): {res['demojized_avg_similarity']:.4f}")
    print("Top-K Quotes:")
    for i, quote in enumerate(res['demojized_quotes']):
        print(f"{i+1}. {quote}")
    print(f"Average Cosine Similarity (Iteration 4): {res['avg_similarity']:.4f}")
    print("Top-K Quotes:")
    for i, quote in enumerate(res['quotes']):
        print(f"{i+1}. {quote}")

print("\n=== Detailed Results ===")
for idx, res in enumerate(results_only_emo[15:20], 1):
    print(f"\nQuery {idx}: {res['text']}")
    print(f"Average Cosine Similarity (Iteration 2): {res['demojized_avg_similarity']:.4f}")
    print("Top-K Quotes:")
    for i, quote in enumerate(res['demojized_quotes']):
        print(f"{i+1}. {quote}")
    print(f"Average Cosine Similarity (Iteration 4): {res['avg_similarity']:.4f}")
    print("Top-K Quotes:")
    for i, quote in enumerate(res['quotes']):
        print(f"{i+1}. {quote}")



=== Detailed Results ===

Query 1: Happy Easter beloved Souls. We Now Enlighten All of Earth with Radiant New WAVES that will further elevate your souls rise unto greater ARCHphathoms OF LIGHT. Just As a Souls LIVE‚ôéÔ∏èMind, Rising up thru waters' PHATHOMIC CURRENTCIES, REQUIRES SUFFICIENT TIMEto Decompress Safelyüòá,
Average Cosine Similarity (Iteration 2): 0.9023
Top-K Quotes:
1. Happy Independence Day! Let Freedom Stream!!! NetworkEtiquette.net
2. Every New Year brings its sacred blessings.
3. Celebrate one soul, touch one heart, light one lamp; and the whole universe moves.
4. Welcome to Sex Media, Where Fantasy Becomes Reality!
5. Every year on your birthday, you get a chance to start new.
Average Cosine Similarity (Iteration 4): 0.8862
Top-K Quotes:
1. Every New Year brings its sacred blessings.
2. New Year's Day. A fresh start. A new chapter in life waiting to be written. New questions to be asked, embraced, and loved. Answers to be discovered and then lived in this transform

## Results Analysis: Iteration 2 vs Iteration 4
We conducted a comparative evaluation between Iteration 2 and Iteration 4 on a diverse set of emoji-rich queries. The goal was to identify which approach more effectively captures the emotional and contextual meaning of the emoji inputs and retrieves semantically relevant quotes from the embedding database.

Summary of Findings:
* Iteration 2 consistently outperformed Iteration 4 across the majority of test cases.

* It produced more emotionally aligned, contextually relevant, and coherent quote selections.

* Average cosine similarities were higher for Iteration 2 in almost all cases, and the retrieved quotes better reflected the emotion or theme of the emoji input.

* In contrast, Iteration 4 frequently generated off-topic or overly abstract quotes, sometimes missing the tone (e.g., joy, irony, intimacy) intended by the emoji input.