# 2nd Iteration
For this time we will use emoji.demojize to create text description from emoji text. The main purpose of this part is to evaluate is it enough to use this simple tool to capture emotional richness of emoji text.

Main steps:
1. Choose emoji-rich dataset
2. Convert emojis to text descriptions (instead of removing)
3. Create embeddings from this enriched text  
4. Search most relevant quotes using FAISS index  
5. Compare quality of retrieved quotes to baseline

In [None]:
!pip install emoji
!pip install faiss-cpu


Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/590.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m286.7/590.6 kB[0m [31m8.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.1
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m68.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0


In [None]:
import pandas as pd
import numpy as np
import emoji
import os
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import faiss
from tabulate import tabulate
import torch


## 1. Prepare csv files

In [None]:
all_dfs = []
for file_name in os.listdir("archive"):
    if file_name.endswith('.csv'):
        file_path = os.path.join("archive", file_name)
        try:
            # Read CSV with appropriate encoding
            df = pd.read_csv(file_path)
            all_dfs.append(df)
            print(f"Loaded {file_name} with {len(df)} rows")
        except Exception as e:
            print(f"Error loading {file_name}: {e}")

combined_df = pd.concat(all_dfs, ignore_index=True)
print(f"Combined dataset size: {len(combined_df)} rows")
print()
combined_df.head()
combined_df.to_csv("combined_dataset.csv", index=False)

Loaded rabbit.csv with 20000 rows
Loaded thinking_face.csv with 20000 rows
Loaded hot_face.csv with 20000 rows
Loaded smiling_face_with_heart-eyes.csv with 20000 rows
Loaded fire.csv with 20000 rows
Loaded folded_hands.csv with 20000 rows
Loaded fearful_face.csv with 20000 rows
Loaded rolling_on_the_floor_laughing.csv with 20000 rows
Loaded saluting_face.csv with 20000 rows
Loaded white_heart.csv with 20000 rows
Loaded grinning_face_with_sweat.csv with 20000 rows
Loaded thumbs_up.csv with 20000 rows
Loaded cooking.csv with 20000 rows
Loaded face_with_steam_from_nose.csv with 20000 rows
Loaded rabbit_face.csv with 20000 rows
Loaded sparkles.csv with 20000 rows
Loaded smiling_face_with_halo.csv with 20000 rows
Loaded ghost.csv with 20000 rows
Loaded hatching_chick.csv with 20000 rows
Error loading backhand_index_pointing_right.csv: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Error loading red_heart.csv: Error tokenizing data. C error: Buffer o

## 2. Load dataset and filter rows with emojis

In [None]:
df = pd.read_csv('combined_dataset.csv')
df['has_emoji'] = df['Text'].apply(lambda x: bool(emoji.emoji_count(str(x))))
emoji_rich_df = df[df['has_emoji']].copy()

# Select 100 random records
df_subset = emoji_rich_df.sample(n=100, random_state=100).reset_index(drop=True)


## 3. Create embeddings of the text with emoji description

In [None]:
# Convert Emoji to Text Description
def demojize_text(text):
    text = emoji.demojize(str(text), language='en')  # 😢 → ":crying_face:"
    text = text.replace(":", "").replace("_", " ")
    return text

In [None]:
tokenizer = AutoTokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")
model = AutoModelForSequenceClassification.from_pretrained("SamLowe/roberta-base-go_emotions")

def get_embedding(text, tokenizer, model, device='cpu'):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    inputs = {key: val.to(device) for key, val in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)

    hidden_states = outputs.hidden_states[-1]
    cls_embedding = hidden_states[:, 0, :].squeeze().cpu().numpy()
    return cls_embedding.astype(np.float32)


## 4. Create Faiss index

In [None]:
quotes_df = pd.read_csv('selected_quotes_embeddings.csv')

def parse_embedding(emb):
    try:
        if isinstance(emb, str):
            return np.array(eval(emb), dtype=np.float32)
        return np.array(emb, dtype=np.float32)
    except Exception as e:
        print(f"Error parsing embedding: {e}")
        return None

index = faiss.IndexFlatIP(768)

for i in range(0, len(quotes_df), 10000):
    chunk = quotes_df.iloc[i:i + 10000]
    chunk_embeddings = [parse_embedding(emb) for emb in chunk['embeddings']]
    chunk_embeddings = [emb for emb in chunk_embeddings if emb is not None and emb.shape == (768,)]
    if chunk_embeddings:
        chunk_array = np.vstack(chunk_embeddings)
        faiss.normalize_L2(chunk_array)
        index.add(chunk_array)
    print(f"Processed chunk {i // 10000 + 1}/{len(quotes_df) // 10000 + 1}")


Processed chunk 1/11
Processed chunk 2/11
Processed chunk 3/11
Processed chunk 4/11
Processed chunk 5/11
Processed chunk 6/11
Processed chunk 7/11
Processed chunk 8/11
Processed chunk 9/11
Processed chunk 10/11
Processed chunk 11/11


In [None]:
def search_similar_quotes(query_embedding, k=5):
    faiss.normalize_L2(query_embedding.reshape(1, -1))
    distances, indices = index.search(query_embedding.reshape(1, -1), k)
    return distances[0], indices[0]


## 5. Evaluation

In [None]:
results = []

for idx, row in df_subset.iterrows():
    input_text = row['Text']

    # Emojis turn into text
    demojized_text = demojize_text(input_text)

    demojized_embedding = get_embedding(demojized_text, tokenizer, model)
    demojized_distances, demojized_indices = search_similar_quotes(demojized_embedding)
    demojized_avg_similarity = np.mean(demojized_distances)
    demojized_quotes = [quotes_df.iloc[i]['quote'] for i in demojized_indices]

    results.append({
        'text': input_text,
        'demojized_text': demojized_text,
        'demojized_avg_similarity': demojized_avg_similarity,
        'demojized_quotes': demojized_quotes
    })


In [None]:
print("\n=== Detailed Results ===")
for idx, res in enumerate(results[:10], 1):
    print(f"\nQuery {idx}: {res['text']}")
    print(f"Demojized: {res['demojized_text']}")
    print(f"Average Cosine Similarity: {res['demojized_avg_similarity']:.4f}")
    print("Top-K Quotes:")
    for i, quote in enumerate(res['demojized_quotes']):
        print(f"{i+1}. {quote}")



=== Detailed Results ===

Query 1: @z388z @IFLTV Yh I had to search my house and take a picture of it 🤡
Demojized: @z388z @IFLTV Yh I had to search my house and take a picture of it :clown_face:
Average Cosine Similarity: 0.8694
Top-K Quotes:
1. Find me, my thief.
2. Taking a dump...blackout
3. They X-rayed my head and found nothing.
4. Literature is a microscope
5. ...a murder of crows gormandized until they were satiated.

Query 2: just booked hotel for 5sos in prague even when I'm not sure if i get my ticket 
SO hoping for good luck while ticketing 🤞🤞🤞
Demojized: just booked hotel for 5sos in prague even when I'm not sure if i get my ticket 
SO hoping for good luck while ticketing :crossed_fingers::crossed_fingers::crossed_fingers:
Average Cosine Similarity: 0.8909
Top-K Quotes:
1. With hope, we can endure any hardship.
2. \with hope, you can survive any shock.
3. To travel hopefully is better than to have arrived.
4. With high hope and optimism, start swimming with time.
5. There 

## Results Analysis

This iteration explores whether converting emojis to descriptive text (using `emoji.demojize`) helps retain emotional context in vector embeddings.

Demojized tokens often improved interpretability of the tweet and led to better emotional alignment with retrieved quotes.
For example, 🤞🤞🤞 became :crossed_fingers: and matched hopeful quotes like
“With hope, we can endure any hardship.” However, some matches were still semantically shallow. For instance, the 😋-tagged tweet describing delicious food led to generic quotes like “My passionate leisure pursuit...”, missing the sensory joy or humor implied by the emoji.

Results show that demojizing:
- Preserves some of the emotional nuance lost in baseline
- Improves average similarity for certain emojis
- Still fails to fully capture sarcasm or context-aware emotion

**Conclusion**: Demojize is a lightweight but limited improvement. Might benefit from more expressive models (e.g. emoji2vec or emojional) in future iterations.
