# 4th Iteration
In this iteration, we intentionally do not separate text and emojis. The goal is to evaluate how well a general-purpose roberta-base-go_emotions model can handle mixed inputs — containing both natural language and emojis — without any preprocessing or transformation.

Main steps:

1. Keep emojis in place and combine them with raw text

2. Tokenize this mixed input directly

3. Generate embeddings using a text classification model

4. Measure how well the model captures the emotional meaning

In [None]:
!pip install emoji
!pip install faiss-cpu

In [2]:
import pandas as pd
import numpy as np
import emoji
import os
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import faiss
from tabulate import tabulate
import torch

## 1. Prepare csv files

In [3]:
all_dfs = []
for file_name in os.listdir("archive"):
    if file_name.endswith('.csv'):
        file_path = os.path.join("archive", file_name)
        try:
            # Read CSV with appropriate encoding
            df = pd.read_csv(file_path)
            all_dfs.append(df)
            print(f"Loaded {file_name} with {len(df)} rows")
        except Exception as e:
            print(f"Error loading {file_name}: {e}")

combined_df = pd.concat(all_dfs, ignore_index=True)
print(f"Combined dataset size: {len(combined_df)} rows")
print()
combined_df.head()
combined_df.to_csv("combined_dataset.csv", index=False)

Loaded rabbit.csv with 20000 rows
Loaded thinking_face.csv with 20000 rows
Loaded hot_face.csv with 20000 rows
Loaded smiling_face_with_heart-eyes.csv with 20000 rows
Loaded fire.csv with 20000 rows
Loaded folded_hands.csv with 20000 rows
Loaded fearful_face.csv with 20000 rows
Loaded rolling_on_the_floor_laughing.csv with 20000 rows
Loaded saluting_face.csv with 20000 rows
Loaded white_heart.csv with 20000 rows
Loaded grinning_face_with_sweat.csv with 20000 rows
Loaded thumbs_up.csv with 20000 rows
Loaded cooking.csv with 20000 rows
Loaded face_with_steam_from_nose.csv with 20000 rows
Loaded rabbit_face.csv with 20000 rows
Loaded sparkles.csv with 20000 rows
Loaded smiling_face_with_halo.csv with 20000 rows
Loaded ghost.csv with 20000 rows
Loaded hatching_chick.csv with 20000 rows
Error loading backhand_index_pointing_right.csv: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Error loading red_heart.csv: Error tokenizing data. C error: Buffer o

## 2. Load dataset and filter rows with emojis

In [4]:
df = pd.read_csv('combined_dataset.csv')
df['has_emoji'] = df['Text'].apply(lambda x: bool(emoji.emoji_count(str(x))))
emoji_rich_df = df[df['has_emoji']].copy()

# Select 100 random records
df_subset = emoji_rich_df.sample(n=100, random_state=100).reset_index(drop=True)

## 3. Create embeddings of the text together with emoji

In [None]:
tokenizer = AutoTokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")
model = AutoModelForSequenceClassification.from_pretrained("SamLowe/roberta-base-go_emotions")

def get_embedding(text, tokenizer, model, device='cpu'):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    inputs = {key: val.to(device) for key, val in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)

    hidden_states = outputs.hidden_states[-1]
    cls_embedding = hidden_states[:, 0, :].squeeze().cpu().numpy()
    return cls_embedding.astype(np.float32)

## 4. Create Faiss index

In [6]:
quotes_df = pd.read_csv('selected_quotes_embeddings.csv')

def parse_embedding(emb):
    try:
        if isinstance(emb, str):
            return np.array(eval(emb), dtype=np.float32)
        return np.array(emb, dtype=np.float32)
    except Exception as e:
        print(f"Error parsing embedding: {e}")
        return None

index = faiss.IndexFlatIP(768)

for i in range(0, len(quotes_df), 10000):
    chunk = quotes_df.iloc[i:i + 10000]
    chunk_embeddings = [parse_embedding(emb) for emb in chunk['embeddings']]
    chunk_embeddings = [emb for emb in chunk_embeddings if emb is not None and emb.shape == (768,)]
    if chunk_embeddings:
        chunk_array = np.vstack(chunk_embeddings)
        faiss.normalize_L2(chunk_array)
        index.add(chunk_array)
    print(f"Processed chunk {i // 10000 + 1}/{len(quotes_df) // 10000 + 1}")

Processed chunk 1/11
Processed chunk 2/11
Processed chunk 3/11
Processed chunk 4/11
Processed chunk 5/11
Processed chunk 6/11
Processed chunk 7/11
Processed chunk 8/11
Processed chunk 9/11
Processed chunk 10/11
Processed chunk 11/11


In [7]:
def search_similar_quotes(query_embedding, k=5):
    faiss.normalize_L2(query_embedding.reshape(1, -1))
    distances, indices = index.search(query_embedding.reshape(1, -1), k)
    return distances[0], indices[0]

## 5. Evaluation

In [8]:
results = []

for idx, row in df_subset.iterrows():
    input_text = row['Text']
    embedding = get_embedding(input_text, tokenizer, model)
    distances, indices = search_similar_quotes(embedding)
    avg_similarity = np.mean(distances)
    quotes = [quotes_df.iloc[i]['quote'] for i in indices]

    results.append({
        'text': input_text,
        'avg_similarity': avg_similarity,
        'quotes': quotes
    })

In [9]:
print("\n=== Detailed Results ===")
for idx, res in enumerate(results[:10], 1):
    print(f"\nQuery {idx}: {res['text']}")
    print(f"Average Cosine Similarity: {res['avg_similarity']:.4f}")
    print("Top-K Quotes:")
    for i, quote in enumerate(res['quotes']):
        print(f"{i+1}. {quote}")


=== Detailed Results ===

Query 1: @z388z @IFLTV Yh I had to search my house and take a picture of it 🤡
Average Cosine Similarity: 0.7074
Top-K Quotes:
1. Yes, I kidnapped that Lindberg baby.
2. Yeah and purple monkeys fly from my ass at dawn.
3. I make movies for teenage boys. Oh dear, what a crime.
4. I ransack public libraries, and find them full of sunk treasure.
5. I smell blood and an era of prominent madmen.

Query 2: just booked hotel for 5sos in prague even when I'm not sure if i get my ticket 
SO hoping for good luck while ticketing 🤞🤞🤞
Average Cosine Similarity: 0.8628
Top-K Quotes:
1. I'm optimistic about the possibility of having a positive attitude
2. Whatever my path, I have faith I will end up where I need to be.
3. With high hope and optimism, start swimming with time.
4. To travel hopefully is better than to have arrived.
5. I am prepared for the worst, but hope for the best.

Query 3: Stateside tips are live 🇺🇸

✔️ Tampa Bay Downs
✔️ Mahoning Valley
✔️ Philadelphia


## Results Analysis
Without removing or transforming emojis, the model showed a high correspondence of emotions between queries and quotes. For most examples, the average cosine similarity was above 0.8, indicating successful perception of mixed texts.

Emojis expressing emotions (🤞, 😋, ❤️) clearly strengthened the emotional signal: the model selected quotes with a similar mood. Even in ironic or absurd examples (🤡), the results retained thematic relevance. In more neutral cases (e.g., with factual content), the correspondence of emotions was less clear, but still acceptable.

The model successfully copes with emojis in the text, even without special processing.