Tweets With Emoji
Tweets dataset of 43 types of tweets with 20k tweets per emoji ü•∞

About Dataset
The data was obtained through the utilization of snscrape. The query used for retrieval was based on individual emojis. Relevant data was identified, and subsequently assessed for the presence of emojis as well as the sentence's adherence to English language conventions. The language detection analysis was conducted using pycld3, which was inspired by the paper "The WiLI benchmark dataset for written language identification." Each csv file consists of 20,000 distinct data entries. The file name is created based on emoji package (emoji.EMOJI_DATA) in Python.

It should be noted that given the possible occurrence of small errors associated with pycld3, along with the potential for multiple emojis per data entry, there may exist instances of non-English tweets or duplicated tweets across different CSV files.

In [21]:
import os, re, unicodedata, csv
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
DATASET_DIR = "/kaggle/input/tweets-with-emoji"
OUTPUT_DIR = "/kaggle/working"
TEST_SIZE = 0.1
RNG_SEED = 42
random.seed(RNG_SEED)

In [23]:
FILENAME_TO_EMOJI = {
    "backhand_index_pointing_right.csv": "üëâ",
    "check_mark.csv": "‚úîÔ∏è",
    "check_mark_button.csv": "‚úÖ",
    "clown_face.csv": "ü§°",
    "cooking.csv": "üç≥",
    "egg.csv": "ü•ö",
    "enraged_face.csv": "üò°",
    "eyes.csv": "üëÄ",
    "face_holding_back_tears.csv": "ü•π",
    "face_savoring_food.csv": "üòã",
    "face_with_steam_from_nose.csv": "üò§",
    "face_with_tears_of_joy.csv": "üòÇ",
    "fearful_face.csv": "üò®",
    "fire.csv": "üî•",
    "folded_hands.csv": "üôè",
    "ghost.csv": "üëª",
    "grinning_face_with_sweat.csv": "üòÖ",
    "hatching_chick.csv": "üê£",
    "hot_face.csv": "ü•µ",
    "loudly_crying_face.csv": "üò≠",
    "melting_face.csv": "ü´†",
    "middle_finger.csv": "üñï",
    "party_popper.csv": "üéâ",
    "partying_face.csv": "ü•≥",
    "pile_of_poo.csv": "üí©",
    "rabbit.csv": "üêá",
    "rabbit_face.csv": "üê∞",
    "red_heart.csv": "‚ù§Ô∏è",
    "rolling_on_the_floor_laughing.csv": "ü§£",
    "saluting_face.csv": "ü´°",
    "skull.csv": "üíÄ",
    "smiling_face.csv": "‚ò∫Ô∏è",
    "smiling_face_with_halo.csv": "üòá",
    "smiling_face_with_heart-eyes.csv": "üòç",
    "smiling_face_with_hearts.csv": "ü•∞",
    "smiling_face_with_sunglasses.csv": "üòé",
    "smiling_face_with_tear.csv": "ü•≤",
    "sparkles.csv": "‚ú®",
    "sun.csv": "‚òÄÔ∏è",
    "thinking_face.csv": "ü§î",
    "thumbs_up.csv": "üëç",
    "white_heart.csv": "ü§ç",
    "winking_face.csv": "üòâ",
}

In [25]:
dfs = []
for fname, emo in FILENAME_TO_EMOJI.items():
    fpath = os.path.join(DATASET_DIR, fname)
    # newline='' ƒë·ªÉ csv.reader x·ª≠ l√Ω newline ƒë√∫ng chu·∫©n CSV (k·ªÉ c·∫£ trong quotes)
    with open(fpath, "r", encoding="utf-8", errors="ignore", newline="") as f:
        reader = csv.reader(
            f,
            delimiter=",",
            quotechar='"',
            escapechar="\\",
            doublequote=True,
            quoting=csv.QUOTE_MINIMAL,
            skipinitialspace=False,
            strict=False,
        )
        rows = []
        for row in reader:
            if not row:
                continue
            # file k·ª≥ v·ªçng 1 c·ªôt; n·∫øu ‚Äúb·∫©n‚Äù c√≥ nhi·ªÅu c·ªôt th√¨ g·ªôp l·∫°i b·∫±ng d·∫•u ph·∫©y
            text = row[0] if len(row) == 1 else ",".join(row)
            # b·ªè BOM n·∫øu d√≠nh ·ªü ƒë·∫ßu
            if text.startswith("\ufeff"):
                text = text[1:]
            rows.append(text)

    if not rows:
        continue
    df = pd.DataFrame({"text": rows})
    df["emoji"] = emo
    dfs.append(df)

full_df = pd.concat(dfs, ignore_index=True)
full_df = full_df.drop_duplicates(subset=["text", "emoji"]).reset_index(drop=True)

In [26]:
print(full_df.shape)

(852757, 2)


In [27]:
full_df.sample(5)

Unnamed: 0,text,emoji
476005,@RoseK24206400 @MariaGodinez25 @redbullracing ...,üí©
536743,"@auntmaggiep Good Morning, Maggie! ‚òïÔ∏è‚ù§Ô∏èüêáüíê",‚ù§Ô∏è
829376,"To this wonderful pair of children of God, tha...",ü§ç
306984,Just me being cuteüëª HAHAHHAHAH https://t.co/uh...,üëª
677903,@AnnePParis @angie241325 Hope you‚Äôve had a lov...,ü•∞


In [55]:
print("\nS·ªë l·ªõp (emoji):", full_df['emoji'].nunique())


S·ªë l·ªõp (emoji): 43


In [56]:
# Regex v√† h√†m l√†m s·∫°ch
URL_RE        = re.compile(r"(https?://\S+|www\.\S+)", flags=re.IGNORECASE)
MENTION_RE    = re.compile(r"@\w+")
HASHTAG_RE    = re.compile(r"#\w+")
MULTISPACE_RE = re.compile(r"\s+")
EMOJI_RE      = re.compile(
    "["                     
    "\U0001F300-\U0001F5FF"
    "\U0001F600-\U0001F64F"
    "\U0001F680-\U0001F6FF"
    "\U0001F700-\U0001F77F"
    "\U0001F780-\U0001F7FF"
    "\U0001F800-\U0001F8FF"
    "\U0001F900-\U0001F9FF"
    "\U0001FA00-\U0001FAFF"
    "\U00002700-\U000027BF"
    "\U00002600-\U000026FF"
    "\U00002B00-\U00002BFF"
    "\U00002300-\U000023FF"
    "]+",
    flags=re.UNICODE
)
EMOJI_EXTRA_RE = re.compile(r"[\u200d\ufe0e\ufe0f\U0001F3FB-\U0001F3FF]", flags=re.UNICODE)

def clean_text(s: str) -> str:
    t = unicodedata.normalize("NFKC", str(s))
    t = URL_RE.sub(" ", t)
    t = MENTION_RE.sub(" ", t)
    t = HASHTAG_RE.sub(" ", t)
    t = EMOJI_RE.sub(" ", t)
    t = EMOJI_EXTRA_RE.sub(" ", t)
    t = t.lower()
    t = MULTISPACE_RE.sub(" ", t).strip()
    return t

In [57]:
# √Åp d·ª•ng l√†m s·∫°ch l√™n full_df
full_df["text"] = full_df["text"].map(clean_text)

# B·ªè c√°c d√≤ng r·ªóng sau khi l√†m s·∫°ch
full_df = full_df[full_df["text"] != ""].reset_index(drop=True)

In [58]:
print("Sau khi l√†m s·∫°ch:", full_df.shape)
full_df.sample(5)

Sau khi l√†m s·∫°ch: (835428, 2)


Unnamed: 0,text,emoji
432448,"the latest opportunities in dance study, perfo...",üéâ
167210,tangina what if ot7 live after kcon,ü•π
144629,maybe give the girls a snippet of one of the r...,üëÄ
558868,humans finding neanderthals attractive and vic...,ü§£
444326,have a nice weekend killa and congrats again i...,üéâ


In [73]:
# T√°ch train/test
train_df, test_df = train_test_split(
    full_df, 
    test_size=TEST_SIZE, 
    random_state=RNG_SEED, 
    shuffle=True, 
    stratify=full_df['emoji']
)

In [75]:
print("Train set size: ", train_df.shape)
print("Test set size: ", test_df.shape)

Train set size:  (751885, 2)
Test set size:  (83543, 2)


In [76]:
# L∆∞u
train_path = os.path.join(OUTPUT_DIR, "train_data.csv")
test_path  = os.path.join(OUTPUT_DIR, "test_data.csv")
train_df.to_csv(train_path, index=False, encoding="utf-8")
test_df.to_csv(test_path, index=False, encoding="utf-8")