# Emoji Sentiment

Are popular emojis generally associated with positive or negative sentiments?

The file `"emoji-sentiment.csv"` provides data on the sentiment associated with various emojis.

Researchers examined 1.6 million tweets across 13 European languages. Each tweet was labeled by annotators as positive (+1), negative (-1), or neutral (0). About 4% of these tweets included emojis.

Columns include:
- `Occurrences [5...max]`: Number of times the emoji appears in the dataset.
- `Position [0...1]`: Average position of the emoji in tweets, from start (0) to end (1).
- `Neg [0...1]`: Percentage of tweets with the emoji that are 'negative'.
- `Neu [0...1]`: Percentage of tweets with the emoji that are 'neutral'.
- `Pos [0...1]`: Percentage of tweets with the emoji that are 'positive'.



In [1]:
# FOR GOOGLE COLAB ONLY.
# Uncomment and run the code below. A dialog will appear to upload files.
# Upload 'emoji-sentiment.csv'.

# from google.colab import files
# uploaded = files.upload()

In [5]:
import pandas as pd
df = pd.read_csv('emoji-sentiment.csv')
df.head(3)

Unnamed: 0,Char,Image [twemoji],Unicode codepoint,Occurrences [5...max],Position [0...1],Neg [0...1],Neut [0...1],Pos [0...1],Sentiment bar (c.i. 95%),Unicode name,Unicode block
0,😂,😂,0x1f602,14622,0.805,0.247,0.285,0.468,,FACE WITH TEARS OF JOY,Emoticons
1,❤,❤,0x2764,8050,0.747,0.044,0.166,0.79,,HEAVY BLACK HEART,Dingbats
2,♥,♥,0x2665,7144,0.754,0.035,0.272,0.693,,BLACK HEART SUIT,Miscellaneous Symbols


### Project Ideas:

Data Cleaning: 
- Remove unnecessary columns that are not useful for your analysis.

- Rename the remaining columns using `snake_case` (all lowercase letters with underscores between words).

New Variables:
- Add a new column called `sentiment`, where sentiment = (% positive tweets) - (% negative tweets).

- Add a `positive_flag` column that is `True` if `sentiment > 0` (or above a set threshold), otherwise `False`.

Types of questions you can now answer more easily:
- What percentage of emojis in the dataset have a positive sentiment?

- What percentage of the top 20 most popular emojis are positive?

- Which emoji (with more than 500 mentions) is the most positive?

- Which emoji (with more than 500 mentions) is the most negative?

- Where in the tweets are most emojis located (i.e. at the beginning or the end)?

- Is there a difference in the placement of positive versus negative emojis within a tweet?

In [None]:
# YOUR CODE HERE (add additional cells as needed)
#  Data Cleaning: Remove & Rename Columns

# Load the CSV
df = pd.read_csv("emoji-sentiment.csv")

# Keep relevant columns
df = df[["Char", "Occurrences [5...max]", "Position [0...1]", "Neg [0...1]", "Neut [0...1]", "Pos [0...1]"]]

# Rename columns to snake_case
df.columns = ["emoji", "occurrences", "position", "neg", "neut", "pos"]


In [10]:
# New Variables: Sentiment and Positive Flag

# Calculate sentiment
df["sentiment"] = df["pos"] - df["neg"]

# Create positive_flag: True if sentiment > 0
df["positive_flag"] = df["sentiment"] > 0

In [11]:
# What percentage of emojis have positive sentiment?

percent_positive = df["positive_flag"].mean() * 100
print(f"{percent_positive:.2f}% of emojis have positive sentiment.")

82.42% of emojis have positive sentiment.


In [12]:
# What percentage of the top 20 most popular emojis are positive?

top20 = df.sort_values("occurrences", ascending=False).head(20)
percent_top20_positive = top20["positive_flag"].mean() * 100
print(f"{percent_top20_positive:.2f}% of top 20 emojis are positive.")

90.00% of top 20 emojis are positive.


In [13]:
# Most positive emoji with > 500 mentions

most_positive = df[df["occurrences"] > 500].sort_values("sentiment", ascending=False).iloc[0]
print("Most positive emoji over 500 mentions:", most_positive["emoji"])

Most positive emoji over 500 mentions: ❤


In [14]:
# Most negative emoji with > 500 mentions

most_negative = df[df["occurrences"] > 500].sort_values("sentiment").iloc[0]
print("Most negative emoji over 500 mentions:", most_negative["emoji"])

Most negative emoji over 500 mentions: 😒


In [15]:
# Where are emojis most likely located in tweets?

mean_position = df["position"].mean()
location = "end" if mean_position > 0.5 else "beginning"
print(f"Most emojis appear toward the {location} of tweets (avg. position: {mean_position:.2f}).")

Most emojis appear toward the end of tweets (avg. position: 0.67).


In [16]:
# Difference in placement: Positive vs. Negative

pos_mean = df[df["positive_flag"] == True]["position"].mean()
neg_mean = df[df["positive_flag"] == False]["position"].mean()

print(f"Avg. position of positive emojis: {pos_mean:.2f}")
print(f"Avg. position of negative emojis: {neg_mean:.2f}")

Avg. position of positive emojis: 0.66
Avg. position of negative emojis: 0.68
