# Emoji Prediction Project




### Overview of Emoji Prediction Notebook 😊

This notebook implements a deep learning model to predict appropriate emojis based on text input. It involves data preprocessing, tokenization, emoji encoding, and training an LSTM-based neural network. The model is then evaluated and used to suggest emojis for new text inputs. 📊🤖



# Connect with me:  
[![GitHub](https://img.shields.io/badge/GitHub-Profile-blue?style=for-the-badge&logo=github)](https://github.com/nazishjaveed) 

[![Kaggle](https://img.shields.io/badge/Kaggle-Profile-blue?style=for-the-badge&logo=kaggle)](https://www.kaggle.com/nazishjaveed) 

[![LinkedIn](https://img.shields.io/badge/LinkedIn-Profile-blue?style=for-the-badge&logo=linkedin)]

### Step 1.Data Collection:

In [1]:
import pandas as pd

# Load dataset
data = pd.read_csv('Emoji Predition data.csv')




In [2]:
data

Unnamed: 0.1,Unnamed: 0,text,emoji
0,0,😜,0
1,1,📸,1
2,2,😍,2
3,3,😂,3
4,4,😉,4
5,5,🎄,5
6,6,📷,6
7,7,🔥,7
8,8,😘,8
9,9,❤,9


In [3]:
# Display the first few rows of the dataset
data.head()


Unnamed: 0.1,Unnamed: 0,text,emoji
0,0,😜,0
1,1,📸,1
2,2,😍,2
3,3,😂,3
4,4,😉,4


### Step 2.Data Preprocessing:

In [4]:
import pandas as pd
import re
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load dataset
data = pd.read_csv('Emoji Predition data.csv')  # Ensure you have the correct path to your dataset

# Debug: Print column names to verify they exist
print("Columns in dataset:", data.columns)

# Check if the necessary columns are present
if 'text' not in data.columns or 'emoji' not in data.columns:
    raise ValueError("Dataset must contain 'text' and 'emoji' columns")

# Function to clean the text data
def clean_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^A-Za-z0-9\s]', '', text)  # Remove special characters
    text = text.lower().strip()  # Convert to lowercase and strip whitespace
    return text

# Clean the text column
data['text'] = data['text'].apply(clean_text)

# Tokenize the text data
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000, oov_token='<OOV>')
tokenizer.fit_on_texts(data['text'])
sequences = tokenizer.texts_to_sequences(data['text'])
padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=100, padding='post', truncating='post')

# Encode the emojis
label_encoder = LabelEncoder()
data['emoji'] = label_encoder.fit_transform(data['emoji'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, data['emoji'], test_size=0.2, random_state=42)

# Check the shapes of the splits to ensure they are correct
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")


Columns in dataset: Index(['Unnamed: 0', 'text', 'emoji'], dtype='object')
X_train shape: (16, 100), y_train shape: (16,)
X_test shape: (4, 100), y_test shape: (4,)


### Step 3: Model Selection and Training

In [5]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Define the model
model = Sequential([
    Embedding(input_dim=10000, output_dim=64, input_length=100),
    LSTM(64, return_sequences=True),
    Dropout(0.5),
    LSTM(64),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dense(len(label_encoder.classes_), activation='softmax')
])

# Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Step 4: Evaluation

In [6]:
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Loss: {loss}, Accuracy: {accuracy}')

Loss: 3.0564653873443604, Accuracy: 0.0


### Step 5: Making Predictions

In [7]:
# Function to predict emoji for a given text
import numpy as np
def predict_emoji(text):
    cleaned_text = clean_text(text)
    sequence = tokenizer.texts_to_sequences([cleaned_text])
    padded_sequence = tf.keras.preprocessing.sequence.pad_sequences(sequence, maxlen=100, padding='post', truncating='post')
    prediction = model.predict(padded_sequence)
    emoji = label_encoder.inverse_transform([np.argmax(prediction)])
    return emoji[0]

# Test the function
print(predict_emoji("I love this!"))


2
