<a href="https://colab.research.google.com/github/kimo26/Emojification/blob/main/Emojification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#ACE Workshop

The ACE workshop is built around the idea that the best way to learn something new is by getting a hands-on experience. Leanring something new is a journey form where you are now to where you want to be and this workshop is just a vechicle to get you there. We hope that you find it to be a productive and enjoyable learning experience. In this workshop we will walk you through how to build a deep learning model to add fun emojis to sentences.


# Fetching Data

We need to train our model on sentences with emojis so that our model learns when to use what emoji. For this task we will be using Kaggle to collect the data. There are multiple options the best ones in my opinion are the datasets which are collections of english language tweets. So either you can filter all tweets to keep only the ones containing at least 1 emoji or we can just use an already filtered dataset [EmojifyData-EN: English tweets, with emojis](https://www.kaggle.com/datasets/rexhaif/emojifydata-en).



---



# Linking your Kaggle account to your Colab or Vertex Workbench

After having created an account which you can do by clicking [here](https://www.kaggle.com/account/login?phase=startRegisterTab&returnUrl=%2F) or you can sign in by clicking [here](https://www.kaggle.com/account/login?phase=startSignInTab&returnUrl=%2F), you must download Kaggle's beta API which you can do by going to your account settings and clicking on "Create a New API Token". This will download a file called "kaggle.json" to your computer.

We must now make sure that we mount our Google Drive files by running the following code:

In [None]:
# ONLY NEEDED FOR COLAB / SKIP FOR GCP & VERTEX AI
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


Now we can upload the "kaggle.json" file from our computer to our notebook.

In [None]:
# ONLY NEEDED FOR COLAB / SKIP FOR GCP & VERTEX AI
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"kimo26","key":"12e86ee48ba133234a92099b7cc08db3"}'}

In [None]:
!pip install -q kaggle #install Kaggle API client
!mkdir ~/.kaggle #create kaggle directory
!cp kaggle.json ~/.kaggle/kaggle.json #copy the kaggle.json file into directory
!chmod 600 ~/.kaggle/kaggle.json #Change file permission

# Downloading the Dataset

We will be now downloading the tweets from kaggle to our notebook and unzip the folder.

In [None]:
!kaggle datasets download -q rexhaif/emojifydata-en

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.7/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.7/dist-packages/kaggle/api/kaggle_api_extended.py", line 166, in authenticate
    self.config_file, self.config_dir))
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


In [None]:
!unzip -q emojifydata-en.zip

unzip:  cannot find or open emojifydata-en.zip, emojifydata-en.zip.zip or emojifydata-en.zip.ZIP.


We will only be using dev.txt so we can delete all other txt files.

In [None]:
!rm -rf sample_data
!rm test.txt
!rm train.txt
!rm emojitweets-01-04-2018.txt
!rm emojifydata-en.zip

rm: cannot remove 'test.txt': No such file or directory
rm: cannot remove 'train.txt': No such file or directory
rm: cannot remove 'emojitweets-01-04-2018.txt': No such file or directory
rm: cannot remove 'emojifydata-en.zip': No such file or directory


#Raw text to Dataframe

We will at this point extract the data froim the .txt file and put it into a pandas dataframe so that we can preprocess the data.

We will first import all the libraries we will be using and read the text.

In [None]:
import pandas as pd
import re
import nltk 
from nltk.corpus import stopwords

In [None]:
with open('dev.txt') as f:
  t = f.read()

## What does our data tell us?
We know what the data approxiamtely looks like thanks its website on Kaggle. Every new tweet starts with Start tag and at the end there's STOP tag. Every word is seperated by "0\n". Furthermore, we see that the emojis of the tweets are in CLDR Short Name format e.g a red heart is denoted as :red_heart:

##Our Goal

We want to create a dataframe which will have 2 columns. One denoting a tweet without its original emoji and the other one containing the corresponding emoji. So we must seperate all emojis from their corresponding tweets e.g. we want to start from 

> Congratulations Mo has been named the Players Player of the Year :clapping_hands:

and end with 



> Congratulations Mo has been named the Players Player of the Year

> :clapping_hands:

##Step by Step



###1) Remove all unecessary tags

In [None]:
t = t.replace('\n','').replace('<STOP>','').replace('O','')#none of these tags give us any information about the tweet and must be cleaned up

In [None]:
t = [i for i in t.split(' ') if i != '']#we remove all unecessary spaces
t = ' '.join(t) #add back only the crucial ones

###2) Seperate text into seperate tweets and place them into dataframe

In [None]:
tweets = t.split('<START> ')[1:] #we know that all tweets start with the START tag

In [None]:
tweets[:10]

['No object is so beautiful that under certain conditions it will not look ugly scar Wilde ↺ RT :red_heart:… ',
 'Cant expect different results doing the same thingdoing stuff different from now on :person_shrugging:🏻 \u200d :female_sign:️ ',
 '“ Lets go Marcus ” “ Shiiit where we goin Home ” Marcus Peters :face_with_tears_of_joy: ',
 'Asahd really is a grown man in the body of a 1 year old :face_with_tears_of_joy: ',
 'Yoongi Tweet Hello Im Min fell on Butt What the :face_with_tears_of_joy:Min ',
 'we cannot afford İSJK :backhand_index_pointing_down:n AQ play havoc with our lives in Kashmir Yes these are the independent Kash … ',
 'ranks 6th in January Idol Group Brand Reputation :party_popper:1Keep using 2Search GT 7 on Naver :backhand_index_pointing_down:htt … ',
 'k people are really trying to kill themselves with this Tide Pod challenge Who tf and why tf :person_facepalming:🏽 \u200d :female_sign:️ we had the Cinnamon Ch … ',
 'Cant wait to meet my everything right after meeting my

In [None]:
df = pd.DataFrame()
df['tweets']=tweets
df.head()

Unnamed: 0,tweets
0,No object is so beautiful that under certain c...
1,Cant expect different results doing the same t...
2,“ Lets go Marcus ” “ Shiiit where we goin Home...
3,Asahd really is a grown man in the body of a 1...
4,Yoongi Tweet Hello Im Min fell on Butt What th...


### 3) Seperate the emojis from the tweets

In [None]:
df['text'] = df.tweets.apply(lambda x : re.sub(":.*?:","",x))#we use the re library to remove all instances of CLDR Short Name emojis with 
df['emoji'] = df.tweets.apply(lambda x : re.findall(":.*?:",x)[0].replace(':',''))#we instead find all instances of CLDR Short Name emojis of the tweet and only keep one
df = df.drop(columns=['tweets'])#we drop the tweets column since it takes space and is useless now
df.head()

Unnamed: 0,text,emoji
0,No object is so beautiful that under certain c...,red_heart
1,Cant expect different results doing the same t...,person_shrugging
2,“ Lets go Marcus ” “ Shiiit where we goin Home...,face_with_tears_of_joy
3,Asahd really is a grown man in the body of a 1...,face_with_tears_of_joy
4,Yoongi Tweet Hello Im Min fell on Butt What th...,face_with_tears_of_joy


### 4) Delete all unwanted Emojis from Dataframe

In [None]:
remove = ['heavy_check_mark','female_sign','male_sign','white_heavy_check_mark','right_arrow','double_exclamation_mark','yellow_heart','purple_heart','blue_heart','speaking_head','face_with_rolling_eyes',
          'backhand_index_pointing_down','trophy']
for i in remove:
  df = df[df.emoji != i]

### 5) Balance and Shuffle the Data

In [None]:
g = df.groupby('emoji')
df = g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
df = df.sample(frac=1).reset_index(drop=True)
df.head()

Unnamed: 0,text,emoji
0,i like this one Keep it up,clapping_hands
1,My Little Brother He Is nly 12 Years Doing Gra...,loudly_crying_face
2,You ready to drop out and start a skate shop i...,smiling_face_with_sunglasses
3,happy birthday my dude hope youve had a great ...,smiling_face_with_smiling_eyes
4,Lester Holt amp NBC BIG BB,flushed_face


### 6) Clean the tweets homogenuously but don't change the semantics

In [None]:
!pip install -U -q autocorrect

In [None]:
from autocorrect import Speller

In [None]:
nltk.download('stopwords')
stop_words = stopwords.words("english")#words that you do not want to use to describe the topic of your content like: a, an, of, in, etc
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
spell = Speller(fast=True)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
def preprocess(text):
    text = str(text).lower()
    text = text.replace('rt','')
    text = re.sub(TEXT_CLEANING_RE, ' ', text).strip()#lower cases the tweet and removes all links, mentions, etc
    tokens = []
    for token in text.split():
        if token not in stop_words:#removes stop words
            tokens.append(token)
    string = " ".join(tokens)
    string = spell(string)
    return string

In [None]:
df.text = df.text.apply(lambda x : preprocess(x))
df.head()

Unnamed: 0,text,emoji
0,like one keep,clapping_hands
1,little brother only 12 years grade 6 fighting ...,loudly_crying_face
2,ready drop sta skate shop queens,smiling_face_with_sunglasses
3,happy bihday dude hope youve great day today,smiling_face_with_smiling_eyes
4,letter holt amp nbc big bb,flushed_face


# Dataframe to Neural Net Input

At this point we've cleaned and organised our data. Now we have to transform our data such that our neural network will be able to use it to train. 



##Transforming Tweets
Our first challenge is to transform the tweets into a readable form for our model. We will be using the keras text tokenizer which allows to vectorize a text corpus, by turning each text into a sequence of integers (each integer being the index of a token in a dictionary), we will then pad these sequences so that all our input data is uniform. Finally we will transform the emojis into one hot sequences.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
max_features = 2000
max_length=100
tokenizer = Tokenizer(num_words=max_features, split=' ')
tokenizer.fit_on_texts(df['text'].values)
x = tokenizer.texts_to_sequences(df['text'].values)
x = pad_sequences(x,maxlen=max_length)

In [None]:
labels = df['emoji'].unique().tolist()
encoder = OneHotEncoder(sparse=False)
y = np.array(df['emoji'].tolist()).reshape(-1,1)
encoder.fit(y)
y = encoder.transform(y)

In [None]:
print('x',x.shape)
print('y',y.shape)

x (375984, 100)
y (375984, 36)


You can save your encoder and tokenizer for future use of this model.

In [None]:
import pickle
pickle.dump(tokenizer,open("tokenizer.pkl",'wb'),protocol=0)
pickle.dump(encoder,open('encoder.pkl','wb'),protocol = 0)

# Model Architecture

The first layer of our model will be an embedding one. An embedding layer enables us to convert each word into a fixed length vector of defined size. The resultant vector is a dense one having real values instead of just 0's and 1's. The fixed length of word vectors helps us to represent words in a better way along with reduced dimensions.
An LSTM model generally works well for such a text classification problem. However, it takes forever to train. One way to speed up the training time is to improve the network adding “Convolutional” layer. Convolutional Neural Networks (CNN) come from image processing. They pass a “filter” over the data and calculate a higher-level representation. They have been shown to work surprisingly well for text, even though they have none of the sequence processing ability of LSTMs. Moreover, to increase the value of the data we're going to turn our LSTM layer to a bidirectional one. This is done so that a cell can be used to train two sides, instead of one side of the input sequence.This provides one more context to the word to fit in the right context from words coming after and before, resulting in faster and fully learning and solving a problem.

## Building the Embedding Layer

TensorFlow enables you to train word embeddings. However, this process not only requires a lot of data but can also be time and resource-intensive. To tackle these challenges you can use pre-trained word embeddings. Let's illustrate how to do this using GloVe (Global Vectors) word embeddings by Stanford.  These embeddings are obtained from representing words that are similar in the same vector space. This is to say that words that are negative would be clustered close to each other and so will positive ones. 

In [None]:
!wget --no-check-certificate \
     http://nlp.stanford.edu/data/glove.6B.zip \
     -O /tmp/glove.6B.zip

--2022-10-13 12:24:10--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2022-10-13 12:24:11--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-10-13 12:24:11--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘/tmp/glove.6B.zip’


In [None]:
import os
import zipfile
with zipfile.ZipFile('/tmp/glove.6B.zip', 'r') as zip_ref:
    zip_ref.extractall('/tmp/glove')

The first step is to obtain the word embedding and append them to a dictionary. After that, you'll need to create an embedding matrix for each word in the training set. Let's start by downloading the GloVe word embeddings.

In [None]:
embeddings_index = {}
f = open('/tmp/glove/glove.6B.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


The next step is to create a word embedding matrix for each word in the word index that you obtained earlier. If a word doesn't have an embedding in GloVe it will be presented with a zero matrix. 

In [None]:
word_index = tokenizer.word_index
embedding_matrix = np.zeros((len(word_index) + 1, max_length))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

The next step is to use the embedding you obtained above as the weights to a Keras embedding layer. You also have to set the trainable parameter of this layer to False so that is not trained. There are a couple of other things to note:



*   The Embedding layer takes the first argument as the size of the vocabulary. 1 is added because 0 is usually reserved for padding
*   The input_length is the length of the input sequences
*   The output_dim is the dimension of the dense embedding

In [None]:
from tensorflow.keras.layers import Embedding
embedding_layer = Embedding(input_dim=len(word_index) + 1,
                            output_dim=max_length,
                            weights=[embedding_matrix],
                            input_length=max_length,
                            trainable=False)

## Building the Model

In [None]:
from tensorflow.keras.layers import Dense, LSTM,GlobalAveragePooling1D, SpatialDropout1D,Conv1D,Bidirectional,Dropout
from tensorflow.keras.models import Sequential

In [None]:
model = Sequential()
model.add(embedding_layer)
model.add(SpatialDropout1D(0.6))
model.add(Conv1D(100,5,activation='relu'))
model.add(Bidirectional(LSTM(100,dropout=0.6,recurrent_dropout=0.3)))
model.add(Dropout(0.6))
model.add(Dense(len(y[0]),activation='softmax'))

In [None]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 100)          8532600   
                                                                 
 spatial_dropout1d_4 (Spatia  (None, 100, 100)         0         
 lDropout1D)                                                     
                                                                 
 conv1d_4 (Conv1D)           (None, 96, 100)           50100     
                                                                 
 lstm_4 (LSTM)               (None, 100)               80400     
                                                                 
 dropout_4 (Dropout)         (None, 100)               0         
                                                                 
 dense_4 (Dense)             (None, 36)                3636      
                                                      

In [None]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.callbacks import TensorBoard

In [None]:
#Watch our model train
log_dir = "logs/fit/"
call = TensorBoard(log_dir=log_dir,histogram_freq=1)

In [None]:
model.compile(optimizer=Adam(lr=1e-4),loss=CategoricalCrossentropy(),metrics=['accuracy'])

  super(Adam, self).__init__(name, **kwargs)


In [None]:
hist = model.fit(x,y, batch_size = 64, epochs=10,callbacks=[call])

In [None]:
%load_ext tensorboard

In [None]:
%tensorboard — logdir /content/logs

# Test Model

## Adding the Emoji to the string
We have a trained model now. But to diversiy our outputs a bit more instead of selecting te emoji that our model assigned the greatest confidence, we will pick a random emoji from the weighted confidence output array. So now our final decision is not only determined by our model but there is also a certain element of chance that comes to it so for example the emoji with the second highest confidence is chosen. We will be using the emoji package which displays the emojis for us.

In [None]:
!pip install -q emoji

[?25l[K     |█▌                              | 10 kB 21.4 MB/s eta 0:00:01[K     |███                             | 20 kB 11.0 MB/s eta 0:00:01[K     |████▌                           | 30 kB 15.0 MB/s eta 0:00:01[K     |██████                          | 40 kB 6.3 MB/s eta 0:00:01[K     |███████▋                        | 51 kB 6.8 MB/s eta 0:00:01[K     |█████████                       | 61 kB 8.0 MB/s eta 0:00:01[K     |██████████▋                     | 71 kB 7.1 MB/s eta 0:00:01[K     |████████████                    | 81 kB 8.0 MB/s eta 0:00:01[K     |█████████████▋                  | 92 kB 7.0 MB/s eta 0:00:01[K     |███████████████▏                | 102 kB 7.4 MB/s eta 0:00:01[K     |████████████████▋               | 112 kB 7.4 MB/s eta 0:00:01[K     |██████████████████▏             | 122 kB 7.4 MB/s eta 0:00:01[K     |███████████████████▊            | 133 kB 7.4 MB/s eta 0:00:01[K     |█████████████████████▏          | 143 kB 7.4 MB/s eta 0:00:01[K  

In [None]:
import emoji

In [None]:
def addEmoji(text):
        text = preprocess(text)
        x = pad_sequences(tokenizer.texts_to_sequences([text]),maxlen=25)
        y_hot = model.predict([x],verbose=0)[0]
        w_sum = sum(y_hot)
        sort = list(sorted(y_hot))
        dummy = []
        count = 0
        

        for i in sort:
            count+=i
            dummy.append(count)
        sort = dummy
        r = np.random.uniform(0,1)
        y = np.argmax(y_hot)
        for i,w in enumerate(sort[:-1]):
            if r <= w:
                y = i
                break
        y_in = np.zeros(36)
        y_in[y]=1
        y = encoder.inverse_transform([y_in])[0][0]

      
        text = emoji.emojize(f'{text} :{y}:')
        return text

In [None]:
addEmoji("hey you")

### (Load your model)

If you want to use your model in another context you can save and load your model just like this:

In [None]:
from tensorflow.keras.models import load_model

In [None]:
model.save("model.h5")

In [None]:
model = load_model("model.h5")

# Example Use with ACE Chatbot

Here we will show you what our chatbot looks like with and without the cool added emojis!

In [None]:
!pip install -q ACETH

[K     |████████████████████████████████| 57.2 MB 1.3 MB/s 
[K     |████████████████████████████████| 216 kB 55.9 MB/s 
[K     |████████████████████████████████| 5.3 MB 61.9 MB/s 
[K     |████████████████████████████████| 148 kB 57.3 MB/s 
[K     |████████████████████████████████| 62.5 MB 9.1 kB/s 
[K     |████████████████████████████████| 7.6 MB 37.2 MB/s 
[K     |████████████████████████████████| 163 kB 48.6 MB/s 
[?25h  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Building wheel for sklearn (setup.py) ... [?25l[?25hdone


In [None]:
from ACETH import chatbot as cb

## Boring Chatbot with no Emojis

In [None]:
curr = cb.chatbot(emoji=False)

##Chatbot Everyone wants to talk to

In [None]:
cool = cb.chatbot()