# Discussion 8: Embedding models and LSTMs with Keras

### Why all this is important: Natural Language Processing!

Natural language processing (NLP) is a computer program's ability to understand human language as it's written and spoken. Utilizing AI as for NLP and Natural Language Understanding has become one of the most important avenues of AI.

Different applications of NLP:

1. Machine Translation: translating languages from one to another. 
2. Speech Recognition: the process of converting spoken language into text. 
3. Sentiment Analysis: the process of detecting opinions through text, positive, negative or neutral
4. Named entity recognition (NER): A sub-task of information extraction in NLP that classifies named entities into predefined categories.

One big application of NLP, and very much at the peak of ML research today are $\textbf{Large Language Models}$ such as GPT, LlaMa and others.
 
LLMs are:
1. created on extremely large document corpus (eg. all of internet)
2. synthesize text inputs, and generate coherent outputs at real time.

For starting to work on any of the NLP tasks, there are 2 basic things that you need to implement.

1. Using some embedding technique, need to convert sentences into embeddings, which can be inputted into ML models directly. 
2. Using strong, context-understanding ML models such as LSTMs, transformers, etc. to understand embeddings and creating strong models.

We looked at CountVectorizer and Tf-Idf last week. Both of these embeddings are created from the actual data that we are working with. 

The disadvantage of using a model like this, is that its not broad enough to be able to capture important information at all times.

Another very powerful technique to use in this scenario is to utilize powerful pre-trained embedding models provided by packages like gensim and spaCy.

In [1]:
!pip install tensorflow gensim spacy
!python -m spacy download en_core_web_sm

Collecting gensim
  Downloading gensim-4.3.2-cp39-cp39-win_amd64.whl.metadata (8.5 kB)
Collecting spacy
  Downloading spacy-3.7.4-cp39-cp39-win_amd64.whl.metadata (27 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.0.1-py3-none-any.whl.metadata (23 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.10-cp39-cp39-win_amd64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.8-cp39-cp39-win_amd64.whl.metadata (8.6 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp39-cp39-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.3.0,>=8.2.2 (from spacy)
  Downloading thinc-8.2.3-cp39-cp39-win_amd64.whl.metadata (15 kB)
Collecting wa

Following is an example of using gensim to use a pretrained embedding model, GoogleNews-vectors-negative300. for encoding

The great thing about this model is that the model is able to convert most words into an embedding of shape (300,)

In [8]:
from gensim.models import KeyedVectors

# Load pretrained word vectors (This path might need to be updated based on where you've saved your model)
word_vectors = KeyedVectors.load_word2vec_format("C:\\Users\\premk\\Downloads\\GoogleNews-vectors-negative300.bin\\GoogleNews-vectors-negative300.bin", binary=True)  

# Example of how to get a word vector
vector = word_vectors['computer']  # Get the vector for 'computer'
print(vector.shape)

vector = word_vectors['donut']
print(vector.shape)


(300,)
(300,)


Following is also an example of using a small embedding model, directly from spaCy (installed above). 

In [10]:
import spacy
import numpy as np

# Load the small English model
nlp = spacy.load('en_core_web_sm')

def preprocess_text(text):
    # Tokenize and lemmatize text
    doc = nlp(text)
    lemmatized = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    
    # Convert tokens to vectors, ignoring those not in our word_vectors model
    vectors = [word_vectors[token] for token in lemmatized if token in word_vectors]
    
    if len(vectors) == 0:
        return np.zeros((1, 300))  # Return a zero vector if none of the tokens have vectors in the model
    else:
        return np.mean(vectors, axis=0)

# Example usage
text = "This is an example sentence for processing."
vector = preprocess_text(text)
print(vector.shape)


(300,)


Using existing pretrained embedding models can be very helpful. Please refer to the gensim and spaCy libraries and other implementations and pretrained models to explore how and what to use.

Gensim: https://radimrehurek.com/gensim/

spaCy: https://spacy.io/


### LSTM: the basis of memory implemented in neural networks

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) that have been specifically designed to address the limitations of traditional RNNs, particularly in handling long-term dependencies. Traditional RNNs struggle to maintain information in their memory for long periods of time, which is a critical aspect when dealing with sequence data like natural language processing, time series analysis, and more. LSTMs overcome this challenge through their unique architecture, which includes memory cells and gates (input, output, and forget gates) that regulate the flow of information.

These gates control whether to retain or discard information, making LSTMs capable of learning which data in a sequence is important to keep and which can be thrown away. This ability to remember information for long durations and to effectively manage the vanishing gradient problem makes LSTMs highly advantageous for sequence data, leading to improved performance in tasks like text generation, speech recognition, and more.

![alt text](LSTM_cell.png)

Above is a diagram of an LSTM (Long Short-Term Memory) cell, which is a building block of LSTM networks, a RNN variant designed to remember information for long periods of time. Here's a step-by-step explanation of how this LSTM cell works:

Memory $(C_{t-1})$ and Hidden state $(H_{t-1})$: The cell takes in two pieces of information from the previous time step – the memory $(C_{t-1})$ and the hidden state $(H_{t-1})$. These are the cell's 'memory' of what it has seen in previous steps in the sequence.

Input $(X_t)$: Along with the previous memory and hidden state, the cell receives the current input $(X_t)$.

Forget Gate $(F_t)$: This gate decides which information is irrelevant and can be discarded from the cell state. It looks at the previous hidden state $(H_{t-1})$ and the current input $(X_t)$ and outputs a number between 0 and 1 for each number in the cell state $(C_{t-1})$. A 1 means “completely keep this” while a 0 means “completely get rid of this”. The sigma (σ) denotes the sigmoid function, which squashes the output to be between 0 and 1.

Input Gate $(I_t)$ and Candidate Memory $(~C_t)$: Simultaneously, the input gate decides which new information we're going to store in the cell state. The candidate memory $(~C_t)$, created by applying a tanh function, creates a vector of new candidate values that could be added to the state. The tanh function squashes values to be between -1 and 1.

Update Cell State $(C_t)$: The old cell state $(C_{t-1})$ is updated to the new cell state $(C_t)$. The forgotten information is scaled by $F_t$ and then added to the $I_t * ~C_t$ (input gate times the candidate memory) to update the cell state to the new cell state.

Output Gate $(O_t)$: The output gate decides what the next hidden state $(H_t)$ should be. It looks at the previous hidden state and the current input and decides which parts of the cell state will be output. Then, it applies tanh to the cell state (to push the values to be between -1 and 1) and multiplies it by the output of the sigmoid gate, so that we only output the parts we decided to.

Next Hidden State $(H_t)$: The result is the new hidden state $(H_t)$. This new hidden state and the new cell state $(C_t)$ are then carried over to the next time step.

The combination of these gates and memory updates allows the LSTM to effectively capture long-term dependencies and handle the vanishing gradient problem that can occur in standard RNNs. This makes LSTMs particularly well-suited for tasks such as language modeling, machine translation, and speech recognition, where understanding context and maintaining information over time is critical.

#### Building an LSTM model in Keras

Step-by-step guide on constructing an LSTM model for text classification
1. Input Layer: This is the entry point for your data into the model. For an LSTM, the input layer must be specifically formatted to represent sequences. This often involves padding or truncating text sequences to ensure uniformity in sequence length.

2. Embedding Layer: This layer transforms the input sequence of word indices into dense vectors of fixed size, typically much more compact than the one-hot encoding representations. Using pretrained embeddings like GloVe or Word2Vec is optional but can significantly boost the model's performance by leveraging prior knowledge of word associations.

3. LSTM Layer(s): Here, one or more LSTM layers are added to process the sequence of word embeddings. The LSTM layers learn to identify and utilize long-term dependencies in the data, essential for understanding the context and semantics in text classification tasks.

4. Dense Layer(s) for Classification: After processing the sequences with LSTM layers, the output is flattened or pooled and fed into one or more dense layers. These layers serve to map the learned sequence representations to the desired output format, such as the classes in a classification task.

5. Compilation (loss function, optimizer): Finally, the model is compiled, specifying a loss function and an optimizer. For a classification problem, the loss function is often categorical crossentropy, while the optimizer could be Adam, RMSprop, or SGD. This step also typically includes specifying any metrics to monitor during training, such as accuracy.

### LSTM implementation on Twitter Sentiment Analysis dataset.

data adapted from Kaggle: https://huggingface.co/datasets/carblacac/twitter-sentiment-analysis

In [2]:
import pandas as pd
import numpy as np

skip = lambda i: i > 0 and np.random.rand() > 0.01

df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding_errors='ignore', names=['sentiment', 'id', 'time', 'query', 'username', 'tweet'], skiprows=skip)

# Display the first few rows to confirm it's loaded correctly
print(df.head())

   sentiment          id                          time     query  \
0          0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY   
1          0  1467813992  Mon Apr 06 22:20:38 PDT 2009  NO_QUERY   
2          0  1467814883  Mon Apr 06 22:20:52 PDT 2009  NO_QUERY   
3          0  1467858869  Mon Apr 06 22:32:20 PDT 2009  NO_QUERY   
4          0  1467872181  Mon Apr 06 22:35:50 PDT 2009  NO_QUERY   

          username                                              tweet  
0  _TheSpecialOne_  @switchfoot http://twitpic.com/2y1zl - Awww, t...  
1       swinspeedx  one of my friend called me, and asked to meet ...  
2            gagoo                             im sad now  Miss.Lilly  
3       Jaderade14   is watching the hill . . .and its making me sad   
4           admdrw  @charlietm I know right. I dunno what is going...  


In [3]:
import tensorflow as tf

tf.config.list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

In [4]:
import re

def preprocess_text(text):
    # Remove hashtags and mentions
    text = re.sub(r'(@\w+|#\w+)', '', text)
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # Remove numbers and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Lowercase all text
    text = text.lower()
    
    return text


In [6]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SpatialDropout1D, LSTM, Dense, Dropout

# Preprocess the text data
# This function would need to be defined in your code, adapting for tweet-specific content
X = df['tweet'].apply(preprocess_text).tolist()
y = df['sentiment'].values

# Tokenize and pad sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)
max_length = 100  # Adjust based on your dataset
X_pad = pad_sequences(sequences, maxlen=max_length, padding='post')

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_pad, y, test_size=0.2, random_state=42)

# Define the LSTM model architecture
vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=100, input_length=max_length),
    SpatialDropout1D(0.2),
    LSTM(128, dropout=0.2, recurrent_dropout=0.2, return_sequences=True),  # First LSTM layer
    Dropout(0.2),
    LSTM(64, dropout=0.2, recurrent_dropout=0.2),  # Second LSTM layer
    Dense(64, activation='relu'),  # Additional Dense layer to increase model capacity
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

# Compile and train the model
# model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# model.summary()
# history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_test, y_test))

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {accuracy*100:.2f}%')


Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 100, 100)          1937800   
                                                                 
 spatial_dropout1d_2 (Spatia  (None, 100, 100)         0         
 lDropout1D)                                                     
                                                                 
 lstm_3 (LSTM)               (None, 100, 128)          117248    
                                                                 
 dropout_2 (Dropout)         (None, 100, 128)          0         
                                                                 
 lstm_4 (LSTM)               (None, 64)                49408     
                                                                 
 dense_3 (Dense)             (None, 64)                4160      
                                                      

(3198, 100)

In [5]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SpatialDropout1D, LSTM, Dense, Dropout

# Preprocess the text data
# This function would need to be defined in your code, adapting for tweet-specific content
X = df['tweet'].apply(preprocess_text).tolist()
y = df['sentiment'].values

In [6]:
# Tokenize and pad sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)
max_length = 100  # Adjust based on your dataset
X_pad = pad_sequences(sequences, maxlen=max_length, padding='post')

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_pad, y, test_size=0.2, random_state=42)

In [8]:
X[1], sequences[1]

('one of my friend called me and asked to meet with her at mid valley todaybut ive no time sigh',
 [54,
  13,
  5,
  271,
  485,
  14,
  6,
  819,
  2,
  410,
  22,
  100,
  24,
  3619,
  4737,
  7071,
  124,
  37,
  51,
  737])

In [None]:
# Define the LSTM model architecture
vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=100, input_length=max_length),
    SpatialDropout1D(0.2),
    LSTM(128, dropout=0.2, recurrent_dropout=0.2, return_sequences=True),  # First LSTM layer
    Dropout(0.2),
    LSTM(64, dropout=0.2, recurrent_dropout=0.2),  # Second LSTM layer
    Dense(64, activation='relu'),  # Additional Dense layer to increase model capacity
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])