<a href="https://colab.research.google.com/github/jkchandalia/nlpower/blob/main/notebooks/1.0%20NLP_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Natural Language Processing (NLP)**


# Part I: Motivation -- Structuring Unstructured Data
<figure>
<center>
<p align="center">
<img src='https://drive.google.com/uc?export=view&id=1jhbAMekmaR8t8ZWpHN0Rf9y3rDKfzKr_' alt="History of LLMs", width="900" height="420"/>
<figcaption>Addition and Subtraction Operations on Different Data Types</figcaption></center>
</figure>

Deep learning gives us the ability to represent all kinds of unstructured data like text or images or videos and even more abstract things like dog breeds or even emotions in a numerical way as **embeddings**. This is something that a computer can understand and that we can use for models but also captures our understanding of the data. 

For the rest of this tutorial, we can think of embeddings as vectors, just arrays of floating point numbers.



# Part II: Comparing Sentences

## A. Let's use our knowledge of language to gauge sentence similarity:

## 1.   **The feline slept in the sunshine.**
<figure>
<img src='https://drive.google.com/uc?export=view&id=1gtHWnVm-fPSruzxjMS1GHK0bgLqYBYyo' alt="Cat sleeping in sunshine", width="150" height="150"/>
<figcaption></figcaption></center>
</figure>


## 2.   **The cat took a nap on the rug.**
<figure>
<img src='https://drive.google.com/uc?export=view&id=1KCjrHAU1V7P73OFjzNiabIR0nfKfpen7' alt="Cat taking nap", width="150" height="150"/>
<figcaption></figcaption></center>
</figure>



#### Do we feel like there’s a similar idea being presented in these two sentences? What about the below sentence?



## 3.   **The cat took a bite of the rug.**

<figure>
<img src='https://drive.google.com/uc?export=view&id=1DtnL-sFkhmFKnOAXPJ4q5n9z-eoZMIP0' alt="Cat biting rug", width="150" height="150"/>
<figcaption></figcaption></center>
</figure>





## B. Use a Bag of Words Model for sentence similarity

Is there a way to quantify our feelings about the differences/similarities between these three sentences? Let's try something simple and represent each sentence as a bag of words:

In [None]:
s1 = {'The', 'feline', 'slept', 'in', 'the', 'sunshine'}
s2 = {'The', 'cat', 'took', 'a', 'nap', 'on', 'the', 'rug'}
s3 = {'The', 'cat', 'took', 'a', 'bite', 'of', 'the', 'rug'}


Compare the overlap between the bags of words:

In [None]:
print("Overlapping words between sentences 1 and 2: ")
print(s1.intersection(s2))
print()
print("Overlapping words between sentences 2 and 3: ")
print(s2.intersection(s3))

Overlap between sentences 1 and 2: 
{'the', 'The'}

Overlap between sentences 2 and 3: 
{'The', 'rug', 'a', 'cat', 'the', 'took'}


Compute the similarity scores between the sentences:

In [None]:
similarity_1_2 = len(s1.intersection(s2))/len(s1.union(s2))
similarity_1_3 = len(s1.intersection(s3))/len(s1.union(s3))
similarity_2_3 = len(s2.intersection(s3))/len(s2.union(s3))

print("Similarity between s1 and s2: ", similarity_1_2)
print("Similarity between s1 and s3: ", similarity_1_3)
print("Similarity between s2 and s3: ", similarity_2_3)

Similarity between s1 and s2:  0.16666666666666666
Similarity between s1 and s3:  0.16666666666666666
Similarity between s2 and s3:  0.6


### *Discussion*

Using a bag of words approach, which sentences are more similar? Does this match our intuition for which sentences are most similar/dissimilar out of our examples?

Because ‘The’, ‘cat’, ‘took’, ‘a’, ‘the’, and ‘rug’ are common words in sentences 1 and 3, they appear to be more alike than sentences 1 and 2. What are the limitations of this syntax-based approach?

To quantify differences between sentences, we need to represent them in a numerical way. However, we also need this numerical representation to reflect what the sentences actually mean, i.e., capture the semantic content of these sentences and words. How can we do that?


# Part III: Large Language Models (LLM)

Large language models have the abilty to represent sentences in a numerical way (as vectors or embeddings) that reflects our semantic understanding of language.

## A. History of LLMs
<figure>
<center>
<img src='https://drive.google.com/uc?export=view&id=1x0w2nrDUcuAUOqwbjpM8NKrH2T3m5gLH' alt="History of LLMs", width="900" height="600"/>
<figcaption>Recent History of Large Language Models (credit: Hugging Face, https://huggingface.co/blog/large-language-models)</figcaption></center>
</figure>

Advances that have happened since the creation of this graph include PaLM by Google, LLaMA by Meta, and GPT-4 by OpenAI.

The explosion of advances in large language models comes at the confluence of 1. large amounts of computing power, 2. large amounts of data, 3. training techniques that don't require explicit labels (self-supervised learning), and the 4. transformer architecture with multiple attention layers. 

## B. Large Datasets

To train a model like BERT that has 100M parameters, we need a lot of data. BERT was trained on the BooksCorpus dataset (800M words) which is a large collection of free novels by unpublished authors and English Wikipedia (2,500M words). 

## C. Self-supervised learning (training without labels)

As some of you in the field may know, data is the fuel that powers our models and it can be expensive to label. BERT is trained in a self-supervised way using masked language modeling and next sentence prediction and does not require explicit labels.

### 1. Masked Language Modelling (MLM)

Let’s look at the following sentence and try to predict what word should be in the sentence instead of the **< MASK >** placeholder. 

**The clever < MASK > got the cheese without springing the trap.**




<figure>
<img src='https://drive.google.com/uc?export=view&id=180wxOLtWWJlalHV43KnCXhDYYxNhZDwf' alt="History of LLMs", width="200" height="200"/>
<img>
<img>
<img>
<img>
<img src='https://drive.google.com/uc?export=view&id=1MVwAf3_rqJ8Wfwn8SnpCceAHjwoKHj7R' alt="History of LLMs", width="200" height="200"/>
</figure>
What do we think the word is? By training BERT to predict the correct word, but over hundreds of millions of sentences, we teach the model to learn relationships between words and become a better language model. 


### 2. Next Sentence Prediction (NSP)

BERT was also trained was by taking two pairs of sentences and predicting if the two sentences are related, i.e., does sentence 2 follow logically after sentence 1. As an example, let’s take this first sentence:

**The cat climbed up a tree and got stuck.**

Now, let’s look at two possible next sentences: 

**1. Letters can be posted in person during business hours.**

**2. The firefighter came with a ladder and climbed up to rescue the cat**

Which one makes more sense? By also training BERT over the next sentence prediction, we also capture more semantic content in this model. 


## D. Transformer Encoder and Attention

<figure>
<center>
<p align="center">
<img src='https://drive.google.com/uc?export=view&id=1-vIEv6GOo5EZ_XD9goAyVh52xUyWhGmo' alt="History of LLMs", width="450" height="300"/>
</p>
<figcaption>Transformer Encoder (from https://jalammar.github.io/illustrated-transformer/)</figcaption></center>
</figure>

The above diagram gives an idea of the architecture of the transformer encoder layers. I will talk a little bit about the attention layer of the encoder in the next section. But I will treat both the transformer encoder and the attention layer as black boxes and focus on the intuition behind this architectures.

### 1. Attention Mechanism

#### **Attention is an efficient way of processing a sequence of data.**

Modeling sequence data, even something as simple as a sentence, is challenging. Historically, sequence data has been modeled element-by-element while also keeping track of a representation of all the previously seen elements. At the end of the sequence, this representation becomes the final output to be used in downstream tasks like classifying the sentiment of the original sentence, translating it from English to Spanish, or figuring out if two sentences are similar or dissimilar. If this approach sounds complicated, that's because it is :)



The technical details of attention are out of scope but it is a way of combining the numerical representations of words with position information within sentence. Intuitively, for each item in our sequence, we ask or query each other item in our sequence to see how important or relevant it is for us. 

As an example, in the following sentence:


**“The cat purred in happiness”**

the words **cat** and **purr** would attend to each other because cats purring is much more likely than say an elephant purring. 

After going through an attention layer, each item in the transformed sequence is actually a mixture of itself and all other items that contribute to the meaning of itself. We can see this visually in the figure below.

<figure>
<center>
<p align="center">
<img src='https://drive.google.com/uc?export=view&id=1apSmJEnJh_jDI8CJmpmk47YdOFTfmvyW' alt="History of LLMs", width="400" height="275"/>
</p>
<figcaption>Visualization of Attention (credit: https://github.com/jessevig/bertviz)</figcaption></center>
</figure>

By passing inputs through many such attention layers, we produce the almost magical ability of LLMs to understand language. 

# Part III: Transfer Learning

<figure>
<center>
<p align="center">
<img src='https://drive.google.com/uc?export=view&id=1sJj_9pwxqKqD8EBpiYbRRJjTtDVx1kiN' alt="History of LLMs", width="550" height="300"/>
</p>
<figcaption>Transfer Learning is the act of initializing a model with another model's weights (credit: Adapted from Hugging Face)</figcaption></center>
</figure>


Transfer learning is using the knowledge contained in large models that have been trained for a long time on expensive resources with vast amounts of data for other tasks that may have potentially much less data and constrained computer resources. 

This has been hugely successful with large models like LLMs as well as large computer vision models. In the next section, we will use transfer learning to adapt a large pretrained model to build a sentiment classifer for our specific dataset. 

# Part IV: Intro to [Hugging Face](https://huggingface.co/)

Let's explore the Documentation, Models, and Datasets on Hugging Face.