# Module 0 - Introduction to LLMs & NLP Fundamentals

> **What is an LLM?**

A language model is a statistical model that tries to predict words in text, in some kind of language. So the way to set this up, it's basically a statistical problem: you have a collection of text and you say, hey if I show you some words, like say the beginning of a sentence but not the end, can you predict what the missing words are? And this is sort of a prediction problem. This is the kind of thing that we've been doing in machine learning and statistics for a long time. There are lots of ways to try to fit a model to the data to predict these words, but it turns out that in the past few years, we've
developed ways of doing it that can get
better and better accuracy, especially as
you feed in more data. And this
ability ultimately gives you models
that learn useful things about the world
or can be used to sort of query
interesting information about the world.
So just to give you a sense of the scale
that these operate on,
a typical person might read
a few hundred books in their lifetime,
like say 700 books on average.
That's a lot of books; it takes a long
time. But it's not much compared to what
we feed into a large language model
today. A large language model...
One book is about 80,000 words which when
encoded into the form that a language
model takes is a 110,000 tokens.
A token is sort of a
piece of a word; we'll talk a lot about
them later. So, your typical model
like say ChatGPT is often trained on
trillions of tokens today, which means
tens of millions of books if you do the
math. So that's a lot more books than a
person would ever look at, and if you
have a way of turning these into
a useful statistical model, you often end
up with something that can be
applied to real world tasks.

**So, a language model is really just a
computational model that takes in a
sequence of some kind, it might be a
sequence of tokens like we spoke about
just before,
and then finds a probability
distribution over the vocabulary, which
is what we talked about just before as
well, to find the most likely word. Now,
LLMs, or language models in general
actually, can really be split into two
categories: they could be generative, or
they could be classification based.**

- **Typically, for a classification model, the
prediction that the model is looking for
is a masked word that it tries to
uncover**.

- **For generative models and for generative
AI, which is the topic of most of the
research at the moment, what the models
are trying to do is to predict the
likely next word after that sequence
that it's been shown.
So, a language model is really just a
model that's internally trying to find a
probability distribution where that
probability distribution is spread over
the entire vocabulary that the model has
to use, to find the word that fits best
to come next.**

> **What does that this mean for developers and users?**

LLMs can automate many tasks
that previously required a person to do
them in detail, that usually involve
imprecise
language or knowledge about the world,
that's kind of hard to codify and
put into a computer. And this can help
both accelerate innovation and build new
features, new interfaces to
the software, and just increase ROI and
efficiency of a business. So just some of
the examples of things you can do with
it. You can speed up software development, you can democratize AI itself users can now use LLMs by just talking
to them and asking them to do things,
instead of doing a heavy duty machine
learning project. You can open up these
rich use cases like assistants,
analyzing documents for business,
all kinds of things like
that,
generating ad copy and stuff like
that. And of course you can reduce
development costs of applications, and
you can reduce monotonous tasks that
people have to do: get the easy stuff to
a model, and get people to focus on the
hard stuff. So we're just beginning to
scratch the use cases.

> **Factor to take into consideration when thinking of al LLMs & it's trade-offs**

Now unfortunately with LLMs, there's no
perfect or magic bullet model. So once
you get past the idea
of, what could I use this for,
and try to actually use it in an
application, the right model will
depend a lot on the requirements for
your application. And there'll be a lot
of trade-offs required and potentially
quite a bit of custom development, so
here are some of the factors you want to
take into account when thinking about an
application using LLMs.

1) **Model Quality** - The first one is
the model's quality. What
kind of quality can I get from it? Can I
find a model that I can use out there
that's high quality, or do I have
a way to improve it myself? Or do I need
to sort of wrap the LLM around in a
bigger application that uses this
unreliable component to do something
reliable at the end? So that's an
important piece.

2) **Serving Cost** - The second one is the
serving cost. This depends a lot on what
you're trying to do.
This is the cost of just running the
model and getting a prediction. For
example if your application is
reading a couple of documents per day
and extracting some information from
them, it's okay if it costs a
dollar, ten dollars, whatever for the
LLM to process one document. It's
totally fine. If your application is
placing ads on a web page that
are customized to each user, or as they
click through the site for your store
selecting recommendations for
them,
you can't afford to spend
many dollars each time someone looks at
a page. You need to serve
millions, maybe hundreds of millions,
of instances of this per day
at a moderate cost. So you'll really
optimize for cost. 

3) **Serving Latency** - Serving latency is
another important factor. Again if you're
doing some offline analysis, it's okay if
it takes minutes for your application
to run. But if you're doing something in an
interactive web page, it better
run
in a few milliseconds to get
a high-quality application.

4) **Customizability** - And then the final one that can
be tricky, and is important to think
about, is customizability. If you have an
important application, you're going to
want to keep improving its quality over
time, debugging it when something goes
wrong, and generally sort of making it
better and customizing it. So you need to
think about, in different solutions, what
knobs do I have to control
how well this is doing,
to monitor it, to make it better,
to prevent certain kinds of bad behavior
that I don't want it to have,
and so on. So this is an important
thing to think about as you design your
LLM application.

# 1.1 Natural Language Processing - NLP

> **Use cases of NLP**

NLP is useful for us as it enables us to
solve a number of different tasks.
This could be from whether or not the
sentiment of a review that someone's
giving about a particular product is
positive or negative, if we want to
translate from one language to another,
or if we want to create a chatbot or
some kind of interactive system that
relies on natural language as its form
of input.
Other use cases like similarity
searching in which, we want to use
a natural language input and have a
natural language output. Another example, if we want to summarize
complex documents or find just the
important pieces in a very long document,
summarization is also an important
component of NLP.
Finally, text classification can be more
than just a positive or negative review.
It also can tell us what the certain
genres are, what the moods are, and what
is contained within the text.

> **NLP Fundamentals - Tokens**

If we think about what a
sentence is made up of, we could think
phrases, words, characters, those
individual building blocks are what are
known as "tokens" in natural language
processing.Tokens don't have to just be words or
characters or subwords, they're a choice
that we make when we create our models.

Tokens you can think of as the building
blocks or the atoms of NLP problems.
A sequence, therefore, is a collection of
tokens that's meant to imply a
sequential listing of those tokens. If we
consider tokens to be words, then a
sequence might be an entire sentence or
a fragment of a sentence.
If a token is considered to be a
character then a sequence might be "t", "h,"
and "e" of the single word of "the".
The vocabulary, then, is the entire set of
tokens that we have at our disposal for
our particular model.
This could be our vocabulary of words in
the English language it could also be a
vocabulary of all the characters that
we're using to define our problem.
**This means we can then classify
different problems in NLP using these
definitions. Being able to classify these different
types of tasks will be relevant further
on because we will be able to evaluate whether or not a
model is good or bad, can be quite
subjective and dependent on the task
that you're solving.**.

> **Tokenization - Transforming Words into word-pieces**

The process
of tokenization is meant to take the
words or whatever we choose to break up
our text as and convert it into a format
that we can use in computation, computers aren't good at symbolic
mathematics so we need to convert our
words or our text in some format into a
digital format.

One of the most common and typical
places to start when thinking about how
to tokenize would be to cut a sentence
up, or a sequence of words, into
individual words themselves.

So, the process of tokenization is
twofold: firstly, we create a vocabulary of all of
the different tokens that we can see in
our training data set, so if we have the
English dictionary, for example, we could
take every word in that dictionary and
convert that into our vocabulary. So that
every word has an associated number. We
could start with the first word give it
the number zero and then go all the way
through for the rest of the dictionary.
This would build up our index, and then
anytime that we see a new sequence of
tokens, a new sequence
we could convert that to a list of
indices / numbers, so that we could
codify this as a series of numbers that
our models could then work with. **Embedding
vectors work very well to encapsulate
meaning for every token**.


> **Tokenization Limitation - Misspelling Words**

Some limitations and
problems with using word tokenizations.
If we think about the training set that
we're using to build our vocabulary, if
we miss out on common words or uncommon
words and we see them later on in our
usage of our language model it will come
up with an error as this will be
technically a out of vocabulary or oov
error.
This is a limitation for word-based
tokenizations as you have to associate
every individual word with a particular
token value.
This also means that if we have
misspellings, if we want to create new
words, we can't do that as these
tokenization schemes won't allow for
this kind of
ductile behavior. They're very brittle.
It also means we end up with very large
vocabularies, as we have to account for
every type of every word. So, if we have
"fast", "faster", "fastest" we have to take
three different tokens in order to store
those three words. And then if we have
"slow", "slower", and "slowest", we again have
another three words. Whereas, if you think
about it, we could take just the stem of
both of those words and then add those
suffixes as separate pieces.
Another solution to make our vocabulary
much smaller would be then to just look
at individual characters.
If we're picking, say, the English
language, we would have 26 characters for
the lower case 26 for the upper case and
then maybe some other punctuation and
numerical
characters, as well.

The middle ground for these two extremes
would be to do something in terms of sub
words.
So, this would break up words like
subject for example you could take the
first "sub" and "ject" and then you'd be
able to build up other words like "object",
"subjective",
"subordinate", "submarine", you'd be able to
build up words using these pieces of
words. Now, there are a number of
different strategies. Byte pair encoding
is a popular scheme to build up these
kinds of vocabularies. There are many
others like sentence piece and wordpiece
as well, which are very commonly used in
modern large language models. And these
tend to have a good
trade-off of vocabulary size to
flexibility, in looking for words that
are outside of the vocabulary, and also
looking at
how to build sufficient - how to retain
sufficient meaning from the words that
you're describing. Once we have these tokens then we want
to try and figure out how we can
incorporate meaning and context, this is called Word Embeddings.

> **Word Embeddings - The Power of Similar Context & Vectors**

The goal of word embeddings is to try and
conserve the context that a particular
token has in its vocabulary. The context
might be its relationship to other words
or it might be the intrinsic meaning
that a particular word might have. Words with similar meaning
often tend to occur in similar contexts.

What would be fantastic is if we could
build up some kind of scheme or some
kind of model that would let us express
this numerically. **Therefore, the goal then is to try and build up
some vectors that we can use for context
mapping and for embedding**.


> **Vectors Limitations - Sparsity**

However, if we think about how we would
extend this to a much larger situation
with a realistic vocabulary we would
actually have almost 99% of the vector
being filled with zeros. As most
sentences of any reasonable length are
not going to contain anywhere near the
amount of different words that you might
have in a typical language.
If the English language has something
like 250,000 unique words, you're never
going to find any sentence, or really any
document, that's going to come close to
containing all of those words.

The problem of sparsity here means
that this is not really going to work
for us as we build this and scale it up
into larger problems and more complex
documents. And this also loses the sense
of what each of the words are, while we
can see that the word "the" appears more
commonly than any other word, it doesn't
really give us a sense of what that word
actually means.

> **Embedding Vectors**

This is where the **embedding
functions or vectorize functions for
words comes into play**. Now we won't go
into what a word embedding
method is, but in essence what happens is that we use
every word in our training data set and
we look at the words that surround it. We
have a window that looks at the words to
the left and the words to the right of
every single word for every other word.
What this does then is it builds a map
of how one word appears with three other
words and how another word might appear
with the same three other words in
particular contexts.

## Summary

- Natural language processing is a field
that focuses on natural language and how
to study it. Particularly text, though
natural language processing is much
wider than just looking at how to value -
at how to model text-based problems, we
also look at speech, and text-to-video image-
to-text, all of these other concepts where
natural language is important.

- Natural language is incredibly useful
for things like translation between one
text to another text, summarizing long
pieces of text, classification problems
where we want a natural language input
and a natural language output,
these are done by language models which
are essentially just tools to create a
probability distribution over the
vocabulary of the tokens that we have to
use.

- Large language models are language
models based on the transformer
architecture with millions and billions
of parameters.

- Tokens, are the smallest
building blocks of our language models
and they convert our text to indices,
which we then convert to n-dimensional
word embedding vectors, so that we can
understand more detail about the context
and the meaning behind each token or
each word.