![image.png](attachment:452eb318-8c8f-4401-aaa2-9037edc9247b.png)

## Stages for Building and Finetuning a Large Language Model (LLM)

### Stage 1: **Building Large Language Model**
- **Data Preparation and Sampling**:  
  This step involves collecting, cleaning, and sampling data. It ensures that the data fed into the model is of high quality and relevant for training purposes.
  
- **Attention Mechanism**:  
  Implementing the attention mechanism is crucial for handling large amounts of data effectively. Attention helps the model focus on important parts of the input data, enhancing the learning process.
  
- **LLM Architecture**:  
  The architecture of the large language model (LLM) is built in this step. This involves defining the structure of the neural network and ensuring it can efficiently handle the data provided.

> **Goal**: Implement data sampling and understand the basic mechanism of the LLM.


### Stage 2: **Foundational Model**
- **Training Loop**:  
  This involves running the LLM through multiple iterations of training. The model is exposed to the data repeatedly, gradually improving its performance.

- **Model Evaluation**:  
  After each round of training, the model’s performance is evaluated. Metrics are used to determine how well the model is learning and adjusting.

- **Load Pre-trained Weights**:  
  Instead of training the model from scratch, pre-trained weights can be loaded. These weights are learned parameters from previous models that help speed up the training process and improve performance.

> **Goal**: Pre-train the LLM on unlabeled data.

### Stage 3: **Finetuning**
- **Finetuning for Specific Tasks**:  
  At this stage, the foundational LLM is fine-tuned to specific tasks, such as:
  - **Classifier**: Using the LLM for classification tasks, where the model categorizes inputs into predefined labels.
  - **Personal Assistant**: The model is adapted to function as a personal assistant, responding intelligently to queries and performing tasks based on the user's instructions.

> **Goal**: Customize the foundational model for specific use cases via fine-tuning.


Now let's see each stage one by one to understand and build the Large Language Model

### 1. Data Preparation and Sampling

We know that we can not feed the direct data as a input to the model, we can not feed the images or text directly to the model because the model do not understand the any format other than numbers. So in order to make the model understand our data we will change our data into numbers. For the first step we devide the collected data into samll parts, This process is called the tokenization. and this comes under the data preparation and sampling.


`How do you prepare input text for training LLMs ?`

* Step I : Splitting tetx into individual words and subwords tokens
* Step II : Convert tokens into token ids
* Step III : Encode token IDs into vector representation


 #### Step I : Splitting tetx into individual words and subwords tokens

 1. Let us use the Dataset of a book : 'The Verdict' by Edith Wharton
 2. Let's download and load the book of 'Edith Wharton' in python
 3. 

In [4]:
# Step 1 : I downloaded and added to the directory
# Step 2 : Let's download and load the book of 'Edith Wharton' in python

with open("the-verdict.txt", "r", encoding='utf-8') as f:
    raw_text = f.read()

print("The total number of charecters :", len(raw_text))
print(raw_text[:99])

The total number of charecters : 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [6]:
# step 3 : Convert the text into tokens or subwords
import re # Regular Expressions

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)
print(result) # the result is a list of individual words, whitespaces, and punctuation characters:

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


<div class="alert alert-block alert-warning">

Let's modify the regular expression splits on whitespaces (\s) and commas, and periods
([,.]):</div>

In [7]:
result = re.split(r'([,.]\s)', text)
print(result)

['Hello', ', ', 'world', '. ', 'This', ', ', 'is a test.']


<div class="alert alert-block alert-info">
We can see that the words and punctuation characters are now separate list entries just as
we wanted
</div>


In [8]:
result = [item for item in result if item.strip()]
print(result) # now whitespaces will be removed

['Hello', ', ', 'world', '. ', 'This', ', ', 'is a test.']
