# [ML-10] Text generation

## What is natural language processing?

**Natural language processing** (NLP) is an interdisciplinary subfield of computer science and artificial intelligence. It is concerned with providing computers the ability to process text written in a natural language. 

The adjective *natural* is used here to refer to the languages we humans use, such as English, Spanish, etc, as opposed to computer languages like Python or Java. Nevertheless, the language models developed since 2020 have been trained in a way that they can process code written in a number of computer languages, so the distinction is fading away.

NLP has advanced in giant steps in the last years, as **large language models** (LLM's) took the stage. Unfortunatley, this means that reading anything published ealier than five years ago may be wasting time. 

The NLP toolkit used in this course is based on LLM's. The input data unit is typically a string (which can be a long document). A data set contains a collection of these inputs. Some classic collections are called **corpora**. A famous one is the Wikipedia corpus.

## NLP tasks

Natural language processing turned around a few specific tasks during many years. The classics are:

* **Text classification**. The input is a text and the output a set of class probabilities. For instance, in the preceding lecture we trained a model for fake news detection. Another popular application, which appears in the example of this lecture, is **sentiment analysis**. 

* **Text generation**. The input is a text and the output another text. This covers many specific tasks:

    + **Summarization**. The output text is a summarized version of the input text. 

    + **Translation**. The output text is a translation of the input text to another language. Automatic translation from Russian to English is one of the very classics of artificial intelligence.

    + **Question answering** (QA). The input text is a question and the output text is an answer to that question, either extracted from a context text inputted together with the question, or retrieved from a data source.

    + **Named entity recognition** (NER). The input is a text together with a collection of entities. The output is a list of the entities found in the text. For instance, the model can extract a dish name from a restaurant review.

Though this view of natural language processing is still found in many places, the perspective has changed. Right now, we have, essentially, two types of models: the **embedding models**, that we have already seen in this course, and **text generation models**, which appear in this lecture. 

## Tokens

Even if the data unit in NLP is a string, that string is not processed as a whole, but previously split in a list of substrings called **tokens**. The extraction of tokens from the input text is called **tokenization**. Tokenization is one of the oldest problems in NLP, and different approaches have been discussed for years: words, sub-words, pairs of words, etc. Also, tokenization is a harder problem in some languages, like Spanish and French, with many verb forms, or in German, with its declensions, prefixes and suffixes, than in English.

Nowadays, the debate about the tokenization level has lost interest for the practitioners). First, there is plenty of supply of tokenization models, called **tokenizers**. Second, all the relevant LLM's come with their own tokenizers, so you don't have to think about them once you have chosen your model. In these models, most of the tokens are words, and a small proportion of subwords. Also, punctuation marks can be tokens. In addition to all these, there is an "unknown" token (the vocabulary of the model is limited), a token for the start of the text and another token for the end. So, when we input a string to one of these models, it is converted to a list of tokens, though we never see them in practice.

## What are large language models?

Large language models are not only large (they often go beyond 1B parameters), but a new generation of neural network models. They are based on a novel architecture, the **transformer**, published in 2017. An LLM can be used directly, or taken as a **pre-trained model** for transfer learning, which in this context is called **fine-tuning**. Transfer learning is very common in LLM's. The original pre-trained models are called **foundation models**, and the task for which they are fine-tuned is called the **downstream task**.

There are just a few relevant foundation models, but thousands of fine-tuned versions of them. You can find most of these in Hugging Face. The classic embedding model was **BERT** (Bidirectional Encoder Representations from Transformers), developed by Google, and the classic text generation model was **GPT** (Generative Pre-trained Transformer), developed by OpenAI. 

The models behind the chat apps (ChatGPT, Gemini, etc) are the text generation models. The input text is called the **prompt**. To get the best output, you have to carefully craft your prompts. This is called **prompt engineering**. Two simple examples: 

* To get a summary, the user includes in the prompt instructions such as *summarize the following text, in no more than 100 words* $\dots$. 

* To get a translation, he/she includes something like *translate to French the following text* $\dots$.

## What makes the transformers special?

The first segment of a transformer is always a tokenizer, specific for the model. The tokenizer comes with a **token dictionary** and a list of embedding vectors of a given length, one for each token. When we prompt a string, the string is split in a list of tokens, and the corresponding vectors are packed as a matrix, with a token embedding vector in each row. So, the input in a transformer is, properly speaking, a 2D array, not a text. This implies that, even if the original transformer was designed with an NLP perspective, in the transformer, as in any model based on a neural network architecture, the input and the output are tensors. As a matter of fact, transformers have also been applied in computer vision (they are a serious challenge to CNN models).

The three differential components of the transformer are briefly explained below.

* At the beginning, a **positional embedding** are used to give the model information about the position of each token in the input text or, more specifically, the row number in the input 2D array. The positional embedding generates a different vector for every row. These vectors are added to the input token vector. This is a way of encoding the order of the tokens (and punctuation) in the text, which is one of the ingredients of the "meaning". 

* The mid part of the transformer is a combo of common **feedforweard layers** (those that compose an MLP network) and **attention layes**. The attention layer is a transformation that replaces every token vector by a weighted average of all token vectors. The weights are based on the similarity among the token vectors. A simple example may help to understand the role of attention. Suppose that word 'bank' comes in the sentence 'I got money from the bank'. The attention mechanism changes the vector that stands for 'bank' by pushing it toward the vector that stands for money. So, 'bank' gets a more "financial" meaning. This is the way in which the attention mechanism captures the meaning of a word from the context.

* At the end, the output layer is the same as in the classification neural networks that we have seen in this course, but the predicted token is not always the one with the maximum probability. Instead it is randomly chosen according to the predicted class (token) probabilities. For instance, if the model predicts a probability 0.4 for a certain token, that token will be chosen 40% of the time. The softmax activation is also slightly modified, with an additional parameter called **temperature**. This controls the randomness of the prediction of the next token. In mathematical terms, the token probabilities are calculated as

$$p_i = \frac{\exp(z_i/t)}{\exp(z_1/t) + \cdots + \exp(z_m/t)}.$$

With $t = 1$, the next token probabilities are calculated with the ordinary softmax function. With $t = 0$, the probability is equal to 1 for the token for which $z_i$ is maximum, and 0 for the rest. So, there is no randomness at all, and, for a given input, the model will predict always the same token.

## Example - Tweet sentiment analysis

### Introduction

**Sentiment analysis** aims at capturing the writer's attitude towards a particular topic, product, etc. The simplest versions classify texts as Positive/Negative, or Neutral/Positive/Negative, while the most complex versions consider the many dimensions of the writer's attitude, such as joy, sadness, anger, or fear, and their intensity. 

**Large language models** (LLM) have been applied for sentiment analysis since the very beginning, with Twitter/X providing training data sets. The data set of this example, available from the Hugging Face, has been used many times for **fine tuning** foundation models.

### The data set

The data set `sentiment.csv` contain 14,640 tweets from airlines' customers for sentiment analysis. The columns are:

* `text`, the text of the tweet.

* `label`, a label indicating the sentiment class: 0 = 'neutral', 1 = 'positive', and 2 = 'negative'.

* `idx`, a counter that can be used as an ID.

Source: Hugging Face.

### Questions

Q1. Encode the tweets provided with the model `all-minilm`, as in the preceding lecture.

Q2. Pack the embedding vectors generated and the labels as a features matrix and a target vector.

Q3. Train a **logistic regression model** on the training data set, and validate it on the validation data set.

Q4. LLM's can perform sentiment analysis by themselves, without extra post-training. Could the results of the logistic regression model be matched with that approach?

### Importing the data

We import the data from the GitHub repo, as usual.

In [1]:
import pandas as pd
path = 'https://raw.githubusercontent.com/lab30041954/Data/main/'
df = pd.read_csv(path + 'sentiment.csv')


Next, we explore the data set with the method `.info()`. The data are complete.

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    14640 non-null  object
 1   label   14640 non-null  int64 
 2   idx     14640 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 343.3+ KB


The column `text` contains the text that has to be classifed. Printing a few rows shows that some of the text entries may be challenging.

In [3]:
df.head()

Unnamed: 0,text,label,idx
0,@VirginAmerica What @dhepburn said.,0,0
1,@VirginAmerica plus you've added commercials t...,1,1
2,@VirginAmerica I didn't today... Must mean I n...,0,2
3,"""@VirginAmerica it's really aggressive to blas...",2,3
4,@VirginAmerica and it's a really big bad thing...,2,4


Finally, we explore the potential **class imbalance**. The method `.value_counts()` shows that the class proportions vary from 62.7 (for the negatives) to 16.1% (for the positives). This level of imbalance could affect the ability of the model to deal with the two minority classes. This is not addressed in the analysis presented here, and is left for the homework.

In [4]:
df['label'].value_counts()/len(df)

label
2    0.626913
0    0.211680
1    0.161407
Name: count, dtype: float64

### Q1. Encoding the tweets

We encode the tweets in the same way as the news in the preceding lecture. First, we import the function `embed()`, from the `ollama` package.

In [5]:
from ollama import embed

Now, we write the text data as a list, which is needed by `embed()`.

In [6]:
text = df['text'].tolist()

We generate the the embedding vectors with the same model as in the preceding lecture.

In [7]:
embeds = embed(model='all-minilm:33m', input=text).embeddings

### Q2. Packing

The embedding vectors are packed as the rows of a features matrix, while the column `label` provides then target vector. We check the shapes, which are correct.

In [8]:
import numpy as np
y, X = df['label'], np.array(embeds)
X.shape

(14640, 384)

### Q3. Logistic regression model

We develop the logistic regression model just as we have done in other lectures. Overfitting is not relevant, and the accuracy is an 83%, quite respectable for this type of data. Anyway, don't forget that the accuracy may be poorer in the minority classes.

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = LogisticRegression()
clf.fit(X_train, y_train)
round(clf.score(X_train, y_train), 3), round(clf.score(X_test, y_test), 3)

(0.836, 0.829)

### Q4. Text generation model 

Pulling a text generation model from the Ollama platform works the same as pulling an embedding model. You can do it in the shell or in a Jupyter app, including in the last case the quotation mark (`!`), to warn the app that the command is a shell command, not a Python command (`! ollama pull gemma3n:e4b`).

For a text generation model, the appropriate `ollama` function is `chat()`:

In [10]:
from ollama import chat

The inputs to the model are given as a **JSON object**, which in Python is managed as dictionary, with (at least) two keys:

* The **role**, which is 'user' for the inputs and 'assistant' for the outputs. 

* The **content**, which, for an input, is what we usually call a **prompt**.

Let us illustrate this with the first tweet from our data set.

In [17]:
prompt = '''
You are an AI assistant whose task is to perform a sentiment analysis of tweets related to US airline companies.
You are expected to assign to the input text one the following three labels: 0 = neutral, 1 = positive, 2 = negative.
Respond with a single numeric label, without providing an explanation.
text: @VirginAmerica What @dhepburn said.
label: 
'''
input = [{'role': 'user', 'content': prompt}]

The function `chat()` has (at least) two arguments, for the model and the message, respectively. For a first approach, we use the model `gemma3n:e4b`. The Google's **Gemma 3n models** are designed for efficient execution on everyday devices such as laptops, tablets or phones. Mind that this model takes 7.5 GB, so this may run slow if you are short of RAM memory.

In [18]:
resp = chat(model='gemma3n:e4b', messages= input)

The object returned by the function `chat()` contains the response of the model (a string) plus metadata, not discussed here. The response can be extracted by applying `.message.content`.

In [19]:
resp.message.content

'0\n'

The tweet has been classified as neutral. So, it seems that this approach may work. We can get it cleaner by converting the response to an integer.

In [20]:
int(resp.message.content)

0

To apply this approach massively, to a whole tweet collection, we separate the instruction from the text to be classified.

In [15]:
instruction = '''
You are an AI assistant whose task is to assign to the input text one the following six labels: 0 = neutral, 1 = positive, 2 = negative.
Respond with a single numeric label, without providing an explanation.
'''

Next, we create an appropriate function, which we call `sentiment()`, for this job.

In [16]:
def sentiment(text):
    prompt = instruction + '\n' + '\n' + text  
    input = [{'role': 'user', 'content': prompt}]
    resp = chat(model='gemma3n:e4b', messages= input)
    label = int(resp.message.content)
    return label

Let us check it:

In [21]:
sentiment('@VirginAmerica What @dhepburn said.')

0

As a hint to more serious testing, we apply this function to a random selection of 100 data units. Again, mind that this can be slow if you are short of RAM memory.

In [24]:
df1 = df.sample(n=100)
y1 = df1['label'].iloc[:100]
y1_pred = df1['text'].iloc[:100].apply(sentiment)

Now, the confusion matrix tells us how good our approach is with this small sample.

In [23]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y1, y1_pred)

array([[15,  6,  3],
       [ 1, 21,  1],
       [ 3,  1, 49]])

So far, an 85% accuracy. Also, note that the model seems to fail more often with the neutral messages. We suggest some ways to continue the analysis in the homework.

### Homework

1. Use a confusion matrix to examine how homogeneous is the accuracy of the logistic model across the six classes. What do you think?

2. No attention was paid in this lecture to the obvious **class imbalance** of the data sets. Can you correct what you found in the previous question by either **undersampling** or **oversampling** the training data set?

3. Can you get a better logistic regression model by replacing the embedding model `all-minilm:33m` by a bigger one (*e.g*. the model `granite-embedding:278m` suggested in the preceding lecture)?

4. Can you get better results by replacing the logistuic regression model by an MLP model with one hidden layer?

5. Can you get better results with a bigger text generation model? If you have enough RAM available (16 GB), you can explore the use of OpenAI's `gpt-oss:20b`, or Alibaba's `qwen3:14b` (mind that these are thinking models, so they will be slower and verbose). On a different direction, you can also explore something smaller, such as Google's `gemma3:4b` or Meta's `llama3.2:3b`.