# Exercise 3: Language models and large language models 

Attribution: Kolhatkar, Varada (2024) [DSCI 575](https://ubc-mds.github.io/DSCI_575_adv-mach-learn/README.html)

## Imports <a name="im"></a>

In [None]:
import os
import re
import sys
from collections import Counter, defaultdict
from urllib.request import urlopen
from hashlib import sha1

import numpy as np
import numpy.random as npr
import pandas as pd

from transformers import pipeline, AutoTokenizer

pd.set_option('display.max_colwidth', 200)

## Getting Started with Kaggle Kernels
<hr>

We are going to run this notebook on the cloud using [Kaggle](https://www.kaggle.com). To get started, follow these steps:

1. Go to https://www.kaggle.com/kernels

2. Make an account if you don't have one, and verify your phone number (to get access to GPUs)
3. Select `+ New Notebook`
4. Go to `File -> Import Notebook`
5. Upload this notebook
6. On the right-hand side of your Kaggle notebook, make sure:
  
  - `Internet` is enabled.

Once you've done all your work on Kaggle, you can download the notebook from Kaggle. That way any work you did on Kaggle won't be lost. 

## Exercise 1: Text Generation using Markov Models

### Character-based Markov model of language
<hr>

In this exercise, you will write a class `MarkovModel` to `fit` an n-gram model of language and generated text.

The starter code below uses the hyperparameter `n`, which represents the state of the Markov model as the last `n` characters of a given string. In Exercise 1, we explored `n=1` (bigram model), where each state of the Markov model was a single character, and the generation of each character was dependent only on the previous character. We aim to incorporate more context to produce more intelligible text. For instance, with `n=3`, the probability distribution for the next character will be based on the preceding three characters. 

> Note that `n` in the term n-gram does not exactly correspond to the variable `n` in our implementation below. Instead `n` refers to the number of previous time steps to use as a context. For 2-gram (bigram) the value of the variable `n` is 1 which means considering one previous time step as context. For 4-gram the value of `n` is 3 which means considering three previous time steps as context.    

To train our model, we record every occurrence of each n-gram and the subsequent character. Then, similar to the approach in naive Bayes, we normalize these counts to probabilities for each n-gram. The `fit` function implements these steps. 

To generate a new sequence, we start with some initial seed at least of length `n`. You can explicitly pass this seed when you call the `generate` method. By default, we will just use the first `n` characters in the training text as the seed, which are saved at the end of the `fit` function. Then, for the current n-gram we will look up the probability distribution over next characters and sample a character according to this distribution.

Attribution: assignment adapted with permission from Princeton COS 126, [_Markov Model of Natural Language_]( http://www.cs.princeton.edu/courses/archive/fall15/cos126/assignments/markov.html). Original assignment was developed by Bob Sedgewick and Kevin Wayne. If you are interested in more background info, you can take a look at the original version. The original paper by Shannon, [A Mathematical Theory of Communication](http://math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf), essentially created the field of information theory and is one of the best scientific papers ever written (in terms of both impact and readability).  

In order to use the recipe [dataset](https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions)

1. Click `+ Add Input` at the top right of the notebook. 

2. Select "Datasets" and search for 'food-com-recipes-and-user-interactions.' Several datasets will appear. Locate and add the one with a size of 280MB.

3. Run the follow cells for preparation of the data and model training setup.

In [None]:
class MarkovModel:
    def __init__(self, n):
        """
        Initialize the Markov model object.

        Parameters:
        ----------
        n : int
            the size of the ngram
        """
        self.n = n
        self.probabilities_ = None
        self.frequencies_ = None
        self.starting_chars = None

    def fit(self, text):
        """
        Fit a Markov model and create a transition matrix.

        Parameters
        ----------
        text : str
            a corpus of text
        """

        # Store the first n characters of the training text, as we will use these
        # to seed our `generate` function
        self.starting_chars = text[: self.n]

        # Make text circular so markov chain doesn't get stuck
        circ_text = text + text[: self.n]

        # Step 1: Compute frequencies
        # count the number of occurrences of each letter following a given n-gram
        frequencies = defaultdict(Counter)
        for i in range(len(text)):
            frequencies[circ_text[i : i + self.n]][circ_text[i + self.n]] += 1.0

        # Step 2: Normalize the frequencies into probabilities
        self.probabilities_ = defaultdict(dict)
        for ngram, counts in frequencies.items():
            self.probabilities_[ngram]["symbols"] = list(counts.keys())
            probs = np.array(list(counts.values()))
            probs /= np.sum(probs)
            self.probabilities_[ngram]["probs"] = probs

        self.frequencies_ = frequencies  # you never know when this might come in handy
    
    def generate(self, seq_len, seed=""):
        """
        Using self.starting_chars, generate a sequence of length seq_len
        using the transition matrix created in the fit method.

        Parameters
        ----------
        seq_len : int
            the desired length of the sequence
        seed : str
            the seed for text generation
        Returns:
        ----------
        str
            the generated sequence
        """
        if not seed:
            s = self.starting_chars
        else:
            s = seed

        while len(s) < seq_len:
            current_ngram = s[-self.n :]
            probs = self.probabilities_[current_ngram]
            s += np.random.choice(probs["symbols"], p=probs["probs"])
        return s

In [None]:
# Set up data
recipes_file = "/kaggle/input/food-com-recipes-and-user-interactions/RAW_recipes.csv"
orig_recipes_df = pd.read_csv(recipes_file)
orig_recipes_df = orig_recipes_df.dropna()
corpus = "\n".join(orig_recipes_df["name"].tolist())
print(f"Corpus length: {len(corpus)}")
print(f"Corpus sample: {corpus[:100]}")

### Novel recipe name generation

**Your tasks:**

In this exercise, you will train Markov models using the recipe titles in the `corpus` variable above for different values of `n` within the range 1 to 10. For each value of `n`, show generated text comprising at least `100` characters by running the follow cells. 

Feel free to select any `n` and `seed` of your choice. Please note that the length of your `seed` should be at least `n`.

In [None]:
# Settings
n = 3
num_characters = 100
seed = 'egg'

char_model = MarkovModel(n=n)
char_model.fit(corpus)
print(f"Text generated with n = {n}:")
print(f"Generate recipe titles: \n{char_model.generate(num_characters, seed=seed)}")

### Exercise 1.1 

<div class="alert alert-info">

**Discussion questions**

1. How does the value of `n` affect the quality of generated recipe titles?
2. Experiment with different seeds.
3. What changes would be required to adapt the model from character-based generation to word-based generation?

</div>

<div class="alert alert-warning">

Type your answer below.
    
</div>

_Type your answer here, replacing this text._

## Exercise 2: Sentiment Analysis with LLMs
<hr>

In this exercise, you'll apply a large language model (LLM) to real-life use cases. Specifically, you'll build a sentiment analysis model to detect sentiment using a Kaggle dataset [on emotion analysis based on text](https://www.kaggle.com/datasets/saurabhshahane/twitter-sentiment-dataset) without building any machine learning on your own.

In order to use the dataset.

1. Click `+ Add Input` at the top right of the notebook. Click on the "Dataset"

2. Search for 'emotion-analysis-based-on-text'. Several datasets will appear. Locate and add the one with a size of 33MB.

3. Run the follow cell for preparation of the data and model training setup.

In [None]:
# Set up data
text_data_file = "/kaggle/input/emotion-analysis-based-on-text/emotion_sentimen_dataset.csv"
text_df = pd.read_csv(text_data_file, index_col=0)
text = text_df['text'].tolist()
text_df.drop(columns=['Emotion'])

### Predicting sentiment (binary classification)

Let's predict whether a given text is positive or negative for a sample using a pre-trained large language model (LLM).

In [None]:
# Set up model
model_name = "facebook/bart-large-mnli"
classifier = pipeline("zero-shot-classification", model=model_name)
candidate_labels = ["positive", "negative"]

def sentiment_analyzer(classifier, text, candidate_labels):
    results = classifier(text, candidate_labels)
    return pd.DataFrame(results)

In [None]:
results = sentiment_analyzer(classifier, text[100:120], candidate_labels)
results

Observe the output above. The labels are sorted based on their corresponding scores. A higher score in the scores column indicates that the model predicts a higher probability of the text belonging to the corresponding sentiment label.

### Exercise 2.1 

<div class="alert alert-info">

**Discussion questions**

1. To what extent do you agree with the model's predictions?
2. Do you find the corresponding scores helpful?   
</div>

<div class="alert alert-warning">

Type your answer below.
    
</div>

_Type your answer here, replacing this text._

### Predicting emotions (multiclass classification)

It’s impressive that we were able to perform sentiment analysis without having to build a machine learning model from scratch.

But what if our goal isn’t sentiment analysis, but emotion classification instead? For example, we might want to assign emotion labels like “joy,” “anger,” “fear,” or “surprise” to a given text.

In this exercise, you’ll use a pre-trained LLM to classify the emotions expressed in text into different categories based on their content. To start, we’ll focus on four emotions: sadness, joy, anger, and fear. Feel free to experiment by adding more emotions to the list below.

In [None]:
candidate_labels = ["sadness", "joy", "anger", "fear"]

results = sentiment_analyzer(classifier, text[100:120], candidate_labels)
results

### Exercise 2.2

<div class="alert alert-info">

**Discussion questions**

1. To what extent do you agree with the model's predictions?
2. Do you find the corresponding scores helpful?   
</div>

<div class="alert alert-warning">

Type your answer below.
    
</div>

_Type your answer here, replacing this text._

### Your Free Time (Optional)

**Your tasks**:

You can use any text dataset you like and customize the labels by replacing the items marked with ... to build your own classification model!

Feel free to share your ideas and insights with your teammates and the teaching team.

In [None]:
candidate_labels = ...
text = ...

results = sentiment_analyzer(classifier, text, candidate_labels)
results