<a href="https://colab.research.google.com/github/manavmittal05/Large-Language-Models/blob/main/A1/Q2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers huggingface_hub
!pip install git+https://github.com/huggingface/transformers.git
!pip install git+https://github.com/huggingface/accelerate.git
!pip install bitsandbytes

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-cjvtg9v5
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-cjvtg9v5
  Resolved https://github.com/huggingface/transformers.git to commit eb5b968c5d80271ecb29917dffecc8f4c00247a8
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting git+https://github.com/huggingface/accelerate.git
  Cloning https://github.com/huggingface/accelerate.git to /tmp/pip-req-build-pu2m_yoq
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate.git /tmp/pip-req-build-pu2m_yoq
  Resolved https://github.com/huggingface/accelerate.git to commit 3fcc9461c4fcb7228df5e5246809ba09cfbb232e
  Installing build dependencies ... [?25l[?25hdon

In [2]:
import torch
import torch
from transformers import AutoModel, AutoTokenizer
import pandas as pd
from huggingface_hub import login
from google.colab import userdata

In [3]:
df = pd.read_csv("hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv")
df.dropna(inplace=True)

In [4]:
# columns we need - album_name, track_name, artists, popularity, track_genre
df = df[["album_name", "track_name", "artists", "popularity", "track_genre"]]

# top 10 genres
top_genres = df["track_genre"].value_counts().head(5).index
df = df[df["track_genre"].isin(top_genres)]

unique_genres = df["track_genre"].unique()
num_genres = len(unique_genres)
unique_genres, num_genres

# randomly sample 40 tracks from each genre
df = df.groupby("track_genre").apply(lambda x: x.sample(40)).reset_index(drop=True)


# create a new column containing prompt - "tell me about the genre and popularity score of the song - track name: {track_name} | album name: {album_name} | artists: {artists}"
df["track_info"] = df.apply(lambda x: f"tell me about the genre and popularity score of the song - track name: {x['track_name']} | album name: {x['album_name']} | artists: {x['artists']}", axis=1)

df.head()

Unnamed: 0,album_name,track_name,artists,popularity,track_genre,track_info
0,When Marnie Was There Song Album - Just Know T...,I Am Not Alone,Priscilla Ahn,46,acoustic,tell me about the genre and popularity score o...
1,No Matter Where You Are,No Matter Where You Are,Us The Duo,55,acoustic,tell me about the genre and popularity score o...
2,The Way It Was,Kiss Me Slowly,Parachute,61,acoustic,tell me about the genre and popularity score o...
3,Kaleidoscope Heart,King of Anything,Sara Bareilles,61,acoustic,tell me about the genre and popularity score o...
4,Alternative Christmas 2022,I Heard The Bells On Christmas Day,The Civil Wars,0,acoustic,tell me about the genre and popularity score o...


In [5]:
login(userdata.get('HF_TOKEN'))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [7]:
# Load the LLaMA 3 model and tokenizer
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModel.from_pretrained(model_id, output_hidden_states=True, load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

In [3]:
# for every track_info, get the embeddings and store then in a new column called first_layer_embedding, middle_layer_embedding, last_layer_embedding
def get_embeddings(track_info):
    inputs = tokenizer(track_info, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    hidden_states = outputs.hidden_states
    first_layer_embeddings = hidden_states[1]
    middle_layer_index = len(hidden_states) // 2
    middle_layer_embeddings = hidden_states[middle_layer_index]
    last_layer_embeddings = hidden_states[-1]
    final_token_first_layer = first_layer_embeddings[:, -1, :]
    final_token_middle_layer = middle_layer_embeddings[:, -1, :]
    final_token_last_layer = last_layer_embeddings[:, -1, :]
    return final_token_first_layer, final_token_middle_layer, final_token_last_layer

# if spotify_tracks_embeddings.pkl exists, load it
try:
    df = pd.read_pickle("spotify_tracks_embeddings.pkl")
except:

    df[["first_layer_embedding", "middle_layer_embedding", "last_layer_embedding"]] = df["track_info"].apply(get_embeddings).apply(pd.Series)
    # save the dataframe to a pickle file
    df.to_pickle("spotify_tracks_embeddings.pkl")

In [4]:
df.head()

Unnamed: 0,album_name,track_name,artists,popularity,track_genre,track_info,first_layer_embedding,middle_layer_embedding,last_layer_embedding
0,When Marnie Was There Song Album - Just Know T...,I Am Not Alone,Priscilla Ahn,46,acoustic,tell me about the genre and popularity score o...,"[[tensor(0.0078, dtype=torch.float16), tensor(...","[[tensor(-0.0496, dtype=torch.float16), tensor...","[[tensor(-1.2578, dtype=torch.float16), tensor..."
1,No Matter Where You Are,No Matter Where You Are,Us The Duo,55,acoustic,tell me about the genre and popularity score o...,"[[tensor(0.0237, dtype=torch.float16), tensor(...","[[tensor(-0.0155, dtype=torch.float16), tensor...","[[tensor(-3.4727, dtype=torch.float16), tensor..."
2,The Way It Was,Kiss Me Slowly,Parachute,61,acoustic,tell me about the genre and popularity score o...,"[[tensor(-0.0032, dtype=torch.float16), tensor...","[[tensor(0.0364, dtype=torch.float16), tensor(...","[[tensor(-2.4414, dtype=torch.float16), tensor..."
3,Kaleidoscope Heart,King of Anything,Sara Bareilles,61,acoustic,tell me about the genre and popularity score o...,"[[tensor(-0.0144, dtype=torch.float16), tensor...","[[tensor(0.1203, dtype=torch.float16), tensor(...","[[tensor(-1.6094, dtype=torch.float16), tensor..."
4,Alternative Christmas 2022,I Heard The Bells On Christmas Day,The Civil Wars,0,acoustic,tell me about the genre and popularity score o...,"[[tensor(-0.0132, dtype=torch.float16), tensor...","[[tensor(-0.0587, dtype=torch.float16), tensor...","[[tensor(-2.3691, dtype=torch.float16), tensor..."


In [5]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# define the features and target
features = ["first_layer_embedding", "middle_layer_embedding", "last_layer_embedding"]
target = "popularity"

# define the models
models = {
    "first_layer_embedding": LinearRegression(),
    "middle_layer_embedding": LinearRegression(),
    "last_layer_embedding": LinearRegression(),
}

# train the models
for layer, model in models.items():

    X = pd.DataFrame(df[layer].apply(lambda x: x.reshape(-1)).tolist())
    y = df[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)
    mse = mean_squared_error(y_test, y_pred)
    # print training error
    print(f"{layer} Training MSE: {mean_squared_error(y_train, y_train_pred)}")
    # print test error
    print(f"{layer} Test MSE: {mse}")

first_layer_embedding Training MSE: 4.4467268811967e-25
first_layer_embedding Test MSE: 4778.288728600571
middle_layer_embedding Training MSE: 1.0223085129030658e-26
middle_layer_embedding Test MSE: 324.91917257035783
last_layer_embedding Training MSE: 1.085693486637574e-26
last_layer_embedding Test MSE: 230.72164799163102


In [7]:
# train classifier to predict genre from embeddings, do this for all 3 layers
import warnings
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.exceptions import ConvergenceWarning

warnings.filterwarnings('ignore', category=ConvergenceWarning)

# define the features and target
features = ["first_layer_embedding", "middle_layer_embedding", "last_layer_embedding"]
target = "track_genre"

# define the models
models = {
    "first_layer_embedding": LogisticRegression(),
    "middle_layer_embedding": LogisticRegression(),
    "last_layer_embedding": LogisticRegression(),
}

# train the models
for layer, model in models.items():

    X = pd.DataFrame(df[layer].apply(lambda x: x.reshape(-1)).tolist())
    y = df[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_pred_train = model.predict(X_train)
    acc = accuracy_score(y_test, y_pred)
    # print training accuracy
    print(f"{layer} Training Accuracy: {accuracy_score(y_train, y_pred_train)}")
    # print test accuracy
    print(f"{layer} Test Accuracy: {acc}")

first_layer_embedding Training Accuracy: 0.98125
first_layer_embedding Test Accuracy: 0.625
middle_layer_embedding Training Accuracy: 1.0
middle_layer_embedding Test Accuracy: 0.925
last_layer_embedding Training Accuracy: 1.0
last_layer_embedding Test Accuracy: 0.975


# Evaluation and Probing Results

## Linear Regression Head
### First Layer Embeddings
* Training MSE - 4.4467268811967e-25
* Testing MSE - 4778.288728600571
### Middle Layer Embeddings
* Training MSE - 1.0223085129030658e-26
* Testing MSE - 324.91917257035783
### Last Layer Embeddings
* Training MSE - 1.085693486637574e-26
* Testing MSE - 230.72164799163102

### Analysis
* We can clearly see that as we take embeddings of deeper layers of the model the performace improves for both Training and Testing data. This clearly shows that LLM has learned well and performance improves as we go deep in the network.
* As we go deep in the network we can see the MSE (Mean Square Error) loss decreasing.
* Model is able to predict Popularity score of the song more precisely on deeper layers of the LLM.

## Classification Head
### First Layer Embeddings
* Training Accuracy - 0.98125
* Testing Accuracy - 0.625
### Middle Layer Embeddings
* Training Accuracy - 1.0
* Testing Accuracy - 0.925
### Last Layer Embeddings
* Training Accuracy - 1.0
* Testing Accuracy - 0.975

### Analysis
This is same as what we saw in Analysis of Linear Regression Head
* We can clearly see that as we take embeddings of deeper layers of the model the performace improves for both Training and Testing data. This clearly shows that LLM has learned well and performance improves as we go deep in the network.
* As we go deep in the network we can see the classification accuracy increasing.
* Model is able to predict genre of the song more precisely on deeper layers of the LLM.

# Discussion
## Findings
* LLM is performing very well in predicting the popularity score and genre of the song.
* It is only provided with the Track name, the album name to which the song belongs and the artist name. And only with these three feilds it is predicting popularity score and genre of the exceptionally well.
## Patterns
* As we go deeper in the network the quality of the embeddings we are getting is improving. Which implies that at each and every layer the model is learning something.