# **GPT Model**

- GPT is a state-of-the-art language model that excels in natural language understanding and generation tasks.
- As a recommender system, GPT can be fine-tuned on specific datasets, like TED Talk transcripts, to create personalized recommendations for users based on their queries or preferences.
- By leveraging its pre-trained knowledge and contextual understanding of language, GPT can effectively generate relevant and engaging recommendations, making it a powerful tool for building intelligent content recommenders.

##### Import Dataframe and Data Libraries

In [40]:
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
from tqdm import tqdm
from tqdm.notebook import tqdm_notebook

In [16]:
tedtalks=pd.read_csv("/Users/brittanyharding/LHL-Projects/Ted-Talks-Recommender-System/data_output/ted_talk_fulldata_v2.csv")

In [17]:
tedtalks.head(2)

Unnamed: 0,author,talk,description,transcript,likes,views,url,date,topic,transcript_length,individual_topic
0,Machine Dazzle,How to unleash your inner maximalist through c...,Tapping into the transformational power of cos...,"Hello, I am Machine Dazzle, and I am an emotio...",81000,270192,ted.com/talks/machine_dazzle_how_to_unleash_yo...,Jun 2023,"art, creativity, design, fashion, performance",1901,art
1,Jioji Ravulo,A liberating vision of identity that transcend...,How can we move past society's inclination to ...,Can you paint with all the colors of the wind?...,92000,309952,ted.com/talks/jioji_ravulo_a_liberating_vision...,Jun 2023,"diversity, identity, inclusion, indigenous+peo...",1779,design


### **Data Wrangling and Cleaning**

In [None]:
# Combine textual columns: talks, descripton, and transcript (not including topics that will be treated separately as an additional input feature).
tedtalks['combined_textual_columns'] = tedtalks['talk'] + ' ' + tedtalks['description'] + ' ' + tedtalks['transcript']

# Drop the individual textual features that were combined
tedtalks.drop(columns=['talk', 'description', 'transcript', 'individual_topic'], inplace=True)

In [22]:
def preprocess_all(data_frame, column):
    # Create new column names for the preprocessed text
    new_column = 'preprocessed_' + column
    
    # Fill null values with empty strings
    data_frame[column].fillna('', inplace=True)
    
    # Remove punctuation
    data_frame[new_column] = data_frame[column].apply(lambda x: x.translate(str.maketrans("", "", string.punctuation)) if isinstance(x, str) else x)
    
    # Convert to lowercase
    data_frame[new_column] = data_frame[new_column].apply(lambda x: x.lower() if isinstance(x, str) else x)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    data_frame[new_column] = data_frame[new_column].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]) if isinstance(x, str) else x)
    
    # Stem the tokens
    stemmer = PorterStemmer()
    data_frame[new_column] = data_frame[new_column].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]) if isinstance(x, str) else x)
    
    return data_frame

In [26]:
preprocess_all(tedtalks, "combined_textual_columns")

Unnamed: 0,author,likes,views,url,date,topic,transcript_length,combined_textual_columns,preprocessed_combined_textual_columns
0,Machine Dazzle,81000,270192,ted.com/talks/machine_dazzle_how_to_unleash_yo...,Jun 2023,"art, creativity, design, fashion, performance",1901,How to unleash your inner maximalist through c...,unleash inner maximalist costum tap transform ...
1,Jioji Ravulo,92000,309952,ted.com/talks/jioji_ravulo_a_liberating_vision...,Jun 2023,"diversity, identity, inclusion, indigenous+peo...",1779,A liberating vision of identity that transcend...,liber vision ident transcend label move past s...
2,Rebecca Darwent,10000,341218,ted.com/talks/rebecca_darwent_how_to_fund_real...,Jun 2023,"business, community, equality, humanity, money...",1661,How to fund real change in your community Is ...,fund real chang commun way give back benefit e...
3,Susanne Buckley-Zistel,37000,126376,ted.com/talks/susanne_buckley_zistel_what_caus...,Jun 2023,"africa, animation, education, history, identit...",838,What caused the Rwandan Genocide? For one hun...,caus rwandan genocid one hundr day 1994 africa...
4,Conor Russomanno,11000,374259,ted.com/talks/conor_russomanno_a_powerful_new_...,Jun 2023,"biotech, brain, disability, health, invention,...",1784,A powerful new neurotech tool for augmenting y...,power new neurotech tool augment mind astonish...
...,...,...,...,...,...,...,...,...,...
5350,Hans Rosling,467000,15592756,ted.com/talks/hans_rosling_the_best_stats_you_...,Jun 2006,"africa, asia, demo, economics, global+issues, ...",3174,The best stats you've ever seen You've never ...,best stat youv ever seen youv never seen data ...
5351,Sir Ken Robinson,22000000,75235356,ted.com/talks/sir_ken_robinson_do_schools_kill...,Jun 2006,"creativity, culture, dance, education, kids, p...",3170,Do schools kill creativity? Sir Ken Robinson ...,school kill creativ sir ken robinson make ente...
5352,Majora Carter,92000,3072786,ted.com/talks/majora_carter_greening_the_ghetto,Jun 2006,"activism, business, cities, environment, equal...",3071,Greening the ghetto In an emotionally charged...,green ghetto emot charg talk macarthurwin acti...
5353,David Pogue,60000,2020628,ted.com/talks/david_pogue_simplicity_sells,Jun 2006,"computers, entertainment, media, music, perfor...",3373,Simplicity sells New York Times columnist Dav...,simplic sell new york time columnist david pog...


### **Run Vectorisation with GPT-2 on preprocessed_combined_textual_columns**

In [32]:
import torch
from transformers import GPT2Tokenizer, GPT2Model

  from .autonotebook import tqdm as notebook_tqdm


In [33]:
# Load pre-trained GPT model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2Model.from_pretrained(model_name)

Downloading (…)olve/main/vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 31.1MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 12.9MB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 665/665 [00:00<00:00, 1.21MB/s]
Downloading model.safetensors: 100%|██████████| 548M/548M [00:14<00:00, 37.9MB/s] 


In [44]:
def get_gpt_embeddings(text):
    # Tokenize the text
    input_ids = tokenizer.encode(text, add_special_tokens=True, return_tensors="pt")
    
    # Check if the input exceeds the model's maximum token limit
    if input_ids.size(1) > model.config.max_position_embeddings:
        # Truncate the input to the maximum allowed tokens
        input_ids = input_ids[:, :model.config.max_position_embeddings]
    
    # Get GPT embeddings
    with torch.no_grad():
        model.eval()
        outputs = model(input_ids)
        embeddings = outputs.last_hidden_state  # Get the last hidden state (GPT embeddings)
    
    return embeddings

tedtalks["gpt_embeddings"] = tedtalks["preprocessed_combined_textual_columns"].apply(get_gpt_embeddings)

### **Build recommender model using with GPT-2**

In [None]:
class RecommendationDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return {
            "input_ids": self.data.iloc[idx]["gpt_embeddings"],
            "labels": self.data.iloc[idx]["gpt_embeddings"],
        }

# Prepare the dataset
recommendation_dataset = RecommendationDataset(tedtalks)

# Define DataLoader
batch_size = 8  
recommendation_dataloader = DataLoader(recommendation_dataset, batch_size=batch_size, shuffle=True)

In [None]:
# Fine-tuning parameters
num_epochs = 3  

# Load the pre-trained GPT-2 model and tokenizer
model_name = "gpt2"  
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
config = GPT2Config.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name, config=config)

# Fine-tune the GPT-2 model on the recommendation task
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)  

for epoch in range(num_epochs):
    model.train()
    total_loss = 0.0
    for batch in recommendation_dataloader:
        input_ids = batch["input_ids"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(input_ids, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    avg_loss = total_loss / len(recommendation_dataloader)
    print(f"Epoch {epoch + 1}/{num_epochs} - Loss: {avg_loss:.4f}")

# Save the fine-tuned model
model.save_pretrained("fine_tuned_recommender_model")

In [None]:
# Load the fine-tuned model
fine_tuned_model = GPT2LMHeadModel.from_pretrained("fine_tuned_recommender_model")
fine_tuned_model.to(device)

# Input query
query_text = "How to unleash your creativity"

# Tokenize the query text
input_ids = tokenizer.encode(query_text, add_special_tokens=True, return_tensors="pt").to(device)

# Generate recommendations
with torch.no_grad():
    outputs = fine_tuned_model.generate(input_ids, max_length=50, num_return_sequences=3, pad_token_id=tokenizer.eos_token_id)

# Decode and print the recommendations
for i, output in enumerate(outputs):
    recommendation = tokenizer.decode(output, skip_special_tokens=True)
    print(f"Recommendation {i + 1}: {recommendation}")
