# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

Instructions:

### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

### Step 3: Text Preprocessing

* Delete puntuation
* Delete stop words

###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

## Instructions:

* Follow the steps outlined above to implement the IR system.
* Run the provided code snippets to understand how each part of the system works.
* Test the system with various queries to observe the results from both TF-IDF and BERT representations.
* Compare and analyze the results. Discuss the pros and cons of each method.
* Document your findings and any improvements you make to the system.

### Step 1: Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertTokenizer, BertModel
import torch
import re
import nltk
from nltk.corpus import stopwords

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

In [4]:
df = pd.read_csv('../data/podcastdata_dataset.csv')

In [5]:
print(df.columns)

Index(['id', 'guest', 'title', 'text'], dtype='object')


In [6]:
print(df.head())

   id            guest                    title  \
0   1      Max Tegmark                 Life 3.0   
1   2    Christof Koch            Consciousness   
2   3    Steven Pinker  AI in the Age of Reason   
3   4    Yoshua Bengio            Deep Learning   
4   5  Vladimir Vapnik     Statistical Learning   

                                                text  
0  As part of MIT course 6S099, Artificial Genera...  
1  As part of MIT course 6S099 on artificial gene...  
2  You've studied the human mind, cognition, lang...  
3  What difference between biological neural netw...  
4  The following is a conversation with Vladimir ...  


### Step 3: Text Preprocessing
* Delete puntuation
* Delete stop words

In [7]:
def preprocess_text(text):
    # Eliminar puntuación
    text = re.sub(r'[^\w\s]', '', text)
    # Convertir a minúsculas
    text = text.lower()
    # Eliminar stopwords
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

In [8]:
df['preprocessed_text'] = df['text'].apply(preprocess_text)

### Step 4: Vector Space Representation - TF-IDF
Create TF-IDF vector representations of the transcripts.

In [9]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['preprocessed_text'])

### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [10]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [11]:
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

bert_embeddings = df['preprocessed_text'].apply(get_bert_embedding)
bert_matrix = np.vstack(bert_embeddings)

### Step 6: Query Processing

In [12]:
def process_query(query, tfidf_vectorizer, bert_tokenizer, bert_model):
    preprocessed_query = preprocess_text(query)

    # TF-IDF
    tfidf_query = tfidf_vectorizer.transform([preprocessed_query])
    tfidf_similarities = cosine_similarity(tfidf_query, tfidf_matrix).flatten()

    # BERT
    bert_query = get_bert_embedding(preprocessed_query)
    bert_similarities = cosine_similarity([bert_query], bert_matrix).flatten()

    return tfidf_similarities, bert_similarities

### Step 7: Retrieve and Compare Results

In [13]:
def retrieve_results(query):
    tfidf_similarities, bert_similarities = process_query(query, tfidf_vectorizer, tokenizer, model)

    # Crear DataFrames para TF-IDF y BERT
    tfidf_df = pd.DataFrame({
        'sim': tfidf_similarities,
        'id': df['id'],
        'episodio': df['title']
    }).sort_values('sim', ascending=False).reset_index(drop=True)

    bert_df = pd.DataFrame({
        'sim': bert_similarities,
        'id': df['id'],
        'episodio': df['title']
    }).sort_values('sim', ascending=False).reset_index(drop=True)

    return tfidf_df, bert_df

### Step 8: Test the IR System

In [18]:
query = "gpt"
tfidf_results, bert_results = retrieve_results(query)

In [19]:
print("Resultados TF-IDF:")
print(tfidf_results)

print("\nResultados BERT:")
print(bert_results)

Resultados TF-IDF:
          sim   id                                           episodio
0    0.099371  215  OpenAI Codex, GPT-3, Robotics, and the Future ...
1    0.032537   17                                     OpenAI and AGI
2    0.028676   94                                      Deep Learning
3    0.028510  121                    Friendship with an AI Companion
4    0.025214  118  Math, Manim, Neural Networks & Teaching with 3...
..        ...  ...                                                ...
314  0.000000  106       Neuroscience, Psychology, and AI at DeepMind
315  0.000000  105                                 Edison of Medicine
316  0.000000  104             Computer Architecture and Data Storage
317  0.000000  103                    Artificial General Intelligence
318  0.000000  325  Biology, Life, Aliens, Evolution, Embryogenesi...

[319 rows x 3 columns]

Resultados BERT:
          sim   id                                           episodio
0    0.086577   87     Evolut