# Workshop: Building an Information Retrieval System for Podcast Episodes

## Objective:
Create an Information Retrieval (IR) system that processes a dataset of podcast transcripts and, given a query, returns the episodes where the host and guest discuss the query topic. Use TF-IDF and BERT for vector space representation and compare the results.

Instructions:

### Step 1: Import Libraries
Import necessary libraries for data handling, text processing, and machine learning.

### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

Find the dataset in: https://www.kaggle.com/datasets/rajneesh231/lex-fridman-podcast-transcript

### Step 3: Text Preprocessing

You know what to do ;)

###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

### Step 7: Retrieve and Compare Results

Define a function to retrieve the top results based on similarity scores for both TF-IDF and BERT representations.

### Step 8: Test the IR System

Test the system with a sample query.

Retrieve and display the top results using both TF-IDF and BERT representations.

### Step 9: Compare Results

Analyze and compare the results obtained from TF-IDF and BERT representations.

Discuss the differences, strengths, and weaknesses of each method based on the retrieval results.

## Instructions:

* Follow the steps outlined above to implement the IR system.
* Run the provided code snippets to understand how each part of the system works.
* Test the system with various queries to observe the results from both TF-IDF and BERT representations.
* Compare and analyze the results. Discuss the pros and cons of each method.
* Document your findings and any improvements you make to the system.

In [None]:
import pandas as pd
import string

### Step 2: Load the Dataset

Load the dataset of podcast transcripts.

In [None]:
df = pd.read_csv('data/podcastdata_dataset.csv')#, index_col=0)
print(df.head())

In [114]:
print(df.shape)

(319, 6)


### Step 3: Text Preprocessing
* Delete punctuation
* Delete stop words

In [None]:
corpus = df['text']
print(corpus.head())

In [None]:
# First, we delete punctuation
corpus_nopunct = []
for doc in corpus:
    corpus_nopunct.append(doc.lower().translate(str.maketrans('', '', string.punctuation)))

In [None]:
print(corpus_nopunct[:10])

In [None]:
df['text_nopunct'] = corpus_nopunct
print(df.head())

In [None]:
from nltk.corpus import stopwords
# nltk.download('stopwords')
stopw = set(stopwords.words('english'))

In [None]:
print(len(stopw))

In [None]:
corpus_nostopw = []
# TODO: This code should be optimized
for doc in corpus_nopunct:
    clean_doc = []
    doc_array = doc.split(' ')
    for word in doc_array:
        if word not in stopw:
            clean_doc.append(word)
    corpus_nostopw.append(' '.join(clean_doc))

In [None]:
corpus_nostopw[300]

In [None]:
df['text_nostopw'] = corpus_nostopw
print(df.head())

###  Step 4: Vector Space Representation - TF-IDF

Create TF-IDF vector representations of the transcripts.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_mtx = vectorizer.fit_transform(df['text_nostopw'])

In [None]:
query = 'Computer Science' 

In [None]:
query_vector = vectorizer.transform([query])

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(tfidf_mtx, query_vector)

In [None]:
type(similarities)

In [None]:
df

In [None]:
similarities_df = pd.DataFrame(similarities, columns=['sim'])
similarities_df['ep'] = df['title']
print(similarities_df.head())

In [None]:
similarities_df

### Step 5: Vector Space Representation - BERT

Create BERT vector representations of the transcripts using a pre-trained BERT model.

In [None]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')

In [89]:
import numpy as np
def generate_bert_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
        outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token representation
    return np.array(embeddings).transpose(0,2,1)

corpus_bert = generate_bert_embeddings(corpus[:50])

In [94]:
corpus_bert.shape

(50, 768, 1)

In [102]:
query = ['Computer Science']
query_bert = generate_bert_embeddings(query)

In [103]:
query_bert.shape

(1, 768, 1)

In [105]:
similarities = cosine_similarity(corpus_bert.reshape(50,768), query_bert.reshape(1,768))

# similarities_bert = cosine_similarity(query_bert.reshape(1, -1), corpus_bert.squeeze())
similarities

array([[0.64740765],
       [0.65648293],
       [0.62955046],
       [0.5798758 ],
       [0.6637971 ],
       [0.68807083],
       [0.652893  ],
       [0.5962678 ],
       [0.6157255 ],
       [0.6364522 ],
       [0.6226719 ],
       [0.65230983],
       [0.6951232 ],
       [0.6479816 ],
       [0.6479816 ],
       [0.69003546],
       [0.67366695],
       [0.6663608 ],
       [0.5504489 ],
       [0.6264469 ],
       [0.69075406],
       [0.5804626 ],
       [0.62647045],
       [0.6380265 ],
       [0.59840155],
       [0.66201544],
       [0.65600586],
       [0.60880053],
       [0.61693186],
       [0.6207942 ],
       [0.6466563 ],
       [0.66204685],
       [0.66466063],
       [0.69745606],
       [0.70639443],
       [0.6530784 ],
       [0.60017955],
       [0.662514  ],
       [0.6738858 ],
       [0.6957747 ],
       [0.65353596],
       [0.6237168 ],
       [0.632741  ],
       [0.65763825],
       [0.68119484],
       [0.64490247],
       [0.61485004],
       [0.582

In [106]:
def retrieve_bert(query):
    query_bert = generate_bert_embeddings(query)
    similarities = cosine_similarity(corpus_bert.reshape(50,768), query_bert.reshape(1,768))
    similarities_df = pd.DataFrame(similarities, columns=['sim'])
    similarities_df['ep'] = df['title']
    return similarities_df

In [110]:
retrieve_bert(['gpt'])

Unnamed: 0,sim,ep
0,0.60803,Life 3.0
1,0.606848,Consciousness
2,0.568143,AI in the Age of Reason
3,0.528032,Deep Learning
4,0.615966,Statistical Learning
5,0.631584,Python
6,0.601826,Stack Overflow and Coding Horror
7,0.546332,Google
8,0.54986,Long-Term Future of AI
9,0.583512,Deep Reinforcement Learning


### Step 6: Query Processing

Define a function to process the query and compute similarity scores using both TF-IDF and BERT embeddings.

In [111]:
def retrieve_tfidf(query):
    query_vector = vectorizer.transform([query])
    similarities = cosine_similarity(tfidf_mtx, query_vector)
    similarities_df = pd.DataFrame(similarities, columns=['sim'])
    similarities_df['ep'] = df['title']
    return similarities_df

In [113]:
retrieve_tfidf('gpt')

Unnamed: 0,sim,ep
0,0.0,Life 3.0
1,0.0,Consciousness
2,0.0,AI in the Age of Reason
3,0.0,Deep Learning
4,0.0,Statistical Learning
...,...,...
314,0.0,"Singularity, Superintelligence, and Immortality"
315,0.0,"Emotion AI, Social Robots, and Self-Driving Cars"
316,0.0,"Comedy, MADtv, AI, Friendship, Madness, and Pro Wrestling"
317,0.0,Poker
