# Process

1. Dataset:

    * Import `ted_talk_clean_merged_bert`

    * Extract only required columns
    
    * Create a huggingface dataset
    
    * Save the dataset to `t5_dataset`
    
2. Semantic search with FAISS

    * create new text column that concatenates title, description and transcript.
    
    * create `get_embeddings` function that:
    
        * tokenizes the `text` column
        * forward pass token tensors through model to get `output`
        * feed model `output` to CLS pooling
        
    * create embeddings with `get_embeddings` function
    
    * add FAISS index

    * save the dataset as `t5_with_sentence_embedding_dataset`.
    
3. Testing

    * embed sample query with `get_embeddings`
    
    * use `.get_nearest_examples()` method to get similar embeddings
    
    * create `get_recommendations` to give results in pandas dataframe
    

# 1 Dataset
1.1 Import ted_talk_clean_merged_bert

1.2 Extract only required columns

1.3 Create a huggingface dataset

1.4 Save the dataset to t5_hf_dataset

In [1]:
try:
    from google.colab import drive
    drive.mount('/content/drive')
    SOURCE_DIR = "/content/drive/MyDrive/MLOPs_Projects/TED_Project/Data_output/T5/"
except:
    print("Not in Colab environment. Some of the functionality will not work.")
    SOURCE_DIR = ""


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# import the df
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/MLOPs_Projects/TED_Project/Data_output/ted_talk_clean_merged_bert.csv',
                 parse_dates=['date']).drop(columns=['Unnamed: 0'])
print(len(df))
df.info()

5140
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5140 entries, 0 to 5139
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   author       5140 non-null   object        
 1   title        5140 non-null   object        
 2   description  5140 non-null   object        
 3   likes        5140 non-null   int64         
 4   views        5140 non-null   int64         
 5   transcript   5140 non-null   object        
 6   date         5140 non-null   datetime64[ns]
 7   tags         5140 non-null   object        
dtypes: datetime64[ns](1), int64(2), object(5)
memory usage: 321.4+ KB


In [3]:
df.head(2)

Unnamed: 0,author,title,description,likes,views,transcript,date,tags
0,Machine Dazzle,how to unleash your inner maximalist through c...,tapping into the transformational power of cos...,8100,270192,"Hello, I am Machine Dazzle, and I am an emotio...",2023-06-01,"art, creativity, design, fashion, performance"
1,Jioji Ravulo,a liberating vision of identity that transcend...,how can we move past societys inclination to b...,9200,309952,Can you paint with all the colors of the wind?...,2023-06-01,"diversity, identity, inclusion, indigenous_peo..."


In [4]:
# extract only the required columns - keeping all fo rnow
# create a huggingface dataset
try:
    from datasets import Dataset
except:
    !pip -q install datasets
    from datasets import Dataset

t5_dataset = Dataset.from_pandas(df, split="train")
t5_dataset

Dataset({
    features: ['author', 'title', 'description', 'likes', 'views', 'transcript', 'date', 'tags'],
    num_rows: 5140
})

## 1.2 create `text` "soup" column

In [5]:
# create new text column that concatenates title, description and transcript
def concat_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["description"]
        + " \n "
        + examples["transcript"]
        # + " \n Talk by "
        # + examples["author"]
    }

t5_dataset = t5_dataset.map(concat_text)

Map:   0%|          | 0/5140 [00:00<?, ? examples/s]

In [6]:
t5_dataset

Dataset({
    features: ['author', 'title', 'description', 'likes', 'views', 'transcript', 'date', 'tags', 'text'],
    num_rows: 5140
})

## Save the dataset to t5_dataset

In [7]:
# Save the dataset to t5_dataset

t5_dataset.save_to_disk(SOURCE_DIR+'t5_dataset')

# confirm by loading
from datasets import load_from_disk, load_dataset

dataset_reloaded = load_from_disk(SOURCE_DIR+'t5_dataset')
dataset_reloaded

Saving the dataset (0/1 shards):   0%|          | 0/5140 [00:00<?, ? examples/s]

Dataset({
    features: ['author', 'title', 'description', 'likes', 'views', 'transcript', 'date', 'tags', 'text'],
    num_rows: 5140
})

# 2 Semantic search with FAISS
2.1 create new text column that concatenates title, description and transcript.

2.2 create get_embeddings function that:

 * create tokenizer and model from t5-base checkpoint
 * tokenize the `text` column and get tensors
 * put tensors on cuda (gpu) for faster processing
 * forward pass tensors through model to get `output`
 * feed model `output` to CLS pooling

2.3 create embeddings with get_embeddings function

2.4 add FAISS index

## 2.1 Use t5 sentence transformer to generate text `embeddings`

In [8]:
from datasets import load_from_disk
t5_dataset = load_from_disk(SOURCE_DIR+'t5_dataset')
t5_dataset

Dataset({
    features: ['author', 'title', 'description', 'likes', 'views', 'transcript', 'date', 'tags', 'text'],
    num_rows: 5140
})

In [9]:
# import the t5-base model and tokenizer

try:
    # from sentence_transformers import SentenceTransformer, models
    from transformers import AutoTokenizer, AutoModel
except:
    %pip install -q transformers
    %pip install -q sentence_transformers
    from transformers import AutoTokenizer, AutoModel
    # from sentence_transformers import SentenceTransformer, models
model_ckpt = 'sentence-transformers/sentence-t5-base'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

# word_embedding_model = models.Transformer(model_ckpt)
# model = SentenceTransformer(modules=[word_embedding_model])

import torch
# set device diagnostics
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

Some weights of T5Model were not initialized from the model checkpoint at sentence-transformers/sentence-t5-base and are newly initialized: ['decoder.block.6.layer.1.EncDecAttention.q.weight', 'decoder.block.10.layer.1.EncDecAttention.k.weight', 'decoder.block.6.layer.0.layer_norm.weight', 'decoder.block.8.layer.0.SelfAttention.k.weight', 'decoder.block.4.layer.1.EncDecAttention.q.weight', 'decoder.block.3.layer.1.EncDecAttention.q.weight', 'decoder.block.11.layer.0.SelfAttention.q.weight', 'decoder.block.0.layer.0.SelfAttention.o.weight', 'decoder.block.0.layer.1.EncDecAttention.o.weight', 'decoder.block.9.layer.2.DenseReluDense.wo.weight', 'decoder.block.6.layer.2.layer_norm.weight', 'decoder.block.0.layer.1.EncDecAttention.v.weight', 'decoder.block.10.layer.0.layer_norm.weight', 'decoder.block.11.layer.2.DenseReluDense.wi.weight', 'decoder.block.0.layer.1.layer_norm.weight', 'decoder.block.11.layer.1.layer_norm.weight', 'decoder.block.9.layer.0.SelfAttention.k.weight', 'decoder.bloc

T5Model(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dropout(p=0.1, inplace

In [10]:
# create a get_embeddings function

def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

def get_embeddings(text):
    # tokenizes the text column
    encoded_input = tokenizer(
        text,
        padding=True,
        truncation=True,
        return_tensors='pt'
    )

    # change to the gpu device
    encoded_input = {k:v.to(device) for k, v in encoded_input.items()}

    # Forward pass through the model to obtain embeddings
    with torch.no_grad():
        model_output = model(**encoded_input,
                             # required for t5 transformer
                             decoder_input_ids=torch.tensor([[0]]).to(device),
                             return_dict=True)
    # feed model output to cls pooling
    return cls_pooling(model_output)

In [11]:
# testing with the first entry
embedding = get_embeddings(t5_dataset["text"][0])
embedding.shape

torch.Size([1, 768])

In [12]:
# create embeddings with get_embeddings function
t5_dataset = t5_dataset.map(
    lambda x: {"embeddings" : get_embeddings(x['text']).to('cpu').numpy()[0]}
)

Map:   0%|          | 0/5140 [00:00<?, ? examples/s]

In [13]:
t5_dataset

Dataset({
    features: ['author', 'title', 'description', 'likes', 'views', 'transcript', 'date', 'tags', 'text', 'embeddings'],
    num_rows: 5140
})

## 2.2. Create FAISS index

In [11]:
try:
    import faiss
except:
    %pip install --upgrade faiss-cpu faiss-gpu -q
    import faiss
# create faiss index
t5_dataset.add_faiss_index(column="embeddings")


## 2.3 Save the FAISS index

In [15]:
t5_dataset

Dataset({
    features: ['author', 'title', 'description', 'likes', 'views', 'transcript', 'date', 'tags', 'text', 'embeddings'],
    num_rows: 5140
})

In [16]:
t5_dataset.save_faiss_index('embeddings', SOURCE_DIR+'t5_faiss_index.faiss')
t5_dataset.drop_index('embeddings')
t5_dataset.save_to_disk(SOURCE_DIR+'/t5_embedded_dataset')

Saving the dataset (0/1 shards):   0%|          | 0/5140 [00:00<?, ? examples/s]

In [17]:
# load dataset and index
try:
    from datasets import load_from_disk
except:
    %pip install datasets
    from datasets import load_from_disk
# load the dataset
# t5_dataset = load_from_disk(SOURCE_DIR+'/t5_dataset')
# load the index
# t5_dataset.load_faiss_index('embeddings', SOURCE_DIR+'/t5_faiss_index.faiss')

In [18]:
t5_dataset

Dataset({
    features: ['author', 'title', 'description', 'likes', 'views', 'transcript', 'date', 'tags', 'text', 'embeddings'],
    num_rows: 5140
})

# 3. Build recommender

3.1 embed sample query with get_embeddings

3.2 use .get_nearest_examples() method to get similar embeddings

3.3 use pandas to sort out top results

In [19]:
t5_dataset = load_from_disk(SOURCE_DIR+'/t5_embedded_dataset')
t5_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/6 [00:00<?, ?it/s]

Dataset({
    features: ['author', 'title', 'description', 'likes', 'views', 'transcript', 'date', 'tags', 'text', 'embeddings'],
    num_rows: 5140
})

In [20]:
# testing a query with get_embeddings
query = "canadian directors"
# query_embedding = get_embeddings([query]).cpu().detach().numpy()
query_embedding = get_embeddings([query]).cpu().detach().numpy()
query_embedding.shape

(1, 768)

In [21]:
# load the t5_dataset_with_sentence_embeddings
import pandas as pd
import numpy as np
df = pd.DataFrame(t5_dataset[:])

def get_embeddings_with_topic(topic):
    """
    input: topic string
    output: either the topic embeddings or an error
    """

    # filter by topic
    title = df[df['title'].str.contains(topic)].index
    num_results = len(title)

    # count the number of topics and return if it's not 1
    if num_results != 1:
        print(f"Multiple search results were found for topic '{topic}'.") if num_results else print(f"No search results for topic '{topic}'.")
        print("Performing query search instead.")
        return None, topic

    result = df.loc[title[0]].to_dict()
    topic = f"{result['title']} by {result['author']}"

    embedding = result['embeddings']
    return np.array(embedding).reshape((1,768)), topic


def get_recommendations(topic = None, query=None, num=3):
    """
    input: a query asking for a topic recommendation
    OR
    input: one of the recommender topics
    output: a list of the top 3 most relevant topics
    """
    query_embedding = None
    if topic:
        # call a function to get the embeddings
        query_embedding, topic = get_embeddings_with_topic(topic)
        if query_embedding is None:
            query = topic[:]
            topic = None
    if query:
        print(f"Searching the Ted Talk Database for recommendations based on the query '{query}'.")
        # embed the query with get_embeddings & conver to numpy
        query_embedding = get_embeddings([query]).to('cpu').numpy()

    # use get_nearest_example to get similar embeddings
    scores, samples = t5_dataset.get_nearest_examples(
        'embeddings',
        query_embedding,
        k=10)

    # create a df with the results
    samples_df = pd.DataFrame.from_dict(samples)
    samples_df['scores'] = scores
    print(len(samples_df))
    samples_df = samples_df.sort_values('scores', ascending=False)[:num]

    # print results
    print("\n\nRecommendations based on",
          f"the query '{query}'\n" if query else f"the topic: '{topic}'\n"
          )
    for _, row in samples_df.iterrows():
        print(f"TITLE: {row.title}")
        print(f"AUTHOR: {row.author}")
        print(f"SCORE: {row.scores}")
        print(f"DESCRIPTION: {row.description}")
        print(f"TAGS: {row.tags}")
        print(f'TRANSCRIPT: {" ".join(row.transcript.split(" ")[:20])}')
        print("============")
        print("")

In [None]:
display(get_recommendations(topic="a liberating vision of identity", num=5))
get_recommendations(topic="stage fright", num=3)

In [None]:
get_recommendations(topic="stage fright", num=10)

In [None]:
get_recommendations(topic="climate change", num=5)

In [None]:
get_recommendations("Nollywood movies", num=10)

# Run from here to test recommender engine

In [1]:
# load dataset and index
try:
    from datasets import load_from_disk
except:
    %pip install -q datasets
    from datasets import load_from_disk

try:
    from google.colab import drive
    drive.mount('/content/drive')
    # get source file
    !cp /content/drive/MyDrive/MLOPs_Projects/TED_Project/Data_output/T5/t5.py .
    %pip install --upgrade faiss-gpu faiss-gpu -q
except:
    print("Not in Colab environment. Some of the functionality may not work.")
    %pip install --upgrade faiss-cpu -q

# try:
#     import faiss
# except:
#     %pip install --upgrade faiss-cpu faiss-gpu -q
#     import faiss


from t5 import *

Not in Colab environment. Some of the functionality will not work.
Note: you may need to restart the kernel to use updated packages.


  0%|          | 0/6 [00:00<?, ?it/s]

In [2]:
get_recommendations("how to fund real change in your community", num=10)



Recommendations based on the topic: 'how to fund real change in your community by Rebecca Darwent'

TITLE: how to quickly scale up contact tracing across the us
AUTHOR: Joia Mukherjee
SCORE: 0.0006183923687785864
DESCRIPTION: contact tracing  the process of identifying people who may have been exposed to the coronavirus in order to slow its spread  is a fundamental tool in the fight against covid19 how can we scale this critical work across the entire united states joia mukherjee chief medical officer of partners in health discusses how her team is working with public health agencies to ramp up contact tracing for the countrys most vulnerable communities  and shows why it will take a compassionate approach to be truly effective this ambitious plan is part of the audacious project teds initiative to inspire and fund global change the conversation hosted by head of ted chris anderson and current affairs curator whitney pennington rodgers was recorded may 27 2020
TAGS: community, corona