[Go to Home Page](https://weaviate.oneblink.ai)


# Semantic Search on Podcast Transcripts

In this project, we will be using Weaviate to perform semantic search on podcast transcripts. We will utilize the OpenAI text2vec transformer module to vectorize the text, enabling us to conduct semantic searches on the data.

The project's origin is [here](https://github.com/weaviate/weaviate-examples/tree/main/podcast-semantic). More information on the vectorization module can be found [here](https://weaviate.io/developers/weaviate/current/retriever-vectorizer-modules/text2vec-transformers.html#pre-built-images).

The dataset consists of 300 podcast transcripts from [Changelog](https://github.com/thechangelog/transcripts).

In [None]:
import weaviate
from weaviate import Config
import weaviate.classes as wvc
import json
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from tqdm import tqdm
import helper



client = weaviate.connect_to_local(
    port=8083,
    grpc_port=50053)

print(client.collections.delete("Podcast"))

Now, let's create a new collection named 'Podcast' to store our podcast data, and check if the collection is created successfully.

In [None]:
# Creating a new collection named Podcast
client.collections.create(
    name="Podcast",
    properties=[
        wvc.Property(
            name="title",
            data_type=wvc.DataType.TEXT,
        ),
        wvc.Property(
            name="transcript",
            data_type=wvc.DataType.TEXT,
        )
    ],
    vectorizer_config=wvc.Configure.Vectorizer.text2vec_transformers()

)

# Checking if the collection is created successfully
print(client.collections.exists("Podcast"))

Next, we will read our podcast dataset and start processing the transcripts. We will chunk the data as the transcripts are long and chunking will enhance the performance for semantic search.

In [None]:
with open("./data/podcast_ds.json", 'r', encoding = 'utf-8') as f:
    datastore = json.load(f)

podcast = client.collections.get("Podcast")

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap  = 20,
    length_function = len,
    add_start_index = True,
)

transcripts_to_add = []

# Chunking the data as the transcripts are too long and won't give good performance for semantic search
with helper.std_out_err_redirect_tqdm() as orig_stdout:
    for data in tqdm(datastore, desc='Importing transcripts', file = orig_stdout, unit = 'transcript'):
        transcripts_to_add = []
        chunked_transcript = text_splitter.create_documents([data["transcript"]])
        for chunk in chunked_transcript:
            transcripts_to_add.append(
                wvc.DataObject(
                    properties={
                        "title": data["title"],
                        "transcript": chunk.page_content,
                    }
                )
            )
        response = podcast.data.insert_many(transcripts_to_add)
        message = str(data["title"]) + ' imported'
        helper.log(message)
        print(response.errors)

Finally, let's verify by fetching a few objects from our 'Podcast' collection.

In [None]:
print(podcast.query.fetch_objects(limit=2))

[Go to Home Page](https://weaviate.oneblink.ai)
