# Topics Extractor
A proof of concept on extracting topics using Langchain

In [1]:
import os
from pathlib import Path
from dotenv import load_dotenv
from langchain_google_genai import GoogleGenerativeAI
from langchain_core.messages import HumanMessage, SystemMessage
from langchain.output_parsers import CommaSeparatedListOutputParser
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate
)

Create a `.env` file and add your Gemini API key using `GOOGLE_API_KEY`

In [2]:
load_dotenv()
llm = GoogleGenerativeAI(model="gemini-pro")

The system template that instructs the LLM on how to process the input and respond which will be added into the prompt

In [3]:
sys_template = """
You are an assistant that is mainly responsible for retrieving topics talked about in a video transcript. You only do topic modeling.
- Respond only with the topics, separated by commas.
- Categories of a transcript include:
    - Trends: Current developments or shifts in various fields.
    - Industries: Specific sectors such as technology, healthcare, finance, etc.
    - Themes: Subjects or ideas presented in the content.
    - Business Ideas: Innovative concepts or strategies for entrepreneurship.
    - Highlights: Key points or significant takeaways from the transcript.
    - Innovations: New technologies or methods impacting a field.
    - Subjects: Key components that are implied rather than stated.
    - Opportunities: Potential areas for growth or exploration.
- You can use the words and terminologies that are mentioned in the transcript.
- You can define categories and functions as topics.
- Only pull topics from the transcript. Do not use the examples.
- Only add most accurate topics in descending order.
- Only return up to 3 topics.
- Do not respond with irrelevant topics to the transcript.
- Do not include any other characters, like hyphens or numbers.

% START OF EXAMPLES
    - Tech
    - Social media
    - Gaming
    - Cybersecurity
    - Virtual machines
    - Cloud computing
    - Weight lifting
    - Calisthenics
    - Cuisines
    - Interview tips
    - Life hacks
    - Marketing Strategies
    - Health trends
    - Sports
% END OF EXAMPLES
"""
system_message_prompt_map = SystemMessagePromptTemplate.from_template(sys_template)

The data that'll be used as inputs. Each text file's content is stored in the `transcripts` list (array)

In [None]:
transcripts = [Path(f"data/transcript-{num}.txt").read_text() for num in range(1, 4)]

Prompt template used to format human-like inputs

In [5]:
human_template="Transcript: {input}"
human_message_prompt_map = HumanMessagePromptTemplate.from_template(human_template)

Combine `system_message_prompt_map` and `human_message_prompt_map` into a single conversational flow using `ChatPromptTemplate`

In [6]:
chat_prompt = ChatPromptTemplate.from_messages(messages=[system_message_prompt_map, human_message_prompt_map])

Parse the output to make sure you receive a list (array)

In [7]:
output_parser = CommaSeparatedListOutputParser()

Chain them all together 

In [8]:
chain = chat_prompt | llm | output_parser

Invoke the chain using the first file as an input

In [9]:
response = chain.invoke({"input": transcripts[0]})
print(response)

['Art', 'Gaming', 'Hardware']


Check the type of the output

In [10]:
print(type(response))

<class 'list'>


Invoke the chain with the other two files

In [11]:
response = chain.invoke({"input": transcripts[1]})
print(response)

['Exercises', 'Health', 'Fitness']


In [12]:
response = chain.invoke({"input": transcripts[2]})
print(response)

['Salary Negotiation', 'Career', 'Employment']
