In [1]:
#|default_exp app

Let's start by loading the environment variables we need to use.

In [9]:
#|export
import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# This is the YouTube video we're going to use. State of competitive intelligence 2023
YOUTUBE_VIDEO = "https://www.youtube.com/watch?v=GTZAoRiZnpQ&t=96s"

## Setting up the model
Let's define the LLM model that we'll use as part of the workflow.

In [10]:
from langchain_openai.chat_models import ChatOpenAI

model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")

We can test the model by asking a simple question.

In [11]:
model.invoke("What MLB team won the World Series during the COVID-19 pandemic?")

AIMessage(content='The Los Angeles Dodgers won the World Series during the COVID-19 pandemic, defeating the Tampa Bay Rays in the 2020 World Series.')

## String Parsing

In [12]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

chain = model | parser
chain.invoke("What is the capital of India?")

'The capital of India is New Delhi.'

## Introducing prompt templates

We want to provide the model with some context and the question. [Prompt templates](https://python.langchain.com/docs/modules/model_io/prompts/quick_start) are a simple way to define and reuse prompts.

In [13]:
from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt.format(context="Klue refers to a competitive intelligence software", question="What is Klue?")

'Human: \nAnswer the question based on the context below. If you can\'t \nanswer the question, reply "I don\'t know".\n\nContext: Klue refers to a competitive intelligence software\n\nQuestion: What is Klue?\n'

## create chain

In [14]:
chain = prompt | model | parser
chain.invoke({
    "context": "Klue refers to a competitive intelligence software, crayon is their biggest competitor",
    "question": "What is Klue?"
})

'Klue is a competitive intelligence software.'

## Transcribing the YouTube Video

The context we want to send the model comes from a YouTube video. Let's download the video and transcribe it using [OpenAI's Whisper](https://openai.com/research/whisper).

In [10]:
YOUTUBE_VIDEO

'https://www.youtube.com/watch?v=GTZAoRiZnpQ&t=96s'

Need to run this for ffmpeg -> conda install -c conda-forge ffmpeg

### transcription is needed only once

In [12]:
import tempfile
import whisper
from pytube import YouTube


# Let's do this only if we haven't created the transcription file yet.
if not os.path.exists("transcription.txt"):
    youtube = YouTube(YOUTUBE_VIDEO)
    audio = youtube.streams.filter(only_audio=True).first()

    # Let's load the base model. This is not the most accurate
    # model but it's fast.
    whisper_model = whisper.load_model("base")

    with tempfile.TemporaryDirectory() as tmpdir:
        file = audio.download(output_path=tmpdir)
        transcription = whisper_model.transcribe(file, fp16=False)["text"].strip()

        with open("transcription.txt", "w") as file:
            file.write(transcription)

Let's read the transcription and display the first few characters to ensure everything works as expected.

In [13]:
with open("transcription.txt") as file:
    transcription = file.read()

transcription[:100]

'Welcome everyone. Thank you for for joining us today for the State of competitive intelligence in 20'

In [14]:
len(transcription)

59810

## Using the entire transcription as context

If we try to invoke the chain using the transcription as context, the model will return an error because the context is too long.

Large Language Models support limitted context sizes. The video we are using is too long for the model to handle, so we need to find a different solution.

In [16]:
try:
    chain.invoke({
        "context": transcription,
        "question": "What is the best tip for a company trying to enable their sales orgs around compete?"
    })
except Exception as e:
    print(e)

## Splitting the transcription

In [15]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("transcription.txt")
text_documents = loader.load()
text_documents

[Document(page_content="Welcome everyone. Thank you for for joining us today for the State of competitive intelligence in 2023. My name is Connor. I'm on the content team here at crayon and I am joined today. I'm very happy today to be joined by Mimi in August who run competitive intelligence at Akamai and Deltech respectively. I am also joined today by Sheila Leihar who is our senior director of content here at crayon. She will be monitoring the the Q&A section. So periodically throughout today's session, I will I'll throw it over to Sheila and she will show chime in with some questions that we're getting from all of you. So please throughout this session, any questions that come around please put them in that Q&A section and you know if it aligns with what we're talking about, we will we'll discuss it in real time and then we also have some time set aside the end for questions that we aren't able to get to. We also have a poll question that we will push lives in just a minute. But fi

There are many different ways to split a document. For this example, we'll use a simple splitter that splits the document into chunks of a fixed size. Check [Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/) for more information about different approaches to splitting documents.

For illustration purposes, let's split the transcription into chunks of 100 characters with an overlap of 20 characters and display the first few chunks:

In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
documents = text_splitter.split_documents(text_documents)

In [17]:
documents

[Document(page_content="Welcome everyone. Thank you for for joining us today for the State of competitive intelligence in 2023. My name is Connor. I'm on the content team here at crayon and I am joined today. I'm very happy today to be joined by Mimi in August who run competitive intelligence at Akamai and Deltech respectively. I am also joined today by Sheila Leihar who is our senior director of content here at crayon. She will be monitoring the the Q&A section. So periodically throughout today's session, I will I'll throw it over to Sheila and she will show chime in with some questions that we're getting from all of you. So please throughout this session, any questions that come around please put them in that Q&A section and you know if it aligns with what we're talking about, we will we'll discuss it in real time and then we also have some time set aside the end for questions that we aren't able to get to. We also have a poll question that we will push lives in just a minute. But fi

## Finding the relevant chunks

Let's generate embeddings for an arbitrary query:

In [18]:
from langchain_openai.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
embedded_query = embeddings.embed_query("What is competitive intelligence?")

print(f"Embedding length: {len(embedded_query)}")
print(embedded_query[:10])

Embedding length: 1536
[-0.02676220181819881, -0.0055145550288614635, 0.01305554314016356, -0.0170751880981857, -0.00973352210576431, 0.024915158223655663, -0.025260649089236004, 0.005285335593320399, -0.018882367362084963, -0.05511897313686283]


To illustrate how embeddings work, let's first generate the embeddings for two different sentences:

In [19]:
sentence1 = embeddings.embed_query("Its very hard to grow a start up")
sentence2 = embeddings.embed_query("Sales enablement is one of the primary use cases of competitive enablement")

We can now compute the similarity between the query and each of the two sentences. The closer the embeddings are, the more similar the sentences will be.

We can use [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to calculate the similarity between the query and each of the sentences:

In [26]:
from sklearn.metrics.pairwise import cosine_similarity

query_sentence1_similarity = cosine_similarity([embedded_query], [sentence1])[0][0]
query_sentence2_similarity = cosine_similarity([embedded_query], [sentence2])[0][0]

query_sentence1_similarity, query_sentence2_similarity

(0.741941282185766, 0.83047198279227)

## Setting up a Vector Store

We need an efficient way to store document chunks, their embeddings, and perform similarity searches at scale. To do this, we'll use a **vector store**.

A vector store is a database of embeddings that specializes in fast similarity searches. 

<img src='images/system4.png' width="1200">

To understand how a vector store works, let's create one in memory and add a few embeddings to it:

Our prompt expects two parameters, "context" and "question." We can use the retriever to find the chunks we'll use as the context to answer the question.

We can create a map with the two inputs by using the [`RunnableParallel`](https://python.langchain.com/docs/expression_language/how_to/map) and [`RunnablePassthrough`](https://python.langchain.com/docs/expression_language/how_to/passthrough) classes. This will allow us to pass the context and question to the prompt as a map with the keys "context" and "question."

## Loading transcription into the vector store

We initialized the vector store with a few random strings. Let's create a new vector store using the chunks from the video transcription.

In [35]:
len(documents)

62

In [48]:
from langchain_community.vectorstores import DocArrayInMemorySearch
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

Let's set up a new chain using the correct vector store. This time we are using a different equivalent syntax to specify the [`RunnableParallel`](https://python.langchain.com/docs/expression_language/how_to/map) portion of the chain:

## Setting up Pinecone

So far we've used an in-memory vector store. In practice, we need a vector store that can handle large amounts of data and perform similarity searches at scale. For this example, we'll use [Pinecone](https://www.pinecone.io/).

The first step is to create a Pinecone account, set up an index, get an API key, and set it as an environment variable `PINECONE_API_KEY`.

Then, we can load the transcription documents into Pinecone:

In [46]:
from langchain_pinecone import PineconeVectorStore

index_name = "competitive-intelligence-index"

pinecone = PineconeVectorStore.from_documents(
    documents, embeddings, index_name=index_name
)

Let's now run a similarity search on pinecone to make sure everything works:

In [47]:
pinecone.similarity_search("What is the best way for a company to get started with competitive intelligence?")[:3]

[Document(page_content="16% of CI leaders said yes and this year 36% of CI leaders said yes that is a 125% increase you know since 2018 so folks are getting better and better at measuring the impact of competitive intelligence so maybe we'll we'll start with you if that's okay I'm curious for folks on the phone who do not have KPIs for their CI program any advice you have for those folks to to get started yeah I'm start simple and our our KPIs frankly are very very simple too at least we're about two years into the journey here at Alka Mai so a couple of KPIs some are hard numbers and others are a little bit more aspirational the hard numbers is number of battle cards right we started with technically zero there were existing battle cards that predated my my arrival that we converted but we wanted to have a good baseline again bait using the tears we wanted the tier one competitors covered and over time we will build more right to address all of the others so that's one KPI that you kn

Let's setup the new chain using Pinecone as the vector store:

In [49]:
chain = (
    {"context": pinecone.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)

chain.invoke("What is the best way for a company to get started with competitive intelligence?")

'Start simple with a few key performance indicators (KPIs) that are easy to track and measure, such as the number of battle cards. Build from there over time to cover more competitors and focus on building trust with stakeholders.'

In [None]:
chain.invoke("What is the best way for a company to get started with competitive intelligence?")