In [1]:
from IPython.display import Markdown, display

display(Markdown("transcripts.md"))

###  Board and Committee Meetings   
   
We start with the [videos recorded by the local public access cable channel](https://acmi.tv/programs/government/) for government meetings, create a transcript using [OpenAI Whisper](https://github.com/openai/whisper), ask a series of questions, one for each of the published meeting agenda items, and follow-up questions depending on the summary of each item, using LangChains text splitters and QA retrieval pipelines.  The raw transcript is further organized by agenda item and speaker in markdown format; to be automated.  The raw transcript and markdown are saved as text columns in the ```governance.meetings``` table, while the Q&A is stored as a dictionary.


This [notebook]("transcripts\ ETL.ipynb") shows the components and is intended to be used with the *polis* postgres database available.  Adapting for stand alone use should be straighforward.


#### Raw Transcript

The raw transcript using [OpenAI Whisper](https://github.com/openai/whisper) is about 25,000 words, 125,000 characters for a 3 hour meeting.  The OpenAI api calls to create the transcripts run from about 0.90 to $1.00 per meeting.

Meetings held via Zoom have poor transcription when compared to in-person meetings; where remote participants also are sometimes garbled.

The raw transcript is one long paragraph.  Our choice of text splitters in langchain's RecursiveCharacterTextSplitter module bears scrutiny.

We store the raw transcripts for further internal processing or by agents.


#### 20 Questions

We use the OpenAI chat llm and Langchain's RetrievalQA to ask questions.  We can prepare the questions from asking for a summary of the entire meeting, a summary for each agenda item and follow-up questions where appropriate.  The chat does poorly on answering questions such as "list everyone who spoke in the meeting" due to context limitations, but does very well in asking detailed questions about a topic. We ask pre-determined questions such as results of any votes expected.  

These questions can all be prepared in advance of the meeting.

Answers from the OpenAI chat gpt-4 questions cost about \$0.20 each as of summer, 2023.  


#### Attribution Markdown

Currently, the attribution is done manually, using the published meeting agenda items for the top headers and the speaker for the sub-headers, making a collapsible presentation possible.  Analytics on speakers and independent topics of a meeting become possible by using attribution, so the effort is worthwhile.  Presenting a 25,000 wall of words has value to feed other processes.  Expanding each agenda item (topic) and arranging the speakers in order with the portion of the transcript attributed to each is eaier to consume for individuals.  As well, the raw transcript is improved with text associated with speakers.



### Set-up

In [None]:
from os import environ
from sqlalchemy import create_engine
import openai

##set-up
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(usecwd=True),override=True) # read local .env file

openai.api_key  = environ.get('OPENAI_API_KEY','')

username     =  environ.get("POSTGRES_USERNAME", "postgres")
password     =  environ.get("POSTGRES_PASSWORD", "postgres")
ipaddress    =  environ.get("POSTGRES_IPADDRESS", "localhost")
port         =  environ.get("POSTGRES_PORT", "5432")
dbname       =  environ.get("POSTGRES_DBNAME", "ArlingtonMA")

cnx= create_engine(f'postgresql://{username}:{password}@{ipaddress}:{port}/{dbname}')

### Extract

    1. Get youtube video links from hosting agent
    2. Use Whisper to create video transcript

In [None]:
## For ArlingtonMA the local public cable company is ACMI

def extract_acmi_video_urls():
    
    from pandas import DataFrame, to_datetime
    import requests
    import re
    
    stub = "https://acmi.tv/programs/government/"
    
    urls = {
        "school-committee":"School Committee Meeting - ",
        "select-board-meetings":"Select Board Meeting - ",
        "redevelopment-board-meetings":"Redevelopment Board Meeting - ",
        "zoning-board-of-appeals":"Zoning Board of Appeals Meeting - ",
        "finance-committee":"Finance Committee Meeting - ",
        "town-meeting":"Annual Town Meeting - ",
    }
    
    rows = []
    for url in urls.keys():
        
        response = requests.get(stub+url)
    
        if response.status_code == 200:
            # Extract hyperlinks and labels using regular expressions
            links_and_labels = re.findall(r'href=["\'](https?://[^"\']+)["\'][^<]*<span>(.*?)<\/span>', response.text)
    
            for link, label in links_and_labels:
                date = to_datetime(label.replace(urls[url],'').replace("Special & ","")).date().strftime('%Y-%m-%d')
                rows.append({
                    "dor" : 10,
                    "authority": url, 
                    "date": date,
                    "video": link.replace("https://www.youtube.com/watch?v=",""),
                })
        else:
            print("Failed to fetch the webpage.")

    df  =  DataFrame(rows)[['dor','authority','date','video']]

    ## substitute integer keys for string value
    x   =  list(df.authority.sort_values().unique())
    x   =  dict(zip(x,range(len(x))))
    df  =  df.replace(x).sort_values(['authority','date'])

    ## update to common.int_value_pairs; s/b only done once
    ivp = DataFrame([x]).T.reset_index().rename(columns={"index":"value",0:"key"})
    ivp['item']='authority'

    return df, ivp

df, int_value_pairs = extract_acmi_video_urls()

In [None]:
## Read video links from db
from pandas import read_sql_query, to_datetime

query = """
            select * from governance.meetings m 
            left join common.int_value_pairs ivp 
                on ivp.item='authority' and ivp.key=m.authority 
            where transcript is null;
        """
meetings = read_sql_query(query,cnx)

data_dir   =   "./meetings/"
authority  =   "zoning-board-of-appeals"
date       =   "2023-08-01"

mask       =  (
    meetings.value == authority
) & ( 
    meetings.date == to_datetime(date).date()
)

url_video       =  "https://www.youtube.com/watch?v=" + meetings[mask]['video'].iloc[0]
save_directory  =  data_dir + authority

In [None]:
## Whisper transcript
import time
start = time.time()

from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

loader = GenericLoader(
    YoutubeAudioLoader([url_video],save_directory),
    OpenAIWhisperParser()
)
docs = loader.load()

print('elapsed',time.time()-start)

### Transform

    1. Save raw transcript and split documents
    2. langchain's RecursiveCharacterTextSplitter on raw transcript
    3. create embeddings, store in Chroma vectorstore
    4. connect to LLM (gpt-4), create Q&A prompt template
    5. RetrievalQA, ask questions

In [None]:
from os import makedirs
from json import dump, loads

working_dir = save_directory+'/'+date
makedirs(working_dir,exist_ok=True)

txt = ' '.join([d.page_content for d in docs])

with open(working_dir+'/transcript.txt', 'w') as f:
    f.write(txt)

makedirs(working_dir+'/docs',exist_ok=True)

idx = 0
for d in docs:
    with open(working_dir+f'/docs/d_{idx}.json', "w") as json_file:
        dump(loads(d.json()), json_file)
        idx+=1

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=7000,
    chunk_overlap=1000,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)

loader = TextLoader(working_dir+'/transcript.txt')
documents = loader.load_and_split(r_splitter)
len(documents)

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings()

from langchain.vectorstores import Chroma

persist_directory = f'chroma/transcripts/{authority}'

!rm -rf ./f'chroma/transcripts/{authority}'  # remove old database files if any

vectordb = Chroma.from_documents(
    documents=documents,
    embedding=embedding,
    persist_directory=persist_directory
)

print(vectordb._collection.count())

In [None]:
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-4", temperature=0)

# Build prompt
from langchain.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use ten sentences maximum. Keep the answer as concise as possible. 
Try to include the name of everyone who spoke.
{context}
Question: {question}
Helpful Answer:"""

QA_CHAIN_PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template=template
)

In [None]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever   =  vectordb.as_retriever(),
    chain_type  =  "stuff",
    chain_type_kwargs = { "prompt" : QA_CHAIN_PROMPT }
)

In [None]:
from IPython.display import display, Markdown

question  =  f"Please, summarize the {authority} meeting."
result    =  qa_chain({"query": question})
display(Markdown(result['result']))

### Load