# Hawthorne Media Analyzer

This program ingests short-form media and analyzes the content.

### Contents:

Setup<br>
Data Management<br>
Audio Transcription<br>
Video sampling<br>
Analysis<br>

## Setup

We recommend using a designated environment for this program. See the README file for instructions on preparing the enviornment. Package installations are controlled via the requirements.txt file.

In [1]:
# If using Anaconda, uncomment and run this code.
# conda install -c conda-forge libsndfile

In [2]:
#$ conda update -n base -c defaults conda

In [3]:
# The ffmpeg package is included in the folder. Alternatively you may download FFmpeg from the official site: https://ffmpeg.org/download.html, extract it, and update the path to the folder where you placed it.

In [4]:
# Package imports:

import os
import re
import pandas as pd

from pydub import AudioSegment
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

import tempfile
# import streamlit as st
# from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled # Replace with other transcription package
import yt_dlp
import openai
import tiktoken
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document



In [5]:
with open('ffmpegPath.txt', 'r') as file:
    ffmpeg_path = file.read()
    print(ffmpeg_path)

C:\Users\matth\Desktop\MSBA\Capstone Project\ffmpeg-7.1.1-full_build\bin


In [6]:
# Add ffmpeg to PATH manually
os.environ["PATH"] += os.pathsep + ffmpeg_path

In [7]:
AudioSegment.converter = ffmpeg_path + r"\ffmpeg.exe"

### Setting up your API Key
This program uses an openai API. You will need a key to access the API. This key is linked to the organization's account and API calls will be charged against the account. To protect the key, we recommend storing it as an environmental variable using the steps below. This step only needs to be done once for the computer or environment in which you are working; you never need repeat this step unless you set up a new account with the LLM. Note that the name of the key is case sensitive and must match exactly.<br><br>

##### If you use Anaconda:
&emsp; Launch Anaconda Prompt (this is different than the Windows command prompt)<br>
&emsp; run: conda active base <br>
&emsp; then run: setx AI_API_Key "your_API_Key"<br>

##### For Windows:
&emsp; From the Windows start menu, select "Settings" <br>
&emsp; Go to "System" \> "Advanced System Settings" <br>
&emsp; From the "Advanced" tab, select the "Evironmental Variables" button <br>
&emsp; Select "New" <br>
&emsp; Name the variable "AI_API_Key" <br>
&emsp; Enter the key as the variable value; click "OK"

In [8]:
# --- User-provided OpenAI API Key via file upload or text input ---
api_key = os.environ.get('AI_API_Key')

In [9]:
os.environ["OPENAI_API_KEY"] = api_key

In [10]:
#uncomment for troubleshooting API key errors
# print(api_key)
# os.getenv('OPENAI_API_KEY')

#### Alternate method
uploaded_file = st.file_uploader(
    label="Upload a .txt file containing your OpenAI API key:",
    type=["txt"],
    help="File should contain only the API key text."
)
if uploaded_file:
    try:
        api_key = uploaded_file.read().decode("utf-8").strip()
    except Exception:
        st.error("Failed to read API key from the uploaded file.")
if not api_key:
    api_key = st.text_input(
        "Or enter your OpenAI API key manually:",
        type="password"
    ).strip() or None
if not api_key:
    st.warning("Please provide your OpenAI API key by upload or manual entry to proceed.")
    st.stop()

##### Set the OpenAI key for the library and API
openai.api_key = api_key
os.environ["OPENAI_API_KEY"] = api_key

## Data Management

In this section, we ingest the media from a list and set up a dataframe to store the dimension values generated during the analysis.

In [11]:
# Read in the list of creatives to analyze
media_list = pd.read_csv("WindowNation_Pathmatics_2025.csv")

In [12]:
# Choose a specific creative from the list.
# Internal team note: Replace this with a loop for batch processing later in the project.
media_list.loc[6,] # choose a video from the list

Advertiser                                       Andersen Corporation
Date                                                         1/1/2025
Device                                                  Desktop Video
Type                                                            Video
First Seen                                                  7/17/2023
Last Seen                                                    1/9/2025
Link to Creative    https://s3.amazonaws.com/YM_Ads/RWQy_qJW4LArer...
Spend                                                        309.5686
Impressions                                                     62594
Name: 6, dtype: object

In [13]:
video_url = media_list.loc[6,'Link to Creative']

In [14]:
video_url # (optional) check that the url has been correctly identified

'https://s3.amazonaws.com/YM_Ads/RWQy_qJW4LArera54bPNtg.mp4'

## Audio Transcription

In [15]:
# This step downloads the audio from the creative.

def download_audio(video_url: str) -> str:
    """
    Use yt-dlp to download the best audio stream to a local file.
    """
    ydl_opts = {
        'format': 'bestaudio/best',
        'outtmpl': 'audio.%(ext)s'
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(video_url, download=True)
        filename = ydl.prepare_filename(info)
    return filename

In [16]:
raw_audio_path = download_audio(video_url)

[generic] Extracting URL: https://s3.amazonaws.com/YM_Ads/RWQy_qJW4LArera54bPNtg.mp4
[generic] RWQy_qJW4LArera54bPNtg: Downloading webpage
[info] RWQy_qJW4LArera54bPNtg: Downloading 1 format(s): mp4
[download] Destination: audio.mp4
[download] 100% of    3.71MiB in 00:00:07 at 526.63KiB/s   


In [17]:
# Convert file to wav

def convert_to_wav(audio_path: str) -> str:
    """
    Convert any audio file (mp4/m4a/webm) to WAV (mono, 16kHz) for transcription.
    """
    audio = AudioSegment.from_file(audio_path)
    audio = audio.set_channels(1).set_frame_rate(16000)

    wav_path = os.path.join(tempfile.gettempdir(), "audio.wav")
    audio.export(wav_path, format="wav")

    os.remove(audio_path)
    return wav_path


In [18]:
wav_path = convert_to_wav(raw_audio_path)

#### Alternate to pydub, (conversion to wav)

@st.cache_data(show_spinner=False)
def convert_to_wav(audio_path: str) -> str:
    """
    Convert any audio file (mp4/m4a/webm) to WAV (mono, 16kHz) for transcription.
    """
    wav_path = "audio.wav"
    subprocess.run([
        "ffmpeg", "-y", "-i", audio_path,
        "-ac", "1", "-ar", "16000", wav_path
    ], check=True)
    os.remove(audio_path)
    return wav_path


#### Alternate to TorchAudio (for transcription)
def transcribe_with_openai(wav_path: str) -> str:
    """
    Use OpenAI Whisper API to transcribe the given WAV audio file.
    """
    # Open file in binary mode
    with open(wav_path, "rb") as audio_file:
        transcription = openai.Audio.transcribe(
            model="whisper-1",
            file=audio_file
        )
    os.remove(wav_path)
    return transcription["text"]


transcribe_with_openai("C:\\Users\\ttesno\\AppData\\Local\\Temp\\audio.wav")

In [19]:
# transcription with torchaudio and Wav2Vec

def transcribe_with_torchaudio(wav_path: str) -> str:
    """
    Transcribe a WAV audio file using torchaudio and a pretrained Wav2Vec2 model.
    """
    # Load pretrained model and processor
    processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

    # Load audio
    waveform, sample_rate = torchaudio.load(wav_path)

    # Resample if needed
    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
        waveform = resampler(waveform)

    # Preprocess
    input_values = processor(waveform.squeeze().numpy(), return_tensors="pt", sampling_rate=16000).input_values

    # Inference
    with torch.no_grad():
        logits = model(input_values).logits

    # Decode
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.decode(predicted_ids[0])
    
    # Clean up
    os.remove(wav_path)
    return transcription

In [20]:
transcript = transcribe_with_torchaudio(wav_path)

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
transcript

'TRANSFORM YOUR HOME WITH REPLACEMENT WINDOWS FROM RUNOLBAI ANDERSAN BEAUTIFUL NEW WINDOWS BRIGHTEN ANY ROOM AND REDUCE EETING AND COOLING CAUSE THEIR GORGEOUS ENERGY EFFICIENT AND HAVE UNMATCHED YOUR ABILITY RUNOLBI ANDERSON A BETTER WAY TO A BETTER WINDOW'

#### Chunking and Embedding

In [22]:

def chunk_transcript(text: str, size: int = 1000, overlap: int = 200) -> list[str]:
    """
    Split transcript into overlapping chunks for embedding.
    """
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", " ", ""]
    )
    return splitter.split_text(text)


def build_faiss_index(chunks: list[str]) -> FAISS:
    """
    Create a FAISS index from text chunks using OpenAI embeddings.
    """
    embeddings = OpenAIEmbeddings()
    return FAISS.from_texts(chunks, embeddings)


def make_qa_chain(store: FAISS) -> RetrievalQA:
    """
    Build a RetrievalQA chain for question-answering over the vector store.
    """
    llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
    return RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=store.as_retriever(search_kwargs={"k": 5})
    )


def make_summary_chain():
    """
    Build a summarization chain (map-reduce style).
    """
    llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
    return load_summarize_chain(llm, chain_type="map_reduce")

In [23]:
chunks = chunk_transcript(transcript)

In [24]:
store = build_faiss_index(chunks)

In [25]:
qa_chain = make_qa_chain(store)

In [26]:
question = "Is this ad problem-oriented or solution-oriented? Choose from Problem or Solution"

In [None]:
answer = qa_chain.invoke(question)

In [None]:
'''
chunks = chunk_transcript(transcript)
store = build_faiss_index(chunks)
qa_chain = make_qa_chain(store)
#summary_chain = make_summary_chain()

#summary = summary_chain.run([Document(page_content=c) for c in chunks])

question = "Is this ad problem-oriented or solution-oriented? Choose from Problem or Solution"
answer = qa_chain.invoke(question)
answer
'''