# Transform unstructured data to structured in real-time

Media companies want to extract key information from livestreamed events for subtitles, translations, and content summaries but doing this manually or with bactch processing causes delays. This project showcases how to use GlassFlow for real-time extraction, transformation, and translation of YouTube video data. The handler extracts key topics from the video transcript, generates meaningful insights, and translates the transcript into any specified language. 

### Features
1. Extract video transcript from YouTube.
2. Process the data to extract topics and other meaningful data (identifies key metrics such as the number of speakers and the total duration of the spoken content).
3. Translate the transcript into the user's preferred language (for example, from English to Spanish).
4. Return structured data and derived metrics.


## Pre-requisites

- Create your free GlassFlow account via the [GlassFlow WebApp](https://app.glassflow.dev).
- Get your [Personal Access Token](https://app.glassflow.dev/profile) to authorize the Python SDK to interact with GlassFlow Cloud.
- Get your OpenAI API Key https://platform.openai.com/.

## Step 1: Install GlassFlow and import

In [1]:
%pip install "glassflow>=2.0.5"


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import glassflow
import time

## Step 2: Create GlassFlow Pipeline

In [3]:
# Set personal access token from your GlassFlow account and OpenAI API key
personal_access_token = ""
OPENAI_API_KEY = ""

In [4]:
# Create a GlassFlow client
client = glassflow.GlassFlowClient(
    personal_access_token=personal_access_token
)

In [5]:
# Get the space named "unstructured-to-structured" (or create one if no space is found)
list_spaces = client.list_spaces()

space_name = "unstructured-to-structured"
for s in list_spaces.spaces:
    if s["name"] == space_name:
        space = glassflow.Space(
            personal_access_token=client.personal_access_token,
            id=s["id"], 
            name=s["name"]
        )
        break
else:
    space = client.create_space(name=space_name)

print(f"Created space {space.name} with ID: {space.id}")

Created space unstructured-to-structured with ID: c5e01c71-7bc5-4990-a76a-a15c35156cb3


### Transformation Function

In [6]:
%pycat transform.py

[0;32mimport[0m [0mopenai[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0myoutube_transcript_api[0m [0;32mimport[0m [0mYouTubeTranscriptApi[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0mos[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;31m# You will need an API key from OpenAI for this to work[0m[0;34m[0m
[0;34m[0m[0mopenai[0m[0;34m.[0m[0mapi_key[0m [0;34m=[0m [0mos[0m[0;34m.[0m[0mgetenv[0m[0;34m([0m[0;34m"OPENAI_API_KEY"[0m[0;34m)[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;31m# GlassFlow mandatory handler function[0m[0;34m[0m
[0;34m[0m[0;32mdef[0m [0mhandler[0m[0;34m([0m[0mdata[0m[0;34m,[0m [0mlog[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""[0m
[0;34m    GlassFlow handler function for extracting key insights from YouTube video transcripts,[0m
[0;34m    translating the transcript into another language, and generating derived metrics.[0m
[0;34m[0m
[0;34m    Parameters:[0m


### Requirements txt

Define external dependencies for the transformation function

In [7]:
with open("requirements.txt") as f:
    requirements_txt = f.read()
print(requirements_txt)

openai
youtube_transcript_api


### Environment variables

In [8]:
env_vars = [{
  "name": "OPENAI_API_KEY",
  "value": OPENAI_API_KEY
}]

### Create Pipeline

Create a pipeline for the video processing.

In [9]:
pipeline_name = "video-transcript-analysis"

pipeline = client.create_pipeline(
    name=pipeline_name, 
    transformation_file='transform.py',
    env_vars=env_vars,
    space_id=space.id,
    requirements=requirements_txt
)
print(f"Pipeline created successfully with ID: {pipeline.id}")
print("Pipeline URL on GlassFlow UI to discover %s "% f"https://app.glassflow.dev/pipelines/{pipeline.id}")

Pipeline created successfully with ID: a8ce1e13-6e55-4acb-b4ea-bd0958449e20
Pipeline URL on GlassFlow UI to discover https://app.glassflow.dev/pipelines/a8ce1e13-6e55-4acb-b4ea-bd0958449e20 


## Step 3: Send events to the pipeline

In [14]:
data_source = pipeline.get_source()

# Sample event data for testing, using different YouTube links and languages
test_events = [
    {"youtube_link": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "target_language": "Spanish"}
]

n_events = len(test_events)
# Publish 10 unique events
for i, event in enumerate(test_events):
    print(f"Publishing event {i+1}: {event}")
    data_source.publish(event)
    time.sleep(1)  # Optional delay to simulate real-time event publishing


Publishing event 1: {'youtube_link': 'https://www.youtube.com/watch?v=dQw4w9WgXcQ', 'target_language': 'Spanish'}


## Step 4: Consume structured data from the pipeline

Get pipeline data sink to consume the transformed events from the pipeline.

In [15]:
data_sink = pipeline.get_sink()

In [16]:
output_events = []
while True:
    resp = data_sink.consume()
    if resp.status_code == 200:
        event = resp.json()
        output_events.append(event)
        print(event)
    if len(output_events) == n_events:
        # all events have been consumed
        break

# for event in output_events:
#     print(event)

{'youtube_link': 'https://www.youtube.com/watch?v=dQw4w9WgXcQ', 'translated_transcript': '[Music] No somos extraños al amor, conoces las reglas y yo también. Un compromiso total es lo que estoy pensando. No obtendrías esto de ningún otro chico. Solo quiero decirte cómo me siento, tengo que hacerte entender. Nunca te voy a abandonar, nunca te voy a decepcionar, nunca voy a correr y dejarte. Nunca te haré llorar, nunca te diré adiós, nunca te mentiré ni te haré daño.\n\nNos conocemos', 'transcript_length_minutes': 2.14, 'number_of_speakers': 3, 'average_words_per_speaker': 128.67}


## Monitor the pipeline

Go to the pipeline logs you created and monitor real-time events.

In [18]:
## Explore the pipeline logs on the web-UI 
pipeline_url = f"https://app.glassflow.dev/pipelines/{pipeline.id}/logs"
print(pipeline_url)

https://app.glassflow.dev/pipelines/a8ce1e13-6e55-4acb-b4ea-bd0958449e20/logs
