# Transform unstructured data to structured in real-time

Media companies want to extract key information from livestreamed events for subtitles, translations, and content summaries but doing this manually or with bactch processing causes delays. This project showcases how to use GlassFlow for real-time extraction, transformation, and translation of YouTube video data. The handler extracts key topics from the video transcript, generates meaningful insights, and translates the transcript into any specified language. 

### Features
1. Extract video transcript from YouTube.
2. Process the data to extract topics and other meaningful data (identifies key metrics such as the number of speakers and the total duration of the spoken content).
3. Translate the transcript into the user's preferred language (for example, from English to Spanish).
4. Return structured data and derived metrics.


## Pre-requisites

- Create your free GlassFlow account via the [GlassFlow WebApp](https://app.glassflow.dev).
- Get your [Personal Access Token](https://app.glassflow.dev/profile) to authorize the Python SDK to interact with GlassFlow Cloud.
- Get your OpenAI API Key https://platform.openai.com/.

## Step 1: Install GlassFlow and import

In [None]:
%pip install "glassflow>=2.0.8"

In [None]:
import glassflow
import time

## Step 2: Create GlassFlow Pipeline

In [None]:
# Set personal access token from your GlassFlow account and OpenAI API key
personal_access_token = ""
OPENAI_API_KEY = ""

In [None]:
# Create a GlassFlow client
client = glassflow.GlassFlowClient(
    personal_access_token=personal_access_token
)

In [None]:
# Get the space named "unstructured-to-structured" (or create one if no space is found)
list_spaces = client.list_spaces()

space_name = "unstructured-to-structured"
for s in list_spaces.spaces:
    if s["name"] == space_name:
        space = glassflow.Space(
            personal_access_token=client.personal_access_token,
            id=s["id"], 
            name=s["name"]
        )
        break
else:
    space = client.create_space(name=space_name)

print(f"Created space {space.name} with ID: {space.id}")

### Transformation Function

In [None]:
%pycat transform.py

### Requirements txt

Define external dependencies for the transformation function

In [None]:
with open("requirements.txt") as f:
    requirements_txt = f.read()
print(requirements_txt)

### Environment variables

In [None]:
env_vars = [{
  "name": "OPENAI_API_KEY",
  "value": OPENAI_API_KEY
}]

### Create Pipeline

Create a pipeline for the video processing.

In [None]:
pipeline_name = "video-transcript-analysis"

pipeline = client.create_pipeline(
    name=pipeline_name, 
    transformation_file='transform.py',
    env_vars=env_vars,
    space_id=space.id,
    requirements=requirements_txt
)
print(f"Pipeline created successfully with ID: {pipeline.id}")
print("Pipeline URL on GlassFlow UI to discover %s "% f"https://app.glassflow.dev/pipelines/{pipeline.id}")

## Step 3: Send events to the pipeline

In [None]:
data_source = pipeline.get_source()

# Sample event data for testing, using different YouTube links and languages
test_events = [
    {"youtube_link": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "target_language": "Spanish"}
]

n_events = len(test_events)
# Publish 10 unique events
for i, event in enumerate(test_events):
    print(f"Publishing event {i+1}: {event}")
    data_source.publish(event)
    time.sleep(1)  # Optional delay to simulate real-time event publishing


## Step 4: Consume structured data from the pipeline

Get pipeline data sink to consume the transformed events from the pipeline.

In [None]:
data_sink = pipeline.get_sink()

In [None]:
output_events = []
while True:
    resp = data_sink.consume()
    if resp.status_code == 200:
        event = resp.json()
        output_events.append(event)
        print(event)
    if len(output_events) == n_events:
        # all events have been consumed
        break

# for event in output_events:
#     print(event)

## Monitor the pipeline

Go to the pipeline logs you created and monitor real-time events.

In [None]:
## Explore the pipeline logs on the web-UI 
pipeline_url = f"https://app.glassflow.dev/pipelines/{pipeline.id}/logs"
print(pipeline_url)