# **Project Description: StoryCraft – Multimodal Story Generator**

> Build an AI-powered storytelling pipeline that generates narrated visual stories from a single prompt. The system transforms a basic idea into a full multimedia experience — including images, narration, and transcript — using LLMs and generative APIs.
### **Project Goals**

- Take a **story title and message** as input
- Use **LLMs** (via LangChain) to generate a **multi-scene storyline**
- Convert scene descriptions into **images** using **DALL·E**
- Generate **audio narration** for each scene using **Eleven Labs**
- Transcribe audio using **Whisper** for accessibility and recordkeeping
- Organize all outputs into a structured, replayable **story package**



###  **End-to-End Pipeline**

1. **User Input**: Story title and one-line message (e.g., *“A lonely robot finds a friend”*)
2. **LLM (LangChain)**: Expands this into 3–5 story scenes with detailed descriptions
3. **Text-to-Image (DALL·E)**: Each scene’s description → image
4. **Text-to-Speech (ElevenLabs)**: Scene text → audio narration
5. **Speech-to-Text (Whisper)**: Audio → transcript
6. **Output**: A storybook-like experience with:
    - Scene image
    - Audio player
    - Transcript
    - Option to export/share/download




## Section 1: Setting Up the Environment

In this section, we install necessary libraries, authenticate APIs (OpenAI, ElevenLabs), and prepare the environment for the storytelling pipeline. This includes handling keys securely and verifying access to each external service.


In [2]:
!pip install openai elevenlabs pydub --quiet
!pip install git+https://github.com/openai/whisper.git --quiet

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


### API Key Setup

We need API keys for:
- **OpenAI** (for both LLM and DALL·E)
- **ElevenLabs** (for text-to-speech)
- **Whisper** is used locally, so no key is needed

Enter your keys on the left pannel of secrets. Do **not** share this notebook with keys still visible.


In [92]:
import os
import openai
from elevenlabs import ElevenLabs
from google.colab import userdata
import re
from IPython.display import display, Image, Audio
import pathlib

In [8]:
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

In [20]:
openai_client = openai.OpenAI(api_key=OPENAI_API_KEY)

In [28]:
ELEVENLABS_API_KEY = userdata.get('ELEVENLABS_API_KEY')

In [29]:
from elevenlabs.client import ElevenLabs
elevenlabs_client = ElevenLabs(
  api_key=ELEVENLABS_API_KEY
)

## Section 2: Tools & Utilities

This section defines modular helper functions to access different AI tools used in the pipeline. Each tool is wrapped in a function so it can be reused in the agent later.

We cover:
-  Scene generation using OpenAI GPT-4
-  Image generation using DALL·E
-  Audio narration using ElevenLabs
-  Transcription using Whisper


###  LLM Scene Generator

This function uses OpenAI GPT-4 to generate a numbered list of story scenes based on a title and a short message. Each scene is 1–3 sentences long, vivid, and suitable for a children’s picture book.

The output is a plain text numbered list we can later parse.


In [104]:
def generate_scenes(title: str, message: str) -> dict:
    prompt = f"""
    You are a creative story writer. Given the story title and a one-line message, write a single engaging story of about 100 words.
    Use natural punctuation marks (., '' "" ! ... and line breaks \\n) to make the story flow smoothly for audio narration.

    Then, provide a short, vivid prompt to generate an animated image of the most iconic moment of this story, suitable for DALL·E.



    Title: {title}
    Message: {message}

    Format the output exactly as:

    Story:
    <the full story here>

    Image prompt:
    <a detailed and precise description for a story cover image>
    """
    response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo-0125",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip()



##  Tool 2: DALL·E Image Generator

This function uses OpenAI's DALL·E model to generate a single image from a text prompt (scene description). The output is a URL to the generated image.


In [24]:
def generate_image(prompt: str) -> str:
    response = openai_client.images.generate(
        model="dall-e-2",
        prompt=prompt,
        size="1024x1024",
        n=1
    )
    return response.data[0].url


## Tool 3: ElevenLabs Text-to-Speech (TTS)

This function converts scene text into speech audio using ElevenLabs. It saves the audio as an MP3 file locally and returns the file path.


In [76]:
def generate_audio(text, output_path="output_audio.mp3"):
    voice_id = "XB0fDUnXU5powFXDhCwa"  # Charlotte's voice ID
    model_id = "eleven_multilingual_v2"
    audio_gen = elevenlabs_client.text_to_speech.convert(
        voice_id=voice_id,
        text=text,
        model_id=model_id,
        output_format="mp3_44100_128"
    )
    audio_bytes = b"".join(audio_gen)
    with open(output_path, "wb") as f:
        f.write(audio_bytes)
    return output_path


### Tool 4: Transcript Generator (Whisper)

This tool transcribes the audio narration into text using OpenAI’s Whisper model. It helps create subtitles or closed captions for the AI-generated video scenes.

We upload the audio file (e.g., `scene1.mp3`) and use `openai.Audio.transcribe()` to get the plain text transcription.

Whisper is robust against background noise and accents, making it ideal for spoken storytelling.


In [90]:
def transcribe_audio(audio_path):
    audio_file = pathlib.Path(audio_path)
    with audio_file.open("rb") as f:
        transcript = openai_client.audio.transcriptions.create(
            model="whisper-1",
            file=f
        )
    return transcript.text

## Full Story Generation Pipeline

This function orchestrates the entire storytelling workflow:

1. Takes a story title and a one-line message as input.
2. Uses the LLM to generate a 100-word story and an image prompt for the most iconic moment.
3. Sends the image prompt to DALL·E to generate a scene illustration.
4. Sends the story text to ElevenLabs to create an audio narration.
5. Transcribes the audio narration back to text using Whisper for accessibility.
6. Outputs the story title, message, the generated image, audio playback, and the transcription — all displayed inline.

This end-to-end pipeline enables a seamless multimedia storytelling experience directly within the notebook.


In [98]:
def full_story_flow(title: str, message: str):
    # 1. Run LLM generator to get combined output
    llm_output = generate_scenes(title, message)

    # 2. Parse LLM output into story and image prompt
    story_match = re.search(r"Story:\s*(.+?)\s*Image prompt:", llm_output, re.DOTALL)
    image_prompt_match = re.search(r"Image prompt:\s*(.+)", llm_output, re.DOTALL)

    if not story_match or not image_prompt_match:
        raise ValueError("Failed to parse LLM output into story and image prompt.")

    story = story_match.group(1).strip()
    image_prompt = image_prompt_match.group(1).strip()

    # 3. Generate image with DALL·E
    image_url = generate_image(image_prompt)

    # 4. Generate audio narration with ElevenLabs
    audio_path = generate_audio(story)

    # 5. Transcribe audio with Whisper
    transcript = transcribe_audio(audio_path)

    # 6. Display / return all outputs
    print(f"Story Title:\n{title}\n")
    print(f"Story Message:\n{message}\n")
    #print(f"### Story Text:\n{story}\n")

    #print(f"### Image Prompt:\n{image_prompt}\n")
    print('Story Cover')
    display(Image(url=image_url))

    print("Audio Narration:")
    display(Audio(audio_path))

    print("Transcript:")
    print(transcript)




## Checking our System

In [105]:
story_title = "Rabbit races Turtoise"
story_message = "Slow but steady wins the race"

result = full_story_flow(story_title, story_message)


Story Title:
Rabbit races Turtoise

Story Message:
Slow but steady wins the race

Story Cover


Audio Narration:


Transcript:
The rabbit was known for its speed, while the turtoise was mocked for its slowness. So, they decided to have a race. The rabbit dashed ahead confidently, leaving the turtoise far behind. But the turtoise kept plodding along steadily. As the finish line approached, the rabbit, tired and overconfident, took a nap. The turtoise, slow but determined, crossed the finish line first, proving that slow and steady wins the race.


In [None]:
### Full flow without the story cover
def full_story_flow(title: str, message: str):
    # 1. Run LLM generator to get combined output
    llm_output = generate_scenes(title, message)

    # 2. Parse LLM output into story and image prompt (image prompt not used here)
    story_match = re.search(r"Story:\s*(.+?)\s*Image prompt:", llm_output, re.DOTALL)

    if not story_match:
        raise ValueError("Failed to parse LLM output into story.")

    story = story_match.group(1).strip()

    # 3. Generate audio narration with ElevenLabs
    audio_path = generate_audio(story)

    # 4. Transcribe audio with Whisper
    transcript = transcribe_audio(audio_path)

    # 5. Display / return all outputs
    print(f"Story Title:\n{title}\n")
    print(f"Story Message:\n{message}\n")
    print(f"Story Text:\n{story}\n")

    print("Audio Narration:")
    display(Audio(audio_path))

    print("Transcript:")
    print(transcript)