# Podcastfy 
Transforming Multimodal Content into Captivating Multilingual Audio Conversations with GenAI:

In this Notebook, we will go through the main features of Podcastfy via its Python package:

- Support multiple input sources (text, images, websites, YouTube, and PDFs).
- Generate short (2-5 minutes) or longform (30+ minutes) podcasts.
- Customize transcript and audio generation (e.g., style, language, structure).
- Generate transcripts using 100+ cloud-basedLLM models (OpenAI, Anthropic, Google etc).
- Leverage local LLMs for transcript generation for increased privacy and control.
- Integrate with advanced text-to-speech models (OpenAI, ElevenLabs, Microsoft Edge, Google single and multispeaker).
- Provide multi-language support for global content creation.

## Table of Contents

- Setup
- Getting Started
- Generate a podcast from text content
  - Generate podcast from URLs
  - Selecting TTS models
  - Generate longform podcasts
  - Generate transcript only
  - Generate audio from transcript
  - Generate podcast from pdf
  - Raw text as input
  - Podcast from topic / grounded web search
- Generate podcast from images
- Conversation Customization
- Multilingual Support
  - French (fr)
  - Portugue (pt-br)
- Custom LLM Support

## Setup

Firstly, please make sure you have installed the podcastfy module, its dependencies and associated API keys. [See Setup](README.md#setup).

## Getting Started

Podcast generation is done through the `generate_podcast` function.

In [1]:
from podcastfy.client import generate_podcast

  from .autonotebook import tqdm as notebook_tqdm


Below we write a custom function we will use to embed audio in this Python notebook.

In [4]:
%pip install ipython
from IPython.display import Audio, display

def embed_audio(audio_file):
	"""
	Embeds an audio file in the notebook, making it playable.

	Args:
		audio_file (str): Path to the audio file.
	"""
	try:
		display(Audio(audio_file))
		print(f"Audio player embedded for: {audio_file}")
	except Exception as e:
		print(f"Error embedding audio: {str(e)}")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Generate podcast from URL

This code demonstrates the process of generating a podcast from a single URL, in this case wikipedia's page on "Podcast":
1. Extract content from the URL
2. Generate a Q&A transcript from the extracted content 
3. Convert the transcript to speech Text-to-Speech model
4. Save the generated audio file to data/audio

In [2]:
transcript_file = generate_podcast(urls=["https://en.wikipedia.org/wiki/Podcast"], 
                              llm_model_name='llama-3.2-3b-instruct-q4_k_m', 
                              api_key_label="OPENAI_API_KEY",
                              transcript_only=True)

Generating podcast...
default config {'content_generator': {'llm_model': 'openai/llama-3.2-3b-instruct-q4_k_m', 'meta_llm_model': 'gemini-1.5-pro-latest', 'max_output_tokens': 8192, 'prompt_template': 'souzatharsis/podcastfy_multimodal_cleanmarkup', 'prompt_commit': 'b2365f11', 'longform_prompt_template': 'souzatharsis/podcastfy_longform', 'longform_prompt_commit': 'acfdbc91', 'cleaner_prompt_template': 'souzatharsis/podcastfy_longform_clean', 'cleaner_prompt_commit': '8c110a0b', 'rewriter_prompt_template': 'souzatharsis/podcast_rewriter', 'rewriter_prompt_commit': '8ee296fb'}, 'content_extractor': {'youtube_url_patterns': ['youtube.com', 'youtu.be']}, 'website_extractor': {'markdown_cleaning': {'remove_patterns': ['\\[.*?\\]', '\\(.*?\\)', '^\\s*[-*]\\s', '^\\s*\\d+\\.\\s', '^\\s*#+']}, 'unwanted_tags': ['script', 'style', 'nav', 'footer', 'header', 'aside', 'noscript'], 'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.12

  self.llm = ChatOpenAI(
2025-01-06 07:22:37,109 - podcastfy.client - INFO - Processing 1 links
2025-01-06 07:50:31,787 - podcastfy.client - INFO - Transcript generated successfully: ./data/transcripts/transcript_2eee3a23c90a43cab8cbcc0e1f697ee1.txt


Transcript saved to ./data/transcripts/transcript_2eee3a23c90a43cab8cbcc0e1f697ee1.txt


In [None]:
# Embed the audio file generated from transcript
embed_audio(audio_file)

Default configuration:
- LLM model for transcript generation: Gemini 1.5 Pro
- TTS model for audio generation: OpenAI
- Audio output directory: "./data/audio/"
- Transcript output directory: "./data/transcripts/"

All the above parameters can be customized as we will see later in the "Conversation Customization" section.


The `urls` parameter can take a list of urls, which could be:
- Websites
- Youtube videos
- PDFs (local path)

If you have a list of urls you would like to process, you can pass a file path of a file containing that list, one url per line, using the `file_path` parameter.

In [None]:
audio_file = generate_podcast(file_path="./data/urls.txt")

## Selecting TTS Models

TTS model selection is one of the most important parameters in podcast generation as it can greatly impact the quality of the generated audio. Users can select a particular TTS model by passing the `tts_model` parameter to the `generate_podcast` function. Available TTS models and their respective parameters values are:
- [_default_] OpenAI (`openai`)
- ElevenLabs (`elevenlabs`)
- Microsoft Edge (`edge`)
- Google's single speaker - Journey - model (`gemini`)
- [_recommended_] Google's multispeaker model (`geminimulti`)

We recommend using Google's multispeaker model (`geminimulti`) for the best quality. In our experience, it sounds more natural and human-like. So why isn't Google's multispeaker model the default? 

Google's TTS models requires an additional set up step compared to other models. See instructions [here](usage/config.md#setting-up-google-tts-model).


## Generate longform podcasts


By default, Podcastfy generates shortform podcasts (2-5 minutes). However, users can generate longform podcasts (20-30+ minutes) by setting the `longform` parameter to `True`. Note: Images are not yet supported for longform podcast generation.



In this example, we generate a longform podcast from the book "The Autobiography of Benjamin Franklin":

In [None]:
audio_file = generate_podcast(urls=["https://www.gutenberg.org/cache/epub/20203/pg20203.txt"], 
                              longform=True,
                              tts_model="geminimulti")

In [None]:
print(audio_file)
#./data/audio/podcast_df7fe0314c1848dfb558b8d3d103441b.mp3

LLMs have a limited ability to output long text responses. Most LLMs have a `max_output_tokens` of around 4096 and 8192 tokens. Hence, long-form podcast transcript generation is challeging. We have implemented a technique I call "Content Chunking with Contextual Linking" to enable long-form podcast generation by breaking down the input content into smaller chunks and generating a conversation for each chunk while ensuring the combined transcript is coherent and linked to the original input.

### Adjusting longform podcast length

Users may adjust lonform podcast length by setting the following parameters in your customization params (see later section "Conversation Customization"):
- `max_num_chunks` (default: 7): Sets maximum number of rounds of discussions.
- `min_chunk_size` (default: 600): Sets minimum number of characters to generate a round of discussion.

We define "round of discussion" as the output transcript obtained from a single LLM call. The higher the `max_num_chunks` and the lower the `min_chunk_size`, the longer the generated podcast will be.
Today, this technique allows users to generate long-form podcasts of any length if input content is long enough. However, the conversation quality may decrease and its length may converge to a maximum if `max_num_chunks`/`min_chunk_size` is to high/low particularly if input content length is limited.

Recommendation:
- If input content is short (1-10 paragraphs), generate shortform podcast (`longform=False`, which is the default).
- If input content is long (10+ paragraphs), generate longform podcast (`longform=True`).
- If input content is very long (e.g. long pdfs, books, series of websites), consider increasing `max_num_chunks` from default 7 to e.g. 10 or 15.


## Generate transcript

Users have the option to generate the transcript only from input urls, i.e. without audio generation. In that way, users may edit/process transcripts before further downstream audio generation.

In [None]:
# Generate transcript only
transcript_file = generate_podcast(
	urls=["https://github.com/souzatharsis/podcastfy/blob/main/README.md"],
	transcript_only=True
)

In [None]:

print(f"Transcript generated and saved as: {transcript_file}")
# Read and print the first 20 characters from the transcript file
with open(transcript_file, 'r') as file:
	transcript_content = file.read(100)
	print(f"First 100 characters of the transcript: {transcript_content}")

## Generate audio from transcript

Users can generate audio from a given transcript. Here, we generate a podcast from the previsouly generate transcript. This allows users to re-use previsouly generated transcripts or provide their own custom produced transcript for podcast generation.

In [5]:
# Generate podcast from existing transcript file
transcript_file="./data/transcripts/transcript_e1828e3bfe4b4f2aa6fef10c83ee4770.txt"
audio_file_from_transcript = generate_podcast(
	transcript_file=transcript_file,
    tts_model="openai"
)

2025-01-06 15:59:02,707 - podcastfy.client - INFO - Using transcript file: ./data/transcripts/transcript_e1828e3bfe4b4f2aa6fef10c83ee4770.txt


Generating podcast...
default config {'content_generator': {'llm_model': 'gemini-1.5-pro-latest', 'meta_llm_model': 'gemini-1.5-pro-latest', 'max_output_tokens': 8192, 'prompt_template': 'souzatharsis/podcastfy_multimodal_cleanmarkup', 'prompt_commit': 'b2365f11', 'longform_prompt_template': 'souzatharsis/podcastfy_longform', 'longform_prompt_commit': 'acfdbc91', 'cleaner_prompt_template': 'souzatharsis/podcastfy_longform_clean', 'cleaner_prompt_commit': '8c110a0b', 'rewriter_prompt_template': 'souzatharsis/podcast_rewriter', 'rewriter_prompt_commit': '8ee296fb'}, 'content_extractor': {'youtube_url_patterns': ['youtube.com', 'youtu.be']}, 'website_extractor': {'markdown_cleaning': {'remove_patterns': ['\\[.*?\\]', '\\(.*?\\)', '^\\s*[-*]\\s', '^\\s*\\d+\\.\\s', '^\\s*#+']}, 'unwanted_tags': ['script', 'style', 'nav', 'footer', 'header', 'aside', 'noscript'], 'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.3

2025-01-06 16:02:35,752 - podcastfy.client - INFO - Podcast generated successfully using openai TTS model


Generated audio segments:
['/media/limcheekin/My Passport/ws/py/podcastfy/podcastfy/data/audio/tmp/tmpggwwq604/1_question.wav', '/media/limcheekin/My Passport/ws/py/podcastfy/podcastfy/data/audio/tmp/tmpggwwq604/1_answer.wav', '/media/limcheekin/My Passport/ws/py/podcastfy/podcastfy/data/audio/tmp/tmpggwwq604/2_question.wav', '/media/limcheekin/My Passport/ws/py/podcastfy/podcastfy/data/audio/tmp/tmpggwwq604/2_answer.wav', '/media/limcheekin/My Passport/ws/py/podcastfy/podcastfy/data/audio/tmp/tmpggwwq604/3_question.wav', '/media/limcheekin/My Passport/ws/py/podcastfy/podcastfy/data/audio/tmp/tmpggwwq604/3_answer.wav', '/media/limcheekin/My Passport/ws/py/podcastfy/podcastfy/data/audio/tmp/tmpggwwq604/4_question.wav', '/media/limcheekin/My Passport/ws/py/podcastfy/podcastfy/data/audio/tmp/tmpggwwq604/4_answer.wav', '/media/limcheekin/My Passport/ws/py/podcastfy/podcastfy/data/audio/tmp/tmpggwwq604/5_question.wav', '/media/limcheekin/My Passport/ws/py/podcastfy/podcastfy/data/audio/tmp/

In [6]:
print(audio_file_from_transcript)

./data/audio/podcast_a75ac637d5064c2cacbb5b8cedbfed90.mp3


In [8]:
# Embed the audio file generated from transcript
embed_audio(audio_file_from_transcript)

Audio player embedded for: ./data/audio/podcast_a75ac637d5064c2cacbb5b8cedbfed90.mp3


Keep in mind that input transcripts must have the default structure of alternating rounds of discussions tagged with `<Person1>` and `<Person2>`. See previously generated transcripts for reference.

## Generate podcast from pdf

One or many pdfs can be processed in the same way as urls by simply passing a corresponding file path.

In [None]:
audio_file_from_pdf = generate_podcast(urls=["./data/pdf/s41598-024-58826-w.pdf"], 
                                       tts_model="geminimulti")

This is a Scientific Reports article about climate change in France. Let's listen to this short-form podcast:

In [None]:
# Embed the audio file generated from transcript
embed_audio(audio_file_from_pdf)
#./data/audio/podcast_8047c0e9a6c241a38399080cc0d3a084.mp3

## Generate podcast from raw text

Users can generate a podcast from raw text input.


In [None]:
raw_text = "The wonderful world of LLMs."
audio_file_from_raw_text = generate_podcast(text=raw_text,
                                            tts_model="geminimulti")
# Embed the audio file generated from raw text
embed_audio(audio_file_from_raw_text)
#./data/audio/podcast_093cfede64834497851030aea68b8cc1.mp3

Note that if input text is short, the generated podcast may be too short to be interesting. Further, generating a longform podcast from short input text may lead to low-quality conversations.

## Generate podcast from topic

Users can also generate a podcast from a specific topic of interest, e.g. "Latest News in U.S. Politics" or "Modern art in the 1920s". Podcastfy will generate a podcast based on *grounded* real-time information about the most recent content published on the web about the topic.

In [None]:
audio_file_from_topic = generate_podcast(topic="Latest news about OpenAI",
                                          tts_model="geminimulti")

In [None]:
embed_audio(audio_file_from_topic)
#./data/audio/podcast_e5424eae32ba428f87f808dda101b4eb.mp3

The generate conversation captures the rapid pace of OpenAI's developments as of today (11/16/2024) including leaked e-mail between Elon Musk and Sam Altman.

The difference between generating a podcast from a topic and from raw text is that a topic-based podcast is more likely to be grounded in real-time events and news, whereas a raw text-based podcast may not be as current or relevant. Note that a topic-based podcast won't necessarily generate a conversation about most recent events, instead it will consider the most relevant results from a web search. If the user would like to generate a conversation about recent events, please add such such information to properly instruct the LLM, for instance by adding "Latest News on..." to your query.

## Generate podcast from images

Images can be provided as input to generate a podcast. This feature is currently only supported for shortform podcasts. 

This can be useful when users want to generate a podcast from images such as works of art, physical spaces, historical events, etc. One or many images can be provided as input. The following example generates a podcast from two images: Senecio, 1922 (Paul Klee) and Connection of Civilizations (2017) by Gheorghe Virtosu.


In [None]:
# Generate podcast from input images
image_paths = [
        "https://raw.githubusercontent.com/souzatharsis/podcastfy/refs/heads/main/data/images/Senecio.jpeg",
        "https://raw.githubusercontent.com/souzatharsis/podcastfy/refs/heads/main/data/images/connection.jpg",
]

audio_file_from_images = generate_podcast(image_paths=image_paths, 
                                          tts_model="geminimulti")

print("Podcast generated from images:", audio_file_from_images)

Here is the generated podcast, which we have pre-saved in the data directory.

In [None]:
# Embed the audio file generated from images
embed_audio(audio_file_from_images)

## Customization

Podcastfy offers a range of customization options to tailor your AI-generated podcasts. Whether you're creating educational content, storytelling experiences, or anything in between, these configuration options allow you to fine-tune your podcast's tone, style, and format.
See [Conversation Configuration](usage/conversation_custom.md) for more details.


In [None]:
# Example: In-depth Tech Debate Podcast

# Define a custom conversation config for a tech debate podcast
tech_debate_config = {
    'conversation_style': ['Engaging', 'Fast-paced', 'Enthusiastic', 'Educational'], 
    'roles_person1': 'Interviewer', 
    'roles_person2': 'Subject matter expert', 
    'dialogue_structure': ['Topic Introduction', 'Summary of Key Points', 'Discussions', 'Q&A Session', 'Farewell Messages'], 
    'podcast_name': 'Supernova Podcast', 
    'podcast_tagline': 'The future of intelligence', 
    'output_language': 'English', 
    'user_instructions': 'Make if fun and engaging', 
    'engagement_techniques': ['Rhetorical Questions', 'Personal Testimonials', 'Quotes', 'Anecdotes', 'Analogies', 'Humor'], 
    'creativity': 0.75
}

# Generate a tech debate podcast about artificial intelligence
tech_debate_podcast = generate_podcast(
    urls=["https://en.wikipedia.org/wiki/Artificial_intelligence", 
          "https://en.wikipedia.org/wiki/Ethics_of_artificial_intelligence"],
    conversation_config=tech_debate_config,
    tts_model="geminimulti"
)

print("Tech Debate Podcast generated:", tech_debate_podcast)


In [None]:
embed_audio(tech_debate_podcast)

## Multilingual Support

Multilingual transcript generation is supported by default simply by setting the `output_language` parameter to the desired language in the `conversation_config` dictionary. ElevenLabs is the recommended TTS model for non-English languages. Description of how to generate non-English content can be found in [How to generate non-English content](usage/how-to.md#how-to-generate-non-english-content). See [Notes of Caution](usage/conversation_custom.md#notes-of-caution) before starting to customize to avoid unexpected results. Here are a couple of audio examples:

### French (fr)

Generates a podcast from about [AgroClim website](https://agroclim.inrae.fr/) - French Government's service unit that aims to study the climate and its impacts on agroecosystems.

In [None]:
embed_audio("./data/audio/podcast_FR_AGRO.mp3")

### Portugue (pt-br)

Generates a podcast in Brazilian Portuguese from a news article on the most recent voting polls on [Sao Paulo's 2024 Elections](https://noticias.uol.com.br/eleicoes/2024/10/03/nova-pesquisa-datafolha-quem-subiu-e-quem-caiu-na-disputa-de-sp-03-10.htm).

In [None]:
embed_audio("./data/audio/podcast_thatupiso_BR.mp3")

## Custom LLM Support

Podcastfy offers a range of LLM models for generating transcripts including OpenAI, Anthropic, Google as well as local LLM models.

### Cloud-based LLMs

Podcastfy uses gemini 1.5 pro as default LLM model for transcript generation. To select a particular cloud-based LLM model, users can pass the `llm_model_name` and `api_key_label` parameters to the `generate_podcast` function.

For example, to use OpenAI's `gpt-4-turbo` model, users can pass `llm_model_name="gpt-4-turbo"` and `api_key_label="OPENAI_API_KEY"`. Note that the `api_key_label` is not the API KEY value itself, but instead the its label as contained in your `.env` file.

In [None]:
"""Test generating a podcast with a custom LLM model."""
urls = ["https://en.wikipedia.org/wiki/Artificial_intelligence"]

audio_file = generate_podcast(
    urls=urls,
    tts_model="edge",
    llm_model_name="gpt-4-turbo",
    api_key_label="OPENAI_API_KEY" # Note that this is not the API KEY value itself, but instead the label as contained in your .env file
)

Remember to have the correct API key label and value in your environment variables (`.env` file).

Note: The package has been tested with the default gemini 1.5 pro model. Quality of results may vary when using other LLM models. Additionally, other LLM models may not follow given instructions properly and/or may not generate valid conversations hence they may lead to failure in audio generation, for instance if innadvertently generating wrong TTS tags. Please report any issues on [GitHub](https://github.com/souzatharsis/podcastfy/issues). We'd appreciate any feedback to improve multi-LLM support in the product!


### Local LLM Support

We enable serving local LLMs with llamafile. In the API, Local LLM support is available through the `is_local` parameter. If `is_local=True', then a local (llamafile) LLM model is used to generate the podcast transcript. Llamafiles of LLM models can be found on [HuggingFace today offering 156+ models](https://huggingface.co/models?library=llamafile).

All you need to do is:

1. Download a llamafile from HuggingFace
2. Make the file executable
3. Run the file

Here's a simple bash script that shows all 3 setup steps for running TinyLlama-1.1B locally:

```bash
# Download a llamafile from HuggingFace
wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

# Make the file executable. On Windows, instead just rename the file to end in ".exe".
chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

# Start the model server. Listens at http://localhost:8080 by default.
./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --server --nobrowser
```

Now you can use the local LLM to generate a podcast transcript (or audio) by setting the `is_local` parameter to `True`.

In [None]:
# Generate a tech debate podcast about artificial intelligence
generate_podcast(
    urls=["www.souzatharsis.com"],
    is_local=True  # Using a local LLM
)

We have pre-saved the generated podcast in the data directory.

In [None]:
# Open and print the first 200 characters of the local model transcript
with open('./data/transcripts/transcript_local_model.txt', 'r') as file:
    content = file.read(500)
    print(content)

### Notes of caution

When using local LLM models versus widely known private large language models:

1. Performance: Local LLMs often have lower performance compared to large private models due to size and training limitations.

2. Resource requirements: Running local LLMs can be computationally intensive, requiring significant CPU/GPU resources.

3. Limited capabilities: Local models may struggle with complex tasks or specialized knowledge that larger models handle well.

5. Reduced multimodal abilities: Local LLMs will be assumed to be text-only capable

6. Potential instability: Local models may produce less consistent or stable outputs compared to well-tested private models oftentimes producing transcripts that cannot be used for podcast generation (TTS) out-of-the-box

7. Limited context window: Local models often have smaller context windows, limiting their ability to process long inputs.

Always evaluate the trade-offs between using local LLMs and private models based on your specific use case and requirements. We highly recommend extensively testing your local LLM before productionizing an end-to-end podcast generation and/or manually checking the transcript before passing to TTS model.
