# Creating an Audiobook from a PDF
Author: Mohamed Oussama NAJI

Date: March 27, 2024

## Introduction

This notebook demonstrates the process of creating an audiobook from a PDF file. It involves extracting text from the PDF, cleaning the text, converting the text into speech, saving the speech, and playing the speech. The task tests the ability to apply Text to Speech conversion and Extraction of Text from PDF files.


## Table of Contents
1. [Extracting Text from PDF](#extracting-text-from-pdf)
   - [Installing PyPDF2 Library](#installing-pypdf2)
   - [Importing PyPDF2](#importing-pypdf2)
   - [Extracting the Text](#extracting-text)
   - [Printing the Extracted Text](#printing-extracted-text)
2. [Converting Text into Speech](#converting-text-into-speech)
   - [Installing gTTS Library](#installing-gtts)
   - [Importing gTTS](#importing-gtts)
   - [Initializing a Speaker Object](#initializing-speaker)
   - [Converting the Text](#converting-text)
   - [Saving the Audio](#saving-audio)
3. [Conclusion](#conclusion)


## Extracting Text from PDF <a id="extracting-text-from-pdf"></a>

### Installing PyPDF2 Library <a id="installing-pypdf2"></a>


In [None]:
!pip install PyPDF2

In [None]:
import PyPDF2
import requests

### Importing PyPDF2 <a id="importing-pypdf2"></a>




In [None]:
# Open the PDF file
pdf = open('book.pdf', 'rb')

### Printing the Extracted Text <a id="printing-extracted-text"></a>

In [None]:
# Create a PDF file reader object
pdf_reader = PyPDF2.PdfReader(pdf)

# Get the number of pages in the PDF
num_pages = len(pdf_reader.pages)

# Loop through each page in the PDF
for page_num in range(num_pages):
    # Extract the text from the page
    text = pdf_reader.pages[page_num].extract_text()
    text = text.replace('\n', ' ').strip()
    # Print the extracted text
    print(text)

### Installing gTTS Library

### Installing gTTS Library <a id="installing-gtts"></a>


In [None]:
!pip install gTTS

### Importing gTTS <a id="importing-gtts"></a>

In [None]:
from gtts import gTTS

### Initializing a Speaker Object <a id="initializing-speaker"></a>

In [None]:
# Initialize the speaker object (speech engine)
speaker = gTTS(text)

### Converting the Text <a id="converting-text"></a>

In [None]:
# Convert text to speech
speaker.text = text

### Saving the Audio <a id="saving-audio"></a>

In [None]:
# Save the audio to a file
output_file = "audiobook.mp3"
speaker.save(output_file)
print(f"Speech saved as {output_file}")

# Close the PDF file
pdf.close()

## Conclusion
In this notebook, we successfully created an audiobook from a PDF file. The process involved the following steps:
1. Extracting text from the PDF using the PyPDF2 library. We opened the PDF file, created a PDF reader object, and looped through each page to extract the text. The extracted text was cleaned by removing newline characters and extra whitespace.
2. Converting the extracted text into speech using the gTTS (Google Text-to-Speech) library. We initialized a speaker object, set the text to be converted, and then saved the generated speech as an audio file in MP3 format.

This notebook demonstrates the power of combining text extraction from PDFs with text-to-speech conversion to create audiobooks. It provides a convenient way to consume written content in an audio format, making it accessible to a wider audience. Further improvements can be made to enhance the audiobook creation process, such as:
- Handling complex PDF layouts and extracting text more accurately.
- Applying additional text cleaning and preprocessing techniques to improve the quality of the extracted text.
- Customizing the speech synthesis parameters to achieve better audio quality and more natural-sounding speech.
- Implementing a user-friendly interface for selecting PDF files and customizing audiobook settings.

Overall, this notebook serves as a starting point for creating audiobooks from PDF files and can be extended and refined based on specific requirements and preferences.