<a href="https://colab.research.google.com/github/kazimali07/ChatPDF/blob/main/Copy_of_URLs_to_PDF_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [SEO Notebook](https://seonotebook.com/) HTML to Merged PDF Converter

1.   List item
2.   List item


This notebook allows you to convert a list of URLs into a single merged PDF file. You provide a text file containing URLs, and the notebook processes each URL, extracts the textual content, and compiles everything into a PDF.

## Setup
### Install Necessary Packages

In [None]:
# Install Python libraries
!apt-get install --fix-missing -y wkhtmltopdf

!pip install pdfkit html2text beautifulsoup4 lxml wkhtmltopdf

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  avahi-daemon bind9-host bind9-libs geoclue-2.0 glib-networking glib-networking-common
  glib-networking-services gsettings-desktop-schemas iio-sensor-proxy libavahi-core7 libavahi-glib1
  libdaemon0 libevdev2 libfontenc1 libgudev-1.0-0 libhyphen0 libinput-bin libinput10
  libjson-glib-1.0-0 libjson-glib-1.0-common liblmdb0 libmaxminddb0 libmbim-glib4 libmbim-proxy
  libmd4c0 libmm-glib0 libmtdev1 libnl-genl-3-200 libnotify4 libnss-mdns libproxy1v5 libqmi-glib5
  libqmi-proxy libqt5core5a libqt5dbus5 libqt5gui5 libqt5network5 libqt5positioning5
  libqt5printsupport5 libqt5qml5 libqt5qmlmodels5 libqt5quick5 libqt5sensors5 libqt5svg5
  libqt5webchannel5 libqt5webkit5 libqt5widgets5 libsoup2.4-1 libsoup2.4-common libudev1
  libwacom-bin libwacom-common libwacom9 libwoff1 libxcb-icccm4 libxcb-image0 libxcb-keysyms1
  libxcb-render-util0 libx

### Import Libraries

In [None]:
import os
from google.colab import files
import time
import requests
import pdfkit
import html2text
from bs4 import BeautifulSoup
from tqdm import tqdm
from io import BytesIO

## File Input
### Upload Your URLs File Here
Upload a text file containing the list of URLs you want to process. The file can have any name, but it should contain one URL per line.

In [None]:
# Upload the input file
print("Please upload your text file containing the list of URLs.")
input_data = files.upload()
for filename in input_data.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
        name=filename, length=len(input_data[filename])))

    # Read the URLs from the file
    with open(filename, 'r') as infile:
        input_urls = [line.strip() for line in infile if line.strip()]

Please upload your text file containing the list of URLs.


TypeError: 'NoneType' object is not subscriptable

## Processing
From this point onwards, your file will be processed without the need for user input or intervention. Monitor

In [None]:
# Initialize html2text converter
text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text_maker.ignore_images = True

In [None]:
def fetch_url_content(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
    }
    try:
        response = requests.get(url, headers=headers, timeout=30)
        response.raise_for_status()
        return response.text
    except Exception as e:
        print(f'Error fetching {url}: {e}')
        return None

def convert_html_to_text(html_content):
    return text_maker.handle(html_content)


In [None]:
descriptions = []

print("Processing URLs...")
for url in tqdm(input_urls, desc='Processing', unit='url'):
    if not url.startswith('http'):
        continue

    html_content = fetch_url_content(url)
    if html_content is None:
        continue

    # Parse HTML content
    soup = BeautifulSoup(html_content, 'lxml')
    text_content = convert_html_to_text(str(soup))

    # Prepare the description
    descriptions.append('{{<br>' + f'#{url} for article below:<br>')
    for line in text_content.split('\n'):
        line = ' '.join(line.split())
        if line and line.strip('#').strip():
            descriptions.append(line)
    descriptions.append('}}<br><br>')

## File Output
This part of the code takes care

In [None]:
# Combine all descriptions
full_description = '<br>'.join(descriptions)

# Configure pdfkit with the path to wkhtmltopdf
config = pdfkit.configuration(wkhtmltopdf='/usr/bin/wkhtmltopdf')

# Generate the PDF file with progress bar
print("Generating PDF...")
with tqdm(total=1, desc='PDF Generation') as pbar:
    output_pdf = 'merged_pdf_file.pdf'
    pdfkit.from_string(full_description, output_pdf, configuration=config)
    pbar.update(1)

# Download the PDF file
print("Downloading the PDF file...")
files.download(output_pdf)
print('ALL DONE! If your PDF file does not download automatically, click the "Files" Icon in the sidebar and find the file named "merged_pdf_file.pdf", then click on the three dots > Download.')