# Appending Many HTML Files Into One in Python (w/ BeautifulSoup)

By: [Paul Jeffries](https://twitter.com/ByPaulJ) 

## Example Use Case:

For the purpose of this example, all I did was append two copies of the example HTML document [from R Studio's website](https://rmarkdown.rstudio.com/gallery.html)--which happens to be a [series of vignettes involving charts made by the NYT](http://timelyportfolio.github.io/rCharts_nyt_home_price/). That said, in practice this code, or an adaptation thereof, might prove helpful wherever you need to go from many .html files to one. This might be compiling, for example, many .html files that were output by an RMarkdown-centric process, where the use of Bookdown was not feasible. The benefit of this kind of process as opposed to a unix scripting solution or some other alternative is that if you have a table of contents, this process will preserve it. 

Whatever your desired use case, I hope you find this code helpful or informative in some way!

In [1]:
import datetime
# prints the present date and time as a form of log
print("This notebook was late run: ", datetime.datetime.now())

This notebook was late run:  2019-03-31 16:58:26.882320


In [2]:
# needed packages
from bs4 import BeautifulSoup
import copy
import glob

In [3]:
# pulls in all files at specified location ending in .html
# you can sort many ways, I just chose alphabetical (default)
list_of_files_to_append = sorted(glob.glob('*.html'))

# prints the list of paths to test that the glob call worked
print(list_of_files_to_append)

['html_example_file1.html', 'html_example_file2.html.html', 'test_html_master_file.html']


In [4]:
# builiding a for loop to iteratively append all of the HTML files in the target directory

# creating an empty list to store errors I catch in the for loop below
bad_list = []

# kicks off the main for loop
for i in range(len(list_of_files_to_append)):
    # prints out what iteration we're on--mostly for trouble-shooting
    print(i)
    # wrapping the rest in a try/except block for error-catching
    try:
            # if it's the first run through, initialize final_soup as the bs from the first html file
            if i == 0:
                    final_soup = BeautifulSoup(''.join(open(list_of_files_to_append[0])), 'lxml')
            # if it's not the first run through, create present_soup as the i-th soup object
            else:
                    present_soup = BeautifulSoup(''.join(open(list_of_files_to_append[i])), 'lxml')
                    # iterate over each element in the body of present_soup and append to the final_soup
                    for element in present_soup.body:
                            final_soup.body.append(copy.copy(element))
    # if anything breaks, we append the iteration number to the bad_list so we can see where we failed
    except:
            bad_list.append(i)
            continue

0
1
2


In [5]:
# iterating over all captured erros and printing the names of the files that failed to append
# otherwise, we get a message saying all is well

# checks if list object is empty
if not bad_list:
  print("Congratulations! No files failed to append.")
else:
    for i in range(len(bad_list)):
        print("One file that failed to be appended is: ", 
              # splits on '/' and takes the last segement counting from end
              list_of_files_to_append[bad_list[i]].split('/')[-1])   

Congratulations! No files failed to append.


In [6]:
# write the resulting single master html file locally
with open("test_html_master_file.html", 'w') as file:
    file.write(str(final_soup))