# python-docx

python-docx is a python package that will enable us to manipulate Microsoft Word documents, by creating and editing them.
GitHub Codespaces will not properly show a .docx file (try opening one!), so if you would like to view any files you will need to download them and open them on your local machine. 
To download, simple right-click on the file and select "Download"

First, we need to install the python-docx package so we have access to all the . Normally this would be done in the terminal, but in the notebook we can do it like this.

In [30]:
%pip install python-docx


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Then we need to import the libraries we are going to use. Despite installing python-docx, we only need to import docx (and only the Document class from that). OS is helpful for saving files and making sure folders exist.

In [31]:
from docx import Document
from pyspark import sql
import os

Like in workshop 4, we will need to start up our own spark session.

In [32]:
spark = (sql.SparkSession
    .builder
    .appName("pyspark_intro")
    .getOrCreate()
)

# Creating a document

To create a new Word document, we first need to create an instance of the `Document` class. This is what we imported above.

In [33]:
doc = Document()

To add a title to the document, use the `add_heading` method.

In [34]:
doc.add_heading('Document Title', level=1)

<docx.text.paragraph.Paragraph at 0x7c2258f09ae0>


To add a paragraph to the document, use the `add_paragraph` method.

In [35]:
doc.add_paragraph('This is the first paragraph in the document.')

<docx.text.paragraph.Paragraph at 0x7c2258f0b5e0>

A run is a contiguous run of text with the same style. You can add multiple runs to a paragraph to apply different styles to different parts of the text.

In [36]:
# Add a paragraph with runs
paragraph = doc.add_paragraph()
run = paragraph.add_run('This is a run of text. ')
run.bold = True
run = paragraph.add_run('This is another run of text. ')
run.italic = True

You can add tables to your document using the `add_table` method.

In [37]:
table = doc.add_table(rows=2, cols=2)
cell = table.cell(0, 0)
cell.text = 'Cell 1,1'
cell = table.cell(0, 1)
cell.text = 'Cell 1,2'
cell = table.cell(1, 0)
cell.text = 'Cell 2,1'
cell = table.cell(1, 1)
cell.text = 'Cell 2,2'

To save the document, use the `save` method. After running this cell, download and open the file.

In [38]:
doc.save('example.docx')

# Accessing an existing document

You can read an existing Word document by passing the file path to the `Document` class.

In [39]:
doc = Document('example.docx')

You can access the paragraphs in a document using the `paragraphs` attribute.

In [40]:
for paragraph in doc.paragraphs:
    print(paragraph.text)

Document Title
This is the first paragraph in the document.
This is a run of text. This is another run of text. 



You can access the tables in a document using the `tables` attribute.

In [41]:
for table in doc.tables:
    for row in table.rows:
        for cell in row.cells:
            print(cell.text)

Cell 1,1
Cell 1,2
Cell 2,1
Cell 2,2


# Using a template and replacing values in it

We can use an existing document, and replace placeholder values with our own variables.

We first need to load in the file. Download and open it up so you can see the 'before'

In [42]:
template_path = 'inputs/simple_template.docx'
doc = Document(template_path)

To replace text, we need to search through the paragraphs and runs in the document. The following two functions will do that, using a combination of for loops and if statements.

In [43]:
def replace_text_in_paragraph(paragraph, replacements):
    """
    Replace text in a paragraph based on a dictionary of replacements.
    """
    for old_text, new_text in replacements.items():
        if old_text not in paragraph.text:
            continue
        for run in paragraph.runs:
            if old_text in run.text:
                run.text = run.text.replace(old_text, new_text)

In [44]:
def replace_text_in_doc(doc, replacements):
    """
    Replace text in the entire document based on a dictionary of replacements.
    """
    for paragraph in doc.paragraphs:
        replace_text_in_paragraph(paragraph, replacements)

In [45]:
replacements = {
    'NAME': 'John Doe',
    'FAV_COLOUR': 'blue',
    'COUNTRY': 'Canada'
}

# Replace text in the document
replace_text_in_doc(doc, replacements)

In [46]:
modified_doc_path = 'modified_template.docx'
doc.save(modified_doc_path)

# Using a csv to replace values and create multiple files

We might want to import a csv of data and use the information in that to replace the placeholder values. This can be done with a few simple tweaks.

First, we need to read in the data as a spark dataframe.

In [None]:
path_to_data = "inputs/data.csv"
spark_df = (spark.read
    .option('header', 'true')
    .csv(path_to_data)
)

spark_df.show()

Then we need to create the functions to replace our placeholders with the data from the csv. These are very similar to the functions above, but are specifically written for taking in a df and creating multiple files, whereas before we were only dealing with one.

In [47]:
def replace_placeholder_in_paragraph(paragraph, placeholder, replacement_text):
    # Iterate through runs in the paragraph
    for run in paragraph.runs:
        if placeholder in run.text:
            # Replace placeholder with replacement text
            run.text = run.text.replace(placeholder, replacement_text)


In [49]:
def generate_documents_from_spark_df(spark_df, template_docx, output_folder, placeholder_mapping):
    # Load the template document
    doc = Document(template_docx)

    # Collect the DataFrame rows into a list of dictionaries
    rows = spark_df.collect()
    
    # Iterate through each row in the DataFrame
    for row in rows:
        # Replace placeholders with text
        for para in doc.paragraphs:
            for placeholder, replacement in placeholder_mapping.items():
                replacement_text = str(row[replacement])
                replace_placeholder_in_paragraph(para, placeholder, replacement_text)

        # Ensure the output folder exists
        os.makedirs(output_folder, exist_ok=True)
        
        # Save the modified document
        output_file = os.path.join(output_folder, f"{row['org_name']}_document.docx")
        doc.save(output_file)
        print(f"Saved: {output_file}")

Like before, this is a dictionary mapping the placeholder with the column with the matching data for each row

In [50]:
placeholder_mapping = {
    'ORGNAME': 'org_name',
    'MONTH': 'month',
    'YEAR': 'year',
    'TOTAL_ATTENDANCE': 'total_attendance'
}

template_docx = 'inputs/template_document.docx'

In [51]:
generate_documents_from_spark_df(spark_df, template_docx, 'output_folder', placeholder_mapping)

Saved: output_folder/Leeds General Hospital_document.docx
Saved: output_folder/York Infirmary_document.docx
Saved: output_folder/Queen Mary Hospital_document.docx
