# Editing Excel files with Python

You can create, load, modify and save Excel files with a python library called openpyxl.

# Import Required Libraries
Import the necessary libraries, including openpyxl and pandas.

You'll need to have installed these into your Python environment first with:

`pip install openpyxl`

`pip install pandas`

In [None]:
import openpyxl
import pandas as pd

# Loading an Existing Workbook
Use openpyxl to load an existing Excel file.

In [None]:
# Load an existing workbook
wb = openpyxl.load_workbook('./report_template.xlsx')

The variable `wb` is an openpyxl Workbook object as we can see here:

In [None]:
type(wb)

Just like how DataFrames have useful functions and properties that we can use to query the data inside it ( like `.select()`, `.where()` etc.), openpyxl Workbooks have useful methods and properties that let us manipulate the Excel file.

Let's start with an easy one - listing the names of the sheets in the workbook.

In [None]:
# Display the sheet names in the workbook
sheet_names = wb.sheetnames
print(sheet_names)

# Accessing Sheets
If we want to edit a sheet in the Excel file, first we need to create a Worksheet object. Now that we know the names of the sheets in this file, we can use these to create our Worksheet object, like this.

You may recognise this syntax - it's the same syntax we use to select items from a dictionary!

For example if you wanted to select the "Title" sheet, we would do the following:

In [None]:
ws_title = wb['Title']

Worksheet objects all have a property called `title` that we can use to return the name of the worksheet as a string. We can do other stuff with it, like print it out:

In [None]:
print(ws_title.title)

## Exercise

In the cell below, create a Worksheet object based on the "Table 1" sheet, and print out its title:

# Reading and Writing Cell Values

Once we have created a Worksheet object, we can access the cells within it using the same square bracket syntax.

Cell objects have a property called `value` that we can use to change what the cell contains.

For example, our report's main header is currently "Report title". This is in cell A3 of the "Title" worksheet. If we want to change this, we need to reassign it, like this:

In [None]:
ws_title['A3'].value = 'Hello World!'

After we've made a change, we need to save the file. There is a Workbook method we can use for this.

But of course, we should save it with a different file name, so that we don't overwrite our template.

Note: If you're doing this in Codespaces, you won't be able to open the Excel file directly from here. You'll have to right-click the file in the browser on the left and download it. Alternatively, you can install an extension called "Excel Viewer", although it isn't perfect and doesn't show all the formatting correctly.

In [None]:
wb.save('report_modified.xlsx')

# Exercise

The header for the table at A4 in the "Table 1" worksheet is "Table 1: Here is some data". In the cell below, change this to something else and save the file.

Note: If you get an error when saving, you might not have permissions to overwrite the file. Just delete report_modified.xlxs manually from the file browser, and then try again.

# Formatting Cells

We can use openpyxl to apply formatting to cells, font styles, colour, and the like.

To apply formatting we need to import some classes from the openpyxl package. Each class applies a different type of formatting. For example, to format fonts, we import the `Font` class. To add borders, we import the `Border` class. And so on.

When automating report creation with Python, we'll be starting with a template most of the time. So we usually won't have to do any formatting in Python.

But it's useful to know about. So here's a quick example - how to change font styles.

To learn more, see the [openpyxl guidance](https://openpyxl.readthedocs.io/en/stable/styles.html).

In [None]:
from openpyxl.styles import Font

# Let's apply some styles to our report title using the Font class:
ws_title['A3'].font = Font(
    bold=True,
    color='005eb8',
    size=27,
    name='Arial',
)

# Save the workbook with the new formatting changes
wb.save('report_modified.xlsx')

# Exercise

We should make the "Official Sensitive: Management Informtion Reporting Only" subheader a bit more prominent. It's in cell A4 of the title sheet. 

In the cell below, make it red, and change the font size to 20.

# Inserting a DataFrame into the spreadsheet

We can insert DataFrames into worksheet, but they have to be Pandas DataFrames. If you're using PySpark, you'll have to convert them to Pandas with:

```py
df_pandas = df_spark.to_pandas()
```

Here we'll create a Pandas DataFrame manually and insert it into our template. So first let's create the df:

In [None]:
# Here's how to create a Pandas DataFram manually:

data_example_1 = {
    'Region': ["East of England", "London", "Midlands", "North East and Yorkshire", "North West", "South East", "South West"],
    'Statistic 1': [25, 30, 35, 40, 45, 50, 55],
    'Statistic 2': [11, 12, 55, 13, 14, 15, 16],
}
df_example_1 = pd.DataFrame(data_example_1)

OK! We have some data. Here's a function we prepared earlier (*aka stole from Stack Overflow...*) that we can use to insert the DataFrame into a specified location in a given sheet:

In [None]:
from openpyxl.utils.dataframe import dataframe_to_rows

def insert_pandas_df_into_excel(df, ws, header=True, startrow=1, startcol=1, index=True):
    """
    Inserts a pandas dataframe into an excel worksheet
 
    Parameters:
    df: (pandas DataFrame): The pandas dataframe to be inserted
    ws: (openpyxl sheet object): The openpyxl sheet object to insert the dataframe into (e.g. sheets['Data'])
    startrow: (int): The starting row to insert the dataframe (default 0)
    startcol: (int): The starting column to insert the dataframe (default 0)
    index: (bool): Whether to include the index column in the dataframe (default True)
    """
    rows = dataframe_to_rows(df, header=header, index=index)
    for r_idx, row in enumerate(rows, startrow):
        for c_idx, value in enumerate(row, startcol):
             ws.cell(row=r_idx, column=c_idx).value = value

Let's use this to add our example data into the table 1 sheet....

In [None]:
insert_pandas_df_into_excel(
    df = df_example_1,
    ws = wb['Table 1'],
    header = False,
    startrow = 7,
    startcol = 1,
    index = False,
)

# Save the workbook with the DataFrame inserted
wb.save('report_modified.xlsx')

# Exercise

Now it's your turn! Here's another example DataFrame:

In [None]:
data_example_2 = {
    'Procedure': ["ABC", "DEF", "GHI", "JKL", "MNO", "PQR", "STU"],
    'Statistic 1': [124, 644, 84, 119, 644, 33, 90],
    'Statistic 2': [20, 30, 40, 50, 60, 70, 80],
}
df_example_2 = pd.DataFrame(data_example_2)

In the cell below, insert this data into the Table 2 sheet.

Note that unlike Table 1, our Table 2 sheet does not have headers. How can we modify the code we used earlier to get the headers into the sheet?