<a href="https://colab.research.google.com/github/goteguru/kmooc_python/blob/main/notebooks/en/kmooc_08_2_openpyxl_en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spreadsheet (Excel) data handling

Even today, companies manage a large portion of their data in some spreadsheet program. Sometimes even when they probably shouldn't. Spreadsheets are familiar to everyone and most people can manage them to some extent, so if our program can create and read such files, we can significantly ease data exchange.

Of course, spreadsheets understand the CSV format, so we could move data that way as well, but CSV is quite limited (for example, it has no worksheets) and it can be awkward to explain to someone how to import a CSV file correctly.

In short, if we can create and read XLSX files, that's very useful. This is where the openpyxl package comes in!

## Openpyxl

If it's not installed (in Colab it is installed) then:
`pip install openpyxl`

### Creating a new XLSX file and worksheets

In [None]:
# the basis of work is the Workbook object
from openpyxl import Workbook

# so let's create one:
wb =  Workbook()

# by default there will be one sheet and it's active.
# let's rename it to 'Adatok'!
ws = wb.active
ws.title = "Adatok"

# Done! We can already write to cells!

ws["A1"] = "Anyag"
ws["B1"] = "Mennyiség"
ws["C1"] = "Ár"

ws["A2"] = "Cement"
ws["B2"] = 100
ws["C2"] = 12.5

# Want a new worksheet?
new_ws = wb.create_sheet(title="Eredmények")
new_ws["A1"] = "Ide valami más kerül..."

# Finally save as a new file
wb.save("adataim.xlsx")


If you open your file manager now you'll find a file adataim.xlsx which you can open with Excel, Sheets or LibreOffice.

If you created it here in Colab, click the file icon on the left to see the filesystem. Hover over the file and use the three-dot menu to download it so you can inspect it.

### Opening an existing XLSX file

Of course it's possible the file already exists and the data is already there. Create an Excel file, upload it here and try to modify something in it!

As an example, we'll modify the file we created earlier:

In [None]:
# we need the workbook loading function now
from openpyxl import load_workbook

wb = load_workbook("adataim.xlsx")

# Select worksheet by name
ws = wb["Adatok"]

# Read a cell
material = ws["A2"].value # ws["A2"] is a cell object
quantity = ws["B2"].value # .value is the actual value inside

print(material, quantity)

# Modify a value
ws["C2"] = 13.0  # new price

# Append a new row to the end
ws.append(["Acél", 80, 22.1])


Cement 100


In [None]:
from random import randint
# let's write another 100 random rows
for i in range(100):
  material = "Cement" if randint(1, 2) == 1 else "Acél"
  ws.append([material, randint(1, 100), randint(1, 100)/10])

wb.save("modositott.xlsx")

### Navigating between cells, iteration

The "A2"-style notation is very useful in a spreadsheet, but in code it's often more convenient to work with numbers. It would be more practical to address cells by coordinates. It is also useful to be able to iterate through rows and extract their values one by one.

In [None]:
wb = load_workbook("adataim.xlsx")

ws = wb["Adatok"]
max_row = ws.max_row # ask where the last data row is

for row in ws.iter_rows(min_row=1, max_row=max_row, values_only=True):
    print(row)


The `values_only` makes the rows contain the actual values instead of cell objects. Super convenient! Of course if you wanted to color the sheet cell-by-cell you would need the cell objects.

In [None]:
# create a new "random" sheet
wsnew = wb.create_sheet(title="random")

# and use coordinates instead of letters:
cell = wsnew.cell(row=2, column=3)
cell.value = 42

# write a small 3x3 matrix below:
for r in range(5, 5+3):
    for c in range(1, 1+3):
        wsnew.cell(row=r, column=c, value=c*r)

# if we want it to persist, we must save it
wb.save("táblás.xlsx")

In [None]:
# when finished, close it (frees allocated memory)
wb.close()

### Styling, coloring

Openpyxl is a fairly powerful package offering many possibilities that we don't have room to cover here, but let's look at a very short example of how to color a cell (for example to highlight a value for the user)

In [None]:
from openpyxl.styles import PatternFill
wb = load_workbook("adataim.xlsx")
ws = wb.active

red = PatternFill(fill_type="solid", start_color="FF0000", end_color="FF0000")
ws["A1"].fill = red

wb.save("szinezett.xlsx")

## Pandas DataFrame integration

If you already use pandas, it's often simpler to think of an Excel file as a DataFrame. (For fine-grained, cell-level edits you can use openpyxl). In fact Pandas uses openpyxl in the background, so you don't need to install it separately.

In [None]:
import pandas as pd

# Single worksheet
df = pd.read_excel("modositott.xlsx", sheet_name="Adatok")
print(df.head())


And now you can perform all kinds of data manipulation magic that pandas (and numpy) can do!


In [None]:
# Extremely complicated Pandas computations:
# ....
result = df.groupby("Anyag").mean()

# and finally write it to a new worksheet:
result.to_excel("eredmények.xlsx", index=False, sheet_name="Eredmenyek")
# we did not request pandas indexes, so they are not written to the file.

It may be that you wanted to save the result into the original xlsx as a new sheet. In that case the above is not good because it would overwrite the file.
Open the original in "append" mode and add a sheet.

In [None]:
# tell pandas to write into this existing file in "append" mode
with pd.ExcelWriter("modositott.xlsx", engine="openpyxl", mode="a", if_sheet_exists="new") as writer:
    result.to_excel(writer, sheet_name="Statisztikák", index=False)
