# Requirements

To run this demo;

- copy this notebook in the demo folder
- copy the file demo_excel.xlxs in the data folder (or adjust the path when reading the file)

```Python
# adjust the path as needed
 wb = load_workbook('../data/excel_demo.xlsx', data_only=True)
 ```


 <img src='https://nico.nexgate.ch/images/openpyxl_demo.png' width=300>

# Reading data form an Excel file

Let's open the Excel file excel_demo.xlsx and read some data!

The excel file contains 2 tabs:
- Movies
- Directors

<img src='https://nico.nexgate.ch/images/excel_demo.png' width='50%'>


## Imports

In [254]:
from openpyxl import load_workbook
import pandas as pd

## Reading tab names

In [255]:
 wb = load_workbook('../data/excel_demo.xlsx', data_only=True)

for sheet in wb:
    print(sheet.title)



Movies
Directors


## Reading data from a specific tab

In [256]:
# let's create a variable for each sheet
worksheet_movie = wb["Movies"]
worksheet_director = wb["Directors"]

### Reading a specific cell

for instance, to read the value in cell A1:

```Python
A1 = worksheet_movie['A1'].value
```

<img src='https://nico.nexgate.ch/images/cellA1.png' width='40%'>

In [257]:
# let's get the values in cells A1 and B1 from the sheet "Movies"

A1 = worksheet_movie['A1'].value
B1 = worksheet_movie['B1'].value

print(A1, " is ", B1)
print("the variable B1 is of type ", type(B1))

Date  is  2023-03-31 23:37:09.696000
the variable B1 is of type  <class 'datetime.datetime'>


### Reading a range of cells

Instead of a single cell, we can also access a range of cells:

```Python
my_range = my_worksheet["A1:C10"]
```

The range variable contains a list of rows and each row contains a list of values (columns)

Once we have a range variable, we can iterate through the range rows to extract the values.

```Python
for row in my_range:
    # retrieve the values 
    list_of_values = [cell.value for cell in row]
    ...
```

The function below creates a dataframe from a range in an excel worksheet


In [258]:

def create_dataframe_from_worksheet_range(range_string, ws):

    data_rows = []
    first_row = True
    headers = []

    for row in ws[range_string]:

        # we use the 1st row in the range to read the headers
        if first_row:
            headers = [cell.value for cell in row] 
            first_row = False
            continue

        # for all subsequent rows, we append the data to the data_rows list
        data_rows.append([cell.value for cell in row])

    # we return a DataFrame constructed from the list of rows and we specify the headers
    return pd.DataFrame(data_rows, columns=headers)


### Movies dataframe

let's create a df_movies dataframe from the data available in the sheet "Movies" in the excel file.


<img src='https://nico.nexgate.ch/images/tab1.png' width='40%'>

In [259]:
df_movies = create_dataframe_from_worksheet_range("A4:E35",worksheet_movie)
df_movies.head()

Unnamed: 0,imdb_id,title,year,director,imdb_rating
0,tt0120815,Saving Private Ryan,1998,d13,8.6
1,tt0137523,Fight Club,1999,d4,8.8
2,tt0172495,Gladiator,2000,d11,8.5
3,tt0209144,Memento,2000,d2,8.5
4,tt0120737,The Lord of the Rings: The Fellowship of the Ring,2001,d9,8.8


### Directors dataframe

now let's create a dataframe from the range of cells A3:C16 from the second worksheet in the excel file


<img src='https://nico.nexgate.ch/images/tab2.png' width='40%'>

In [260]:
df_directors = create_dataframe_from_worksheet_range("A3:B16",worksheet_director)
df_directors.head()

Unnamed: 0,director_id,director_name
0,d1,Asif Kapadia
1,d2,Christopher Nolan
2,d3,Damien Chazelle
3,d4,David Fincher
4,d5,Florian Henckel von Donnersmarck


### Merging 2 dataframes 

let's look at 2 ways of doing a VLOOKUP style merge with pandas

#### Option 1: pd.merge(df1,df2,...)

In [261]:
# let's create a new dataframe df_merged by merging the two dataframes df_movies and df_directors

df_merged = pd.merge(df_movies, df_directors, left_on='director', right_on='director_id', how='left')

df_merged.head()

Unnamed: 0,imdb_id,title,year,director,imdb_rating,director_id,director_name
0,tt0120815,Saving Private Ryan,1998,d13,8.6,d13,Steven Spielberg
1,tt0137523,Fight Club,1999,d4,8.8,d4,David Fincher
2,tt0172495,Gladiator,2000,d11,8.5,d11,Ridley Scott
3,tt0209144,Memento,2000,d2,8.5,d2,Christopher Nolan
4,tt0120737,The Lord of the Rings: The Fellowship of the Ring,2001,d9,8.8,d9,Peter Jackson


#### Option 2: df1.merge(df2, ...)

In [262]:
df_movies = df_movies.merge(df_directors, left_on='director', right_on='director_id', how='left')

df_movies.head()

Unnamed: 0,imdb_id,title,year,director,imdb_rating,director_id,director_name
0,tt0120815,Saving Private Ryan,1998,d13,8.6,d13,Steven Spielberg
1,tt0137523,Fight Club,1999,d4,8.8,d4,David Fincher
2,tt0172495,Gladiator,2000,d11,8.5,d11,Ridley Scott
3,tt0209144,Memento,2000,d2,8.5,d2,Christopher Nolan
4,tt0120737,The Lord of the Rings: The Fellowship of the Ring,2001,d9,8.8,d9,Peter Jackson


### Pivot table

Let's use groupby to show the number of movies per director

In [263]:
# groupby director and count the number of movies per director

movies_per_director = df_movies.groupby('director_name').size()

movies_per_director

director_name
Asif Kapadia                        1
Christopher Nolan                   4
Damien Chazelle                     2
David Fincher                       4
Florian Henckel von Donnersmarck    1
Hayao Miyazaki                      1
Marco Tullio Giordana               1
Martin Scorsese                     5
Peter Jackson                       3
Quentin Tarantino                   2
Ridley Scott                        1
Roman Polanski                      1
Steven Spielberg                    5
dtype: int64

In [264]:
type(movies_per_director)

pandas.core.series.Series

In [265]:
# convert the series to a dataframe

movies_per_director = movies_per_director.to_frame()

movies_per_director.columns = ['movies_count']

movies_per_director.reset_index(inplace=True)

movies_per_director.head()

Unnamed: 0,director_name,movies_count
0,Asif Kapadia,1
1,Christopher Nolan,4
2,Damien Chazelle,2
3,David Fincher,4
4,Florian Henckel von Donnersmarck,1


### Creating a new worksheet

In [266]:
worksheet_movies_per_director = wb.create_sheet("Movies per director")

### Writing a dataframe to a worksheet

In [267]:
from openpyxl.utils.dataframe import dataframe_to_rows

for r in dataframe_to_rows(movies_per_director, index=False, header=True):
    worksheet_movies_per_director.append(r)


### Adding Table styling

In [268]:
from openpyxl.worksheet.table import Table, TableStyleInfo

excel_table = Table(displayName="Directors_tbl", ref="A1:B14")

   
style = TableStyleInfo(name="TableStyleMedium9", 
                       showRowStripes=True)
excel_table.tableStyleInfo = style
worksheet_movies_per_director.add_table(excel_table)


from openpyxl.worksheet.filters import (
    FilterColumn,
    CustomFilter,
    CustomFilters,
    DateGroupItem,
    Filters,
    )

# add autofilter to the table

worksheet_movies_per_director.auto_filter.ref = "A1:B14"



### Adding a bar chart


In [269]:
# let's add a bar chart to the worksheet "Movies per director"

from openpyxl.chart import BarChart, Reference

chart = BarChart()
chart.type = "bar"
chart.style = 11
chart.title = "Movies per director"
chart.y_axis.title = 'Movies'
chart.x_axis.title = 'Director'

data = Reference(worksheet_movies_per_director, min_col=2, min_row=1, max_row=14, max_col=2)
categories = Reference(worksheet_movies_per_director, min_col=1, min_row=2, max_row=14)
chart.add_data(data, titles_from_data=True)
chart.set_categories(categories)
chart.height = 10
chart.width = 25
worksheet_movies_per_director.add_chart(chart, "C1")


### Saving updated excel file

In [270]:
# let's save the workbook

wb.save('../data/excel_demo_u4.xlsx')