::: {#fig-etl}
![](images/etl.png){.lightbox width=110%}

ETL Phases
:::

## Overview

Some reasons why spreadsheets and slideshows persist in the office workflow, include:

- Familiarity and Ease of Use
- Flexibility and Control
- Storytelling and Communication
- Collaboration and Sharing
- Ad-hoc Analysis and Exploration
- Cost and Accessibility

While BI tools may be transforming the way we analyze data, it’s clear that spreadsheets and slideshows aren’t going anywhere anytime soon. They serve a different purpose, filling a gap that BI tools often miss.

Building a Python-Powered Data Pipeline

Business Intelligence (BI) tools are powerful, but they can also be expensive and complex. What if you could build a custom, flexible, and potentially more cost-effective solution using Python?

This post explores how you can leverage Python to collect, transform, and deliver data to Spreadsheets and Slides for compelling presentations.
Why Python?

Python has become a powerhouse in data science and automation. Its rich ecosystem of libraries makes it ideal for data manipulation, and seamless slides interaction. This combination offers a powerful alternative to traditional BI tools for certain use cases.

In this post we shall propose the use of Python (to collect, cleanse and transform data), Google Spreadsheets (to store transformed data) and Google Slides (to showcase visualizations).
Proposed Workflow

Imagine you need to generate a weekly sales report and all you have to do is to run the next command:

```python
%%bash
jupyter-execute ./projects/weekly-report.ipynb
```

And, voila! you have your weekly report updated and ready to present in Google Slides.

## Environment settings

In [27]:
# Import authenticator and gspread to manage g-sheets
from oauth2client.service_account import ServiceAccountCredentials
import gspread

# Import other libraries
import numpy as np
import pandas as pd
import polars as pl
import duckdb as db
import json
import warnings
warnings.filterwarnings('ignore')

In [28]:
# get token
filename = 'credentials.json'

# read json file
with open(filename) as f:
    keys = json.load(f)

# read credentials
token = keys['md_token']

## Extract Phase

In [29]:
# connect to motherduck cloud
conn = db.connect(f'md:?motherduck_token={token}')

In [30]:
#| label: tbl-databases
#| tbl-cap: Databases

conn.sql('show databases')

┌───────────────────────┐
│     database_name     │
│        varchar        │
├───────────────────────┤
│ md_information_schema │
│ my_db                 │
│ my_portfolio          │
│ sample_data           │
└───────────────────────┘

In [31]:
# select specific database
conn.sql('use my_portfolio')

In [32]:
#| label: tbl-tables
#| tbl-cap: Tables in database

# show tables in database
conn.sql('show tables')

┌──────────────────┐
│       name       │
│     varchar      │
├──────────────────┤
│ airports         │
│ appl_stock       │
│ cdmx_subway      │
│ colors           │
│ contains_null    │
│ houses           │
│ people           │
│ prevalencia      │
│ restaurants      │
│ retail_sales     │
│ sales            │
│ sales_info       │
│ sets             │
│ water_collection │
├──────────────────┤
│     14 rows      │
└──────────────────┘

In [33]:
#| label: tbl-dataset
#| tbl-cap: Dataset Preview

# dataset
dataset = conn.sql('select * from restaurants').df()

(
    dataset.head()
        .style
        .hide()    
        .format({'rating_count': '{:,.0f}', 'cost': '${:.2f}'})
)

name,rating_count,cost,city,cuisine,rating
The Golden Wok,1477,$33.62,Berlin,American,5
Greek Gyros,770,$68.39,New York,French,1
Taste of Italy,4420,$88.23,Amsterdam,Chinese,0
Midnight Diner,2155,$12.97,Lisbon,Mexican,1
Taste of Italy,3375,$52.79,Sydney,Chinese,1


## Transform Phase

### Which restaurant chain has the maximum number of restaurants?

In [34]:
#| label: tbl-query1
#| tbl-cap: Data grouped by restaurant chains

chains = (
    conn.sql('''
    select name, count(name) as no_of_chains
    from restaurants
    group by name
    order by no_of_chains DESC
    limit 10
    ''').df()
)
chains

Unnamed: 0,name,no_of_chains
0,The Burger Joint,721
1,Pizza Palace,703
2,Greek Gyros,696
3,Cafe Delight,692
4,French Delights,681
5,The BBQ Shack,671
6,The Golden Wok,667
7,Ocean Breeze,665
8,Spice & Bloom,665
9,Midnight Diner,657


### Which restaurant chain has generated maximum revenue?

In [35]:
#| label: tbl-query2
#| tbl-cap: Data grouped by restaurant and revenue

revenue = (
    conn.sql('''
    select name, sum(rating_count * cost) as revenue
    from restaurants
    group by name
    order by revenue DESC
    limit 10
    ''').df()
)

(
    revenue
        .style
        .hide()    
        .format({'revenue': '{:,.2f}'})
)

name,revenue
The Burger Joint,108820424.64
Pizza Palace,100382853.92
Cafe Delight,98037063.93
Greek Gyros,96403445.25
The BBQ Shack,96286414.02
Ocean Breeze,95664612.77
The Golden Wok,94860125.75
Spice & Bloom,91824854.56
French Delights,91236701.38
Midnight Diner,91170162.3


### Which city has generated maximum revenue?

In [36]:
#| label: tbl-query3
#| tbl-cap: Data grouped by city and revenue

cities = (
    conn.sql('''
    select city, sum(rating_count * cost) as revenue
    from restaurants
    group by city
    order by revenue DESC
    limit 10
    ''').df()
)

(
    cities
        .style
        .hide()    
        .format({'revenue': '${:,.2f}'})
)

city,revenue
Amsterdam,"$148,839,878.62"
Tokyo,"$148,035,421.32"
Madrid,"$141,487,618.97"
Paris,"$141,219,374.56"
London,"$140,876,613.54"
Rome,"$139,622,129.63"
New York,"$138,621,609.28"
Lisbon,"$136,814,247.27"
Berlin,"$136,434,163.25"
Sydney,"$131,656,513.97"


## Load Phase

In [37]:
# Create scope to authenticate
SCOPES = ['https://www.googleapis.com/auth/spreadsheets', 'https://www.googleapis.com/auth/drive']

# Read credentials
GOOGLE_SHEETS_KEY_FILE = 'arkham-538.json'
credentials = ServiceAccountCredentials.from_json_keyfile_name(GOOGLE_SHEETS_KEY_FILE, SCOPES)
gc = gspread.authorize(credentials)

In [38]:
import pytz
import datetime

tz = pytz.timezone('America/Mexico_City')
update = datetime.datetime.now(tz).strftime('%b %d, %Y')
period = update

In [39]:
def save_to_gsheets(df, sheet_name, worksheet_name, period):
    creds = ServiceAccountCredentials.from_json_keyfile_name(GOOGLE_SHEETS_KEY_FILE, SCOPES)
    client = gspread.authorize(creds)
    sheet = client.open(sheet_name)
    worksheet = sheet.worksheet(worksheet_name)

    # Convert datetimes to strings in advance
    for column in df.columns[df.dtypes == 'datetime64[ns]']:
        df[column] = df[column].astype(str)

    # Prepare data for batch update
    data = [df.columns.values.tolist()] + df.fillna('').values.tolist()

    # Freeze rows and update cell values with a single batch update
    worksheet.freeze(4)
    worksheet.update('A4:M', data)

    #fija fecha de consulta o actualizacion
    update_data = {
    'Last update': [
        period,]
    }

    # convert to dataframe
    update_data = pd.DataFrame(update_data, columns=['Last update'])

    worksheet.update([update_data.columns.values.tolist()] + update_data.fillna('').values.tolist(),'A1:A2',)

    print(f'DataFrame uploaded to: workbook: {sheet_name}, sheet: {worksheet_name}')

In [40]:
save_to_gsheets(dataset, 'restaurants', 'data', period)

DataFrame uploaded to: workbook: restaurants, sheet: data


In [41]:
save_to_gsheets(chains, 'restaurants', 'chains', period)

DataFrame uploaded to: workbook: restaurants, sheet: chains


In [42]:
save_to_gsheets(revenue, 'restaurants', 'revenue', period)

DataFrame uploaded to: workbook: restaurants, sheet: revenue


In [43]:
save_to_gsheets(cities, 'restaurants', 'cities', period)

DataFrame uploaded to: workbook: restaurants, sheet: cities


## Close connection

In [44]:
# close connection
conn.close()

## Retrieve data from gsheets

In [45]:
# Access worksheet id
df_id = '1JNAWb2QkFwh61v7QwEEVZnNhTPS0csbdMdll9y1csEg'
df_workbook = gc.open_by_key(df_id)
# Access data by worksheet sheet
df = df_workbook.worksheet('data')
# Save data to table
df = df.get_all_values()
# Save accessed data from google sheets to dataframe
df = pd.DataFrame(df[1:], columns=df[0])

In [46]:
#| label: tbl-gsheets
#| tbl-cap: Data Saved on Gogle Sheets

df.head()

Unnamed: 0,Last update,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
0,"Feb 24, 2025",,,,,
1,,,,,,
2,name,rating_count,cost,city,cuisine,rating
3,The Golden Wok,1477,33.62048759,Berlin,American,5
4,Greek Gyros,770,68.38887409,New York,French,1


## Google Sheets Report Data

::: {#fig-gsheets}

![](images/gsheets.png){width=80%}

Google Sheets Data for Presentation Report
:::

## Sync between Google Sheets and Google Slides

Simply we copy and paste with sync for each table and chart and customize our slides.

::: {#fig-slides}
![](images/sync.png)

Synchronization between Google Sheets and Slides
:::

::: {.callout-note appearance="default" .callout-tip title="Google Slides"}

You can see the final report on [Google Slides](https://docs.google.com/presentation/d/e/2PACX-1vSNbr2SozdFPzYGjDmd7_b_BvFRpCinWuD7fbRWDJd9hByA0-jKCTDsh1pX4gRmQFPQOdqkTcPcIG21/pub?start=false&loop=false&delayms=3000#slide=id.p)

![](images/gslides.png){.lightbox width=110%}
:::

## Conclusions

While BI tools are valuable, Python offers a compelling alternative for building custom data pipelines.
By leveraging the power of Python using polars and duckdb libraries for data collection and transformation, and libraries like plotly for visualization you can create a flexible, cost-effective, and automated solution for delivering data to Google Spreadsheets, using gspread, and Google Slides for impactful presentations, by sync between these Google apps.

This approach empowers you to take control of your data and create highly tailored reporting solutions by replacing BI license costs.

## References

- Business (2023) How to Design a Dashboard Presentation: A Step-by-Step Guide in slidemodel.com
- Karlson, P. (2022) Are Spreadsheets Secretly Running Your Business? in Forbes
- Monroy, Jesus (2024) Why BI Tools Fall Short: PowerPoint and Excel Still Rule the Business World in Medium
- Moore J. (2024) But, Can I Export it to Excel? in Do Mo(o)re with Data
- Schwab, P. (2021) Excel dominates the business world… and that’s not about to change in Into the Minds

## Contact

**Jesus L. Monroy**
<br>
*Economist & Data Scientist*

[Medium](https://medium.com/@jesuslm) | [Linkedin](https://www.linkedin.com/in/j3sus-lm) | [Twitter](https://x.com/j3suslm)