# Which Python string formatting method should you be using in your data science project?

**3rd July 2021**

[Engineering for Data Science post](https://engineeringfordatascience.com/posts/python_string_formatting_for_data_science/)

## Python String Formatting

String formatting (also known as string interpolation) is the process of inserting a custom string or variable into a predefined 'template' string.

In Python, there are four methods for formatting strings (also known as string interpolation):
- % operator
- format
- f-strings
- Templates

This is a little confusing, even by Python's own manifesto, the Zen of Python:

> [Zen of Python](https://www.python.org/dev/peps/pep-0020/) - "There should be one-- and preferably only one --obvious way to do it."

So why are there four native methods for formatting strings in Python?

Each has their own trade-offs and benefits of simplicity, flexibility, and/or extensibility. But what are the differences, which one should you use for which purpose and why?

In the context of data science there are three common use cases for formatting strings:
1. Print statements
2. User inputs
3. SQL queries

In this post, we will go through each use case and describe which string formatting method might be most appropriate.

## 1. Print Statements

String interpolation in data science is particularly useful for logging (e.g. during model training); creating dynamic chart titles and printing statistics.

In Python, the three most common methods for this purpose are `% operator`, `format`  or `f-strings`.

Let's briefly discuss each of these methods.

### % operator - 'Old method' 

We will start with the `%` operator method to get it out of the way.

String placeholders are denoted by a `%` symbol, followed by a character or characters which specify the desired formatting.

**Example:**

In [1]:
rows = 10
columns = 4

# print string representation
print("My data has %s rows and %s columns" % (rows, columns))

My data has 10 rows and 4 columns


It is also possible to use named placeholders and supply a dictionary which can make the statement more readable.

In [2]:
data = {"rows": rows, "columns": columns}

# print with named placehodlers
print("My data has %(rows)s rows and %(columns)s columns" % data)

My data has 10 rows and 4 columns


The `%` operator method is generally seen as a legacy method for string interpolation and should not be used in favour of the `format` or `f-string` methods described next.

Common grievances with this method include:
- The `%` notation can be hard to read
- `%` notation can be confused with the modulus operator. 
- The syntax can also lead to common errors such as [failing to display tuples and dictionaries correctly](https://docs.python.org/3/library/stdtypes.html?highlight=sprintf#printf-style-string-formatting).

Unless you are using a Python version less than 2.7, you should avoid using this method.

So, let's quickly move on...

### str.format() - 'Newer method'

Since Python 3 (and now backported to Python 2.7), you can format strings by calling the `.format()` method on the string object.

The functionality is very similar to the previous `%` operator formatting, however, the string placeholders are denoted by `{}` which can be more readable.

A full list of formatting functionality is available at [pyformat.info](https://pyformat.info/) which provides a great 'cheat sheet' for all the various ways to format a string (e.g. rounding, date time formatting etc.) - I would highly recommend checking it out.

**Example:**

In [3]:
# print string representation
print("My data has {} rows and {} columns".format(rows, columns))

My data has 10 rows and 4 columns


In [4]:
# print with named placehodlers
print("My data has {rows} rows and {columns} columns".format(**data))

My data has 10 rows and 4 columns


`str.format()` is an improvement on `%`, however, the syntax can be a bit verbose, particularly if you have a lot of variables to substitute.

### f-strings - 'Newest method'

Finally, since Python 3.6, there is a third method called string literals or 'f-strings' which lets you use embedded Python expressions inside string constants.

This can be really useful as it removes some of the verbose syntax overhead of the previous methods which reduces the amount of code you need to write.

With this method you only need to precede the string with the letter `f` or `F`.

**Example**

In [5]:
print(f"My data has {rows} rows and {columns} columns")

My data has 10 rows and 4 columns


### Which method should you use?

That was a *very* brief intro to the three main methods of string formatting in Python. I recommend checking out [RealPython](https://realpython.com/python-string-formatting/) and [pyformat](https://pyformat.info) for more detailed information on each method and the various different ways to customise the formatting.

**For print statements I personally prefer to use f-strings for most use cases.**

The syntax is very easy to remember and is less verbose than the `str.format()` method which makes it easier to read. You can also include expressions within the string which can be useful for making on the fly calculations. For example:


In [6]:
input_list = [1.3, 4.98, 32, 5.32, 3.98, 6.1, 2.4, 10.4]
print(f"The average value of the input list is {sum(input_list)/len(input_list):.2f}")

The average value of the input list is 8.31


However, there are a couple cases where `str.format()` can be more practical. The main example being when you are using a dictionary as the input source for your substitution variables.

For example, if you want to pass a dictionary containing the configuration or metadata for a particular model into a string which logs the training to the console.

Using an f-string, you have to specify the name of the dictionary each time you want to access a key. This involves a lot of repeated typing. It also reduces the flexibility of your statement if you want to pass a dictionary with a different name into the statement. You can also get in a mess with single and double quotes when referencing the dictionary key inside the wider print statement.

In [7]:
metadata = {"model": "xgboost", "model_dir": "models/", "data_dir": "data/"}

# interpolation using f-strings
print(
    f"Training {metadata['model']} model on data in the "
    f"'{metadata['data_dir']}' directory)..."
)

Training xgboost model on data in the 'data/' directory)...


A better and more flexible approach in this scenario would be to use the `str.format()` method and unpack the input dictionary containing the metadata.

In [8]:
print(
    "Training {model} model on data in the '{data_dir}' directory...".format(**metadata)
)

Training xgboost model on data in the 'data/' directory...


## 2. User Inputs

Formatting user inputs is perhaps the least common for data science purposes and is more common in web development. However, handling user inputs could be relevant for simple interactive programs or used as input for interactive dashboards and charts.

For this use case I will introduce the fourth method for interpolating strings - Template

### Template

The Python programming language comes with a [standard library](https://docs.python.org/3/tutorial/stdlib2.html#templating) called string which has a useful method called `Template`. 

The format uses placeholder names formed by $ with valid Python identifiers (alphanumeric characters and underscores):

In [9]:
from string import Template

temp = Template("There are $rows rows and $columns in the data")
temp.substitute(rows=rows, columns=columns)

'There are 10 rows and 4 in the data'

This is overkill and unnecessary for simple print statements, however, it can be particularly useful for protecting your application from malicious actors if you require user input.

Consider the scenario where you are formatting some user input using the `str.format()` method

In [34]:
# set earlier in the application to connect to a db
DB_PASSWORD = "super-secret-password"


class NameHandler:
    def __init__(self, name):
        self.name = name

    def commit_to_db():
        pass


# they were supposed to just input their name e.g. 'Bob'
user_input = "{user_input.__init__._globals__[DB_PASSWORD]}"

t = "Hello {user_input}!"

t.format(user_input=user_input)

'Hello {user_input.__init__._globals__[DB_PASSWORD]}!'

In [16]:
import os

os.environ["SHELL"]

'/bin/zsh'

In [28]:
DB_PASSWORD = os.environ["SHELL"]

user_input = "{globals().[DB_PASSWORD]}"

user_input.format()

KeyError: 'globals()'

In [30]:
s = "sdf"

dir(s)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',


In many cases where you trust the users of your application (e.g. for your own use or colleagues) using f-strings or `str.format()` will suffice. However, it is important to think of security implications of malicious user input if your application is available to the wider public.

## 3. SQL Queries



The final use case we will discuss is string interpolation for SQL queries. This is probably the least trivial use case as there can be added complexity, especially if you want to generate long queries dynamically.

There are two general cases where you will be working with SQL queries in Python:
1. 'In-line' in a Notebook
2. Importing from a .sql file

Both scenarios can be treated in a similar way, because when you import from a `.sql` file you are essentially just reading a string.

It is common deal with dynamic sql querys by developing a 'base' sql query with placeholders. Then substituting the placeholders with the required values for your particular analysis. This is an example of string interpolation. 

For example if we wanted to get the daily value of orders for a particular city we might have a base query defined as follows:

In [None]:
base_sql_query = """
SELECT
    date,
    SUM(order_value)
FROM orders
WHERE city = '{city}'
GROUP BY date
"""

We could then apply string formatting using the `str.format()` method and build the query for a particular city dynamically.

The function below takes the base sql query and inserts the specified city into the query.

In [None]:
def build_orders_by_city_query(city: str, base_sql_query: str = base_sql_query) -> str:
    return base_sql_query.format(city=city)


print(build_orders_by_city_query(city="London"))

We could make this function even more generalisable to build any query from a input dictionary.  We can unpack the variables dictionary to populate the string placeholders using the `str.format()` method.

In [None]:
def build_query(variables: dict, base_sql_query: str = base_sql_query) -> str:
    return base_sql_query.format(**variables)


base_sql_query = """
SELECT
    date,
    SUM(order_value)
FROM orders
WHERE city = '{city}' AND date > '{start_date}'
GROUP BY date
"""

variables = {"city": "London", "start_date": "2020-01-01"}
print(build_query(variables, base_sql_query))

Here we have extended the initial base query by adding an additional filter (start_date) to the input dictionary.

Note here that the `str.format()` method might be preferable to the f-string method as it easily allows us to easily unpack many variables from a dictionary input.

This works fine for small queries where the structure of the query is static. i.e. you always want to filter by the same columns or always want to apply the same arithmetic operations.

However, what happens if we want to make a longer and more complex query? For example, if, depnding on the situation we want to filter by multiple fields or by no fields at all. Or if we want to dynamically unpivot certain rows depending on their value. 

With the current approach we have to specify a fairly rigid base query ahead of time which is inflexible to any change in the query logic.

Luckily, there is a fifth approach to string interpolation - Jinja templates.

### Jinja

[Jinja](https://jinja.palletsprojects.com/en/3.0.x/intro/#introduction) is a fast, expressive and extensible templating engine which allows us to incorporate simple logic into our string expressions. 

Jinja's main use case is for rendering HTML templates for web applications, however, it comes in handy for building SQL queries as well.

I won't go into the syntax details too much in this post, rather, just demonstrate how it is a very powerful templating engine which allows you to program simple loops and if statements into your strings.

Going back to the previous example we can create the following Jinja template which will generalise to our needs.

In [None]:
jinja_base_sql_query = """
SELECT
    date,
    SUM(order_value)
FROM orders
WHERE
    {%- for city in filter_cities %}
    city = '{{city}}'
    {% if not loop.last -%}
    AND
    {%- endif -%}
    {%- endfor %}
GROUP BY date
"""

In [None]:
from jinja2 import Template

filter_cities = ["London", "Cardiff", "Edinburgh"]

print(Template(jinja_base_sql_query).render(filter_cities=filter_cities))

We have improved from the previous examples as we now have the ability to filter by an arbitrary list of cities - imagine if we had to write this query manually with a long list of cities.

We can take this further by applying logic to the columns we want to select as well as the cities we want to filter by.

In [None]:
jinja_base_sql_query2 = """
SELECT
    date
    {%- for product in target_products %}
    , SUM(CASE WHEN product_name = '{{product}}' THEN order_value END) AS sum_{{product}}_value
    {%- endfor %}
FROM orders
{% if cities_filter -%}
WHERE
    {%- for city in cities_filter %}
    city = '{{city}}'
    {% if not loop.last -%}
    AND
    {%- endif -%}
    {%- endfor %}
{% endif -%}
GROUP BY date
"""  # noqa: E501

In [None]:
query_data = {
    "target_products": ["book", "pen", "paper"],
    "cities_filter": ["London", "Cardiff", "Edinburgh"],
}

print(Template(jinja_base_sql_query2).render(query_data))

Here we have pivoted the product_name column to get the daily value of three products we are most interested in and also applied some city filters.

If we don't want to filter by any cities, in this example, we can just ignore the `cities_filter` field from our input dictionary.

In [None]:
# removed cities_filter
query_data = {"target_products": ["book", "pen", "paper"]}

print(Template(jinja_base_sql_query2).render(query_data))

These examples are slightly contreived but I hope they demonstrate the power of Jinja templating for your SQL queries to make them more expressive and generalisable.

The great thing about Jinja templates are that they are portable. You could save the Jinja templates as a `.sql` file and they can be reused across multiple projects. An alternative would be to create your own custom Python function to build up the complex query string dynamically, however, you would have to transport that function around with the SQL file. With Jinja, you just need to import `jinja2` and away you go.