In [24]:
import pandas as pd
import wmfdata as wmf

from wmfdata.utils import (
    pd_display_all,
    sql_tuple
)

How do we query our databases for strings with special characters?

To make this concrete, let's take some English Wikipedia users with special characters in their names:
* `Mr. "Turra"`
* `JKl'`
* `Percy%`
* `Zyonnoray123\`

With MariaDB, double quotes don't need to be escaped when we are using single quotes to enclose the string (which is preferable since Presto and other engines that strictly follow the ANSI SQL standard interpret double quotes-wrapped strings as literal identifiers). Percent signs don't need to be escaped (unless in a `LIKE` clause). Single quotes do need to be escaped, using either `\\'` or `''`. Backslashes need to be escaped using using `\\\\` (`\\` creates a single literal backslach in the Python string).

In [80]:
wmf.mariadb.run(f"""
SELECT
    user_id,
    user_name
FROM user
WHERE
    user_name IN (
        'Mr. "Turra"',
        'JKl\\'',
        'Percy%',
        'Zyonnoray123\\\\'
    )
""", "enwiki")

Unnamed: 0,user_id,user_name
0,53999,JKl'
1,27684213,"Mr. ""Turra"""
2,388583,Percy%
3,39297522,Zyonnoray123\


In [82]:
wmf.mariadb.run(f"""
SELECT
    user_id,
    user_name
FROM user
WHERE
    user_name = 'JKl'''
""", "enwiki")

Unnamed: 0,user_id,user_name
0,53999,JKl'


With Presto, there's also no escaping for double quotes and percent signs. Single quotes can _only_ be escaped with `''` and blackslashes must be escaped with only `\\`.

In [101]:
wmf.presto.run("""
SELECT
    user_id,
    user_name
FROM wmf_raw.mediawiki_user
WHERE
    user_name IN (
        'Mr. "Turra"',
        'JKl''',
        'Percy%',
        'Zyonnoray123\\'
    )
    AND wiki_db = 'enwiki'
    AND snapshot = '2022-09'
""")

Unnamed: 0,user_id,user_name
0,53999,JKl'
1,388583,Percy%
2,39297522,Zyonnoray123\
3,27684213,"Mr. ""Turra"""


With Spark, there's also no need to escape double quotes and percent signs. Single quotes can _only_ be escaped with `\\'` and backslashes must be escaped with `\\\\`.

In [107]:
wmf.spark.run("""
SELECT
    user_id,
    user_name
FROM wmf_raw.mediawiki_user
WHERE
    user_name IN (
        'Mr. "Turra"',
        'JKl\\'',
        'Percy%',
        'Zyonnoray123\\\\'
    )
    AND wiki_db = 'enwiki'
    AND snapshot = '2022-09'
""")

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
                                                                                

Unnamed: 0,user_id,user_name
0,53999,JKl'
1,388583,Percy%
2,27684213,"Mr. ""Turra"""
3,39297522,Zyonnoray123\


22/10/19 21:06:44 WARN UserGroupInformation: Not attempting to re-login since the last re-login was attempted less than 60 seconds before. Last Login=1666213564813
22/10/19 21:07:04 WARN UserGroupInformation: Exception encountered while running the renewal command for neilpquinn-wmf@WIKIMEDIA. (TGT end time:1666213615000, renewalFailures: org.apache.hadoop.metrics2.lib.MutableGaugeInt@198dc8d0,renewalFailuresTotal: org.apache.hadoop.metrics2.lib.MutableGaugeLong@6948e7dc)
ExitCodeException exitCode=1: kinit: Ticket expired while renewing credentials

	at org.apache.hadoop.util.Shell.runCommand(Shell.java:998)
	at org.apache.hadoop.util.Shell.run(Shell.java:884)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1216)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:1310)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:1292)
	at org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:1003)
	at java.lang.Thread.run(Thread.java: