Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query succeds with correct answer, but a botocore.errorfactory.InvalidRequestException error message is logged #30

Closed
mhlr opened this issue Mar 27, 2018 · 4 comments

Comments

@mhlr
Copy link

mhlr commented Mar 27, 2018

#I am using PyAthena to query the recently released CommonCrawl parquet archives as described in
http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

I set up and tested my indes as described in the above article
I then test the same query as in the article using PyAthena:

import pandas as pd
from urllib.parse import quote_plus  # PY2: from urllib import quote_plus
from sqlalchemy.engine import create_engine
from sqlalchemy.sql.expression import select
from sqlalchemy.sql.functions import func
from sqlalchemy.sql.schema import Table, MetaData

conn_str = 'awsathena+rest://{aws_access_key_id}:{aws_secret_access_key}@athena.{region_name}.amazonaws.com:443/'\
           '{schema_name}?s3_staging_dir={s3_staging_dir}'

engine = create_engine(conn_str.format(
    aws_access_key_id=quote_plus(KEY),
    aws_secret_access_key=quote_plus(SECRET),
    region_name='us-east-1',
    schema_name='ccindex',
    s3_staging_dir=quote_plus('s3://athenatmp2/xyz/')))


pd.read_sql(
    """
SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
  AND subset = 'warc'
  AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC
    """, engine)

When I run the above code the following error mesage gets printed out"

Failed to execute query.
Traceback (most recent call last):
  File "/home/dm/anaconda3/lib/python3.6/site-packages/pyathena/common.py", line 165, in _execute
    **request)
  File "/home/dm/anaconda3/lib/python3.6/site-packages/pyathena/util.py", line 45, in retry_api_call
    return retry(func, *args, **kwargs)
  File "/home/dm/anaconda3/lib/python3.6/site-packages/tenacity/__init__.py", line 313, in call
    start_time=start_time)
  File "/home/dm/anaconda3/lib/python3.6/site-packages/tenacity/__init__.py", line 269, in iter
    return fut.result()
  File "/home/dm/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/home/dm/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/dm/anaconda3/lib/python3.6/site-packages/tenacity/__init__.py", line 316, in call
    result = fn(*args, **kwargs)
  File "/home/dm/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 324, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/dm/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 622, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Conversion = '''

The pprograms keeps running and returns the same answer as obtained in the Athena web console:

Out[1]: 
       count   url_host_registered_domain
0     278171                     blogg.no
1     177143               aftenposten.no
2     172755                  blogspot.no
3     166156            platekompaniet.no
4     156718                       nrk.no
5     115110                      urlm.no
6     104641                    fruugo.no
7      90093                       uio.no
8      65259                    zoover.no
9      56452                       uib.no
10     56250              anbudstorget.no
11     52171                  multicom.no
12     50460                     tanum.no
13     49668                       snl.no
14     41147               kursagenten.no
15     40221               slideplayer.no
16     35110                     kayak.no
17     31495                   skruvat.no
18     29917                       mtv.no
19     29659                nettavisen.no
20     27738               regjeringen.no
21     27455                 haugenbok.no
22     27350                   footway.no
...

PyAthena or SQLAlchemy must be reporting and swallowing the error from deeper down.
While the answer is correct, the error meassage is concerning.

@mhlr mhlr changed the title error message printed, but query succed with correct answer query succeds with correct answer, but a botocore.errorfactory.InvalidRequestException error message is logged Mar 27, 2018
@laughingman7743
Copy link
Owner

I think that it is a problem related to quotation escaping.
I would like to know the definition of ccindex table.
Thanks,

@mhlr
Copy link
Author

mhlr commented Mar 27, 2018

I cut and pasted i into the Atena web interfacet from the CommonCrawl article:
http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
It is:

CREATE EXTERNAL TABLE IF NOT EXISTS ccindex (
  url_surtkey                   STRING,
  url                           STRING,
  url_host_name                 STRING,
  url_host_tld                  STRING,
  url_host_2nd_last_part        STRING,
  url_host_3rd_last_part        STRING,
  url_host_4th_last_part        STRING,
  url_host_5th_last_part        STRING,
  url_host_registry_suffix      STRING,
  url_host_registered_domain    STRING,
  url_host_private_suffix       STRING,
  url_host_private_domain       STRING,
  url_protocol                  STRING,
  url_port                      INT,
  url_path                      STRING,
  url_query                     STRING,
  fetch_time                    TIMESTAMP,
  fetch_status                  SMALLINT,
  content_digest                STRING,
  content_mime_type             STRING,
  content_mime_detected         STRING,
  warc_filename                 STRING,
  warc_record_offset            INT,
  warc_record_length            INT,
  warc_segment                  STRING)
PARTITIONED BY (
  crawl                         STRING,
  subset                        STRING)
STORED AS parquet
LOCATION 's3://commoncrawl/cc-index/table/cc-main/warc/';

The same query runs in the Athena web interface without complaint.

@laughingman7743
Copy link
Owner

laughingman7743 commented Apr 1, 2018

SQLAlchemy seems to call the get_columns method before executing the query.
https://github.com/laughingman7743/PyAthena/blob/master/pyathena/sqlalchemy_athena.py#L134

With read_sql method of pandas, it seems that the table_name argument of get_columns method is not the table name, but the query to be executed is getting passed.
The get_columns method gets an error when trying to execute the following query.

SELECT
  table_schema,
  table_name,
  column_name,
  data_type,
  is_nullable,
  column_default,
  ordinal_position,
  comment
FROM information_schema.columns
WHERE table_schema = 'ccindex'
AND table_name = '
SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
  AND subset = 'warc'
  AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC
'

Even if an error occurs in the get_columns method, it seems that the query execution ends normally.
I feel like a problem with the implementation of pandas's read_sql method.

As a solution, it is better to pass the DB-API connection instead of the SQLAlchemy engine to the read_sql method.

df = pd.read_sql("""
SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
  AND subset = 'warc'
  AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC
""", engine.connect().connection)

If it is a read_sql_query method, it seems to be ok to pass SQLAlchemy engine.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_query.html#pandas.read_sql_query

df = pd.read_sql_query("""
SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
  AND subset = 'warc'
  AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC
""", engine)

@mhlr
Copy link
Author

mhlr commented Apr 4, 2018

@laughingman7743 Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants