-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query succeds with correct answer, but a botocore.errorfactory.InvalidRequestException error message is logged #30
Comments
I think that it is a problem related to quotation escaping. |
I cut and pasted i into the Atena web interfacet from the CommonCrawl article: CREATE EXTERNAL TABLE IF NOT EXISTS ccindex (
url_surtkey STRING,
url STRING,
url_host_name STRING,
url_host_tld STRING,
url_host_2nd_last_part STRING,
url_host_3rd_last_part STRING,
url_host_4th_last_part STRING,
url_host_5th_last_part STRING,
url_host_registry_suffix STRING,
url_host_registered_domain STRING,
url_host_private_suffix STRING,
url_host_private_domain STRING,
url_protocol STRING,
url_port INT,
url_path STRING,
url_query STRING,
fetch_time TIMESTAMP,
fetch_status SMALLINT,
content_digest STRING,
content_mime_type STRING,
content_mime_detected STRING,
warc_filename STRING,
warc_record_offset INT,
warc_record_length INT,
warc_segment STRING)
PARTITIONED BY (
crawl STRING,
subset STRING)
STORED AS parquet
LOCATION 's3://commoncrawl/cc-index/table/cc-main/warc/'; The same query runs in the Athena web interface without complaint. |
SQLAlchemy seems to call the get_columns method before executing the query. With read_sql method of pandas, it seems that the table_name argument of get_columns method is not the table name, but the query to be executed is getting passed.
Even if an error occurs in the get_columns method, it seems that the query execution ends normally. As a solution, it is better to pass the DB-API connection instead of the SQLAlchemy engine to the read_sql method.
If it is a read_sql_query method, it seems to be ok to pass SQLAlchemy engine.
|
@laughingman7743 Thanks!! |
#I am using PyAthena to query the recently released CommonCrawl parquet archives as described in
http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
I set up and tested my indes as described in the above article
I then test the same query as in the article using PyAthena:
When I run the above code the following error mesage gets printed out"
The pprograms keeps running and returns the same answer as obtained in the Athena web console:
PyAthena or SQLAlchemy must be reporting and swallowing the error from deeper down.
While the answer is correct, the error meassage is concerning.
The text was updated successfully, but these errors were encountered: