Extracts mysql general logs for long term storage which can be used for audit purposes.
Mysql removes logs to keep exhausting storage and this can be dictated by the storage policy of the logs. For an intance with high volume traffic the storage can get exhausted pretty soon in which case Mysql will rotate the general_log table. To investigate any business issue that might be caused by data/schema changes, study of the general logs are very helpful. Database access patterns can also be determined using these logs as well. To increase the retention of the logs, the mysql-general-log-extractor will extract logs from the mysql instance and will store it into AWS S3 buckets for a longer durations. An Athena table is built on this bucket and this table can be used to effectively query this data using the abilities of AWS Athena.
0. As a prerequisite install Groovy version 2.4.7 or higher.
1. Create the following table in mysql. This serves as a checkpoint and is used for coordinating subsequent runs of the extractor in case of failure which ensures no data/logs are missed.
CREATE TABLE registry
(
registry_key VARCHAR(128) NOT NULL,
registry_value TEXT,
PRIMARY KEY (registry_key)
)
ENGINE=InnoDB
3. Create the Athena table named mysql_general_logs
. The following query can be used to create the table:
CREATE EXTERNAL TABLE `mysql_general_logs`(
`event_time` string,
`user_host` string,
`thread_id` bigint,
`server_id` int,
`command_type` string,
`statement` string)
PARTITIONED BY (
`dt` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '^'
MAP KEYS TERMINATED BY 'undefined'
WITH SERDEPROPERTIES (
'collection.delim'='undefined')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mysql-general-logs/prod/general'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1530908667')
4. Clone this repository to the instance where the log extraction will run and modify the following constant values in com/roy/mysql/StoreMysqlGeneralLogsToS3.groovy with the respective values for your environment.
static final def AWS_ACCOUNT_ID = 'PUT_YOUR_AWS_ACCOUNT_ID'
static final def ATHENA_DB_NAME = 'YOUR_ATHENA_DATABASE_NAME
static final def AWS_REGION = 'us-east-1'
static final def ATHENA_TABLE_NAME = 'mysql-general-logs'
5. Make sure all permissions and credentials are properly setup. The script needs read and write permission to access Mysql, Athena and S3 (bucket).
6. Set up cron for this script with an hourly frequency. The command must be executed from the project root directory. Following is the hourly cron expression sending all the application logs to /var/log (change as needed) directory.
* 0 0/1 ? * * * groovy com/roy/mysql/StoreMysqlGeneralLogsToS3.groovy >> /var/log/mysql/mysql-to-s3-copy.log 2>&1
If the extraction frequency needs to be lowered to minutes in case the data volume is too large, please modify the following snippet of code under script com/roy/mysql/StoreMysqlGeneralLogsToS3.groovy to handle more frequency.
use(groovy.time.TimeCategory) {
if(!lastRunTime) { // if no last run defined, start extracting from last 1 hour
lastRunTime = now - 1.hour() // <-- CHANGE to xx.getMinute() for minute frequency
}
def duration = now - lastRunTime
if (duration.days > 0 || duration.hours > 2) {
LOGGER.info("Falling behind from log generation by {$duration.days} days and {$duration.hours} hours")
now = lastRunTime + 1.getHour() + 30.getMinute() // <-- CHANGE to xx.getMinute() for minute frequency. Reduce 30.getMinute() to some appropriate value as well for the catching up.
}
}
7. After the first extraction completes (monitor the application logs) execute the following query in Athena to see the general logs:
SELECT * FROM "<your_database_name"."mysql_general_logs" limit 100;