## Check for data collection and processing errors

### Kingfisher Collect Log

Print the crawler statistics from the log file specified in the setup section. If `downloader/response_status_count/{code}` is non-zero and `{code}` is an HTTP error code (400-599), then the collection may be incomplete. Where possible, you should check the total number of releases and/or contracting processes against the front-end of the data source.

In [None]:
if log_url != '':

  response = requests.get(log_url, auth=('scrape', scrapy_password))

  with open('log_file', 'wb') as f:
    f.write(response.content)
  
  log = ScrapyLogFile('log_file').logparser
  pprint(dict(log['crawler_stats']))

### Collection notes

Generate a list of notes for each collection from the `collection_note` table. 
Users can add notes when starting a spider or via the cli. Transforms can also add notes.

In [None]:
%%sql

SELECT
    collection_id,
    note
FROM
    collection_note
WHERE
    collection_id IN :collection_ids


### Collection file errors

Generate a summary of errors and warnings from the `collection_file` table.
Kingfisher Collect and the `local-load` command report errors when they cannot retrieve a file.
Kingfisher Process reports warnings when it needs to modify the contents of a file in order to store it.
Presently, the only warning is about the removal of control characters.

In [None]:
%%sql collection_file_error_summary <<

SELECT
    collection_id,
    warnings,
    errors,
    count(*),
    (array_agg(filename ORDER BY random()))[1:3] AS example_filenames,
    (array_agg(url ORDER BY random()))[1:3] AS example_urls
FROM
    collection_file
WHERE
    collection_id IN :collection_ids
    AND (errors IS NOT NULL
        OR warnings IS NOT NULL)
GROUP BY
    1,
    2,
    3
ORDER BY
    4 DESC;



In [None]:
collection_file_error_summary

Generate a full list of errors and warnings from the `collection_file` table.

In [None]:
%%sql collection_file_errors <<

SELECT
    collection_id,
    filename,
    warnings,
    url,
    errors
FROM
    collection_file
WHERE
    collection_id IN :collection_ids
    AND (errors IS NOT NULL
        OR warnings IS NOT NULL);



In [None]:
collection_file_errors

### Collection file item errors

Generate a summary of errors and warnings from the `collection_file_item` table.
Kingfisher Process reports errors when it cannot load a file item.

In [None]:
%%sql collection_file_item_error_summary <<

SELECT
    collection_id,
    cfi.warnings,
    cfi.errors,
    count(*)
FROM
    collection_file_item AS cfi
    JOIN collection_file AS cf ON cfi.collection_file_id = cf.id
WHERE
    cf.collection_id IN :collection_ids
    AND (cfi.errors IS NOT NULL
        OR cfi.warnings IS NOT NULL)
GROUP BY
    1,
    2,
    3
ORDER BY
    4 DESC;



In [None]:
collection_file_item_error_summary

Generate a full list of errors and warnings from the `collection_file_item` table.

In [None]:
%%sql collection_file_item_errors <<

SELECT
    cfi.number,
    cfi.warnings,
    cfi.errors
FROM
    collection_file_item AS cfi
    JOIN collection_file AS cf ON cfi.collection_file_id = cf.id
WHERE
    cf.collection_id IN :collection_ids
    AND (cfi.errors IS NOT NULL
        OR cfi.warnings IS NOT NULL);



In [None]:
collection_file_item_errors

### Check errors

A summary of errors from the `release_check_error` and `record_check_error` tables.
CoVE reports errors when it cannot check a release or record.

In [None]:
%%sql check_error_summary <<

WITH errors AS (
    SELECT
        collection_id,
        'release' AS TYPE,
        release.id AS release_id,
        release_check_error.error
    FROM
        release_check_error
        JOIN RELEASE ON release_check_error.release_id = release.id
    WHERE
        release.collection_id IN :collection_ids
    UNION
    SELECT
        collection_id,
        'record' AS TYPE,
        record.id AS record_id,
        record_check_error.error
    FROM
        record_check_error
        JOIN record ON record_check_error.record_id = record.id
    WHERE
        record.collection_id IN :collection_ids
)
SELECT
    collection_id,
    TYPE,
    error,
    count(*),
    (array_agg(release_id ORDER BY random()))[1:3] AS example_release_ids
FROM
    errors
GROUP BY
    1,
    2,
    3
ORDER BY
    4 DESC;



In [None]:
check_error_summary

Generate a full list of errors from the `release_check_error` and `record_check_error` tables.

In [None]:
%%sql check_errors <<

SELECT
    collection_id,
    'release' AS TYPE,
    release.id AS release_id,
    release_check_error.error
FROM
    release_check_error
    JOIN RELEASE ON release_check_error.release_id = release.id
WHERE
    release.collection_id IN :collection_ids
UNION
SELECT
    collection_id,
    'record' AS TYPE,
    record.id AS record_id,
    record_check_error.error
FROM
    record_check_error
    JOIN record ON record_check_error.record_id = record.id
WHERE
    record.collection_id IN :collection_ids;



In [None]:
check_errors