Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data extract #59

Closed
garethrees opened this issue Oct 13, 2016 · 14 comments
Closed

Data extract #59

garethrees opened this issue Oct 13, 2016 · 14 comments
Assignees

Comments

@garethrees
Copy link
Member

garethrees commented Oct 13, 2016

See https://groups.google.com/a/mysociety.org/forum/#!topic/alaveteli/ZTR_7sd_nw0

I was wondering how easy it is/if it is possible to extract data from AsktheEU.org. Specifically, we are interested in knowing the number of (working) days from start to finish of each request.
We would want to be able to say that ‘for request ABC, it took institutions XX number of days to respond‘. I guess that would mean knowing the date the reply was sent, to the last date of a reply from the institutions.

TL;DR so far is to make a CSV with the following headings:

  • Info Request Title
  • Public Body Name
  • Request Created Timestamp
  • First Response Timestamp (could be an auto-reply though)
  • Last Response Timestamp
  • Last Success/Failure Status Timestamp
  • Last Status Update Timestamp
@garethrees
Copy link
Member Author

garethrees commented Nov 22, 2016

Last Success/Failure Status Timestamp

Want the first one?

@lizconlan
Copy link

I can get all of that fairly simply from a single SQL query except for the success/failure timestamp because we're not caching delivery_status here yet :(

SELECT ir.id,
       ir.title,
       pb.name AS public_body_name,
       ir.created_at,
       MIN(response.created_at) AS first_response_at,
       MAX(response.created_at) AS last_response_at,
       MAX(status.created_at) AS last_status_update_at
  FROM info_requests ir
    INNER JOIN public_bodies pb ON ir.public_body_id = pb.id
    INNER JOIN info_request_events response ON ir.id = response.info_request_id
    INNER JOIN info_request_events status ON ir.id = status.info_request_id
  WHERE response.event_type = 'response' AND status.event_type = 'status_update'
  GROUP BY ir.id, ir.title, pb.name, ir.created_at
  ORDER BY ir.id DESC
  LIMIT 10;

@garethrees
Copy link
Member Author

Sorry, that was referring to user-supplied state description ("This was successful" / "they didn't have what I wanted")

  • A successful status (or failure) indicates when the requester updated the request. This might be long after the authority provided the information. You could infer that if there is a status update of success/failure, then the timestamp of the last response is when the authority “responded”. This is not always true, as there can be some correspondence after the response with information, but its probably a good enough judgement.

https://groups.google.com/a/mysociety.org/d/msg/alaveteli/ZTR_7sd_nw0/B0JKmGJuAwAJ

@lizconlan
Copy link

lizconlan commented Dec 1, 2016

Sorry, that was referring to user-supplied state description

Aha! (Thanks)

In which case, something like...

SELECT ir.id,
       ir.title,
       pb.name AS public_body_name,
       ir.created_at,
       MIN(response.created_at) AS first_response_at,
       MAX(response.created_at) AS last_response_at,
       MAX(status.created_at) AS last_status_update_at,
       MAX(event.created_at) AS requester_updated_at,
       CASE
         WHEN event.described_state = 'successful' OR event.described_state = 'partially_successful'
         THEN 'Success'
         ELSE 'Fail'
       END AS outcome
  FROM info_requests ir
    INNER JOIN public_bodies pb ON ir.public_body_id = pb.id
    INNER JOIN info_request_events response ON ir.id = response.info_request_id
    INNER JOIN info_request_events status ON ir.id = status.info_request_id
    JOIN info_request_events event ON ir.id = event.info_request_id
  WHERE response.event_type = 'response'
    AND status.event_type = 'status_update'
    AND (event.described_state = 'successful' OR 
         event.described_state = 'partially_successful' OR
         event.described_state = 'rejected' OR
         event.described_state = 'not_held')
  GROUP BY ir.id, ir.title, pb.name, ir.created_at, event.described_state
  ORDER BY ir.id DESC
  LIMIT 10;

@garethrees
Copy link
Member Author

Nice SQL skills 😎

Some comments inline. Looks pretty close though!

SELECT ir.id,
       ir.title,
       pb.name AS public_body_name,
       ir.created_at,
       MIN(response.created_at) AS first_response_at,
       MAX(response.created_at) AS last_response_at,
       MAX(status.created_at) AS last_status_update_at,
       --
       -- The idea behind "Last Success/Failure Status Timestamp" and "Last
       -- Status Update Timestamp" was that a user may mark a request as
       -- successful, and then ask for more information, changing the state of
       -- the request to e.g. "waiting_response".
       --
       -- I was imagining we'd have a timestamp for each, so that you can spot
       -- the above case when those timestamps differ. Actually, its probably
       -- too confusing and an edge case, so I think we just stick with the
       -- last_status_update_at line above and do away with requester_updated_at
       -- completely
       --
       MAX(event.created_at) AS requester_updated_at,
       --
       -- I think we should just return the last described state here. It didn't
       -- necessarily fail if e.g. the authority said they don't have the info – they
       -- did their job and responded.
       --
       --    event.described_state AS current_status
       --
       CASE
         WHEN event.described_state = 'successful' OR event.described_state = 'partially_successful'
         THEN 'Success'
         ELSE 'Fail'
       END AS outcome
  FROM info_requests ir
    INNER JOIN public_bodies pb ON ir.public_body_id = pb.id
    INNER JOIN info_request_events response ON ir.id = response.info_request_id
    INNER JOIN info_request_events status ON ir.id = status.info_request_id
    JOIN info_request_events event ON ir.id = event.info_request_id
  WHERE response.event_type = 'response'
    AND status.event_type = 'status_update'
    --- Why are we ignoring a bunch of other possible described states?
    --- I don't think we want this filter at all
    AND (event.described_state = 'successful' OR 
         event.described_state = 'partially_successful' OR
         event.described_state = 'rejected' OR
         event.described_state = 'not_held')
  GROUP BY ir.id, ir.title, pb.name, ir.created_at, event.described_state
  ORDER BY ir.id DESC
  ---
  --- I know you know this, but don't forget to remove the LIMIT.
  ---
  LIMIT 10;

@lizconlan
Copy link

lizconlan commented Dec 6, 2016

Nice SQL skills 😎

Thanks! I once shared an office with 3 DBAs, after the first couple of years they let me be an honorary DBA 😃 (er, which they'll probably revoke after looking at the execution plan for this)

--- Why are we ignoring a bunch of other possible described states?
--- I don't think we want this filter at all

That was an attempt to only have user generated states included when attempting to isolate the most recent user update (but I was concerned I'd cut the list down too much). I'll take it out.

--- I know you know this, but don't forget to remove the LIMIT.

Best not to assume!

Updated (simplified) query...

SELECT ir.id,
       ir.title,
       pb.name AS public_body_name,
       ir.created_at,
       MIN(response.created_at) AS first_response_at,
       MAX(response.created_at) AS last_response_at,
       MAX(status.created_at) AS last_status_update_at,
       event.described_state
  FROM info_requests ir
    INNER JOIN public_bodies pb ON ir.public_body_id = pb.id
    INNER JOIN info_request_events response ON ir.id = response.info_request_id
    INNER JOIN info_request_events status ON ir.id = status.info_request_id
    JOIN info_request_events event ON ir.id = event.info_request_id
  WHERE response.event_type = 'response'
    AND status.event_type = 'status_update'
  GROUP BY ir.id, ir.title, pb.name, ir.created_at, event.described_state
  ORDER BY ir.id DESC;

@lizconlan
Copy link

@garethrees got time for a quick sanity check? I think I've implemented your changes correctly

@garethrees
Copy link
Member Author

That returns multiple rows for the same requests (looks like it might be one for each state its been in)…

@lizconlan
Copy link

wait, it's worse than that - forgot to change the join type when I took the restricting clause out

@lizconlan
Copy link

lizconlan commented Dec 6, 2016

Simplify all the things (we're not getting anything useful from event now anyway) ...

SELECT ir.id,
       ir.title,
       pb.name AS public_body_name,
       ir.created_at,
       MIN(response.created_at) AS first_response_at,
       MAX(response.created_at) AS last_response_at,
       MAX(status.created_at) AS last_status_update_at,
       ir.described_state
  FROM info_requests ir
    INNER JOIN public_bodies pb ON ir.public_body_id = pb.id
    INNER JOIN info_request_events response ON ir.id = response.info_request_id
    INNER JOIN info_request_events status ON ir.id = status.info_request_id
  WHERE response.event_type = 'response'
    AND status.event_type = 'status_update'
  GROUP BY ir.id, ir.title, pb.name, ir.created_at, status.described_state
  ORDER BY ir.id DESC

edit: adding described_state from info_request_events was causing multiple entries for some requests and, as we only wanted the current status, I've updated this to use the info_request described_state instead

@lizconlan
Copy link

lizconlan commented Dec 6, 2016

...then make them complicated again:

\copy (SELECT ir.id,
       ir.title,
       pb.name AS public_body_name,
       to_char(ir.created_at, 'YYYY-MM-DD HH24:MI:SS') AS created_at,
       to_char(MIN(response.created_at), 'YYYY-MM-DD HH24:MI:SS') AS first_response_at,
       to_char(MAX(response.created_at), 'YYYY-MM-DD HH24:MI:SS') AS last_response_at,
       to_char(MAX(status.created_at), 'YYYY-MM-DD HH24:MI:SS') AS last_status_update_at,
       ir.described_state
  FROM info_requests ir
    INNER JOIN public_bodies pb ON ir.public_body_id = pb.id
    INNER JOIN info_request_events response ON ir.id = response.info_request_id
    INNER JOIN info_request_events status ON ir.id = status.info_request_id
  WHERE response.event_type = 'response'
    AND status.event_type = 'status_update'
  GROUP BY ir.id, ir.title, pb.name, ir.created_at, status.described_state
  ORDER BY ir.id DESC) TO '/tmp/asktheeu-20161206.csv' WITH CSV HEADER DELIMITER ',';

(you may have to squash this onto one line to make it work because \copy uses line break terminators)

@lizconlan
Copy link

lizconlan commented Dec 6, 2016

Hopefully final version:

SELECT ir.id,
       ir.title,
       pb.name AS public_body_name,
       to_char(ir.created_at, 'YYYY-MM-DD HH24:MI:SS') AS created_at,
       to_char(MIN(response.created_at), 'YYYY-MM-DD HH24:MI:SS') AS first_response_at,
       to_char(MAX(response.created_at), 'YYYY-MM-DD HH24:MI:SS') AS last_response_at,
       to_char(MAX(status.created_at), 'YYYY-MM-DD HH24:MI:SS') AS last_status_update_at,
       ir.described_state
  FROM info_requests ir
    INNER JOIN public_bodies pb
      ON ir.public_body_id = pb.id
    INNER JOIN info_request_events response
      ON ir.id = response.info_request_id AND response.event_type = 'response'
    INNER JOIN info_request_events status 
      ON ir.id = status.info_request_id AND status.event_type = 'status_update'
  GROUP BY ir.id, ir.title, pb.name, ir.created_at, ir.described_state
  ORDER BY ir.id DESC;

And to export to csv:

\copy (SELECT ir.id,
       ir.title,
       pb.name AS public_body_name,
       to_char(ir.created_at, 'YYYY-MM-DD HH24:MI:SS') AS created_at,
       to_char(MIN(response.created_at), 'YYYY-MM-DD HH24:MI:SS') AS first_response_at,
       to_char(MAX(response.created_at), 'YYYY-MM-DD HH24:MI:SS') AS last_response_at,
       to_char(MAX(status.created_at), 'YYYY-MM-DD HH24:MI:SS') AS last_status_update_at,
       ir.described_state
  FROM info_requests ir
    INNER JOIN public_bodies pb
      ON ir.public_body_id = pb.id
    INNER JOIN info_request_events response
      ON ir.id = response.info_request_id AND response.event_type = 'response'
    INNER JOIN info_request_events status 
      ON ir.id = status.info_request_id AND status.event_type = 'status_update'
  GROUP BY ir.id, ir.title, pb.name, ir.created_at, ir.described_state
  ORDER BY ir.id DESC) TO '/tmp/asktheeu-20161206.csv' WITH CSV HEADER DELIMITER ','

@garethrees
Copy link
Member Author

Looks fine to me.

@lizconlan
Copy link

file emailed so this can probably be closed, but leaving it around for a few days in case of feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants