Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display CSV table(s) as search-able HTML on GitHub Pages #8

Closed
rajivsinclair opened this issue Jun 5, 2016 · 18 comments
Closed

Display CSV table(s) as search-able HTML on GitHub Pages #8

rajivsinclair opened this issue Jun 5, 2016 · 18 comments

Comments

@rajivsinclair
Copy link
Member

rajivsinclair commented Jun 5, 2016

@alexsoble
Copy link
Contributor

This is a fantastic idea, will make it easier for more people to dive into the data.

@jayqi
Copy link
Collaborator

jayqi commented Jun 5, 2016

I tried implementing this for shootings-append.csv using the template by @derekeder. Unfortunately, it seems like the dataset is too big for at least my browser to display all at once in this way.

Here's a version with only the first 9 entries from the data file. It works fine.

Here's a version that tries to load the full thing. My browser isn't able to load it (Chrome 50.0.2661.1 on OS X 10.11.3 on a mid 2015 rMBP).

@rajivsinclair
Copy link
Member Author

rajivsinclair commented Jun 8, 2016

Any simple alternatives? or should we not bother?

Anyone who wants to can import the CSV into Excel or Google Sheets or your preferred table-searching tool.

I’m just going to close this issue now, but feel free to open it up if you want to try another approach.

@derekeder
Copy link

Hey all, yah this file is 52k rows long (and 13mb), which is way too much to render on one page with my javascript template.

A google sheet would probably be the easiest approach.

@alexsoble
Copy link
Contributor

I think a Google Sheet is a fantastic idea. Plenty of people don't know what a .csv is, get intimidated when it loads up thousands and thousands of rows, etc. Making an interface into the data that user friendly for non-technical folks seems like a good idea to me.

@alexsoble
Copy link
Contributor

Re-opening!

Instead of a searchable table for all of shootings-append.csv, we can do a table for https://github.com/DGalt/shootings-data/blob/dev/summary_ipra.csv (only the June Dump).

This is much smaller and more manageable and would be useful. This would be a mirror of the IPRA portal's table, but with many more columns created by linking with the May + April data. Which I believe is what @rajivsinclair may have meant in the first place?

Reference: #6 (comment)

@alexsoble alexsoble reopened this Jun 14, 2016
@alexsoble
Copy link
Contributor

^^ @DGalt

@DGalt
Copy link
Collaborator

DGalt commented Jun 14, 2016

@alexsoble will likely start working on this at hacknight tonight. Just a heads up, though, there's probably still a little cleaning that needs to go on with that csv. I have not looked at the cleaned February data yet, so I don't know what, if anything, should be pulled out of that and added to this.

@alexsoble
Copy link
Contributor

Sounds good!
On Tue, Jun 14, 2016 at 2:36 PM DGalt notifications@github.com wrote:

@alexsoble https://github.com/alexsoble will likely start working on
this at hacknight tonight. Just a heads up, though, there's probably still
a little cleaning that needs to go on with that csv. I have not looked at
the cleaned February data yet, so I don't know what, if anything, should be
pulled out of that and added to this.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#8 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ADD5HecIMqlcLq-LVeiygAx2-yXjx2_Oks5qLwK8gaJpZM4IuYVV
.

@alexsoble
Copy link
Contributor

@DGalt I plugged your work-in-progress CSV into the template: https://github.com/invinst/ipra-table. It works fine. Takes a couple of seconds to load, and there's a lot of [nan] data, but it's a very serviceable first draft. Made it clear in the commit + repo that this a work in progress and not a final product. There's no public internet URL yet. We can look at this together at HackNight tonight if you want.

@DGalt
Copy link
Collaborator

DGalt commented Jun 15, 2016

After our discussion last night I think we need to decide whether we want to continue merging the april and may dumps, or whether we want to keep them separate. Looking at the list of 102 CRIDs in the June ipra dataset, 39 of them do not exist in the May data set, 8 of them do not exist in the April data set, and 4 of them do not exist in the data set composed of the combination of May and April. If we're just trying to provide as much information as possible then we should continue merging the two together, but as Chacyln explained last night these are data sets covering two different things.

As for your other question @alexsoble about percentage of nans in the columns, see below. This is using the combined April and May data set, merged with the files generated by @banoonoo2. The numbers displayed are percentage of rows in the corresponding column that are composed only of nan:

ipra_closed_date                100.000000
ipra_investigate_begin_date     100.000000
recommended_number_of_days       99.019608
discipline_code                  99.019608
penalty_status                   99.019608
finding_id                       96.078431
accused_detail                   94.117647
finding_code                     92.156863
accused_star                     74.509804
accused_appointment_date         72.549020
accused_officer_fname            72.549020
accused_officer_lname            72.549020
accused_position                 72.549020
accused_unit                     72.549020
involved_officer_detail          68.627451
involved_officer_description     60.784314
ipra_investigate_end_date        53.921569
involved_officer_type            38.235294
fileyr                           38.235294
iad_ops                          38.235294
cr_required                      38.235294
converted_rec_i                  38.235294
complaint_date                   38.235294
ipra_investigator_type           38.235294
ipra_assign_date_time            38.235294
police_shooting                  38.235294
police_shooting_no               38.235294
report_status                    38.235294
source                           38.235294
involved_officer_unit            14.705882
FILE_PROBLEMS                    11.764706
IPRA_URL                         11.764706
IPRA_PORTAL_LABEL                11.764706
MEDIA_FILE_TYPE                  11.764706
PUBLIC_URL                       11.764706
PUBLIC_FILE_NAME                 11.764706
ipra_investigator_lastname        8.823529
ipra_investigator_firstname       8.823529
involved_officer_age              8.823529
notification_date                 7.843137
initial_category_code             4.901961
initial_category                  4.901961
current_status                    3.921569
location_code                     3.921569
street                            3.921569
address                           3.921569
beat                              3.921569
current_category                  3.921569
current_category_code             3.921569
incident_time_start               3.921569
involved_officer_sex              3.921569
involved_officer_race             3.921569
involved_officer_position         3.921569
incident_time_end                 3.921569
involved_officer_lname            3.921569
involved_officer_fname            3.921569
SUBJECT_2                         0.000000
NOTIFICATION_DATE                 0.000000
DOCUMENT_COUNT                    0.000000
AUDIO_COUNT                       0.000000
VIDEO_COUNT                       0.000000
SUBJECT_3                         0.000000
JUVENILE_WITHHELD                 0.000000
HIPAA_WITHHELD                    0.000000
INCIDENT_TIME_START               0.000000
SUBJECT_1                         0.000000
DOCS_WITHHELD_NOTES               0.000000
INCIDENT_TYPE                     0.000000
SUBJECT_4                         0.000000

Edit: sorted above by percentage

@alexsoble
Copy link
Contributor

Thanks @DGalt!

@alexsoble
Copy link
Contributor

@DGalt A bunch of the rows have nan in ways I wouldn't necessarily expect, for example [nan 'RONALD'] in row 101 accused_officer_fname. Would you mind bringing me up to date when you have a minute on the best way to interpret these kinds of values? Is this a Python byproduct?

@DGalt
Copy link
Collaborator

DGalt commented Jun 15, 2016

Right, so, in the full data set there are likely several rows for any one CRID. The way I've organized this is that each CRID has a row, and each column contains a list of the unique values that were in all of the rows for that CRID (so basically the multiple rows are collapsed into a list for any one column). So in the example you give, that row corresponds to complaint_number 1080018.0. You're looking at accused_officer_fname, so in the uncollapsed data set there must have been two rows corresponding to complaint_number 1080018.0 that had in the accused_officer_fname column values of nan and RONALD - meaning that, for whatever reason, one of the entries in the original data set did not have an accused_officer_fname filled in. When I collapse everything for that CRID together you then get that list of [nan, 'RONALD'].

You'll see nan values in a lot of the lists within the various rows/columns - this is again simply because there was a missing value for that particular CRID in that particular column.

We should discuss, though, what to do with those nan values. We can simply drop them to make things look cleaner - I left them in 1. because it was just simpler and 2. because it gives a more accurate representation of how many unique rows are associated with any particular CRID. That may not be something we care about, though - it's relatively trivial to remove them, we just need to decide if that's what we want to do.

@alexsoble
Copy link
Contributor

OK great, thanks for the explanation @DGalt!

@alexsoble
Copy link
Contributor

Still WIP, working on a trimmed slice of the data with 5 columns, 28 rows (rows where we have accused officer names):

invinst/ipra-table@892cdc9

@DGalt
Copy link
Collaborator

DGalt commented Jun 16, 2016

how many columns can reasonably be presented before it becomes illegible?

i also wonder if these some way to truncate certain columns that are excessively long (e.g. the ones with urls)

@alexsoble
Copy link
Contributor

Great question, not sure. More columns, more work for reader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants