# Overview

An issue of personal interest of mine is gun violence in the United States. The U.S. ranks 31st worldwide in gun violence with 3.85 gun deaths in 100,000 in 2016 (per the Institute for Health Metrics and Evaluation). This statistic far exceeds other developed countries. 

This a visualization exercise to explore gun violence data published by FiveThirtyEight at https://github.com/fivethirtyeight/guns-data, which looks at American gun deaths between 2012 and 2014. 

## Preprocessing

To begin, we load in the csv file into a pandas dataframe and take a look at the top 20 entries.

In [1]:
import pandas as pd

df = pd.read_csv("full_data.csv")
df.head(20)

Unnamed: 0.1,Unnamed: 0,year,month,intent,police,sex,age,race,hispanic,place,education
0,1,2012,1,Suicide,0,M,34.0,Asian/Pacific Islander,100,Home,BA+
1,2,2012,1,Suicide,0,F,21.0,White,100,Street,Some college
2,3,2012,1,Suicide,0,M,60.0,White,100,Other specified,BA+
3,4,2012,2,Suicide,0,M,64.0,White,100,Home,BA+
4,5,2012,2,Suicide,0,M,31.0,White,100,Other specified,HS/GED
5,6,2012,2,Suicide,0,M,17.0,Native American/Native Alaskan,100,Home,Less than HS
6,7,2012,2,Undetermined,0,M,48.0,White,100,Home,HS/GED
7,8,2012,3,Suicide,0,M,41.0,Native American/Native Alaskan,100,Home,HS/GED
8,9,2012,2,Accidental,0,M,50.0,White,100,Other specified,Some college
9,10,2012,2,Suicide,0,M,,Black,998,Home,


From this, we get an idea of what the data set contains. Most columns are self explanatory, but what about the hispanic column? A numeric value associated with an ethnicity? Let's take a closer look by examining the unique values in the column.

In [2]:
df["hispanic"].unique()

array([100, 998, 281, 211, 261, 210, 222, 282, 260, 270, 231, 237, 200,
       223, 226, 275, 250, 234, 280, 227, 224, 286, 233, 271, 220, 225,
       235, 242, 212, 221, 239, 299, 232, 291, 217, 252, 209, 238, 218])

Well, since I still don't know what that means, let's drop the column for now along with the unnecessary column.

In [3]:
df = df.drop(["Unnamed: 0", "hispanic"], axis = 1)

In [4]:
df.head(20)

Unnamed: 0,year,month,intent,police,sex,age,race,place,education
0,2012,1,Suicide,0,M,34.0,Asian/Pacific Islander,Home,BA+
1,2012,1,Suicide,0,F,21.0,White,Street,Some college
2,2012,1,Suicide,0,M,60.0,White,Other specified,BA+
3,2012,2,Suicide,0,M,64.0,White,Home,BA+
4,2012,2,Suicide,0,M,31.0,White,Other specified,HS/GED
5,2012,2,Suicide,0,M,17.0,Native American/Native Alaskan,Home,Less than HS
6,2012,2,Undetermined,0,M,48.0,White,Home,HS/GED
7,2012,3,Suicide,0,M,41.0,Native American/Native Alaskan,Home,HS/GED
8,2012,2,Accidental,0,M,50.0,White,Other specified,Some college
9,2012,2,Suicide,0,M,,Black,Home,


Let's summarize the remaining columns of interest.

In [5]:
col_info = {}

for col in df.columns:
    print col + ":"
    info = df[col].value_counts()
    col_info[col] = info
    print info
    print "\n"

year:
2013    33636
2014    33599
2012    33563
Name: year, dtype: int64


month:
7     8989
8     8783
6     8677
5     8669
9     8508
4     8455
12    8413
10    8406
3     8289
1     8273
11    8243
2     7093
Name: month, dtype: int64


intent:
Suicide         63175
Homicide        35176
Accidental       1639
Undetermined      807
Name: intent, dtype: int64


police:
0    99396
1     1402
Name: police, dtype: int64


sex:
M    86349
F    14449
Name: sex, dtype: int64


age:
22.0     2712
21.0     2504
23.0     2472
24.0     2437
26.0     2231
25.0     2230
20.0     2219
27.0     2070
19.0     2065
28.0     1986
29.0     1955
30.0     1869
31.0     1833
32.0     1824
51.0     1755
18.0     1753
52.0     1715
53.0     1708
33.0     1700
34.0     1699
54.0     1684
50.0     1674
49.0     1669
35.0     1631
56.0     1625
48.0     1621
55.0     1596
47.0     1532
43.0     1527
36.0     1512
         ... 
87.0      312
89.0      245
13.0      229
90.0      208
91.0      176
92.0      12

From this, we get a summary and general idea about the data. This can be pretty easily converted into some bar graphs.

In [6]:
from bokeh.plotting import figure
from bokeh.palettes import Spectral10
from bokeh.io import output_notebook, push_notebook, show
import math
output_notebook()

def plot_intent():
    x = col_info["intent"].index
    y = col_info["intent"].values
    return x, y, "Gun Deaths by Intent (2012-2014)"
    
def plot_race():
    x = col_info["race"].index
    y = col_info["race"].values
    return x, y, "Gun Deaths by Race (2012-2014)"
    
def plot_education():
    x = col_info["education"].index
    y = col_info["education"].values
    return x,y, "Gun Deaths by Education (2012-2014)"

def plot_place():
    x = col_info["place"].index
    y = col_info["place"].values
    return x, y, "Gun Deaths by Place (2012-2014)"


In [14]:
import ipywidgets as widgets
from ipywidgets import interact
from bokeh.io import push_notebook, reset_output

def update(item):
    x = None
    y = None
    t = None
    if item == "Intent":
        x,y,t = plot_intent()
    elif item == "Race":
        x,y,t = plot_race()
    elif item == "Education":
        x,y,t = plot_education()
    elif item == "Place":
        x,y,t = plot_place()
    
    f = figure(title = t, x_range = list(x), plot_width = 800, plot_height = 800)
    f.xaxis.major_label_orientation = math.pi/3
    f.vbar(x, 0.4, y, color = Spectral10[:len(x)])
    show(f)
    

interact(update, item = ["Intent", "Race", "Education", "Place"])

A Jupyter Widget

<function __main__.update>

Click the drop down box to change graphs. What might be surprising here if you don't listen to the fivethirtyeight podcast is that there a majority of gun deaths are actually due to suicide rather than violence. If you observe data regarding the place most gun deaths occur, an overwhelming majority of gun deaths occur at home, consistent with people committing suicide at a high rate.

# PCA

This is a good overview of the data, but it lacks some context beween data points. We can visualize the data set using PCA, decomposing the features into a graphable two dimensional space such that we can observe details such as a sparsity or clustering.

In [15]:
import numpy as np
columns = df.columns

In [16]:
def numerize():
    for col in columns:
        unique = df[col].unique()
        mask = dict(zip(unique, np.arange(len(unique))))
        new_col = col + "_masked"
        df[new_col] = [mask[value] for value in df[col]]

df["age"] = df["age"].fillna(0)
numerize()
df.head()

Unnamed: 0,year,month,intent,police,sex,age,race,place,education,year_masked,...,month_masked_masked,intent_masked_masked,police_masked_masked,sex_masked_masked,age_masked_masked,race_masked_masked,place_masked_masked,education_masked_masked,pca_x_masked,pca_y_masked
0,2012,1,Suicide,0,M,34.0,Asian/Pacific Islander,Home,BA+,0,...,0,0,0,0,0,0,0,0,0,0
1,2012,1,Suicide,0,F,21.0,White,Street,Some college,0,...,0,0,0,1,1,1,1,1,1,1
2,2012,1,Suicide,0,M,60.0,White,Other specified,BA+,0,...,0,0,0,0,2,1,2,0,2,2
3,2012,2,Suicide,0,M,64.0,White,Home,BA+,0,...,1,0,0,0,3,1,0,0,3,3
4,2012,2,Suicide,0,M,31.0,White,Other specified,HS/GED,0,...,1,0,0,0,4,1,2,2,4,4


In [26]:
sample = df.sample(frac=0.01)

In [27]:
from sklearn.manifold import TSNE

tsne = TSNE().fit_transform(sample.iloc[:, 9:])
x = tsne[:, 0]
y = tsne[:, 1]

KeyboardInterrupt: 

In [None]:
sample["tsne_x"] = x
sample["tsne_y"] = y

t = figure(title = "PCA of Gun Deaths", plot_width = 800, plot_height = 800)
t.scatter("tsne_x", "tsne_y", source = sample)
show(t)

In [11]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 2).fit(df.iloc[:, 9:].values)
vals = pca.transform(df.iloc[:, 9:].values)
print pca.components_

[[  5.74133413e-04   1.94247122e-04  -8.89608391e-03  -1.49126691e-04
   -2.73302349e-04   9.99925674e-01  -6.36097299e-03  -5.32408402e-03
    4.83197071e-04]
 [ -1.62666299e-03  -9.99842439e-01  -1.42186034e-02   6.45857795e-05
    3.10540812e-04   5.19590125e-05  -8.13358427e-03   6.40497943e-03
   -1.73212992e-03]]


In [12]:
pca_x = vals[:, 0]
pca_y = vals[:, 1]

In [13]:
df["pca_x"] = pca_x
df["pca_y"] = pca_y

In [20]:
p = figure(title = "PCA of Gun Deaths", plot_width = 800, plot_height = 800)
p.scatter("pca_x", "pca_y", source = df.sample(frac = .1))
show(p)