<a href="https://www.kaggle.com/code/mikedelong/race-and-education-matter?scriptVersionId=152166117" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import pandas as pd
df = pd.read_csv(filepath_or_buffer='/kaggle/input/gun-deaths-in-america-cdc/gun_deaths.csv')
df.head()

Unnamed: 0,year,month,intent,police,sex,age,race,place,education
0,2012,1,Suicide,0,M,34.0,Asian/Pacific Islander,Home,BA+
1,2012,1,Suicide,0,F,21.0,White,Street,Some college
2,2012,1,Suicide,0,M,60.0,White,Other specified,BA+
3,2012,2,Suicide,0,M,64.0,White,Home,BA+
4,2012,2,Suicide,0,M,31.0,White,Other specified,HS/GED


This is a per-instance plot, so making aggregations should be shockingly simple.

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100798 entries, 0 to 100797
Data columns (total 9 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   year       100798 non-null  int64  
 1   month      100798 non-null  int64  
 2   intent     100797 non-null  object 
 3   police     100798 non-null  int64  
 4   sex        100798 non-null  object 
 5   age        100780 non-null  float64
 6   race       100798 non-null  object 
 7   place      99414 non-null   object 
 8   education  99376 non-null   object 
dtypes: float64(1), int64(3), object(5)
memory usage: 6.9+ MB


In [3]:
from plotly.express import histogram
histogram(data_frame=df.sort_values(by='intent'), x='year', color='intent', nbins=3)

We have only three years of data; they are in the fairly distant past in the sense that they are pre-COVID but in the middle of the original opioid crisis, and the totals change very little from year to year. If we double the number of bins the years separate but the change from year to year vanishes.

In [4]:
histogram(data_frame=df.sort_values(by='intent'), x='year', color='intent', nbins=5)

This breakdown by intent fits our prior: homicides, particularly mass shootings, tend to make the papers, but suicides outnumber them about two to one, and suicides don't make the papers, partly because they're not generally a threat to public safety.

In [5]:
from plotly.express import bar
bar(data_frame=df[['intent', 'education']].groupby(by=['intent', 'education']).size().reset_index(), x='intent', color='education', y=0)

Clearly the homicide/suicide split varies significantly differently by education. Maybe we can make this chart clearer by dropping the other two bins.

In [6]:
bar(data_frame=df[~df['intent'].isin({'Accidental', 'Undetermined'})][['intent', 'education']].groupby(by=['intent', 'education']).size().reset_index(), y='intent', color='education', x=0)

The impact of education on the homicide/suicide split is stark. College graduates are rarely victims of homicide by gunshot, but are about 8x more likely to die by suicide, while people who haven't finished high school are more likely to die of homicide than suicide about 11:9.

In [7]:
bar(data_frame=df[['intent', 'sex']].groupby(by=['intent', 'sex']).size().reset_index(), x='intent', color='sex', y=0)

Men and women are different.

In [8]:
bar(data_frame=df[~df['intent'].isin({'Accidental', 'Undetermined'})][['intent', 'education', 'sex']].groupby(by=['intent', 'education', 'sex']).size().reset_index(), y='intent', x=0,
    color='education',  facet_col='sex').show()
bar(data_frame=df[~df['intent'].isin({'Accidental', 'Undetermined'})][['intent', 'education', 'sex']].groupby(by=['intent', 'education', 'sex']).size().reset_index(), y='intent', x=0,
    facet_col='education',  color='sex').show()

This last chart is probably the nut graf; college graduates are in very little danger of dying from homicide involving guns; they're in much greater danger of suicide by gun. Roughly as many college graduates die by suicide involving guns as non-high-school-graduates die by homicide involving guns.

In [9]:
histogram(data_frame=df[df['education'] == 'Less than HS'].sort_values(by='sex'), x='age', color='sex')

In [10]:
histogram(data_frame=df[(df['education'] == 'Less than HS') & (df['intent'].isin({'Homicide', 'Suicide'}))].sort_values(by='sex'), x='age', color='intent')

There are really three groups of people who die from gun violence without high school educations:
* Minors
* Young adults through middle age
* Older adults
The minors tend to die from homicide before they can finish high school. The older adults become more likely to die from suicide; the middle cohort are the ones we tend to think of as being "working class," whose patterns of life and death are unusual.

In [11]:
histogram(data_frame=df[(df['education'] == 'Less than HS') & (df['intent'].isin({'Homicide', 'Suicide'}))][['age', 'intent']].value_counts().to_frame().reset_index(), nbins=20,
         x='age', color='intent', y='count')

This comes pretty close to showing us what we want to see: we have two bins of people who are mostly too young to graduate from high school on the left and we have seven bins where suicides outnumber homicides on the right; the two bins in between are the ones that are different from the entire rest of the dataset; these are people who are adults who did not finish high school and who are more likely to die by homicide than suicide.

In [12]:
bar(data_frame=df[~df['intent'].isin({'Accidental', 'Undetermined'})][['intent', 'race']].groupby(by=['intent', 'race']).size().reset_index(), y='intent', color='race', x=0)

Remember that whites are roughly 63% of the population during the period of interest, while blacks and hispanics are 10-15%.

In [13]:
from plotly.express import treemap
t_df = df[(df['race'].isin({'Black', 'Hispanic', 'White'})) & (df['intent'].isin({'Homicide', 'Suicide'}))][['intent', 'race', 'education']].groupby(by=['intent', 'race', 'education']).size().reset_index()
treemap(data_frame=t_df, path=['intent', 'race', 'education'], values=0, color='race')

If we just focus on three racial groups and two intents we can see the difference pretty clearly; these numbers would probably be even more stark on a per-hundred-thousand basis.