In this project, we'll be working with Jupyter notebook, and analyzing data on gun deaths in the US. 

The dataset came from FiveThirtyEight, and can be found [here](https://github.com/fivethirtyeight/guns-data). The dataset is stored in the guns.csv file. It contains information on gun deaths in the US from 2012 to 2014. Each row in the dataset represents a single fatality. The columns contain demographic and other information about the victim. 

Key Findings:



In [1]:
import csv

f = open("guns.csv")
csvreader = csv.reader(f)
data = list(csvreader)
data[0:5]

[['',
  'year',
  'month',
  'intent',
  'police',
  'sex',
  'age',
  'race',
  'hispanic',
  'place',
  'education'],
 ['1',
  '2012',
  '01',
  'Suicide',
  '0',
  'M',
  '34',
  'Asian/Pacific Islander',
  '100',
  'Home',
  '4'],
 ['2', '2012', '01', 'Suicide', '0', 'F', '21', 'White', '100', 'Street', '3'],
 ['3',
  '2012',
  '01',
  'Suicide',
  '0',
  'M',
  '60',
  'White',
  '100',
  'Other specified',
  '4'],
 ['4', '2012', '02', 'Suicide', '0', 'M', '64', 'White', '100', 'Home', '4']]

In [2]:
headers = data[0]
headers
data = data[1:]
data

[['1',
  '2012',
  '01',
  'Suicide',
  '0',
  'M',
  '34',
  'Asian/Pacific Islander',
  '100',
  'Home',
  '4'],
 ['2', '2012', '01', 'Suicide', '0', 'F', '21', 'White', '100', 'Street', '3'],
 ['3',
  '2012',
  '01',
  'Suicide',
  '0',
  'M',
  '60',
  'White',
  '100',
  'Other specified',
  '4'],
 ['4', '2012', '02', 'Suicide', '0', 'M', '64', 'White', '100', 'Home', '4'],
 ['5',
  '2012',
  '02',
  'Suicide',
  '0',
  'M',
  '31',
  'White',
  '100',
  'Other specified',
  '2'],
 ['6',
  '2012',
  '02',
  'Suicide',
  '0',
  'M',
  '17',
  'Native American/Native Alaskan',
  '100',
  'Home',
  '1'],
 ['7',
  '2012',
  '02',
  'Undetermined',
  '0',
  'M',
  '48',
  'White',
  '100',
  'Home',
  '2'],
 ['8',
  '2012',
  '03',
  'Suicide',
  '0',
  'M',
  '41',
  'Native American/Native Alaskan',
  '100',
  'Home',
  '2'],
 ['9',
  '2012',
  '02',
  'Accidental',
  '0',
  'M',
  '50',
  'White',
  '100',
  'Other specified',
  '3'],
 ['10', '2012', '02', 'Suicide', '0', 'M', 'NA',

In [3]:
data[0:5]

[['1',
  '2012',
  '01',
  'Suicide',
  '0',
  'M',
  '34',
  'Asian/Pacific Islander',
  '100',
  'Home',
  '4'],
 ['2', '2012', '01', 'Suicide', '0', 'F', '21', 'White', '100', 'Street', '3'],
 ['3',
  '2012',
  '01',
  'Suicide',
  '0',
  'M',
  '60',
  'White',
  '100',
  'Other specified',
  '4'],
 ['4', '2012', '02', 'Suicide', '0', 'M', '64', 'White', '100', 'Home', '4'],
 ['5',
  '2012',
  '02',
  'Suicide',
  '0',
  'M',
  '31',
  'White',
  '100',
  'Other specified',
  '2']]

In [4]:
year_counts = {}
for row in data:
    years = row[1]
    if years in year_counts:
        year_counts[years] = year_counts[years] + 1
    else:
        year_counts[years] = 1

print(year_counts)

{'2012': 33563, '2013': 33636, '2014': 33599}


In [5]:
import datetime

dates = [datetime.datetime(year = int(row[1]),month=int(row[2]),day=1) for row in data]
dates[0:5]

[datetime.datetime(2012, 1, 1, 0, 0),
 datetime.datetime(2012, 1, 1, 0, 0),
 datetime.datetime(2012, 1, 1, 0, 0),
 datetime.datetime(2012, 2, 1, 0, 0),
 datetime.datetime(2012, 2, 1, 0, 0)]

In [6]:
date_counts = {}

for row in dates:
    if row in date_counts:
        date_counts[row] += 1
    else:
        date_counts[row] = 1
date_counts

{datetime.datetime(2012, 1, 1, 0, 0): 2758,
 datetime.datetime(2012, 2, 1, 0, 0): 2357,
 datetime.datetime(2012, 3, 1, 0, 0): 2743,
 datetime.datetime(2012, 4, 1, 0, 0): 2795,
 datetime.datetime(2012, 5, 1, 0, 0): 2999,
 datetime.datetime(2012, 6, 1, 0, 0): 2826,
 datetime.datetime(2012, 7, 1, 0, 0): 3026,
 datetime.datetime(2012, 8, 1, 0, 0): 2954,
 datetime.datetime(2012, 9, 1, 0, 0): 2852,
 datetime.datetime(2012, 10, 1, 0, 0): 2733,
 datetime.datetime(2012, 11, 1, 0, 0): 2729,
 datetime.datetime(2012, 12, 1, 0, 0): 2791,
 datetime.datetime(2013, 1, 1, 0, 0): 2864,
 datetime.datetime(2013, 2, 1, 0, 0): 2375,
 datetime.datetime(2013, 3, 1, 0, 0): 2862,
 datetime.datetime(2013, 4, 1, 0, 0): 2798,
 datetime.datetime(2013, 5, 1, 0, 0): 2806,
 datetime.datetime(2013, 6, 1, 0, 0): 2920,
 datetime.datetime(2013, 7, 1, 0, 0): 3079,
 datetime.datetime(2013, 8, 1, 0, 0): 2859,
 datetime.datetime(2013, 9, 1, 0, 0): 2742,
 datetime.datetime(2013, 10, 1, 0, 0): 2808,
 datetime.datetime(2013, 11,

In [7]:
sex_counts = {}

for row in data:
    sex = row[5]
    if sex in sex_counts:
        sex_counts[sex] += 1
    else:
        sex_counts[sex] = 1
        
sex_counts

{'F': 14449, 'M': 86349}

In [8]:
race_counts = {}

for row in data:
    race = row[7]
    if race in race_counts:
        race_counts[race] += 1
    else:
        race_counts[race] = 1

race_counts

{'Asian/Pacific Islander': 1326,
 'Black': 23296,
 'Hispanic': 9022,
 'Native American/Native Alaskan': 917,
 'White': 66237}

Gun deaths where women were the victim are far fewer in number than those where a man was the victim.  Female deaths amount to 16% of the male figure.

African-American deaths are half that of deaths of white americans.  

Gun deaths are remarkably stable over a number of years.  It is difficult to extract whether this is because detection rates have remained broadly the same, or whether gun deaths have remained at a constant rate, or whether one of these variables is counterbalancing the other.

Further things to look into would be:

- whether rates of gun deaths are in proportion to the population
- whether suicide rates are higher among certain populations
- what accounts for the higher figure among white americans

In [9]:
intent_counts = {}

for row in data:
    intent = row[3]
    race = row[7]
    if intent in intent_counts:
        intent_counts[intent] += 1
    else:
        intent_counts[intent] = 1

print(intent_counts)

{'Suicide': 63175, 'Undetermined': 807, 'Accidental': 1639, 'Homicide': 35176, 'NA': 1}


In [10]:
g = open("census.csv", "r")
csvreader = csv.reader(g)
census = list(csvreader)
census

[['Id',
  'Year',
  'Id',
  'Sex',
  'Id',
  'Hispanic Origin',
  'Id',
  'Id2',
  'Geography',
  'Total',
  'Race Alone - White',
  'Race Alone - Hispanic',
  'Race Alone - Black or African American',
  'Race Alone - American Indian and Alaska Native',
  'Race Alone - Asian',
  'Race Alone - Native Hawaiian and Other Pacific Islander',
  'Two or More Races'],
 ['cen42010',
  'April 1, 2010 Census',
  'totsex',
  'Both Sexes',
  'tothisp',
  'Total',
  '0100000US',
  '',
  'United States',
  '308745538',
  '197318956',
  '44618105',
  '40250635',
  '3739506',
  '15159516',
  '674625',
  '6984195']]

In order to get from the raw counts of gun deaths by race to a rate of gun deaths per 100000 people in each race, we'll need to divide the total number of gun deaths by the population of each race. From the census dataset, we know that the number of people in the White racial category is 197318956. We'd divide 66237 by 197318956:
    
This gives us the percentage chance that a given person in the White census race category would have been killed by a gun in the US from 2012 to 2014. If you do this computation, you'll see that the rate is a very small number, 0.0003356849303419181. It's for this reason that it's typical to express crime statistics as the "rate per 100000". This tells you the number of people in a given group out of every 100000 that were killed by guns in the US. To get this, we just multiply by 100000:


rate_per_hundredk = 0.0003356849303419181 * 100000
This gives us 33.56, which we can interpret as "33.56 out of every 100000 people in the White census race category in the US were killed by guns between 2012 and 2014".

We'll need to calculate these same rates for each racial category. The only stumbling block is that the racial categories are named slightly differently in census and in data. We'll need to manually construct a dictionary that allows us to map between them, and perform the division.

Here's a list of the race name in data, and the corresponding race name in census:

- Pacific Islander -- Race Alone - Asian plus Race Alone - Native Hawaiian and Other Pacific Islander.
- Black -- Race Alone - Black or African American.
- Hispanic -- Race Alone - Hispanic
- Native American/Native Alaskan -- Race Alone - American Indian and - Alaska Native
- White -- Race Alone - White

We'll need to create a dictionary that has each race name from data as a key, and has the population count for the races from census as the values.

In [11]:
mapping = {'Asian/Pacific Islander': 15834141, 'Black': 40250635,'Hispanic': 44618105,
  'Native American/Native Alaskan': 3739506, 'White': 197318956}

race_per_hundredk = {}

for keys, values in race_counts.items():
    race_per_hundredk[keys] = (values/mapping[keys]) * 100000

race_per_hundredk
    

{'Asian/Pacific Islander': 8.374309664161762,
 'Black': 57.8773477735196,
 'Hispanic': 20.220491210910907,
 'Native American/Native Alaskan': 24.521955573811088,
 'White': 33.56849303419181}

### Race and Likelihood of Gun Death

We can filter our results, and restrict them to the Homicide intent. This will tell us what the gun-related murder rate per 100000 people in each racial category is. In order to do this, we'll need to redo our work in generating race_counts, but only count rows where the intent was Homicide.

We can do this by first extracting the intent column, then using the enumerate() function to loop through each index and value in the race column. If the value in the same position in intents is Homicide, we'll count the value in the race column.

Finally, we'll use the mapping dictionary to convert from raw counts to rates.

In [12]:
homicide_race_counts = {}
intents = [row[3] for row in data]
races = [row[7] for row in data]
for i,race in enumerate(races):
    if race not in homicide_race_counts:
        homicide_race_counts[race] = 0
    if intents[i] == "Homicide":
        homicide_race_counts[race] += 1
        
race_per_hundredk = {}
for keys,values in homicide_race_counts.items():
    race_per_hundredk[keys] = values/mapping[keys] * 100000
    
race_per_hundredk

{'Asian/Pacific Islander': 3.530346230970155,
 'Black': 48.471284987180944,
 'Hispanic': 12.627161104219914,
 'Native American/Native Alaskan': 8.717729026240365,
 'White': 4.6356417981453335}

Certainly you are comparatively more likely to die from a gunshot wound as a black american than as any other race.  Despite the high prevalence of white gunshot deaths, the number that are due to homicides is comparatively slight, suggesting that other factors account for more gun deaths among the white population.

Large, but not as pronounced, is the rate of hispanic gun deaths from Homicides. So what can account for the fact that white gun deaths are quite high overall per 100000. 

What about Suicides? Let's look at the data in a similar way to that which we did for homicides.

In [13]:
intents = [row[3] for row in data]
suicide_race_counts = {}
for i,race in enumerate(races):
    if race not in suicide_race_counts:
        suicide_race_counts[race] = 0
    if intents[i] == "Suicide":
        suicide_race_counts[race] += 1

suicide_per_hundredk = {}
for k,v in suicide_race_counts.items():
    suicide_per_hundredk[k] = (v / mapping[k]) * 100000

suicide_per_hundredk

{'Asian/Pacific Islander': 4.705023152187416,
 'Black': 8.278130270491385,
 'Hispanic': 7.106980451097149,
 'Native American/Native Alaskan': 14.841532544673013,
 'White': 28.06217969245692}

### Undetermined Deaths

What about Undetermined deaths? They are relatively small in number.  Oddly, hispanic gun deaths by Undetermined cause appear to be fewer per hundred thousand than other races. Would be interesting to find out why that is.

In [14]:
intents = [row[3] for row in data]
undetermined_race_counts = {}
for i,race in enumerate(races):
    if race not in undetermined_race_counts:
        undetermined_race_counts[race] = 0
    if intents[i] == "Undetermined":
        undetermined_race_counts[race] += 1

undetermined_per_hundredk = {}
for k,v in undetermined_race_counts.items():
    undetermined_per_hundredk[k] = (v / mapping[k]) * 100000

undetermined_per_hundredk

{'Asian/Pacific Islander': 0.0631546731837237,
 'Black': 0.3130385396404305,
 'Hispanic': 0.16136947098044616,
 'Native American/Native Alaskan': 0.3743810011268868,
 'White': 0.2964743032595409}

### Education

So what role does education play? First let's do a rough cut of counts for different levels of attainment. Handily, the education column is already converted to an integer.

In [15]:
education_counts = {}
education = [row[10] for row in data]
races = [row[7] for row in data]
for i,ed in enumerate(education):
    if ed not in education_counts:
        education_counts[ed] = 0
    if ed in education_counts:
        education_counts[ed] += 1

education_counts

{'1': 21823, '2': 42927, '3': 21680, '4': 12946, '5': 1369, 'NA': 53}

So we see here that educational attainment appears to have a marked effect on the likelihood of a gun death, but the relationship is not directly inversely proportional.  For example, we could jump to the ludicrous assumption that those who graduated from high school are a menace to society.  We'd need to compare the rates with educational attainment in the wider population in order to understand the precise relationship, if any, between gun deaths and educational attainment.  For example, we cannot isolate whether college education is responsible for the lower incidence of gun deaths at that level of educational attainment, since there might naturally be a lower incidence of college education in the population. 

One way that gives us further insight is to perform 

### Geography

What about where people died? It does seem from a simple count that home is among the most common places for a gun death to occur.  

In [16]:
place_counts = {}
places = [row[9] for row in data]
for i,place in enumerate(places):
    if place not in place_counts:
        place_counts[place] = 0
    if place in place_counts:
        place_counts[place] += 1
        
place_counts

{'Farm': 470,
 'Home': 60486,
 'Industrial/construction': 248,
 'NA': 1384,
 'Other specified': 13751,
 'Other unspecified': 8867,
 'Residential institution': 203,
 'School/instiution': 671,
 'Sports': 128,
 'Street': 11151,
 'Trade/service area': 3439}

In [17]:

month_counts = {}
months = [row[3] for row in data]
for i,month in enumerate(months):
    if month == :
        month_counts[month] = month_counts[month] + 1
    else:
        month_counts[month] = 1

month_counts

SyntaxError: invalid syntax (<ipython-input-17-636d485c0d1c>, line 5)

In [31]:
race_by_death = {}
for rows in data:
    education = row[10]
    races = row[7]
    intents = row[3]
    if row[10] == (3 or 4 or 5):
        race_by_death[intents] += 1
    else:
        race_by_death[intents] = 1
        
race_by_death

{'Homicide': 1}